Understanding Divergent String Sort Orders in PostgreSQL
PostgreSQL comparison and sorting behavior for text data depends on the collation setting. Running an ORDER BY on the same column can yield different sequences when the underlying collation changes. The example below demonstrates the phenomenon using a simple table containing letters and a leading-space entry.
CREATE TABLE t1 (s text);
INSERT INTO t1 VALUES (' b'), ('b'), ('a'), ('c');
SELECT * FROM t1 ORDER BY s COLLATE "en_US.utf8";
-- a
-- b
-- b
-- c
SELECT * FROM t1 ORDER BY s COLLATE "C";
-- b
-- a
-- b
-- c
With en_US.utf8, the space-padded b appears after a but before b. The C collation places it first. These differences originate from distintc sorting algorithms.
PostgreSQL’s documentaiton explains that collation controls string ordering. The C collation sorts by raw byte values from the character encoding. For UTF-8, a space is 0x20, lowercase a–c are 0x61–0x63. The leading space has the lowest byte value, so b comes first.
' b' -> 0x20 0x62
a -> 0x61
b -> 0x62
c -> 0x63
The en_US.utf8 collation follows Unicode Collation Algorithm rules. Unicode assigns multilevel weights to each code point, as illustrated in this excerpt from allkeys.txt:
0020 ; [*0209.0020.0002] # SPACE
0061 ; [.20A9.0020.0002] # a
0062 ; [.20C3.0020.0002] # b
0063 ; [.20DD.0020.0002] # c
A weight starting with . (like a, b, c) marks a base character that participates in the first comparison pass. A weight starting with * (the space) is a variable character. In en_US.utf8, variable characters are ignored during the first three passes, which compare base characters only.
Pass 1 ignores the space, so b and b both map to the same weight (20C3). Pass 2 compares the second-level weights (0020 for both after ignoring the space). Pass 3 evaluates the third-level weights (0002). Since all three passes tie, Unicode applies a tie‑breaker: a string that incorporates a variable character sorts before the one without it. Thus b appears directly before b in the final order.
Pass 1: a=20A9, b=20C3, ' b'=20C3 (space ignored), c=20DD
Pass 2: only ' b' and b tie → weights both 0020 (space ignored)
Pass 3: same weights 0002, tie persists
Tie-breaker: variable character presence places ' b' before b
Result: a, ' b', b, c
The C collation never applies these multilevel rules—it only considers raw bytes. This is why results differ between the two collations.
Additional system locale data is usually found under /usr/share/i18n/locales/ and can be managed with locale-gen.