Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Understanding Divergent String Sort Orders in PostgreSQL

Tech May 10 2

PostgreSQL comparison and sorting behavior for text data depends on the collation setting. Running an ORDER BY on the same column can yield different sequences when the underlying collation changes. The example below demonstrates the phenomenon using a simple table containing letters and a leading-space entry.

CREATE TABLE t1 (s text);
INSERT INTO t1 VALUES (' b'), ('b'), ('a'), ('c');

SELECT * FROM t1 ORDER BY s COLLATE "en_US.utf8";
--   a
--    b
--   b
--   c

SELECT * FROM t1 ORDER BY s COLLATE "C";
--     b
--   a
--   b
--   c

With en_US.utf8, the space-padded b appears after a but before b. The C collation places it first. These differences originate from distintc sorting algorithms.

PostgreSQL’s documentaiton explains that collation controls string ordering. The C collation sorts by raw byte values from the character encoding. For UTF-8, a space is 0x20, lowercase ac are 0x610x63. The leading space has the lowest byte value, so b comes first.

' b'  -> 0x20 0x62
a     -> 0x61
b     -> 0x62
c     -> 0x63

The en_US.utf8 collation follows Unicode Collation Algorithm rules. Unicode assigns multilevel weights to each code point, as illustrated in this excerpt from allkeys.txt:

0020  ; [*0209.0020.0002]  # SPACE
0061  ; [.20A9.0020.0002]  # a
0062  ; [.20C3.0020.0002]  # b
0063  ; [.20DD.0020.0002]  # c

A weight starting with . (like a, b, c) marks a base character that participates in the first comparison pass. A weight starting with * (the space) is a variable character. In en_US.utf8, variable characters are ignored during the first three passes, which compare base characters only.

Pass 1 ignores the space, so b and b both map to the same weight (20C3). Pass 2 compares the second-level weights (0020 for both after ignoring the space). Pass 3 evaluates the third-level weights (0002). Since all three passes tie, Unicode applies a tie‑breaker: a string that incorporates a variable character sorts before the one without it. Thus b appears directly before b in the final order.

Pass 1: a=20A9, b=20C3, ' b'=20C3 (space ignored), c=20DD
Pass 2: only ' b' and b tie → weights both 0020 (space ignored)
Pass 3: same weights 0002, tie persists
Tie-breaker: variable character presence places ' b' before b

Result: a, ' b', b, c

The C collation never applies these multilevel rules—it only considers raw bytes. This is why results differ between the two collations.

Additional system locale data is usually found under /usr/share/i18n/locales/ and can be managed with locale-gen.

Tags: postgresql

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.