Why do my strings sort incorrectly?
Why do my strings sort incorrectly?
First, make sure you are using the locale you want to be using. Use SHOW lc_collate
to show the database-wide locale in effect. If you are using per-column collations, check those. If everything is how you want it, then read on.
PostgreSQL uses the libc
library’s locale facilities for sorting strings. So if the sort order of the strings is not what you expect, the issue is likely in the libc
library. You can verify the libc
library’s idea of sorting using the sort
utility on a text file, e.g.,
LC_COLLATE="en_US.UTF-8" sort testfile.txt
If this results in the same order that PostgreSQL gives you, then the problem is outside of PostgreSQL.
PostgreSQL deviates from the libc
behavior in so far as it breaks ties by sorting strings in byte order. This should rarely make a difference in practice, and is usually not the source of the problem when users complain about the sort order, but it could affect cases where, for example, combining and precombined Unicode characters are mixed.
If the problem is in the libc
library, you will have to take it up with your operating system maintainers. Note, however, that while actual bugs in locale definitions of libc
libraries have been known to exist, it is more likely that the libc
library is correct, where “correct” means it follows some recognized international or national standard. Possibly, you are expecting one of multiple equally valid interpretations of a language’s sorting rules.
Common complaint patterns include:
-
Spaces and special characters: The sorting algorithm normally works in multiple passes. First, all the letters are compared, ignoring spaces and punctuation. Then, spaces and punctuation are compared to break ties. (This is a simplification of what actually happens.) It’s not possible to change this without changing the locale definitions themselves (and even then it’s difficult). You might want to restructure your data slightly to avoid this problem. For example, if you are sorting a name field, you could split the field into first and last name fields, avoiding the space in between.
-
Upper/lower case: Locales other than the
C
locale generally sort upper and lower case letters together. So the order will be something like “a A b B c C …” instead of the “A B C … a b c …” that a sort based on ASCII byte values will give. That is correct. -
It is not in ASCII/byte order. No, it’s not, it’s not supposed to be. ASCII is an encoding, not a sort order. If you want this, you can use the
C
locale, but then lose the ability to sort non-ASCII characters according to specific languages.