Collation
The collation feature allows specifying the sort order and character classification behavior of data per-column, or even per-operation. This alleviates the restriction that the LC_COLLATE
and LC_CTYPE
settings of a database cannot be changed after its creation.
Conceptually, every expression of a collatable data type has a collation. (The built-in collatable data types are text
, varchar
, and char
. User-defined base types can also be marked collatable, and of course a domain over a collatable data type is collatable.) If the expression is a column reference, the collation of the expression is the defined collation of the column. If the expression is a constant, the collation is the default collation of the data type of the constant. The collation of a more complex expression is derived from the collations of its inputs, as described below.
The collation of an expression can be the “default” collation, which means the locale settings defined for the database. It is also possible for an expression’s collation to be indeterminate. In such cases, ordering operations and other operations that need to know the collation will fail.
When the database system has to perform an ordering or a character classification, it uses the collation of the input expression. This happens, for example, with ORDER BY
clauses and function or operator calls such as <
. The collation to apply for an ORDER BY
clause is simply the collation of the sort key. The collation to apply for a function or operator call is derived from the arguments, as described below. In addition to comparison operators, collations are taken into account by functions that convert between lower and upper case letters, such as lower
, upper
, and initcap
; by pattern matching operators; and by to_char
and related functions.
For a function or operator call, the collation that is derived by examining the argument collations is used at run time for performing the specified operation. If the result of the function or operator call is of a collatable data type, the collation is also used at parse time as the defined collation of the function or operator expression, in case there is a surrounding expression that requires knowledge of its collation.
The collation derivation of an expression can be implicit or explicit. This distinction affects how collations are combined when multiple different collations appear in an expression. An explicit collation derivation occurs when a COLLATE
clause is used; all other collation derivations are implicit. When multiple collations need to be combined, for example in a function call, the following rules are used:
- If any input expression has an explicit collation derivation, then all explicitly derived collations among the input expressions must be the same, otherwise an error is raised. If any explicitly derived collation is present, that is the result of the collation combination.
- Otherwise, all input expressions must have the same implicit collation derivation or the default collation. If any non-default collation is present, that is the result of the collation combination. Otherwise, the result is the default collation.
- If there are conflicting non-default implicit collations among the input expressions, then the combination is deemed to have indeterminate collation. This is not an error condition unless the particular function being invoked requires knowledge of the collation it should apply. If it does, an error will be raised at run-time.
The following table shows the collations available for use in PostgreSQL.
Collation | Provider | Description |
---|---|---|
default | libc | The default collation selects the LC_COLLATE and LC_CTYPE values specified at database creation time. |
C | libc | A collations specify “traditional C” behavior, in which only the ASCII letters “A ” through “Z ” are treated as letters, and sorting is done strictly by character code byte values. |
POSIX | libc | A collations specify “traditional C” behavior. |
ucs_basic | libc | SQL standard collation, available for encoding UTF8 . It is equivalent to C and sorts by Unicode code point. |
zh_CN | libc | Chinese (Simplified, China) |
zh_CN.utf8 | libc | Chinese (Simplified, China), available for encoding UTF8 . |
zh_CN.gb2312 | libc | Chinese (Simplified, China), available for encoding GB2312 . |
zh_HK | libc | Chinese (Traditional, Hong Kong SAR, China) |
zh_HK.utf8 | libc | Chinese (Traditional, Hong Kong SAR, China), available for encoding UTF8 . |
zh_TW | libc | Chinese (Traditional, Taiwan) |
zh_TW.utf8 | libc | Chinese (Traditional, Taiwan), available for encoding UTF8 . |
zh_TW.euctw | libc | Chinese (Traditional, Taiwan), available for encoding EUCTW . |
zh_SG | libc | Chinese (Simplified, Singapore) |
zh_SG.utf8 | libc | Chinese (Simplified, Singapore), available for encoding UTF8 . |
zh_SG.gb2312 | libc | Chinese (Simplified, Singapore), available for encoding GB2312 . |
zh-x-icu | ICU | Chinese. Sort by Chinese phonetic order |
zh-Hans-x-icu | ICU | Chinese (Simplified). Sort by Chinese phonetic order |
zh-Hans-CN-x-icu | ICU | Chinese (Simplified, China). Sort by Chinese phonetic order |
zh-Hans-HK-x-icu | ICU | Chinese (Simplified, Hong Kong SAR, China). Sort by Chinese phonetic order |
zh-Hans-MO-x-icu | ICU | Chinese (Simplified, Macau SAR, China). Sort by Chinese phonetic order |
zh-Hans-SG-x-icu | ICU | Chinese (Simplified, Singapore). Sort by Chinese phonetic order |
zh-Hant-x-icu | ICU | Chinese (Traditional). Sort by Chinese radical/stroke order |
zh-Hant-HK-x-icu | ICU | Chinese (Traditional, Hong Kong SAR, China). Sort by Chinese radical/stroke order |
zh-Hant-MO-x-icu | ICU | Chinese (Traditional, Macau SAR, China). Sort by Chinese radical/stroke order |
zh-Hant-TW-x-icu | ICU | Chinese (Traditional, Taiwan). Sort by Chinese radical/stroke order |