Best Practice: Use Unicode Normal Form C or KC

State: In Review

Contacts: Nick Crossley, Arthur Ryman

Scope

This Best Practice affects the encoding of all IRIs and RDF string literals that may contain national language characters and other complex characters.

Problem Description

IRIs and RDF string literals are Unicode strings. However, in some cases, the encoding is not unique, i.e. there may be multiple ways to represent the same characters. For example, each of the following sequences (the first two being single-character sequences) represent the same character, the angstrom Å:

U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
U+212B ANGSTROM SIGN
U+0041 LATIN CAPITAL LETTER A, followed by U+030A COMBINING RING ABOVE

These sequences are called canonically equivalent. The first of these forms is called NFC - for Normalization Form C, where the C is for composition. For more information, see the introduction of UAX #15: Unicode Normalization Forms.

There are also cases where different character sequences are compatible, or represent the same abstract character sequences but with different visual appearances or behaviors. For example, the sequence U+0066 U+0069 represents the string "fi" (LATIN SMALL LETTER F followed by LATIN SMALL LETTER I), while the sequence U+FB01 represents the single character 'ﬁ' (LATIN SMALL LIGATURE FI). By converting Unicode text to Normalization Form KC, the second representation is converted to the first, and the information that a ligature was used is lost.

In summary, NFC removes the distinction between equivalent characters, while preserving the distinction between compatible characters or sequences; NFKC removes the distinction between both equivalent and compatible sequences. NFC conversion is not considered lossy, but NFKC conversion is.

SPARQL does not automatically compensate for these alternate representations. This may lead to some results being unintentionally omitted from a query result. It is therefore important to standardize on a normal form for Unicode encoding, and to write appropriate queries.

Recommendation

Both data providers and clients writing queries SHOULD use NFKC for resource IRIs, as recommended in RFC3987, section 7.5, though any query string part of the IRI might need to contain unnormalized characters if such characters are those being queried.

In most cases, data providers and clients writing queries SHOULD use Unicode Normalization Form C (NFC), as recommended in section 3.4 of the RDF Concepts and Abstract Syntax.

In specific cases where your data might contain ligatures, half-width characters, fractions, and similar complex characters, but these forms do not have any semantic value you wish to preserve, and you wish to allow simple comparisons to match the logically equivalent characters in client queries, data providers MAY use NFKC.

Example

In Java, a string str is converted to NFC using the method call Normalizer.normalize(str,Normalizer.Form.NFC), or to NFKC using Normalizer.normalize(str,Normalizer.Form.NFKC).

Best Practice: Use Unicode Normal Form C or KC

Scope

Problem Description

Recommendation

Example

See Also