Best Practice: Use Unicode Normal Form C or KC

State: Approved

Scope

This Best Practice affects the encoding of all IRIs and RDF string literals that may contain national language characters and other complex characters.

Problem Description

IRIs and RDF string literals are Unicode strings. However, in some cases, the encoding is not unique, i.e. there may be multiple ways to represent the same characters. For example, each of the following sequences (the first two being single-character sequences) represent the same character, the angstrom Å:

U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
U+212B ANGSTROM SIGN
U+0041 LATIN CAPITAL LETTER A, followed by U+030A COMBINING RING ABOVE

These sequences are called canonically equivalent. The first of these forms is called NFC - for Normalization Form C, where the C is for composition. For more information, see the introduction of UAX #15: Unicode Normalization Forms.

There are also cases where different character sequences are compatible, or represent the same abstract character sequences but with different visual appearances or behaviors. For example, the sequence U+0066 U+0069 represents the string "fi" (LATIN SMALL LETTER F followed by LATIN SMALL LETTER I), while the sequence U+FB01 represents the single character 'ﬁ' (LATIN SMALL LIGATURE FI). By converting Unicode text to Normalization Form KC, the second representation is converted to the first, and the information that a ligature was used is lost.

In summary, NFC removes the distinction between equivalent characters, while preserving the distinction between compatible characters or sequences; NFKC removes the distinction between both equivalent and compatible sequences. NFC conversion is not considered lossy, but NFKC conversion is.

Consider the English word "field". In Unicode this can be written as either of the two compatible forms:

U+0066 U+0069 U+0065 U+006C U+0064
U+FB01 U+0065 U+006C U+0064

SPARQL does not automatically compensate for these alternate representations. This may lead to some results being unintentionally omitted from a query result. For example, given a triple of the form

   ex:resource1 ex:property "\uFB01eld" .

neither of the following queries will find match that triple:

   SELECT * WHERE {
      ?resource ex:property "field"
   }

   SELECT * WHERE {
      ?resource ex:property ?str .
      FILTER (STR(?str) = "field")
   }

Normalization of strings is not provided as a standard feature of SPARQL, so it is important to standardize on a normal form for Unicode encoding of literal values in published RDF, and to transform user queries into that same normal form.

Recommendation

Both data providers and clients writing queries SHOULD use NFKC for resource IRIs, as recommended in RFC3987, section 7.5, though any query string part of the IRI might need to contain unnormalized characters if such characters are those being queried.

In most cases, data providers and clients writing queries SHOULD use Unicode Normalization Form C (NFC), as recommended in section 3.4 of the RDF Concepts and Abstract Syntax.

In specific cases where your data might contain ligatures, half-width characters, fractions, and similar complex characters, but these forms do not have any semantic value you wish to preserve in published RDF, you MAY decide to publish your RDF data using NFKC. Doing so would transform the string "\uFB01eld" to "field" in the RDF; if the application also transformed literals in user queries, then the user could query for "\uFB01eld" or "field" and match the RDF data in both cases.

Example

In Java, a string str is converted to NFC using the method call Normalizer.normalize(str,Normalizer.Form.NFC), or to NFKC using Normalizer.normalize(str,Normalizer.Form.NFKC).

Best Practice: Use Unicode Normal Form C or KC

Scope

Problem Description

Recommendation

Example

See Also