UseOfUnicodeNormalForms < LinkedData

---+ Best Practice: Use Unicode Normal Form C or KC

*State:* In Review

*Contacts:* Nick Crossley, [[https://jazz.net/wiki/bin/view/Main/ArthurRyman][Arthur Ryman]]

---++ Scope 

This Best Practice affects the encoding of all IRIs and RDF string literals that may contain national language characters and other complex characters.

---++ Problem Description

IRIs and RDF string literals are Unicode strings. However, in some cases, the encoding is not unique, i.e. there may be multiple ways to represent the same characters. For example, each of the following sequences (the first two being single-character sequences) represent the same character, the angstrom Å:

   * =U+00C5= LATIN CAPITAL LETTER A WITH RING ABOVE
   * =U+212B= ANGSTROM SIGN
   * =U+0041= LATIN CAPITAL LETTER A, followed by =U+030A= COMBINING RING ABOVE

These sequences are called canonically equivalent. The first of these forms is called NFC - for Normalization Form C, where the C is for composition. For more information, see the [[http://unicode.org/reports/tr15/#Introduction][introduction of UAX #15: Unicode Normalization Forms]].

There are also cases where different character sequences are compatible, or represent the same abstract character sequences but with different visual appearances or behaviors. For example, the sequence =U+0066= =U+0069= represents the string "fi" (LATIN SMALL LETTER F followed by LATIN SMALL LETTER I), while the sequence =U+FB01= represents the single character '&#64257;' (LATIN SMALL LIGATURE FI). By converting Unicode text to Normalization Form KC, the second representation is converted to the first, and the information that a ligature was used is lost.

In summary, NFC removes the distinction between equivalent characters, while preserving the distinction between compatible characters or sequences; NFKC removes the distinction between both equivalent and compatible sequences. NFC conversion is not considered lossy, but NFKC conversion is.

Consider the English word "field". In Unicode this can be written as either of the two compatible forms:
   * =U+0066 U+0069 U+0065 U+006C U+0064=
   * =U+FB01 U+0065 U+006C U+0064=

SPARQL does not automatically compensate for these alternate representations. This may lead to some results being unintentionally omitted from a query result. For example, given a triple of the form

<verbatim>
   ex:resource1 ex:property "\uFB01eld" .
</verbatim>

neither of the following queries will find match that triple:

<verbatim>
   SELECT * WHERE {
      ?resource ex:property "field"
   }

   SELECT * WHERE {
      ?resource ex:property ?str .
      FILTER (STR(?str) = "field")
   }
</verbatim>

Normalization of strings is not provided as a standard feature of SPARQL, so it is important to standardize on a normal form for Unicode encoding of literal values in published RDF, and to transform user queries into that same normal form.

---++ Recommendation

Both data providers and clients writing queries SHOULD use NFKC for resource IRIs, as recommended in [[https://tools.ietf.org/html/rfc3987#section-7.5][RFC3987, section 7.5]], though any query string part of the IRI might need to contain unnormalized characters if such characters are those being queried.

In most cases, data providers and clients writing queries SHOULD use Unicode Normalization Form C (NFC), as recommended in [[http://www.w3.org/TR/rdf-concepts/#section-Literals][section 3.4 of the RDF Concepts and Abstract Syntax]].

In specific cases where your data might contain ligatures, half-width characters, fractions, and similar complex characters, but these forms do not have any semantic value you wish to preserve in published RDF, you MAY decide to publish your RDF data using NFKC. Doing so would transform the string "\uFB01eld" to "field" in the RDF; if the application also transformed literals in user queries, then the user could query for "\uFB01eld" or "field" and match the RDF data in both cases. 

---++ Example

In Java, a string =str= is converted to NFC using the method call =Normalizer.normalize(str,Normalizer.Form.NFC)=, or to NFKC using =Normalizer.normalize(str,Normalizer.Form.NFKC)=.

---++ See Also

   * [[http://www.w3.org/TR/WD-charreq][Requirements for String Identity Matching and String Indexing]]
   * [[http://unicode.org/reports/tr15/#Introduction][Unicode Standard Annex #15, Unicode Normalization Forms, Introduction]]
   * [[https://tools.ietf.org/html/rfc3987#section-7.5][RFC 3987, Internationalized Resource Identifiers (IRIs), Section 7.5 URI/IRI Selection]]
   * [[http://www.w3.org/TR/rdf-concepts/#section-Literals][RDF Concepts and Abstract Syntax, Section 3.4 Literals]]

This topic: LinkedData > WebHome > BestPractices > UseOfUnicodeNormalForms

History: r4 - 2015-01-05 - 17:54:16 - Main.ndjc

Copyright © by IBM and non-IBM contributing authors. All material on this collaboration platform is the property of the contributing authors.
Contributions are governed by our Terms of Use
Ideas, requests, problems regarding TWiki? Send feedback
Dashboards and work items are no longer publicly available, so some links may be invalid. We now provide similar information through other means. Learn more here.