Best Practice: Use Unicode Normal Form C or KC
State: In Review
Contacts: Nick Crossley,
Arthur Ryman
Scope
This Best Practice affects the encoding of all IRIs and RDF string literals that may contain national language characters and other complex characters.
Problem Description
IRIs and RDF string literals are Unicode strings. However, in some cases, the encoding is not unique, i.e. there may be multiple ways to represent the same characters. For example, each of the following sequences (the first two being single-character sequences) represent the same character, the angstrom Å:
-
U+00C5
LATIN CAPITAL LETTER A WITH RING ABOVE
-
U+212B
ANGSTROM SIGN
-
U+0041
LATIN CAPITAL LETTER A, followed by U+030A
COMBINING RING ABOVE
These sequences are called canonically equivalent. The first of these forms is called NFC - for Normalization Form C, where the C is for composition. For more information, see the
introduction of UAX #15: Unicode Normalization Forms.
There are also cases where different character sequences are compatible, or represent the same abstract character sequences but with different visual appearances or behaviors. For example, the sequence
U+0066
U+0069
represents the string "fi" (LATIN SMALL LETTER F followed by LATIN SMALL LETTER I), while the sequence
U+FB01
represents the single character 'fi' (LATIN SMALL LIGATURE FI). By converting Unicode text to Normalization Form KC, the second representation is converted to the first, and the information that a ligature was used is lost.
In summary, NFC removes the distinction between equivalent characters, while preserving the distinction between compatible characters or sequences; NFKC removes the distinction between both equivalent and compatible sequences. NFC conversion is not considered lossy, but NFKC conversion is.
SPARQL does not automatically compensate for these alternate representations. This may lead to some results being unintentionally omitted from a query result. It is therefore important to standardize on a normal form for Unicode encoding, and to write appropriate queries.
Recommendation
Both data providers and clients writing queries SHOULD use NFKC for resource IRIs, as recommended in
RFC3987, section 7.5, though any query string part of the IRI might need to contain unnormalized characters if such characters are those being queried.
In most cases, data providers and clients writing queries SHOULD use Unicode Normalization Form C (NFC), as recommended in
section 3.4 of the RDF Concepts and Abstract Syntax.
In specific cases where your data might contain ligatures, half-width characters, fractions, and similar complex characters, but these forms do not have any semantic value you wish to preserve, and you wish to allow simple comparisons to match the logically equivalent characters in client queries, data providers MAY use NFKC.
Example
In Java, a string
str
is converted to NFC using the method call
Normalizer.normalize(str,Normalizer.Form.NFC)
, or to NFKC using
Normalizer.normalize(str,Normalizer.Form.NFKC)
.
See Also