~~EditWYSIWYGAttach~~Printable

r2 - 2014-12-30 - 17:27:57 - Main.rymanYou are here: TWiki >

LinkedData Web > RwattsSandbox > BestPractices > UseOfUnicodeNormalForms

This wiki: The development wiki is a work area where Jazz development teams plan and discuss technical designs and operations for the projects at Jazz.net. Work items often link to documents here. You are welcome to browse, follow along, and participate. Participation is what Jazz.net is all about! But please keep in mind that information here is "as is", unsupported, and may be outdated or inaccurate. For information on released products, consult IBM Knowledge Center, support tech notes, and the Jazz.net library. See also the Jazz.net Terms of Use.

Any documentation or reference material found in this wiki is not official product documentation, but it is primarily for the use of the development teams. For your end use, you should consult official product documentation (infocenters), IBM.com support artifacts (tech notes), and the jazz.net library as officially "stamped" resources.

Best Practice: Use Unicode Normal Form C or KC

State: In Review

Contacts: Nick Crossley, Arthur Ryman

Scope

This Best Practice affects the encoding of all IRIs and RDF string literals that may contain national language characters and other complex characters.

Problem Description

IRIs and RDF string literals are Unicode strings. However, in some cases, the encoding is not unique, i.e. there may be multiple ways to represent the same characters. For example, each of the following sequences (the first two being single-character sequences) represent the same character, the angstrom Å:

U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
U+212B ANGSTROM SIGN
U+0041 LATIN CAPITAL LETTER A, followed by U+030A COMBINING RING ABOVE

These sequences are called canonically equivalent. The first of these forms is called NFC - for Normalization Form C, where the C is for composition. For more information, see the introduction of UAX #15: Unicode Normalization Forms.

There are also cases where different character sequences are compatible, or represent the same abstract character sequences but with different visual appearances or behaviors. For example, the sequence U+0066 U+0069 represents the string "fi" (LATIN SMALL LETTER F followed by LATIN SMALL LETTER I), while the sequence U+FB01 represents the single character 'ﬁ' (LATIN SMALL LIGATURE FI). By converting Unicode text to Normalization Form KC, the second representation is converted to the first, and the information that a ligature was used is lost.

In summary, NFC removes the distinction between equivalent characters, while preserving the distinction between compatible characters or sequences; NFKC removes the distinction between both equivalent and compatible sequences. NFC conversion is not considered lossy, but NFKC conversion is.

SPARQL does not automatically compensate for these alternate representations. This may lead to some results being unintentionally omitted from a query result. It is therefore important to standardize on a normal form for Unicode encoding, and to write appropriate queries.

Recommendation

Both data providers and clients writing queries SHOULD use NFKC for resource IRIs, as recommended in RFC3987, section 7.5, though any query string part of the IRI might need to contain unnormalized characters if such characters are those being queried.

In most cases, data providers and clients writing queries SHOULD use Unicode Normalization Form C (NFC), as recommended in section 3.4 of the RDF Concepts and Abstract Syntax.

In specific cases where your data might contain ligatures, half-width characters, fractions, and similar complex characters, but these forms do not have any semantic value you wish to preserve, and you wish to allow simple comparisons to match the logically equivalent characters in client queries, data providers MAY use NFKC.

Example

In Java, a string str is converted to NFC using the method call Normalizer.normalize(str,Normalizer.Form.NFC), or to NFKC using Normalizer.normalize(str,Normalizer.Form.NFKC).

Links to additional information:

Copyright © by IBM and non-IBM contributing authors. All material on this collaboration platform is the property of the contributing authors.
Contributions are governed by our Terms of Use
Ideas, requests, problems regarding TWiki? Send feedback
Dashboards and work items are no longer publicly available, so some links may be invalid. We now provide similar information through other means. Learn more here.