Problems with HTML escaping and non-ASCII characters using the JSON OSLC API
1. Create a defect with the summary ">". Then retrieve it. The result says that "dcterms:title" is ">". Is this a bug? Shouldn't this be ">"?
2. Use the OSLC query API to search for this issue by 'oslc.where=dcterms:title=">"'. This returns no results. Changing it to 'oslc.where=dcterms:title=">"'. Is there a reason behind this inconsistency?
3. Doing the same with " " (non-breaking space,  , \u00A0) does still not give any results using the query API even though I HTML-unescaped and Unicode-escaped the character. How can I work around that? Am I doing something wrong?
This is not an artificial problem. There are lots of defects with non-ASCII characters. Has anyone else encountered (and solved) this issue?
One answer
I use this online tool to figure out my json problems
http://codebeautify.org/jsonvalidator
I typically have to convert the xml type chars to their escaped values, >, <, ", etc...
online doc for unicode escapes is
Unicode escape sequences
Any character with a character code lower than
65536
can be escaped using the hexadecimal value of its character code, prefixed with
\u
. (As mentioned before, higher character codes are represented by a pair of surrogate characters.)
Unicode escapes are six characters long. They require exactly four characters following
\u
. If the hexadecimal character code is only one, two or three characters long, you’ll need to pad it with leading zeroes.
The copyright symbol (
'©'
) has character code
169
, which gives
A9
in hexadecimal notation, so you could write it as
'\u00A9'
. Similarly,
'♥'
could be written as
'\u2665'
.
The hexadecimal part of this kind of character escape is case-insensitive; in other words,
'\u00a9'
and
'\u00A9'
are equivalent.
You could define Unicode escape syntax using the following regular expression:
\\u[a-fA-F0-9]{4}
.