How to find high-number unicode or symbol chars

I've got an existing DOORS module which happens to have some rich text entries; these entries have some symbols in them such as 'curly' quotes.  I'm trying to upgrade a DXL macro which exports a LaTeX source file, and the problem is that these high-number symbols are not considered "standard UTF-8" by TexMaker's import function (and in any case probably won't be processed by Xelatex or other converters) .  I can't simply use the `UnicodeString` functions in DXL because those break the rest of the rich text, and apparently the character identifier `charOf(decimal_number_code)` only works over the basic set of characters, i.e. less than some numeric code value.  For example,  `charOf(8217)` should create a right-curly single quote, but when I tried code along the lines of

    if (charOf(8217) == one_char)

I never get a match.  I did copy the curly quote from the DOORS module and verified via an online unicode analyzer that it was definitely Unicode decimal value 8217 .  

So, what am I missing here?  I just want to be able to detect any symbol character, identify it correctly, and then replace it with ,e.g., `\textquoteright`  in the output stream. 

My overall setup works for lower-count chars,  since this works:
( `c` is a single character pulled from a string) 

        thedeg = charOf(176)
     if( thedeg == c )
            {
               temp += "$\\degree$"
           }


witthoft_carl - Wed Sep 28 07:26:59 EDT 2016

Re: How to find high-number unicode or symbol chars
O.Wilkop - Wed Sep 28 08:27:17 EDT 2016

Hey, you are right it seems intOf(char) and charOf(int) both do some modulo 256 and therefore cut anything above that off.

 

Have you tried:

int i=8217;
char c = addr_(i);
print c;

instead?

Re: How to find high-number unicode or symbol chars
Mathias Mamsch - Wed Sep 28 09:11:12 EDT 2016

O.Wilkop - Wed Sep 28 08:27:17 EDT 2016

Hey, you are right it seems intOf(char) and charOf(int) both do some modulo 256 and therefore cut anything above that off.

 

Have you tried:

int i=8217;
char c = addr_(i);
print c;

instead?

The problems are only intOf and charOf functions ... Internally char is an UTF compatible type. You can use it for concatenation and comparison. 

See the attached file (UTF coded). Execute this file as an include to ensure that encoding is not changed! 

#include <c:/temp/utf_handling.dxl>

This is the code inside the file 

pragma encoding, "utf-8"
string s = "ﮚﬞﮚﬞ"
print "IntOf: " (intOf s[1]) "\n"
int code =  (addr_ s[1]) int;
print "Code: " code "\n";
char x = s[0]; 
string sNew = x x x ""; 
print "Concatenated: " sNew "\n";

You can see, that using "addr_" you can create and get the full char code. 

Regards, Mathias


Attachments

utf_handling.dxl

Re: How to find high-number unicode or symbol chars
witthoft_carl - Wed Sep 28 09:23:16 EDT 2016

O.Wilkop - Wed Sep 28 08:27:17 EDT 2016

Hey, you are right it seems intOf(char) and charOf(int) both do some modulo 256 and therefore cut anything above that off.

 

Have you tried:

int i=8217;
char c = addr_(i);
print c;

instead?

Thanks, Oliver.  That does return the desired character.  If I have success running comparison tests, I'll accept your answer.

Re: How to find high-number unicode or symbol chars
Mathias Mamsch - Wed Sep 28 09:26:32 EDT 2016

Mathias Mamsch - Wed Sep 28 09:11:12 EDT 2016

The problems are only intOf and charOf functions ... Internally char is an UTF compatible type. You can use it for concatenation and comparison. 

See the attached file (UTF coded). Execute this file as an include to ensure that encoding is not changed! 

#include <c:/temp/utf_handling.dxl>

This is the code inside the file 

pragma encoding, "utf-8"
string s = "ﮚﬞﮚﬞ"
print "IntOf: " (intOf s[1]) "\n"
int code =  (addr_ s[1]) int;
print "Code: " code "\n";
char x = s[0]; 
string sNew = x x x ""; 
print "Concatenated: " sNew "\n";

You can see, that using "addr_" you can create and get the full char code. 

Regards, Mathias

By the way, an automatic conversion of char to int can be reached by using an integer reference to a char. This way you save the effort to call "addr_" every time, especially on string functions that need to be very performant if called very often: 

string s = "ABCDEF"; 
char c = null; 

// an integer reference to c! It will always reflect the value of c as int. 
int &ref = addr_ (&c);

for (i = 0; i < length s; i++) {
    c = s[i]; 
    if (ref == 67) print "c found at index " i "\n"
}

Regards, Mathias