Whats the algorithm to convert from UTF-16 to character codes?
The Unicode Standard used to contain a short algorithm, now there is just a bit distribution table. Here are three short code snippets that translate the information from the bit distribution table into C code that will convert to and from UTF-16. Using the following type definitions typedef unsigned int16 UTF16; typedef unsigned int32 UTF32; the first snippet calculates the high (or leading) surrogate from a character code C. const UTF16 HI_SURROGATE_START = 0xD800UTF16 X = (UTF16) C; UTF32 U = (C >> 16) & ((1 << 5) - 1); UTF16 W = (UTF16) U - 1; UTF16 HiSurrogate = HI_SURROGATE_START | (W << 6) | X >> 10; where X, U and W correspond to the labels used in Table 3-4 UTF-16 Bit Distribution. The next snippet does the same for the low surrogate. const UTF16 LO_SURROGATE_START = 0xDC00UTF16 X = (UTF16) C; UTF16 LoSurrogate = (UTF16) (LO_SURROGATE_START | X & ((1 << 10) - 1)); Finally, the reverse, where hi and lo are the high and low surrogate, and C the resulting character UTF32 X = (hi