Is the POSIX ctype.h model sufficient for Unicode?
POSIX “ctype.h” knows but two cases, whereas Unicode knows three. In POSIX, only European Arabic digits can pass “isdigit”, whereas Unicode has many sets of digits, all putatively equal. In POSIX “ctype.h”, that which is “alnum” but not “alpha” must be a “digit”, but Unicode is aware that not all numbers are digits, nor are all letters alphabetic. Unicode groks spacing and non-spacing marks, but POSIX comprehends them not.
POSIX “ctype.h” knows but two cases, whereas Unicode knows three. In POSIX, only European Arabic digits can pass “isdigit”, whereas Unicode has many sets of digits, all putatively equal. In POSIX “ctype.h”, that which is “alnum” but not “alpha” must be a “digit”, but Unicode is aware that not all numbers are digits, nor are all letters alphabetic. Unicode groks spacing and non-spacing marks, but POSIX comprehends them not. [JC] Q: How should characters (particularly composite characters) be counted, for the purposes of length, substrings, positions in a string, etc. A: In general, there are 3 different ways to count characters. Each is illustrated with the following sample string. “a” + umlaut + greek_alpha + \uE0000. (the latter is a private use character) 1. Code Units: e.g. how many bytes are in the physical representation of the string. Example: In UTF-8, the sample has 9 bytes. [61 CC 88 CE B1 F3 A0 80 80] In UTF-16BE, it has 10 bytes. [00 61 03 08 03 B1 DB 40 DC 00] In UTF-32BE,