Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

Is the POSIX ctype.h model sufficient for Unicode?

April 26, 2017ctype.h model POSIX sufficient Unicode

0

10 Posted

Is the POSIX ctype.h model sufficient for Unicode?

2 Answers

0

10 Posted

POSIX “ctype.h” knows but two cases, whereas Unicode knows three. In POSIX, only European Arabic digits can pass “isdigit”, whereas Unicode has many sets of digits, all putatively equal. In POSIX “ctype.h”, that which is “alnum” but not “alpha” must be a “digit”, but Unicode is aware that not all numbers are digits, nor are all letters alphabetic. Unicode groks spacing and non-spacing marks, but POSIX comprehends them not.

0

Posted

POSIX “ctype.h” knows but two cases, whereas Unicode knows three. In POSIX, only European Arabic digits can pass “isdigit”, whereas Unicode has many sets of digits, all putatively equal. In POSIX “ctype.h”, that which is “alnum” but not “alpha” must be a “digit”, but Unicode is aware that not all numbers are digits, nor are all letters alphabetic. Unicode groks spacing and non-spacing marks, but POSIX comprehends them not. [JC] Q: How should characters (particularly composite characters) be counted, for the purposes of length, substrings, positions in a string, etc. A: In general, there are 3 different ways to count characters. Each is illustrated with the following sample string. “a” + umlaut + greek_alpha + \uE0000. (the latter is a private use character) 1. Code Units: e.g. how many bytes are in the physical representation of the string. Example: In UTF-8, the sample has 9 bytes. [61 CC 88 CE B1 F3 A0 80 80] In UTF-16BE, it has 10 bytes. [00 61 03 08 03 B1 DB 40 DC 00] In UTF-32BE,