Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

How may character set negotiation be used to negotiate Unicode with UTF-8 encoding?

April 26, 2017character encoding negotiate negotiation Unicode Used utf-8

0

Posted

How may character set negotiation be used to negotiate Unicode with UTF-8 encoding?

1 Answer

0

Posted

Background: Those familiar with Unicode and UTF-8 should skip this background and proceed to the “response”. ISO 10646 defines the Universal Character Set (UCS) and assigns to each character an integer value, its “code”. Four bytes are used to represent codes, thus the code range is potentially 0000 0000 to FFFF FFFF hex. The Unicode Standard contains all the same characters and encoding points as ISO 10646. The first 64K characters of the UCS are referred to as the “Basic Multilingual Plane”, BMP. (More precisely, the highest code point for a character in the BMP is FFFD hex. The last two code points of the BMP are explicitly not characters.) Thus, although four bytes are allocated for code assignments, two bytes are sufficient to represent BMP characters. A number of encodings are defined for the UCS; the most basic are UCS-4 and UCS-2, four and two bytes per character respectively, into which the actual character code value is encoded without transformation. UCS-2 is applicable only