UTF-32

Last updated

UTF-32 (32-bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode code points, needing actually only 21 bits). [1] UTF-32 is a fixed-length encoding, in contrast to all other Unicode transformation formats, which are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.

Contents

The main advantage of UTF-32 is that the Unicode code points are directly indexed. Finding the Nth code point in a sequence of code points is a constant time operation. In contrast, a variable-length code requires sequential access to find the Nth code point in a sequence. This makes UTF-32 a simple replacement in code that uses integers that are incremented by one to examine each location in a string, as was commonly done for ASCII.

The main disadvantage of UTF-32 is that it is space-inefficient, using four bytes per code point, including 11 bits that are always zero. Characters beyond the BMP are relatively rare in most texts, and can typically be ignored for sizing estimates. This makes UTF-32 close to twice the size of UTF-16. It can be up to four times the size of UTF-8 depending on how many of the characters are in the ASCII subset.

History

The original ISO 10646 standard defines a 32-bit encoding form called UCS-4, in which each code point in the Universal Character Set (UCS) is represented by a 31-bit value from 0 to 0x7FFFFFFF (the sign bit was unused and zero). In November 2003, Unicode was restricted by RFC 3629 to match the constraints of the UTF-16 encoding: explicitly prohibiting code points greater than U+10FFFF (and also the high and low surrogates U+D800 through U+DFFF). This limited subset defines UTF-32. [2] [1] Although the ISO standard had (as of 1998 in Unicode 2.1) "reserved for private use" 0xE00000 to 0xFFFFFF, and 0x60000000 to 0x7FFFFFFF [3] these areas were removed in later versions. Because the Principles and Procedures document of ISO/IEC JTC 1/SC 2 Working Group 2 states that all future assignments of code points will be constrained to the Unicode range, UTF-32 will be able to represent all UCS code points and UTF-32 and UCS-4 are identical.

Analysis

Though a fixed number of bytes per code point seems convenient, it is not as useful as it appears. It makes truncation easier but not significantly so compared to UTF-8 and UTF-16 (both of which can search backwards for the point to truncate by looking at 2–4 code units at most).

It is extremely rare[ citation needed ] that code wishes to find the Nth code point without earlier examining the code points 0 to N–1. For instance, XML parsing cannot do anything with a character without first looking at all preceding characters. [4] So an integer index that is incremented by 1 for each character can be replaced with an integer offset, measured in code units and incremented by the number of code units as each character is examined. This removes the perceived speed advantages[ citation needed ] of UTF-32.

UTF-32 does not make calculating the displayed width of a string easier, since even with a "fixed width" font there may be more than one code point per character position (combining characters) or more than one character position per code point ("grapheme clusters" for CJK ideographs). Editors that limit themselves to left-to-right languages and precomposed characters can take advantage of fixed-sized code units, but such editors are unlikely to support non-BMP characters and thus can work equally well with UTF-16.[ citation needed ]

Use

The main use of UTF-32 is in internal APIs where the data is single code points or glyphs, rather than strings of characters. For instance, in modern text rendering, it is common that the last step is to build a list of structures each containing coordinates (x,y), attributes, and a single UTF-32 code point identifying the glyph to draw. Often non-Unicode information is stored in the "unused" 11 bits of each word.[ citation needed ]

Use of UTF-32 strings on Windows (where wchar_t is 16 bits) is almost non-existent. On Unix systems, UTF-32 strings are sometimes, but rarely, used internally by applications, due to the type wchar_t being defined as 32 bit. Python versions up to 3.2 can be compiled to use them instead of UTF-16; from version 3.3 onward all Unicode strings are stored in UTF-32 but with leading zero bytes optimized away "depending on the [code point] with the largest Unicode ordinal (1, 2, or 4 bytes)" to make all code points that size. [5] Seed7 [6] and Lasso [ citation needed ] programming languages encode all strings with UTF-32, in the belief that direct indexing is important, whereas the Julia programming language moved away from builtin UTF-32 support with its 1.0 release, simplifying the language to having only UTF-8 strings (with all the other encodings considered legacy and moved out of the standard library to package [7] ) following the "UTF-8 Everywhere Manifesto". [8]

Variants

Though technically invalid, the surrogate halves are often encoded and allowed. This allows invalid UTF-16 (such as Windows filenames) to be translated to UTF-32, similar to how the WTF-8 variant of UTF-8 works. Sometimes paired surrogates are encoded instead of non-BMP characters, similar to CESU-8. Due to the large number of unused 32-bit values, it is also possible to preserve invalid UTF-8 by using non-Unicode values to encode UTF-8 errors, though there is no standard for this.

See also

Related Research Articles

Character encoding Using numbers to represent text characters

In computing, data storage, and data transmission, character encoding is used to represent a repertoire of characters by some kind of encoding system that assigns a number to each character for digital representation. Depending on the abstraction level and context, corresponding code points and the resulting code space may be regarded as bit patterns, octets, natural numbers, electrical pulses, or anything of the like. A character encoding is used in computation, data storage, and transmission of textual data. "Character set", "character map", "codeset" and "code page" are related, but not identical, terms.

String (computer science) Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

Unicode Character encoding standard

Unicode is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, which is maintained by the Unicode Consortium, defines 143,859 characters covering 154 modern and historic scripts, as well as symbols, emoji, and non-visual control and formatting codes. Unicode 14.0 is now in beta with 144,697 characters, including 5 new scripts, and 37 new emoji characters.

Web pages authored using hypertext markup language (HTML) may contain multilingual text represented with the Unicode universal character set. Key to the relationship between Unicode and HTML is the relationship between the "document character set", which defines the set of characters that may be present in a HTML document and assigns numbers to them, and the "external character encoding", or "charset", used to encode a given document as a sequence of bytes.

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from UnicodeTransformation Format – 8-bit.

UTF-16 Variable-width encoding of Unicode, using one or two 16-bit code units

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid character code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding, now known as UCS-2 (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed.

UTF-7 is an obsolete variable-length character encoding for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in Internet E-mail messages that was more efficient than the combination of UTF-8 with quoted-printable.

GB 18030 Unicode character encoding mostly used for Simplified Chinese

GB 18030 is a Chinese government standard, described as Information Technology — Chinese coded character set and defines the required language and character support necessary for software in China. GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB2312. As a Unicode Transformation Format, GB18030 supports both simplified and traditional Chinese characters. It is also compatible with legacy encodings including GB2312, CP936, and GBK 1.0.

A numeric character reference (NCR) is a common markup construct used in SGML and SGML-derived markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represents a single character. Since WebSgml, XML and HTML 4, the code points of the Universal Character Set (UCS) of Unicode are used. NCRs are typically used in order to represent characters that are not directly encodable in a particular document. When the document is interpreted by a markup-aware reader, each NCR is treated as if it were the character it represents.

A wide character is a computer character datatype that generally has a size greater than the traditional 8-bit character. The increased datatype size allows for the use of larger coded character sets.

A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set for representation, usually in a computer. Most common variable-width encodings are multibyte encodings, which use varying numbers of bytes (octets) to encode different characters. (Some authors, notably in Microsoft documentation, use the term multibyte character set, which is a misnomer, because representation size is an attribute of the encoding, not of the character set.)

Binary Ordered Compression for Unicode (BOCU) is a MIME compatible Unicode compression scheme. BOCU-1 combines the wide applicability of UTF-8 with the compactness of Standard Compression Scheme for Unicode (SCSU). This Unicode encoding is designed to be useful for compressing short strings, and maintains code point order. BOCU-1 is specified in a Unicode Technical Note.

The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26. A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the range U+10000 to U+10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes for each Unicode supplementary character while UTF-8 needs only four. Though not specified in the technical report, unpaired surrogates are also encoded as 3 bytes each, and CESU-8 is exactly the same as applying an older UCS-2 to UTF-8 converter to UTF-16 data.

In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the Unicode Consortium. Three private use areas are defined: one in the Basic Multilingual Plane, and one each in, and nearly covering, planes 15 and 16. The code points in these areas cannot be considered as standardized characters in Unicode itself. They are intentionally left undefined so that third parties may define their own characters without conflicting with Unicode Consortium assignments. Under the Unicode Stability Policy, the Private Use Areas will remain allocated for that purpose in all future Unicode versions.

This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments, and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.

UTF-1 is a method of transforming ISO 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for multi-byte encodings, making it unsuited for some uses. UTF-1 is also slow to encode or decode due to its use of division and multiplication by a number which is not a power of 2. Due to these issues, it did not gain acceptance and was quickly replaced by UTF-8.

Universal Character Set characters Complete list of the characters available on most computers

The Unicode Consortium (UC) and the International Organisation for Standardisation (ISO) collaborate on the Universal Character Set (UCS). The UCS is an international standard to map characters used in natural language, mathematics, music, and other domains to machine-readable values. By creating this mapping, the UCS enables computer software vendors to interoperate and transmit UCS-encoded text strings from one to another. Because it is a universal map, it can be used to represent multiple languages at the same time. This avoids the confusion of using multiple legacy character encodings, which can result in the same sequence of codes having multiple meanings and thus be improperly decoded if the wrong one is chosen.

The Universal Coded Character Set is a standard set of characters defined by the International Standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS), which is the basis of many character encodings, improving as characters from previously unrepresented writing systems are added.

Escape sequences are used in the programming languages C and C++, and their design was copied in many other languages such as Java, PHP, C#, etc. An escape sequence is a sequence of characters that does not represent itself when used inside a character or string literal, but is translated into another character or a sequence of characters that may be difficult or impossible to represent directly.

The C programming language has a set of functions implementing operations on strings in its standard library. Various operations, such as copying, concatenation, tokenization and searching are supported. For character strings, the standard library uses the convention that strings are null-terminated: a string of n characters is represented as an array of n + 1 elements, the last of which is a "NUL" character.

References

  1. 1 2 Mapping code points to Unicode encoding forms, § 1: UTF-32
  2. ISO/IEC 10646:2014 Clause 9.4: "Because surrogate code points are not UCS scalar values, UTF-32 code units in the range 0000 D800-0000 DFFF are ill-formed". Clause 4.57: "[UCS codespace] consisting of the integers from 0 to 10 FFFF (hexadecimal)". Clause 4.58: "[UCS scalar value] any UCS code point except high-surrogate and low-surrogate code points".
  3. THE UNIVERSAL CHARACTER SET (UCS)
  4. "Web Application Development".
  5. Löwis, Martin. "PEP 393 -- Flexible String Representation". python.org. Python. Retrieved 26 October 2014.
  6. "The usage of UTF-32 has several advantages".
  7. JuliaStrings/LegacyStrings.jl: Legacy Unicode string types, JuliaStrings, 2019-05-17, retrieved 2019-10-15
  8. "UTF-8 Everywhere Manifesto".