Comparison of Unicode encodings

Last updated

This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments (which can be assumed), and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.

Contents

Compatibility issues

A UTF-8 file that contains only ASCII characters is identical to an ASCII file. Legacy programs can generally handle UTF-8 encoded files, even if they contain non-ASCII characters. For instance, the C printf function can print a UTF-8 string, as it only looks for the ASCII '%' character to define a formatting string, and prints all other bytes unchanged, thus non-ASCII characters will be output unchanged.

UTF-16 and UTF-32 are incompatible with ASCII files, and thus require Unicode-aware programs to display, print and manipulate them, even if the file is known to contain only characters in the ASCII subset. Because they contain many zero bytes, the strings cannot be manipulated by normal null-terminated string handling for even simple operations such as copy. [lower-alpha 1]

Therefore, even on most UTF-16 systems such as Windows and Java, UTF-16 text files are not common; older 8-bit encodings such as ASCII or ISO-8859-1 are still used, forgoing Unicode support; or UTF-8 is used for Unicode. One rare counter-example is the "strings" file used by macOS (Mac OS X 10.3 Panther and later) applications for lookup of internationalized versions of messages which defaults to UTF-16, with "files encoded using UTF-8 ... not guaranteed to work." [1]

XML is, by convention, encoded as UTF-8, and all XML processors must at least support UTF-8 (including US-ASCII by definition) and UTF-16. [2]

Efficiency

UTF-8 requires 8, 16, 24 or 32 bits (one to four bytes) to encode a Unicode character, UTF-16 requires either 16 or 32 bits to encode a character, and UTF-32 always requires 32 bits to encode a character. The first 128 Unicode code points, U+0000 to U+007F, used for the C0 Controls and Basic Latin characters and which correspond one-to-one to their ASCII-code equivalents, are encoded using 8 bits in UTF-8, 16 bits in UTF-16, and 32 bits in UTF-32.

The next 1,920 characters, U+0080 to U+07FF (encompassing the remainder of almost all Latin-script alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Tāna and N'Ko), require 16 bits to encode in both UTF-8 and UTF-16, and 32 bits in UTF-32. For U+0800 to U+FFFF, i.e. the remainder of the characters in the Basic Multilingual Plane (BMP, plane 0, U+0000 to U+FFFF), which encompasses the rest of the characters of most of the world's living languages, UTF-8 needs 24 bits to encode a character, while UTF-16 needs 16 bits and UTF-32 needs 32. Code points U+010000 to U+10FFFF, which represent characters in the supplementary planes (planes 1–16), require 32 bits in UTF-8, UTF-16 and UTF-32.

Therefore, a file is shorter in UTF-8 than in UTF-16 if there are more ASCII code points than there are code points in the range U+0800 to U+FFFF. A surprising result is that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8, due to the extensive use of spaces, digits, punctuation, newlines, html markup, and embedded words and acronyms written with Latin letters. [3] UTF-32 is always longer unless there are no code points less than U+10000.

All printable characters in UTF-EBCDIC use at least as many bytes as in UTF-8, and most use more, due to a decision made to allow encoding the C1 control codes as single bytes. For seven-bit environments, UTF-7 is more space efficient than the combination of other Unicode encodings with quoted-printable or base64 for almost all types of text (see "Seven-bit environments" below).

Processing time

Text with variable-length encoding such as UTF-8 or UTF-16 is harder to process if there is a need to work with individual code units, as opposed to working with sequences of code units. Searching is unaffected by whether the characters are variable sized, since a search for a sequence of code units does not care about the divisions (it does require that the encoding be self-synchronizing, which both UTF-8 and UTF-16 are). A common misconception is that there is a need to "find the nth character" and that this requires a fixed-length encoding; however, in real use the number n is only derived from examining the n−1 characters, thus sequential access is needed anyway.[ citation needed ]

When character sequences in one endian order are loaded onto a machine with a different endian order, the characters need to be converted before they can be processed efficiently (or two processors are needed). Byte-based encodings such as UTF-8 do not have this problem. UTF-16BE and UTF-32BE are big-endian, UTF-16LE and UTF-32LE are little-endian.

Processing issues

For processing, a format should be easy to search, truncate, and generally process safely. All normal Unicode encodings use some form of fixed size code unit. Depending on the format and the code point to be encoded, one or more of these code units will represent a Unicode code point. To allow easy searching and truncation, a sequence must not occur within a longer sequence or across the boundary of two other sequences. UTF-8, UTF-16, UTF-32 and UTF-EBCDIC have these important properties but UTF-7 and GB 18030 do not.

Fixed-size characters can be helpful, but even if there is a fixed byte count per code point (as in UTF-32), there is not a fixed byte count per displayed character due to combining characters. Considering these incompatibilities and other quirks among different encoding schemes, handling unicode data with the same (or compatible) protocol throughout and across the interfaces (e.g. using an API/library, handling unicode characters in client/server model, etc.) can in general simplify the whole pipeline while eliminating a potential source of bugs at the same time.

UTF-16 is popular because many APIs date to the time when Unicode was 16-bit fixed width (referred as UCS-2). However, using UTF-16 makes characters outside the Basic Multilingual Plane a special case which increases the risk of oversights related to their handling. That said, programs that mishandle surrogate pairs probably also have problems with combining sequences, so using UTF-32 is unlikely to solve the more general problem of poor handling of multi-code-unit characters.

If any stored data is in UTF-8 (such as file contents or names), it is very difficult to write a system that uses UTF-16 or UTF-32 as an API. This is due to the oft-overlooked fact that the byte array used by UTF-8 can physically contain invalid sequences. For instance, it is impossible to fix an invalid UTF-8 filename using a UTF-16 API, as no possible UTF-16 string will translate to that invalid filename. The opposite is not true: it is trivial to translate invalid UTF-16 to a unique (though technically invalid) UTF-8 string, so a UTF-8 API can control both UTF-8 and UTF-16 files and names, making UTF-8 preferred in any such mixed environment. An unfortunate but far more common workaround used by UTF-16 systems is to interpret the UTF-8 as some other encoding such as CP-1252 and ignore the mojibake for any non-ASCII data.

For communication and storage

UTF-16 and UTF-32 do not have endianness defined, so a byte order must be selected when receiving them over a byte-oriented network or reading them from a byte-oriented storage. This may be achieved by using a byte-order mark at the start of the text or assuming big-endian (RFC 2781). UTF-8, UTF-16BE, UTF-32BE, UTF-16LE and UTF-32LE are standardised on a single byte order and do not have this problem.

If the byte stream is subject to corruption then some encodings recover better than others. UTF-8 and UTF-EBCDIC are best in this regard as they can always resynchronize after a corrupt or missing byte at the start of the next code point; GB 18030 is unable to recover until the next ASCII non-number. UTF-16 can handle altered bytes, but not an odd number of missing bytes, which will garble all the following text (though it will produce uncommon and/or unassigned characters). [lower-alpha 2] If bits can be lost all of them will garble the following text, though UTF-8 can be resynchronized as incorrect byte boundaries will produce invalid UTF-8 in almost all text longer than a few bytes.

In detail

The tables below list the number of bytes per code point for different Unicode ranges. Any additional comments needed are included in the table. The figures assume that overheads at the start and end of the block of text are negligible.

N.B. The tables below list numbers of bytes per code point, not per user visible "character" (or "grapheme cluster"). It can take multiple code points to describe a single grapheme cluster, so even in UTF-32, care must be taken when splitting or concatenating strings.

Eight-bit environments

Code range (hexadecimal) UTF-8 UTF-16 UTF-32 UTF-EBCDIC GB 18030
000000 – 00007F12411
000080 – 00009F22 for characters inherited from
GB 2312/GBK (e.g. most
Chinese characters) 4 for
everything else.
0000A0 – 0003FF2
000400 – 0007FF3
000800 – 003FFF3
004000 – 00FFFF4
010000 – 03FFFF444
040000 – 10FFFF5

Seven-bit environments

This table may not cover every special case and so should be used for estimation and comparison only. To accurately determine the size of text in an encoding, see the actual specifications.

Code range (hexadecimal)UTF-7UTF-8 quoted-
printable
UTF-8 base64 UTF-16 q.-p.UTF-16 base64 GB 18030 q.-p.GB 18030 base64
ASCII
graphic characters
(except U+003D "=")
1 for "direct characters" (depends on the encoder setting for some code points), 2 for U+002B "+", otherwise same as for 000080 – 00FFFF11+1342+2311+13
00003D (equals sign)36 3
ASCII
control characters:
000000 – 00001F
and 00007F
1 or 3 depending on directness1 or 3 depending on directness
000080 – 0007FF5 for an isolated case inside a run of single byte characters. For runs 2+23 per character plus padding to make it a whole number of bytes plus two to start and finish the run62+232–6 depending on if the byte values need to be escaped 4–6 for characters inherited from GB2312/GBK (e.g.
most Chinese characters) 8 for everything else.
2+23 for characters inherited from GB2312/GBK (e.g.
most Chinese characters) 5+13 for everything else.
000800 – 00FFFF94
010000 – 10FFFF8 for isolated case, 5+13 per character plus padding to integer plus 2 for a run125+138–12 depending on if the low bytes of the surrogates need to be escaped.5+1385+13

Endianness does not affect sizes (UTF-16BE and UTF-32BE have the same size as UTF-16LE and UTF-32LE, respectively). The use of UTF-32 under quoted-printable is highly impractical, but if implemented, will result in 8–12 bytes per code point (about 10 bytes in average), namely for BMP, each code point will occupy exactly 6 bytes more than the same code in quoted-printable/UTF-16. Base64/UTF-32 gets 5+13 bytes for any code point.

An ASCII control character under quoted-printable or UTF-7 may be represented either directly or encoded (escaped). The need to escape a given control character depends on many circumstances, but newlines in text data are usually coded directly.

Compression schemes

BOCU-1 and SCSU are two ways to compress Unicode data. Their encoding relies on how frequently the text is used. Most runs of text use the same script; for example, Latin, Cyrillic, Greek and so on. This normal use allows many runs of text to compress down to about 1 byte per code point. These stateful encodings make it more difficult to randomly access text at any position of a string.

These two compression schemes are not as efficient as other compression schemes, like zip or bzip2. Those general-purpose compression schemes can compress longer runs of bytes to just a few bytes. The SCSU and BOCU-1 compression schemes will not compress more than the theoretical 25% of text encoded as UTF-8, UTF-16 or UTF-32. Other general-purpose compression schemes can easily compress to 10% of original text size. The general purpose schemes require more complicated algorithms and longer chunks of text for a good compression ratio.

Unicode Technical Note #14 contains a more detailed comparison of compression schemes.

Historical: UTF-5 and UTF-6

Proposals have been made for a UTF-5 and UTF-6 for the internationalization of domain names (IDN). The UTF-5 proposal used a base 32 encoding, where Punycode is (among other things, and not exactly) a base 36 encoding. The name UTF-5 for a code unit of 5 bits is explained by the equation 25 = 32. [4] The UTF-6 proposal added a running length encoding to UTF-5, here 6 simply stands for UTF-5 plus 1. [5] The IETF IDN WG later adopted the more efficient Punycode for this purpose. [6]

Not being seriously pursued

UTF-1 never gained serious acceptance. UTF-8 is much more frequently used.

The nonet encodings UTF-9 and UTF-18 are April Fools' Day RFC joke specifications, although UTF-9 is a functioning nonet Unicode transformation format, and UTF-18 is a functioning nonet encoding for all non-Private-Use code points in Unicode 12 and below, although not for Supplementary Private Use Areas or portions of Unicode 13 and later.

Notes

  1. ASCII software not using null characters to terminate strings would handle UTF-16 and UTF-32 encoded files correctly (such files, if containing only ASCII-subset characters, would appear as normal ASCII padded with null characters), but such software is not common.
  2. An even number of missing bytes in UTF-16, in contrast, will garble at most one character.

Related Research Articles

<span class="mw-page-title-main">Character encoding</span> Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "character map".

While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.

<span class="mw-page-title-main">Plain text</span> Term for computer data consisting only of unformatted characters of readable material

In computing, plain text is a loose term for data that represent only characters of readable material but not its graphical representation nor other objects. It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects.

<span class="mw-page-title-main">Unicode</span> Character encoding standard

Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text written in all of the world's major writing systems. Version 15.1 of the standard defines 149813 characters and 161 scripts used in various ordinary, literary, academic, and technical contexts.

Web pages authored using HyperText Markup Language (HTML) may contain multilingual text represented with the Unicode universal character set. Key to the relationship between Unicode and HTML is the relationship between the "document character set", which defines the set of characters that may be present in an HTML document and assigns numbers to them, and the "external character encoding", or "charset", used to encode a given document as a sequence of bytes.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit.

<span class="mw-page-title-main">UTF-16</span> Variable-width encoding of Unicode, using one or two 16-bit code units

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as "UCS-2" (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed, including most emoji and important CJK characters such as for personal and place names.

The byte-order mark (BOM) is a particular usage of the special Unicode character code, U+FEFFZERO WIDTH NO-BREAK SPACE, whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text:

UTF-32 (32-bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode code points, needing actually only 21 bits). UTF-32 is a fixed-length encoding, in contrast to all other Unicode transformation formats, which are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.

A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file (EOF) marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. Most text files need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records.

In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits at a time, then this group of 6 bits is mapped to one of 64 unique characters.

UTF-7 is an obsolete variable-length character encoding for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in Internet E-mail messages that was more efficient than the combination of UTF-8 with quoted-printable.

In computing, JIS encoding refers to several Japanese Industrial Standards for encoding the Japanese language. Strictly speaking, the term means either:

The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard for reducing the number of bytes needed to represent Unicode text, especially if that text uses mostly characters from one or a small number of per-language character blocks. It does so by dynamically mapping values in the range 128–255 to offsets within particular blocks of 128 characters. The initial conditions of the encoder mean that existing strings in ASCII and ISO-8859-1 that do not contain C0 control codes other than NULL TAB CR and LF can be treated as SCSU strings. Since most alphabets do reside in blocks of contiguous Unicode codepoints, texts that use small alphabets and either ASCII punctuation or punctuation that fits within the window for the main alphabet can be encoded at one byte per character, most other punctuation can be encoded at 2 bytes per symbol through non-locking shifts. SCSU can also switch to UTF-16 internally to handle non-alphabetic languages.

A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set for representation, usually in a computer. Most common variable-width encodings are multibyte encodings, which use varying numbers of bytes (octets) to encode different characters. (Some authors, notably in Microsoft documentation, use the term multibyte character set, which is a misnomer, because representation size is an attribute of the encoding, not of the character set.)

UTF-EBCDIC is a character encoding capable of encoding all 1,112,064 valid character code points in Unicode using one to five one-byte (8-bit) code units. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for existing ASCII-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.

Binary Ordered Compression for Unicode (BOCU) is a MIME compatible Unicode compression scheme. BOCU-1 combines the wide applicability of UTF-8 with the compactness of Standard Compression Scheme for Unicode (SCSU). This Unicode encoding is designed to be useful for compressing short strings, and maintains code point order. BOCU-1 is specified in a Unicode Technical Note.

Many email clients now offer some support for Unicode. Some clients will automatically choose between a legacy encoding and Unicode depending on the mail's content, either automatically or when the user requests it.

Specials is a short Unicode block of characters allocated at the very end of the Basic Multilingual Plane, at U+FFF0–FFFF. Of these 16 code points, five have been assigned since Unicode 3.0:

Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. The technique is recognised to be unreliable and is only used when specific metadata, such as a HTTP Content-Type: header is either not available, or is assumed to be untrustworthy.

References

  1. "Apple Developer Connection: Internationalization Programming Topics: Strings Files".
  2. "Character Encoding in Entities". Extensible Markup Language (XML) 1.0 (Fifth Edition). World Wide Web Consortium. 2008.
  3. "UTF-8 Everywhere". utf8everywhere.org. Retrieved 28 August 2022.
  4. Seng, James, UTF-5, a transformation format of Unicode and ISO 10646, 28 January 2000
  5. Welter, Mark; Spolarich, Brian W. (16 November 2000). "UTF-6 - Yet Another ASCII-Compatible Encoding for ID". Ietf Datatracker. Archived from the original on 23 May 2016. Retrieved 9 April 2016.
  6. "Internationalized Domain Name (idn)". Internet Engineering Task Force. Retrieved 20 March 2023.