UTF-1

Last updated
UTF-1
MIME / IANAISO-10646-UTF-1
Language(s)International
Current statusObscure, of mainly historical interest.
Classification Unicode Transformation Format, extended ASCII, variable-width encoding
Extends US-ASCII
Transforms / Encodes ISO/IEC 10646 (Unicode)
Succeeded by UTF-8

UTF-1 is a method of transforming ISO/IEC 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for multi-byte encodings, making it unsuited for some uses (for instance Unix filenames cannot contain the byte value used for forward slash). UTF-1 is also slow to encode or decode due to its use of division and multiplication by a number which is not a power of 2. Due to these issues, it did not gain acceptance and was quickly replaced by UTF-8.

Contents

Design

Similar to UTF-8, UTF-1 is a variable-width encoding that is backwards-compatible with ASCII. Every Unicode code point is represented by either a single byte, or a sequence of two, three, or five bytes. All ASCII code points are a single byte (the code points U+0080 through U+009F are also single bytes).

UTF-1 does not use the C0 and C1 control codes or the space character in multi-byte encodings: a byte in the range 0–0x20 or 0x7F–0x9F always stands for the corresponding code point. This design with 66 protected characters tried to be ISO/IEC 2022 compatible.

UTF-1 uses "modulo 190" arithmetic (256 − 66 = 190). For comparison, UTF-8 protects all 128 ASCII characters and needs one bit for this, and a second bit to make it self-synchronizing, resulting in "modulo 64" arithmetic (8 − 2 = 6; 26 = 64). BOCU-1 protects only the minimal set required for MIME-compatibility (0x00, 0x07–0x0F, 0x1A–0x1B, and 0x20), resulting in "modulo 243" arithmetic (256 − 13 = 243).

code pointUTF-8UTF-1
U+007F7F7F
U+0080C2 8080
U+009FC2 9F9F
U+00A0C2 A0A0 A0
U+00BFC2 BFA0 BF
U+00C0C3 80A0 C0
U+00FFC3 BFA0 FF
U+0100C4 80A1 21
U+015DC5 9DA1 7E
U+015EC5 9EA1 A0
U+01BDC6 BDA1 FF
U+01BEC6 BEA2 21
U+07FFDF BFAA 72
U+0800E0 A0 80AA 73
U+0FFFE0 BF BFB5 48
U+1000E1 80 80B5 49
U+4015E4 80 95F5 FF
U+4016E4 80 96F6 21 21
U+D7FFED 9F BFF7 2F C3
U+E000EE 80 80F7 3A 79
U+F8FFEF A3 BFF7 5C 3C
U+FDD0EF B7 90F7 62 BA
U+FDEFEF B7 AFF7 62 D9
U+FEFF EF BB BFF7 64 4C
U+FFFDEF BF BDF7 65 AD
U+FFFEEF BF BEF7 65 AE
U+FFFFEF BF BFF7 65 AF
U+10000F0 90 80 80F7 65 B0
U+38E2DF0 B8 B8 ADFB FF FF
U+38E2EF0 B8 B8 AEFC 21 21 21 21
U+FFFFFF3 BF BF BFFC 21 37 B2 7A
U+100000F4 80 80 80FC 21 37 B2 7B
U+10FFFFF4 8F BF BFFC 21 39 6E 6C
U+7FFFFFFFFD BF BF BF BF BFFD BD 2B B9 40

Although modern Unicode ends at U+10FFFF, both UTF-1 and UTF-8 were designed to encode the complete 31 bits of the original Universal Character Set (UCS-4), and the last entry in this table shows this original final code point.

See also

Related Research Articles

<span class="mw-page-title-main">Character encoding</span> Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "character map".

Extended Binary Coded Decimal Interchange Code is an eight-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding six-bit binary-coded decimal code used with most of IBM's computer peripherals of the late 1950s and early 1960s. It is supported by various non-IBM platforms, such as Fujitsu-Siemens' BS2000/OSD, OS-IV, MSP, and MSP-EX, the SDS Sigma series, Unisys VS/9, Unisys MCP and ICL VME.

ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. The ISO working group maintaining this series of standards has been disbanded.

<span class="mw-page-title-main">Unicode</span> Character encoding standard

Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text written in all of the world's major writing systems. Version 15.1 of the standard defines 149813 characters and 161 scripts used in various ordinary, literary, academic, and technical contexts.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit.

<span class="mw-page-title-main">UTF-16</span> Variable-width encoding of Unicode, using one or two 16-bit code units

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding, now known as UCS-2 (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed.

The byte order mark (BOM) is a particular usage of the special Unicode character, U+FEFFZERO WIDTH NO-BREAK SPACE, whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text:

UTF-32 (32-bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode code points, needing actually only 21 bits). UTF-32 is a fixed-length encoding, in contrast to all other Unicode transformation formats, which are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.

ISO/IEC 2022Information technology—Character code structure and extension techniques, is an ISO/IEC standard in the field of character encoding. It is equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202. Originating in 1971, it was most recently revised in 1994.

Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.

A numeric character reference (NCR) is a common markup construct used in SGML and SGML-derived markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represents a single character. Since WebSgml, XML and HTML 4, the code points of the Universal Character Set (UCS) of Unicode are used. NCRs are typically used in order to represent characters that are not directly encodable in a particular document. When the document is interpreted by a markup-aware reader, each NCR is treated as if it were the character it represents.

<span class="mw-page-title-main">ArmSCII</span> Set of obsolete single-byte character encodings

ArmSCII or ARMSCII is a set of obsolete single-byte character encodings for the Armenian alphabet defined by Armenian national standard 166–9. ArmSCII is an acronym for Armenian Standard Code for Information Interchange, similar to ASCII for the American standard. It has been superseded by the Unicode standard.

A wide character is a computer character datatype that generally has a size greater than the traditional 8-bit character. The increased datatype size allows for the use of larger coded character sets.

A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set for representation, usually in a computer. Most common variable-width encodings are multibyte encodings, which use varying numbers of bytes (octets) to encode different characters. (Some authors, notably in Microsoft documentation, use the term multibyte character set, which is a misnomer, because representation size is an attribute of the encoding, not of the character set.)

UTF-EBCDIC is a character encoding capable of encoding all 1,112,064 valid character code points in Unicode using one to five one-byte (8-bit) code units. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for existing ASCII-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.

Binary Ordered Compression for Unicode (BOCU) is a MIME compatible Unicode compression scheme. BOCU-1 combines the wide applicability of UTF-8 with the compactness of Standard Compression Scheme for Unicode (SCSU). This Unicode encoding is designed to be useful for compressing short strings, and maintains code point order. BOCU-1 is specified in a Unicode Technical Note.

The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, an instruction to start a new line, or a message that the text has been received.

This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments, and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.

The Universal Coded Character Set is a standard set of characters defined by the international standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS), which is the basis of many character encodings, improving as characters from previously unrepresented typing systems are added.

The MARC-8 charset is a MARC standard used in MARC-21 library records. The MARC formats are standards for the representation and communication of bibliographic and related information in machine-readable form, and they are frequently used in library database systems. The character encoding now known as MARC-8 was introduced in 1968 as part of the MARC format. Originally based on the Latin alphabet, from 1979 to 1983 the JACKPHY initiative expanded the repertoire to include Japanese, Arabic, Chinese, and Hebrew characters, with the later addition of Cyrillic and Greek scripts. If a character is not representable in MARC-8 of a MARC-21 record, then UTF-8 must be used instead. UTF-8 has support for many more characters than MARC-8, which is rarely used outside library data.

References