UTF-7

UTF-7
Language(s)	International
Standard	RFC 2152
Classification	Unicode Transformation Format, ASCII armor, variable-width encoding, stateful encoding
Transforms / Encodes	ISO/IEC 10646 (Unicode)
Preceded by	HZ-GB-2312
Succeeded by	UTF-8 over 8BITMIME
	v ; t ; e ;

Last updated November 27, 2024

UTF-7 (7-bit Unicode Transformation Format) is an obsolete variable-length character encoding for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in Internet E-mail messages that was more efficient than the combination of UTF-8 with quoted-printable.

UTF-7 (according to its RFC) isn't a "Unicode Transformation Format", as the definition can only encode code points in the BMP (the first 65536 Unicode code points, which does not include emojis and many other characters). However if a UTF-7 translator is to/from UTF-16 then it can (and probably does)^{[ citation needed ]} encode each surrogate half as though it was a 16-bit code point, and thus can encode all code points. It is unclear if other UTF-7 software (such as translators to UTF-32 or UTF-8) support this.

UTF-7 has never been an official standard of the Unicode Consortium. It is known to have security issues, which is why software has been changed to disable its use.^[1] It is prohibited in HTML 5.^[2]^[3]

Motivation

MIME, the modern standard for e-mail formats, forbids encoding of headers using byte values above the ASCII range. Although MIME allows encoding the message body in various character sets (broader than ASCII), the underlying transmission infrastructure (SMTP, the main E-mail transfer standard) is still not guaranteed to be 8-bit clean. Therefore, a non-trivial content transfer encoding has to be applied in case of doubt. Unfortunately, Base64 has a disadvantage of making even ASCII characters unreadable in non-MIME clients. On the other hand, UTF-8 combined with quoted-printable produces a very size-inefficient format requiring 6–9 bytes for non-ASCII characters from the BMP and 12 bytes for characters outside the BMP.

Provided certain rules are followed during encoding, UTF-7 can be sent in e-mail without using an underlying MIME transfer encoding, but still must be explicitly identified as the text character set. In addition, if used within e-mail headers such as "Subject:", UTF-7 must be contained in MIME encoded words identifying the character set. Since encoded words force use of either quoted-printable or Base64, UTF-7 was designed to avoid using the = sign as an escape character to avoid double escaping when it is combined with quoted-printable (or its variant, the RFC 2047/1522 "Q"-encoding of headers).

UTF-7 is generally not used as a native representation within applications as it is very awkward to process. Despite its size advantage over the combination of UTF-8 with either quoted-printable or Base64, the now defunct Internet Mail Consortium recommended against its use.^[4]

8BITMIME has also been introduced, which reduces the need to encode message bodies in a 7-bit format.

A modified form of UTF-7 (sometimes dubbed 'mUTF-7'^[5]) was used in the Internet Message Access Protocol (IMAP) e-mail retrieval protocol, version 4 rev 1, for "international" mailbox names.^[6] The following version, IMAP version 4 rev 2, uses UTF-8 instead.^[7]

Description

UTF-7 was first proposed as an experimental protocol in RFC 1642, A Mail-Safe Transformation Format of Unicode. This RFC has been made obsolete by RFC 2152, an informational RFC which never became a standard. As RFC 2152 clearly states, the RFC "does not specify an Internet standard of any kind". Despite this, RFC 2152 is quoted as the definition of UTF-7 in the IANA's list of charsets. Neither is UTF-7 a Unicode Standard. The Unicode Standard 5.0 only lists UTF-8, UTF-16 and UTF-32. There is also a modified version, specified in RFC 2060, which is sometimes identified as UTF-7.

Some characters can be represented directly as single ASCII bytes. The first group is known as "direct characters" and contains 62 alphanumeric characters and 9 symbols: ' ( ) , - . / : ?. The direct characters are safe to include literally. The other main group, known as "optional direct characters", contains all other printable characters in the range U+0020–U+007E except ~ \ + and space (the characters \ and ~ being excluded due to being redefined in "variants of ASCII" such as JIS-Roman). Using the optional direct characters reduces size and enhances human readability but also increases the chance of breakage by things like badly designed mail gateways and may require extra escaping when used in encoded words for header fields.

Space, tab, carriage return and line feed may also be represented directly as single ASCII bytes. However, if the encoded text is to be used in e-mail, care is needed to ensure that these characters are used in ways that do not require further content transfer encoding to be suitable for e-mail. The plus sign (+) may be encoded as +-.

Other characters must be encoded in UTF-16 (hence U+10000 and higher would be encoded into two surrogates), and then in modified Base64. The start of these blocks of modified Base64-encoded UTF-16 is indicated by a + sign. The end is indicated by any character not in the modified Base64 set. If the character after the modified Base64 is a - (ASCII hyphen-minus) then it is consumed by the decoder and decoding resumes with the next character. Otherwise decoding resumes with the character after the Base64.

Examples

"Hello, World!" is encoded as "Hello, World+ACE-"
"1 + 1 = 2" is encoded as "1 +- 1 +AD0- 2"
"£1" is encoded as "+AKM-1". The Unicode code point for the pound sign is U+00A3 which converts into modified Base64 as in the table below. There are two bits left over, which are padded to 0.

Hex digit	0				0				A				3
Bit pattern	0	0	0	0	0	0	0	0	1	0	1	0	0	0	1	1	0	0
Index	0						10						12
Base64-Encoded	A						K						M

Algorithm for encoding and decoding

Encoding

First, an encoder must decide which characters to represent directly in ASCII form, which + have to be escaped as +-, and which to place in blocks of Unicode characters. The expansion cost of UTF-7 can be high: for example, the character sequence U+10FFFF U+0077 U+10FFFF is 9 bytes in UTF-8, but 17 bytes in UTF-7. (At worst, treating every codepoint as a sequence in its own right produces the maximum expansion of 5x, e.g. when encoding @@ as +AEA-+AEA-.) Each Unicode sequence must be encoded using the following procedure, then surrounded by the appropriate delimiters.

Using the £† (U+00A3 U+2020) character sequence as an example:

Express the character's Unicode numbers (UTF-16) in binary:
- 0x00A3 → 0000 0000 1010 0011
- 0x2020 → 0010 0000 0010 0000
Concatenate the binary sequences:
0000 0000 1010 0011 and 0010 0000 0010 0000 → 0000 0000 1010 0011 0010 0000 0010 0000
Regroup the binary into groups of six bits, starting from the left:
0000 0000 1010 0011 0010 0000 0010 0000 → 000000 001010 001100 100000 001000 00
If the last group has fewer than six bits, add trailing zeros:
000000 001010 001100 100000 001000 00 → 000000 001010 001100 100000 001000 000000
Replace each group of six bits with a respective Base64 code:
000000 001010 001100 100000 001000 000000 → AKMgIA

Decoding

First an encoded data must be separated into plain ASCII text chunks (including +es followed by a dash) and nonempty Unicode blocks as mentioned in the description section. Once this is done, each Unicode block must be decoded with the following procedure (using the result of the encoding example above as our example)

Express each Base64 code as the bit sequence it represents:
AKMgIA → 000000 001010 001100 100000 001000 000000
Regroup the binary into groups of sixteen bits, starting from the left:
000000 001010 001100 100000 001000 000000 → 0000000010100011 0010000000100000 0000
If there is an incomplete group at the end containing only zeros, discard it (if the incomplete group contains any ones, the code is invalid):
0000000010100011 0010000000100000
Each group of 16 bits is a character's Unicode (UTF-16) number and can be expressed in other forms:
0000 0000 1010 0011 ≡ 0x00A3 ≡ 163₁₀

Byte order mark

A byte order mark (BOM) is an optional special byte sequence at the very start of a stream or file that, without being data itself, indicates the encoding used for the data that follows; it can be used in the absence of metadata that denotes the encoding. For a given encoding scheme, it's that scheme's representation of Unicode code point U+FEFF.^[8]

While it's typically a single, fixed byte sequence, in UTF-7 four variations may appear, because the last 2 bits of the 4th byte of the UTF-7 encoding of U+FEFF belong to the following character, resulting in 4 possible bit patterns and therefore 4 different possible bytes in the 4th position. See the UTF-7 entry in the table of Unicode byte order marks.^[9]

Security

UTF-7 allows multiple representations of the same source string. In particular, ASCII characters can be represented as part of Unicode blocks. As such, if standard ASCII-based escaping or validation processes are used on strings that may be later interpreted as UTF-7, then Unicode blocks may be used to slip malicious strings past them. To mitigate this problem, systems should perform decoding before validation and should avoid attempting to autodetect UTF-7.

Older versions of Internet Explorer can be tricked into interpreting the page as UTF-7. This can be used for a cross-site scripting attack as the < and > marks can be encoded as +ADw- and +AD4- in UTF-7, which most validators let through as simple text.^[10]

UTF-7 is considered obsolete, at least for Microsoft software (.NET), with code paths previously supporting it intentionally broken (to prevent security issues) in .NET 5, in 2020.^[1]

Related Research Articles

ASCII, an acronym for American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. ASCII has just 128 code points, of which only 95 are printable characters, which severely limit its scope. The set of available punctuation had significant impact on the syntax of computer languages and text markup. ASCII hugely influenced the design of character sets used by modern computers, including Unicode which has over a million code points, but the first 128 of these are the same as ASCII.

<span class="mw-page-title-main">Character encoding</span> Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up a character encoding are known as code points and collectively comprise a code space, a code page, or character map.

Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 of the standard defines 154998 characters and 168 scripts used in various ordinary, literary, academic, and technical contexts.

UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit. Almost every webpage is stored in UTF-8.

UTF-16 (16-bit Unicode Transformation Format) is a character encoding method capable of encoding all 1,112,064 valid code points of Unicode. The encoding is variable-length as code points are encoded with one or two 16-bitcode units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as UCS-2 (for 2-byte Universal Character Set), once it became clear that more than 2¹⁶ (65,536) code points were needed, including most emoji and important CJK characters such as for personal and place names.

8-bit clean is an attribute of computer systems, communication channels, and other devices and software, that process 8-bit character encodings without treating any byte as an in-band control code.

The byte-order mark (BOM) is a particular usage of the special Unicode character code, U+FEFFZERO WIDTH NO-BREAK SPACE, whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text:

UTF-32 (32-bit Unicode Transformation Format), sometimes called UCS-4, is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 2³² Unicode code points, needing actually only 21 bits). In contrast, all other Unicode transformation formats are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.

In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits at a time, then this group of 6 bits is mapped to one of 64 unique characters.

An email address identifies an email box to which messages are delivered. While early messaging systems used a variety of formats for addressing, today, email addresses follow a set of specific rules originally standardized by the Internet Engineering Task Force (IETF) in the 1980s, and updated by RFC 5322 and 6854. The term email address in this article refers to just the addr-spec in Section 3.4 of RFC 5322. The RFC defines address more broadly as either a mailbox or group. A mailbox value can be either a name-addr, which contains a display-name and addr-spec, or the more common addr-spec alone.

The null character is a control character with the value zero. It is present in many character sets, including those defined by the Baudot and ITA2 codes, ISO/IEC 646, the C0 control code, the Universal Coded Character Set, and EBCDIC. It is available in nearly all mainstream programming languages. It is often abbreviated as NUL. In 8-bit codes, it is known as a null byte.

ISO/IEC 2022Information technology—Character code structure and extension techniques, is an ISO/IEC standard in the field of character encoding. It is equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202. Originating in 1971, it was most recently revised in 1994.

Quoted-Printable, or QP encoding, is a binary-to-text encoding system using printable ASCII characters to transmit 8-bit data over a 7-bit data path or, generally, over a medium which is not 8-bit clean. Historically, because of the wide range of systems and protocols that could be used to transfer messages, e-mail was often assumed to be non-8-bit-clean – however, modern SMTP servers are in most cases 8-bit clean and support 8BITMIME extension. It can also be used with data that contains non-permitted octets or line lengths exceeding SMTP limits. It is defined as a MIME content transfer encoding for use in e-mail.

A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set for representation, usually in a computer. Most common variable-width encodings are multibyte encodings, which use varying numbers of bytes (octets) to encode different characters.

Ascii85, also called Base85, is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII characters to represent four bytes of binary data, it is more efficient than uuencode or Base64, which use four characters to represent three bytes of data.

Many email clients now offer some support for Unicode. Some clients will automatically choose between a legacy encoding and Unicode depending on the mail's content, either automatically or when the user requests it.

The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26. A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the range U+10000 to U+10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes for each Unicode supplementary character while UTF-8 needs only four. Though not specified in the technical report, unpaired surrogates are also encoded as 3 bytes each, and CESU-8 is exactly the same as applying an older UCS-2 to UTF-8 converter to UTF-16 data.

This article compares Unicode encodings in two types of environments: 8-bit clean environments, and environments that forbid the use of byte values with the high bit set. Originally, such prohibitions allowed for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. The Standard Compression Scheme for Unicode and the Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.

UTF-1 is an obsolete method of transforming ISO/IEC 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for multi-byte encodings, making it unsuited for some uses. UTF-1 is also slow to encode or decode due to its use of division and multiplication by a number which is not a power of 2. Due to these issues, it did not gain acceptance and was quickly replaced by UTF-8.

A binary-to-text encoding is encoding of data in plain text. More precisely, it is an encoding of binary data in a sequence of printable characters. These encodings are necessary for transmission of data when the communication channel does not allow binary data or is not 8-bit clean. PGP documentation uses the term "ASCII armor" for binary-to-text encoding when referring to Base64.

References

1 2 "Breaking change: UTF-7 code paths are obsolete". docs.microsoft.com. Retrieved 8 January 2021.
↑ "8.2.2.3. Character encodings". HTML 5.1 Standard. W3C.
↑ "12.2.3.3 Character encodings". HTML Living Standard. WHATWG.
↑ "Using International Characters in Internet Mail". Internet Mail Consortium. 1 August 1998. Archived from the original on 7 September 2015.
↑ "Configuration Manual". Dovecot Documentation. 8 February 2023. Sec. "Mail Location Settings". Retrieved 28 February 2023. Store mailbox names on disk using UTF-8 instead of modified UTF-7 (mUTF-7).
↑ Crispin, Mark (March 2003). INTERNET MESSAGE ACCESS PROTOCOL - VERSION 4rev1. Network Working Group. doi: 10.17487/RFC3501 . RFC 3501.Proposed Standard. sec. 5.1.3 "Mailbox International Naming Convention". Obsoleted by RFC 9051. Updated by RFC 7817, 8437, 8474, 4551, 4469, 5182, 4466, 5032 and 5738. Obsoletes RFC 2060. In modified UTF-7, printable US-ASCII characters, except for "&", represent themselves…. The character "&" (0x26) is represented by the two-octet sequence "&-". All other characters… are represented in modified BASE64….
↑ Melnikov, Alexey; Leiba, Barry (August 2021). Internet Message Access Protocol (IMAP) - Version 4rev2. IETF. doi: 10.17487/RFC9051 . ISSN 2070-1721. RFC 9051.Proposed Standard. sec. 5.1. "Mailbox Naming". Obsoletes RFC 3501 In IMAP4rev2, mailbox names are encoded in Net-Unicode (this differs from IMAP4rev1).
↑ "FAQ – UTF-8, UTF-16, UTF-32 & BOM".
↑ "Clarify guidance for use of a BOM as a UTF-8 encoding signature" (PDF). Retrieved 17 January 2024.
↑ "ArticleUtf7 - doctype-mirror - UTF-7: the case of the missing charset - Mirror of Google Doctype - Google Project Hosting". 14 October 2011. Retrieved 29 June 2012.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[dotnet5-1] 1 2 "Breaking change: UTF-7 code paths are obsolete". docs.microsoft.com. Retrieved 8 January 2021.

[2] "8.2.2.3. Character encodings". HTML 5.1 Standard. W3C.

[3] "12.2.3.3 Character encodings". HTML Living Standard. WHATWG.

[4] "Using International Characters in Internet Mail". Internet Mail Consortium. 1 August 1998. Archived from the original on 7 September 2015.

[5] "Configuration Manual". Dovecot Documentation. 8 February 2023. Sec. "Mail Location Settings". Retrieved 28 February 2023. Store mailbox names on disk using UTF-8 instead of modified UTF-7 (mUTF-7).

[6] Crispin, Mark (March 2003). INTERNET MESSAGE ACCESS PROTOCOL - VERSION 4rev1. Network Working Group. doi: 10.17487/RFC3501 . RFC 3501.Proposed Standard. sec. 5.1.3 "Mailbox International Naming Convention". Obsoleted by RFC 9051. Updated by RFC 7817, 8437, 8474, 4551, 4469, 5182, 4466, 5032 and 5738. Obsoletes RFC 2060. In modified UTF-7, printable US-ASCII characters, except for "&", represent themselves…. The character "&" (0x26) is represented by the two-octet sequence "&-". All other characters… are represented in modified BASE64….

[7] Melnikov, Alexey; Leiba, Barry (August 2021). Internet Message Access Protocol (IMAP) - Version 4rev2. IETF. doi: 10.17487/RFC9051 . ISSN 2070-1721. RFC 9051.Proposed Standard. sec. 5.1. "Mailbox Naming". Obsoletes RFC 3501 In IMAP4rev2, mailbox names are encoded in Net-Unicode (this differs from IMAP4rev1).

[8] "FAQ – UTF-8, UTF-16, UTF-32 & BOM".

[9] "Clarify guidance for use of a BOM as a UTF-8 encoding signature" (PDF). Retrieved 17 January 2024.

[10] "ArticleUtf7 - doctype-mirror - UTF-7: the case of the missing charset - Mirror of Google Doctype - Google Project Hosting". 14 October 2011. Retrieved 29 June 2012.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

v t e Character encodings
Early telecommunications	Telegraph code Needle Morse Non-Latin Wabun/Kana Chinese Cyrillic Baudot and Murray Fieldata ASCII ISO/IEC 646 BCDIC Teletex and Videotex/Teletext T.51/ISO/IEC 6937 ITU T.61 ITU T.101 World System Teletext background sets Transcode
ISO/IEC 8859	Approved parts -1 (Western Europe) -2 (Central Europe) -3 (Maltese/Esperanto) -4 (North Europe) -5 (Cyrillic) -6 (Arabic) -7 (Greek) -8 (Hebrew) -9 (Turkish) -10 (Nordic) -11 (Thai) -13 (Baltic) -14 (Celtic) -15 (New Western Europe) -16 (Romanian) Abandoned parts -12 (Devanagari) Proposed but not approved KOI-8 Cyrillic Sámi Adaptations Welsh Barents Cyrillic Estonian Ukrainian Cyrillic
Bibliographic use	MARC-8 ANSEL CCCII/EACC ISO 5426 5426-2 5427 5428 6438 6862
National standards	ArmSCII Big5 BraSCII CNS 11643 DIN 66003 ELOT 927 GOST 10859 GB 2312 GB 12345 GB 12052 GB 18030 HKSCS ISCII JIS X 0201 JIS X 0208 JIS X 0212 JIS X 0213 KOI-7 KPS 9566 KS X 1001 KS X 1002 LST 1564 LST 1590-4 PASCII Shift JIS SI 960 TIS-620 TSCII VISCII VSCII YUSCII
ISO/IEC 2022	ISO/IEC 8859 ISO/IEC 10367 Extended Unix Code / EUC
Mac OS Code pages ("scripts")	Armenian Arabic Barents Cyrillic Celtic Central European Croatian Cyrillic Devanagari Farsi (Persian) Font X (Kermit) Gaelic Georgian Greek Gujarati Gurmukhi Hebrew Iceland Inuit Keyboard Latin (Kermit) Maltese/Esperanto Ogham Roman Romanian Sámi Turkish Turkic Cyrillic Ukrainian VT100
DOS code pages	437 668 708 720 737 770 773 775 776 777 778 850 851 852 853 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 897 899 903 904 932 936 942 949 950 951 1034 1040 1042 1043 1044 1098 1115 1116 1117 1118 1127 3846 ABICOMP CS Indic CSX Indic CSX+ Indic CWI-2 Iran System Kamenický Mazovia MIK
IBM AIX code pages	895 896 912 915 921 922 1006 1008 1009 1010 1012 1013 1014 1015 1016 1017 1018 1019 1046 1124 1133
Windows code pages	CER-GS 932 936 (GBK) 950 1169 Extended Latin-8 1250 1251 1252 1253 1254 1255 1256 1257 1258 1270 Cyrillic + Finnish Cyrillic + French Cyrillic + German Polytonic Greek
EBCDIC code pages	Japanese language in EBCDIC DKOI
DEC terminals (VTx)	Multinational (MCS) National Replacement (NRCS) French Canadian Swiss Spanish United Kingdom Dutch Finnish French Norwegian and Danish Swedish Norwegian and Danish (alternative) 8-bit Greek 8-bit Turkish SI 960 Hebrew Special Graphics Technical (TCS)
Platform specific	1052 1053 1054 1055 1056 1057 1058 Acorn RISC OS Amstrad CPC Apple II ATASCII Atari ST BICS Casio calculators CDC Compucolor 8001 Compucolor II CP/M+ DEC RADIX 50 DEC MCS/NRCS DG International Galaksija GEM GSM 03.38 HP Roman HP FOCAL HP RPL SQUOZE LICS LMBCS MSX NEC APC NeXT PETSCII PostScript Standard PostScript Latin 1 SAM Coupé Sega SC-3000 Sharp calculators Sharp MZ Sinclair QL Teletext TI calculators TRS-80 Ventura International WISCII XCCS ZX80 ZX81 ZX Spectrum
Unicode / ISO/IEC 10646	UTF-1 UTF-7 UTF-8 UTF-16 UTF-32 UTF-EBCDIC GB 18030 DIN 91379 BOCU-1 CESU-8 SCSU TACE16 Comparison of Unicode encodings
TeX typesetting system	Cork LY1 OML OMS OT1
Miscellaneous code pages	ABICOMP ASMO 449 Digital encoding of APL symbols ISO-IR-68 ARIB STD-B24 Fieldata HZ IEC-P27-1 INIS 7-bit 8-bit ISO-IR-169 ISO 2033 KOI KOI8-R KOI8-RU KOI8-U Mojikyō SEASCII Stanford/ITS Symbol TRON Unified Hangul Code
Control character	Morse prosigns C0 and C1 control codes ISO/IEC 6429 JIS X 0211 Unicode control, format and separator characters Whitespace characters
Related topics	CCSID Character encodings in HTML Charset detection Han unification Hardware code page MICR code Mojibake Variable-length encoding
Character sets