MARC-8

Last updated April 10, 2024

The MARC-8 charset is a MARC standard used in MARC-21 library records.^[1] The MARC formats are standards for the representation and communication of bibliographic and related information in machine-readable form, and they are frequently used in library database systems. The character encoding now known as MARC-8 was introduced in 1968 as part of the MARC format. Originally based on the Latin alphabet, from 1979 to 1983 the JACKPHY initiative expanded the repertoire to include Japanese, Arabic, Chinese, and Hebrew characters (among others), with the later addition of Cyrillic and Greek scripts. If a character is not representable in MARC-8 of a MARC-21 record, then UTF-8 must be used instead. UTF-8 has support for many more characters than MARC-8, which is rarely used outside library data.

Technical details

MARC-8 uses a variant of the ISO-2022 encoding. It uses escape characters to represent characters beyond the 7-bit ASCII range of characters.

It generally uses the same logical BiDi ordering as Unicode.

The combining characters and base characters are in a different order than used in Unicode. The following are some examples. The combining characters are not always stored in reverse order as Unicode normalization. The MARC-21 standard describes the MARC-8 Unicode conversion issues in more detail.

Displayed Character	Unicode NFD	MARC-8
á	a ́	́ a
ậ	a ̣ ̂	̂ ̣ a

Code structure

The ISO/IEC 2022 coding specifies a two-layer mapping between character codes and displayed characters. In MARC-8, character codes from the 7-bit ASCII graphic range (0x20–0x7F) are referred to as "G0" codes, while codes from the "high ASCII" range (0xA0–0xFF) are referred to as the "G1" codes. Graphic character sets are designated and invoked by means of a multiple byte escape sequence consisting of the escape character, an Intermediate character sequence, and a Final character in the form ESC IF.

The following table shows the intermediate byte after the ESC byte (hexadecimal 1B), and the corresponding ASCII characters.

Intermediate Bytes^[2]
	G0 set				G1 set
	SBCS		MBCS		SBCS		MBCS
Normal ISO-2022	28	(	24	$	29	)	24 29	$)
Alternate ISO-2022 (additional 63+16 sets)	2C	,	24 2C	$,	2D	-	24 2D	$-

The following table shows the final bytes in hexadecimal and the corresponding ASCII characters after the intermediate bytes.

Final Bytes^[2]
Bytes	Characters	Name	Type	Comment
31	1	Chinese, Japanese, Korean (EACC)	MBCS
32	2	Basic Hebrew	SBCS
33	3	Basic Arabic	SBCS
34	4	Extended Arabic	SBCS
42	B	Basic Latin (ASCII)	SBCS
21 45	!E	Extended Latin (ANSEL)	SBCS	The 21(hex) technically is a second byte of the Intermediate segment of this escape sequence.
4E	N	Basic Cyrillic	SBCS
51	Q	Extended Cyrillic	SBCS
53	S	Basic Greek (ISO 5428)	SBCS

The EACC is the only multibyte encoding of MARC-8, it encodes each CJK character in three ASCII bytes.

For example, to encode the U+4EBA CJK character (人) you will need the following bytes

 \x1B\x24\x31\x21\x30\x64

The \x1B\x24\x31 switches to EACC/CJK, and the \x21\x30\x64 corresponds to the U+4EBA.

Custom set extension

In addition to the ISO-2022 character sets, the following custom sets are available too. The byte designation follows the escape byte (hexadecimal 1B). There is no intermediate byte.

Final Bytes^[2]
Bytes	Characters	Name	Type	Comment
62	b	Subscript set	SBCS
67	g	Greek Symbol set	SBCS	The alpha, beta, gamma characters normally do not round trip map to Unicode.
70	p	Superscript set	SBCS
73	s	Basic Latin (ASCII)	SBCS

C0 control codes

MARC 21 uses GS (0x1D) as a record terminator, RS (0x1E) as a field terminator and US (0x1F) as a subfield delimiter.^[3]

C1 control codes

The following alternative C1 control code set is defined for bibliographic applications such as library systems. It is mostly concerned with string collation, and with markup of bibliographic fields. Slightly different variants are defined in the German standard DIN 31626^[4] (published in 1978 and since withdrawn)^[5] and the ISO standard ISO 6630,^[6]^[7] the latter of which has also been adopted in Germany as DIN ISO 6630.^[8] Where these differ is noted in the table below where applicable. MARC-8 uses the coding of NSB and NSE from this set, and adds some additional format effectors in locations not used by the ISO version; however, MARC 21 uses this control set only in MARC-8 records, not in Unicode-format records.^[3]

If using the ISO/IEC 2022 extension mechanism, the DIN 31626 set is designated as the active C1 control character set with the sequence 0x1B 0x22 0x45 (ESC " E),^[4] and the ISO 6630 / DIN ISO 6630 set is designated with the sequence 0x1B 0x22 0x42 (ESC " B).^[6] The 1985 expansion of the ISO 6630 set can also be explicitly specified by using the sequence 0x1B 0x26 0x40 0x1B 0x22 0x42 (ESC & @ ESC " B).^[7]

Esc+	Dec	Hex	Acro	Name	Description^[4]^[6]^[7]
G	135	87	CUS	Close-Up for Sorting	(DIN 31626, ISO 6630) Declares that two successive character sequences separated by a space or separator should be treated as one word for collation purposes.
H	136	88	NSB	Non-Sorting Characters Begin	(DIN 31626, ISO 6630, MARC 21) Marks the start of a sequence of characters to be ignored for collation purposes. MARC 21 uses this character in MARC-8 records, but uses 0x98 ( SOS) in Unicode records for the same purpose.^[3]^[9]
I	137	89	NSE	Non-Sorting Characters End	(DIN 31626, ISO 6630, MARC 21) Marks the end of a sequence of characters to be ignored for collation purposes. MARC 21 uses this character in MARC-8 records, but uses 0x9C ( ST) in Unicode records for the same purpose.^[3]^[9]
J	138	8A	FIL	Filler Character	(DIN 31626) Substitutes for a mandatory alphanumeric character in a field.
K	139	8B	TCI	Tag in Context Indicator	(DIN 31626) Within a bibliographic field, used to refer to data in another bibliographic field by its tag number.
K	139	8B	PLD	Partial Line Down	(ISO 6630) Not in the original edition of ISO 6630.^[6] In the 1985 edition of ISO 6630,^[7] used for Partial Line Down (see PLD).
L	140	8C	ICI	Identification Number in Context Indicator	(DIN 31626) Within a bibliographic field, used to refer to data in another bibliographic record by its ID number.
L	140	8C	PLU	Partial Line Up	(ISO 6630) Not in original edition of ISO 6630.^[6] In the 1985 edition of ISO 6630,^[7] used for Partial Line Up (see PLU).
M	141	8D	OSC^{[lower-alpha 1]}	Optional Syllabification^{[lower-alpha 2]} Control	(DIN 31626) Marks a syllable boundary in a long word. See also soft hyphen.
M	141	8D	ZWJ	Joiner	(MARC 21) In MARC-8, used for the Zero-Width Joiner, while U+200D is used in Unicode-format MARC records.^[3]^[9]
N	142	8E	SS2	Single-Shift 2	(DIN 31626) Non-locking shift code, see SS2.
N	142	8E	ZWNJ	Non-Joiner	(MARC 21) In MARC-8, used for the Zero-Width Non-Joiner, while U+200C is used in Unicode-format MARC records.^[3]^[9]
O	143	8F	SS3	Single-Shift 3	(DIN 31626) Non-locking shift code, see SS3.
P	144	90	-	(reserved)
Q	145	91	EAB	Embedded Annotation Beginning	(DIN 31626, ISO 6630) Marks the start of a variable-length annotation which is embedded within a bibliographic field, as opposed to separated using content designation.
R	146	92	EAE	Embedded Annotation End	(DIN 31626, ISO 6630) Marks the end of a variable-length embedded annotation.
S	147	93	ISB	Item Specification Beginning	(DIN 31626) Marks the start of a string of specific information of some description, other than a keyword or a permutation string.
T	148	94	ISE	Item Specification End	(DIN 31626) Marks the end of a string of specific information.
U	149	95	SIB	Sorting Interpolation Beginning	(ISO 6630) Marks the beginning of a sequence of characters used for collation purposes only.
V	150	96	SIE	Sorting Interpolation End	(ISO 6630) Marks the end of a sequence of characters used for collation purposes only.
W	151	97	SSB	Secondary Sorting Value Beginning	(ISO 6630) Marks the start of a string with subordinate collation value.
X	152	98	SSE	Secondary Sorting Value End	(ISO 6630) Marks the end of a string with subordinate collation value.
Y	153	99	INC	Indicator for Non-Standard Character	(DIN 31626) Identifies a following non-standard character.
Z	154	9A	-	(reserved)
[	155	9B	-	(reserved)
\	156	9C	KWB	Keyword Beginning	(DIN 31626, ISO 6630) Marks the start of a keyword within a bibliographic field.
]	157	9D	KWE	Keyword End	(DIN 31626, ISO 6630) Marks the end of a keyword within a bibliographic field.
^	158	9E	PSB	Permutation String Beginning	(DIN 31626, ISO 6630) Marks the start of a string which is to be permuted to the front of the element when references or indices are generated. Terminated by PSE or by the end of the element.
_	159	9F	PSE	Permutation String End	(DIN 31626, ISO 6630) Marks the end of a string which is to be permuted to the front of the element.

Notes

↑ Not the same as the Operating System Command (OSC) in the ISO/IEC 6429 C1 code set.
↑ Spelled "Syllabication[ sic ]" in the ISO-IR-040 document, along with "syllable" being spelled "syllabe[ sic ]" in the description. These are presumably typographical errors.

Related Research Articles

Extended Binary Coded Decimal Interchange Code is an eight-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding six-bit binary-coded decimal code used with most of IBM's computer peripherals of the late 1950s and early 1960s. It is supported by various non-IBM platforms, such as Fujitsu-Siemens' BS2000/OSD, OS-IV, MSP, and MSP-EX, the SDS Sigma series, Unisys VS/9, Unisys MCP and ICL VME.

ISO/IEC 8859-3:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 3: Latin alphabet No. 3, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1988. It is informally referred to as Latin-3 or South European. It was designed to cover Turkish, Maltese and Esperanto, though the introduction of ISO/IEC 8859-9 superseded it for Turkish. The encoding was popular for users of Esperanto, but fell out of use as application support for Unicode became more common.

ISO/IEC 646 is a set of ISO/IEC standards, described as Information technology — ISO 7-bit coded character set for information interchange and developed in cooperation with ASCII at least since 1964. Since its first edition in 1967 it has specified a 7-bit character code from which several national standards are derived.

ISO/IEC 8859-8, Information technology — 8-bit single-byte coded graphic character sets — Part 8: Latin/Hebrew alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings. ISO/IEC 8859-8:1999 from 1999 represents its second and current revision, preceded by the first edition ISO/IEC 8859-8:1988 in 1988. It is informally referred to as Latin/Hebrew. ISO/IEC 8859-8 covers all the Hebrew letters, but no Hebrew vowel signs. IBM assigned code page 916 to it. This character set was also adopted by Israeli Standard SI1311:2002, with some extensions.

ISO/IEC 8859-5:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 5: Latin/Cyrillic alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1988. It is informally referred to as Latin/Cyrillic.

ISO/IEC 8859-13:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 13: Latin alphabet No. 7, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1998. It is informally referred to as Latin-7 or Baltic Rim. It was designed to cover the Baltic languages, and added characters used in Polish missing from the earlier encodings ISO 8859-4 and ISO 8859-10. Unlike these two, it does not cover the Nordic languages. It is similar to the earlier-published Windows-1257; its encoding of the Estonian alphabet also matches IBM-922.

ISO/IEC 8859-14:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 14: Latin alphabet No. 8 (Celtic), is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1998. It is informally referred to as Latin-8 or Celtic. It was designed to cover the Celtic languages, such as Irish, Manx, Scottish Gaelic, Welsh, Cornish, and Breton.

ISO/IEC 2022Information technology—Character code structure and extension techniques, is an ISO/IEC standard in the field of character encoding. It is equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202. Originating in 1971, it was most recently revised in 1994.

GB/T 2312-1980 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. GB refers to the Guobiao standards (国家标准), whereas the T suffix denotes a non-mandatory standard.

In computing and typesetting, a soft hyphen or syllable hyphen, is a code point reserved in some coded character sets for the purpose of breaking words across lines by inserting visible hyphens if they fall on the line end but remain invisible within the line.

<span class="mw-page-title-main">Chinese Character Code for Information Interchange</span> Character encoding standard

The Chinese Character Code for Information Interchange or CCCII is a character set developed by the Chinese Character Analysis Group in Taiwan. It was first published in 1980, and significantly expanded in 1982 and 1987.

The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, an instruction to start a new line, or a message that the text has been received.

T.51 / ISO/IEC 6937:2001, Information technology — Coded graphic character set for text communication — Latin alphabet, is a multibyte extension of ASCII, or more precisely ISO/IEC 646-IRV. It was developed in common with ITU-T for telematic services under the name of T.51, and first became an ISO standard in 1983. Certain byte codes are used as lead bytes for letters with diacritics (accents). The value of the lead byte often indicates which diacritic that the letter has, and the follow byte then has the ASCII-value for the letter that the diacritic is on.

<span class="mw-page-title-main">JIS X 0201</span> Japanese single byte character encoding

JIS X 0201, a Japanese Industrial Standard developed in 1969, was the first Japanese electronic character set to become widely used. The character set was initially known as JIS C 6220 before the JIS category reform. Its two forms were a 7-bit encoding or an 8-bit encoding, although the 8-bit form was dominant until Unicode replaced it. The full name of this standard is 7-bit and 8-bit coded character sets for information interchange (7ビット及び8ビットの情報交換用符号化文字集合).

YUSCII is an informal name for several JUS standards for 7-bit character encoding. These include:

KS X 1001, "Code for Information Interchange ", formerly called KS C 5601, is a South Korean coded character set standard to represent Hangul and Hanja characters on a computer.

The ISO 2033:1983 standard defines character sets for use with Optical Character Recognition or Magnetic Ink Character Recognition systems. The Japanese standard JIS X 9010:1984 is closely related.

Mac OS Ogham is a character encoding for representing Ogham text on Apple Macintosh computers. It is a superset of the Irish Standard I.S. 434:1999 character encoding for Ogham, adding some punctuation characters from Mac OS Roman. It is not an official Mac OS Codepage.

<span class="mw-page-title-main">Videotex character set</span>

The character sets used by Videotex are based, to greater or lesser extents, on ISO/IEC 2022. Three Data Syntax systems are defined by ITU T.101, corresponding to the Videotex systems of different countries.

GB 12345, entitled Code of Chinese ideogram set for information interchange supplementary set, is a Traditional Chinese character set standard established by China, and can be thought as the traditional counterpart of GB 2312. It is used as an encoding of traditional Chinese characters, although it is not as commonly used as Big5. It has 6,866 characters, and has no relationship nor compatibility with Big5 and CNS 11643.

References

↑ "Character Sets: Introduction: MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media (Library of Congress)". Library of Congress .
1 2 3 "Character Sets: MARC-8 Encoding Environment: MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media (Library of Congress)". Library of Congress .
1 2 3 4 5 6 "Control function codes". MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media. Library of Congress. 2007-12-04.
1 2 3 DIN (1979-07-15). Additional Control Codes for Bibliographic Use according to German Standard DIN 31626 (PDF). ITSCJ/IPSJ. ISO-IR-40.
↑ "Information processing; bibliographic control characters". Beuth: publishing DIN. DIN 31626:1978-12.
1 2 3 4 5 ISO/TC 46 (1983-06-01). Additional Control Codes for Bibliographic Use according to International Standard ISO 6630 (PDF). ITSCJ/IPSJ. ISO-IR-67.{{citation}}: CS1 maint: numeric names: authors list (link)
1 2 3 4 5 ISO/TC 46 (1986-02-01). Additional Control Codes for Bibliographic Use according to International Standard ISO 6630 (PDF). ITSCJ/IPSJ. ISO-IR-124.{{citation}}: CS1 maint: numeric names: authors list (link)
↑ "DIN ISO 6630 December 1997". AFNOR Editions Online Store.
1 2 3 4 "Code Table Extended Latin (ANSEL)". MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media. Library of Congress. 2007-12-05.

External links

MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media - The official MARC-8 standard as maintained by the US Library of Congress

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[10] Not the same as the Operating System Command (OSC) in the ISO/IEC 6429 C1 code set.

[11] Spelled "Syllabication[ sic ]" in the ISO-IR-040 document, along with "syllable" being spelled "syllabe[ sic ]" in the description. These are presumably typographical errors.

[1] "Character Sets: Introduction: MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media (Library of Congress)". Library of Congress .

[loc.gov-2] 1 2 3 "Character Sets: MARC-8 Encoding Environment: MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media (Library of Congress)". Library of Congress .

[marc-c0c1-3] 1 2 3 4 5 6 "Control function codes". MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media. Library of Congress. 2007-12-04.

[din31626-4] 1 2 3 DIN (1979-07-15). Additional Control Codes for Bibliographic Use according to German Standard DIN 31626 (PDF). ITSCJ/IPSJ. ISO-IR-40.

[5] "Information processing; bibliographic control characters". Beuth: publishing DIN. DIN 31626:1978-12.

[iso6630-old-6] 1 2 3 4 5 ISO/TC 46 (1983-06-01). Additional Control Codes for Bibliographic Use according to International Standard ISO 6630 (PDF). ITSCJ/IPSJ. ISO-IR-67.{{citation}}: CS1 maint: numeric names: authors list (link)

[iso6630-1985-7] 1 2 3 4 5 ISO/TC 46 (1986-02-01). Additional Control Codes for Bibliographic Use according to International Standard ISO 6630 (PDF). ITSCJ/IPSJ. ISO-IR-124.{{citation}}: CS1 maint: numeric names: authors list (link)

[8] "DIN ISO 6630 December 1997". AFNOR Editions Online Store.

[marc-ansel-9] 1 2 3 4 "Code Table Extended Latin (ANSEL)". MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media. Library of Congress. 2007-12-05.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[lower-alpha 1]

[lower-alpha 2]

v t e Character encodings
Early telecommunications	Telegraph code Needle Morse Non-Latin Wabun/Kana Chinese Cyrillic Korean Baudot and Murray Fieldata ASCII ISO/IEC 646 BCDIC Teletex and Videotex/Teletext T.51/ISO/IEC 6937 ITU T.61 ITU T.101 World System Teletext background sets Transcode
ISO/IEC 8859	Approved parts -1 (Western Europe) -2 (Central Europe) -3 (Maltese/Esperanto) -4 (North Europe) -5 (Cyrillic) -6 (Arabic) -7 (Greek) -8 (Hebrew) -9 (Turkish) -10 (Nordic) -11 (Thai) -13 (Baltic) -14 (Celtic) -15 (New Western Europe) -16 (Romanian) Abandoned parts -12 (Devanagari) Proposed but not approved KOI-8 Cyrillic Sámi Adaptations Welsh Barents Cyrillic Estonian Ukrainian Cyrillic
Bibliographic use	MARC-8 ANSEL CCCII/EACC ISO 5426 5426-2 5427 5428 6438 6862
National standards	ArmSCII Big5 BraSCII CNS 11643 DIN 66003 ELOT 927 GOST 10859 GB 2312 GB 12345 GB 12052 GB 18030 HKSCS ISCII JIS X 0201 JIS X 0208 JIS X 0212 JIS X 0213 KOI-7 KPS 9566 KS X 1001 KS X 1002 LST 1564 LST 1590-4 PASCII Shift JIS SI 960 TIS-620 TSCII VISCII VSCII YUSCII
ISO/IEC 2022	ISO/IEC 8859 ISO/IEC 10367 Extended Unix Code / EUC
Mac OS Code pages ("scripts")	Armenian Arabic Barents Cyrillic Celtic Central European Croatian Cyrillic Devanagari Farsi (Persian) Font X (Kermit) Gaelic Georgian Greek Gujarati Gurmukhi Hebrew Iceland Inuit Keyboard Latin (Kermit) Maltese/Esperanto Ogham Roman Romanian Sámi Turkish Turkic Cyrillic Ukrainian VT100
DOS code pages	437 668 708 720 737 770 773 775 776 777 778 850 851 852 853 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 897 899 903 904 932 936 942 949 950 951 1034 1040 1042 1043 1044 1098 1115 1116 1117 1118 1127 3846 ABICOMP CS Indic CSX Indic CSX+ Indic CWI-2 Iran System Kamenický Mazovia MIK
IBM AIX code pages	895 896 912 915 921 922 1006 1008 1009 1010 1012 1013 1014 1015 1016 1017 1018 1019 1046 1124 1133
Windows code pages	CER-GS 932 936 (GBK) 950 1169 Extended Latin-8 1250 1251 1252 1253 1254 1255 1256 1257 1258 1270 Cyrillic + Finnish Cyrillic + French Cyrillic + German Polytonic Greek
EBCDIC code pages	Japanese language in EBCDIC DKOI
DEC terminals (VTx)	Multinational (MCS) National Replacement (NRCS) French Canadian Swiss Spanish United Kingdom Dutch Finnish French Norwegian and Danish Swedish Norwegian and Danish (alternative) 8-bit Greek 8-bit Turkish SI 960 Hebrew Special Graphics Technical (TCS)
Platform specific	1052 1053 1054 1055 1056 1057 1058 Acorn RISC OS Amstrad CPC Apple II ATASCII Atari ST BICS Casio calculators CDC Compucolor 8001 Compucolor II CP/M+ DEC RADIX 50 DEC MCS/NRCS DG International Galaksija GEM GSM 03.38 HP Roman HP FOCAL HP RPL SQUOZE LICS LMBCS MSX NEC APC NeXT PETSCII PostScript Standard PostScript Latin 1 SAM Coupé Sega SC-3000 Sharp calculators Sharp MZ Sinclair QL Teletext TI calculators TRS-80 Ventura International WISCII XCCS ZX80 ZX81 ZX Spectrum
Unicode / ISO/IEC 10646	UTF-1 UTF-7 UTF-8 UTF-16 UTF-32 UTF-EBCDIC GB 18030 DIN 91379 BOCU-1 CESU-8 SCSU TACE16 Comparison of Unicode encodings
TeX typesetting system	Cork LY1 OML OMS OT1
Miscellaneous code pages	ABICOMP ASMO 449 Digital encoding of APL symbols ISO-IR-68 ARIB STD-B24 Fieldata HZ IEC-P27-1 INIS 7-bit 8-bit ISO-IR-169 ISO 2033 KOI KOI8-R KOI8-RU KOI8-U Mojikyō SEASCII Stanford/ITS Symbol TRON Unified Hangul Code
Control character	Morse prosigns C0 and C1 control codes ISO/IEC 6429 JIS X 0211 Unicode control, format and separator characters Whitespace characters
Related topics	CCSID Character encodings in HTML Charset detection Han unification Hardware code page MICR code Mojibake Variable-length encoding
Character sets