ISO-IR-165

Last updated
CCITT Chinese set (ISO-IR 165)
MIME / IANAiso-ir-165
Alias(es)CN-GB-ISOIR165 (EUC form) [1]
Language(s) Simplified Chinese, English, Russian
Partial support:
Greek, Japanese
Standard ITU T.101, annex C
Definitions ISO-IR 165
Extends GB 2312
Encoding formats ISO-2022-CN-EXT, Videotex Data Syntax 2
Succeeded by GB 18030

The CCITT Chinese Primary Set [2] is a multi-byte graphic character set for Chinese communications created for the Consultative Committee on International Telephone and Telegraph (CCITT) in 1992. [3] It is defined in ITU T.101, annex C, which codifies Data Syntax 2 Videotex. [2] It is registered with the ISO-IR registry for use with ISO/IEC 2022 as ISO-IR-165, [4] and encodable in the ISO-2022-CN-EXT code version. [1]

Contents

It is an extended modification of GB/T 2312-80, and corresponds to the union of the Mainland Chinese GB standards GB 6345.1-86 and GB 8565.2-88, with some further modification and extensions. A subset of the GB 6345.1 extensions are incorporated into GB 18030, while GB 8565.2 serves as the Mainland Chinese source reference for certain CJK Unified Ideographs.

GB 6345.1

GB 6345.1-86 (32 × 32 Dot Matrix Font Set of Chinese Ideographs for Information Interchange) includes both a corrigendum and an extension for GB 2312. [3] The corrigendum alters the following two characters:

Alterations made to existing GB 2312 characters by GB 6345.1
Row-cell EUC GB 2312 (Unamended) [5] GB 6345.1Notes
03-710xA3E7 Looptail g.svg ɡ [lower-alpha 1]
79-810xEFF1 [lower-alpha 2]
  1. Corresponds to U+FF47 in Unicode; however, the amended reference glyph can also correspond to U+0261ɡ. See below for how U+0261 is typically mapped to/from GB/T 6341.1, versus how it is mapped to/from ISO-IR-165. GB 18030 swaps this one back to the original [5] looped glyph. [6]
  2. The unamended reference glyph is a Traditional Chinese character corresponding to U+937E. The character in question is usually replaced with (U+949F, also the simplification of ) in Simplified Chinese except in names of persons; the amended glyph is an alternate simplified form corresponding to U+953A.

Deployed implementations incorporating GB 2312, such as Windows code page 936, generally follow these corrections in mapping 79-81 to U+953A. [7]

The extension adds half-width ISO 646-CN characters in row 10 (in addition to the existing full-width characters in row 3) and extends the set of 26 non-ASCII pinyin characters in row 8 with six additional such characters. These GB 6345.1 extensions are also incorporated into GB/T 12345, the Traditional Chinese counterpart to GB 2312, in addition to 29 vertical presentation forms in row 6. [3] [8]

Later GB/T 6345.1-2010 published in 2011 officially adds half-width forms of the 32 pinyin characters (including the six new additions) in row 8 to row 11. [9] This addition is not featured in GB 18030. [6]

The six additional pinyin characters from GB 6345.1 and the vertical presentation forms from GB 12345 — but not the half-width forms — are included in the classic Mac OS encoding for Simplified Chinese (a modification of EUC-CN), [10] and also as two-byte codes in GB 18030. [6] The additional pinyin characters are as follows: [10]

Extensions made by GB 6345.1 to GB 2312 row 8
Row-cell EUC Character [10] [6] Notes
08-270xA8BBU+0251ɑ
08-280xA8BCU+1E3Fḿ [lower-alpha 1]
08-290xA8BDU+0144ń
08-300xA8BEU+0148ň
08-310xA8BFU+01F9ǹ [lower-alpha 2]
08-320xA8C0U+0261 Looptail g.svg [lower-alpha 3]
  1. Mapped to the Private Use Area U+E7C7 by Windows code page 936 [11] and the first (2000) edition of GB 18030; this was amended by the 2005 edition. [6]
  2. This composed character was added in Unicode 3.0. Prior to this, this character was mapped to its composition sequence (i.e. U+006E U+0300) by Apple. [10] This change predates the stabilisation of Unicode normalisation forms, which was introduced in Unicode 3.1. [12] It is mapped to U+E7C8 by Windows code page 936. [11]
  3. Matches the unamended reference glyph for 03-71 (see above) in being a looped g, in spite of being typically mapped to U+0261. Mappings used for ISO-IR-165 differ (see below). GB 18030 swaps 03-71 back to the looped g, and makes this one the open g. [6]

These extensions and modifications to GB 2312 were first introduced in GB 5007.1-85 in 1985.

GB 8565.2

GB 8565.2-88 (Information Processing - Coded Character Sets for Text Communication - Part 2: Graphic Characters) defines an extension for GB 2312, adding 705 characters between rows 13–15 and 90–94, of which 69 (all in row 15) are non-hanzi. It includes the GB 2312 corrections from GB 6345.1, but not its extensions. [3]

The Unihan database references GB 8565.2 as the Mainland Chinese source of several hanzi included in Unicode. Its Unihan source abbreviation is G8. [2]

CCITT changes

ISO-IR-165 incorporates the GB 2312 extensions from both GB 6345.1-86 and GB 8565.2-88. [3] Additionally, it adds 161 further characters (including 139 hanzi, identified as “general Chinese characters and variants”). [3] [4] These CCITT hanzi extensions have on occasion been mistaken for standard GB 8565.2 characters, including in previous revisions of the Unihan database. [2] In total the set contains 8446 characters.

A number of patterned semigraphic characters are included in row 6. [4] This collides with the vertical presentation forms included in other extensions such as Mac OS Simplified Chinese [10] and GB 18030. [6]

The GB 6345.1 corrections to GB 2312 are applied, but two Unicode mappings are reversed compared to other encodings which include GB 2312 with GB 6345.1 extensions. The table below shows the mappings and their corresponding glyphs including GB 18030:

Row-cell EUC GB 2312 (unamended) [5] GB 6345.1 [9] GB 6345.1 mapping [10] ISO-IR-165 [4] ISO-IR-165 mapping [13] GB 18030 [6] GB 18030 mapping [6]
03-710xA3E7 Looptail g.svg ɡU+FF47ɡU+0261 Looptail g.svg U+FF47
08-320xA8C0(absent) Looptail g.svg U+0261 Looptail g.svg U+FF47ɡU+0261
79-810xEFF1U+953AU+953AU+953A

Related Research Articles

<span class="mw-page-title-main">Character encoding</span> Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "character map".

While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.

Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a feature shared in common by written Chinese (hanzi), Japanese (kanji), Korean (hanja) and Vietnamese.

In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers. Typically each number represents the binary value in a single byte.

In computing, Chinese character encodings can be used to represent text written in the CJK languages—Chinese, Japanese, Korean—and (rarely) obsolete Vietnamese, all of which use Chinese characters. Several general-purpose character encodings accommodate Chinese characters, and some of them were developed specifically for Chinese.

<span class="mw-page-title-main">GB 18030</span> Unicode character encoding mostly used for Simplified Chinese

GB 18030 is a Chinese government standard, described as Information Technology — Chinese coded character set and defines the required language and character support necessary for software in China. GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB2312. As a Unicode Transformation Format, GB18030 supports both simplified and traditional Chinese characters characters. It is also compatible with legacy encodings including GB2312, CP926, and GBK 1.0.

ISO/IEC 2022Information technology—Character code structure and extension techniques, is an ISO/IEC standard in the field of character encoding. Originating in 1971, it was most recently revised in 1994.

Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.

GB/T 2312-1980 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. GB refers to the Guobiao standards (国家标准), whereas the T suffix denotes a non-mandatory standard.

<span class="mw-page-title-main">Chinese Character Code for Information Interchange</span> Character encoding standard

The Chinese Character Code for Information Interchange or CCCII is a character set developed by the Chinese Character Analysis Group in Taiwan. It was first published in 1980, and significantly expanded in 1982 and 1987.

The CNS 11643 character set, also officially known as the Chinese Standard Interchange Code or CSIC, is officially the standard character set of Taiwan. In practice, variants of the related Big5 character set are de facto standard.

<span class="mw-page-title-main">GBK (character encoding)</span> Simplified Chinese character encoding

GBK is an extension of the GB 2312 character set for Simplified Chinese characters, used in the People's Republic of China. It includes all unified CJK characters found in GB 13000.1-93, i.e. ISO/IEC 10646:1993, or Unicode 1.1. Since its initial release in 1993, GBK has been extended by Microsoft in Code page 936/1386, which was then extended into GBK 1.0. GBK is also the IANA-registered internet name for the Microsoft mapping, which differs from other implementations primarily by the single-byte euro sign at 0x80.

Windows Code page 936, is Microsoft's character encoding for simplified Chinese, one of the four DBCSs for East Asian languages. Originally, Windows-936 covered GB 2312, but it was expanded to cover most of GBK with the release of Windows 95.

The Chinese, Japanese and Korean (CJK) scripts share a common background, collectively known as CJK characters. During the process called Han unification, the common (shared) characters were identified and named CJK Unified Ideographs. As of Unicode 15.0, Unicode defines a total of 97,058 characters.

<span class="mw-page-title-main">Microsoft YaHei</span> Typeface

Microsoft YaHei is a sans-serif gothic typeface created by Founder Electronics and Monotype Corporation under commission from Microsoft. Hinting for the font was undertaken by Monotype Imaging. The CJK ideographic characters were designed by the Founder Electronics foundry's senior designer, Li Qi (齐立).

KPS 9566 is a North Korean standard specifying a character encoding for the Chosŏn'gŭl (Hangul) writing system used for the Korean language. The edition of 1997 specified an ISO 2022-compliant 94×94 two-byte coded character set. Subsequent editions have added additional encoded characters outside of the 94×94 plane, in a manner comparable to UHC or GBK.

JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. The official title of the current standard is 7-bit and 8-bit double byte coded KANJI sets for information interchange. It was originally established as JIS C 6226 in 1978, and has been revised in 1983, 1990, and 1997. It is also called Code page 952 by IBM. The 1978 version is also called Code page 955 by IBM.

In mathematics, the radical symbol, radical sign, root symbol, radix, or surd is a symbol for the square root or higher-order root of a number. The square root of a number x is written as

IBM code page 936 was a character encoding for Simplified Chinese including 1880 user-defined characters (UDC). It was a combination of the single-byte Code page 903 and the double-byte Code page 928. Code page 946 used the same double-byte component, but an extended single-byte component.

GB 12345, entitled Code of Chinese ideogram set for information interchange supplementary set, is a Traditional Chinese character set standard established by China, and can be thought as the traditional counterpart of GB 2312. It is used as an encoding of traditional Chinese characters, although it is not as commonly used as Big5. It has 6,866 characters, and has no relationship nor compatibility with Big5 and CNS 11643.

References

  1. 1 2 Zhu, HF.; Hu, DY.; Wang, ZG.; Kao, TC.; Chang, WCH.; Crispin, M. (1996). "Chinese Character Encoding for Internet Messages". Requests for Comments. IETF. doi:10.17487/rfc1922. RFC 1922.
  2. 1 2 3 4 Chung, Jaemin (2018-01-24). "Pseudo-G8 characters" (PDF). ISO/IEC JTC 1/SC 2/WG 2/IRG N2276.
  3. 1 2 3 4 5 6 Lunde, Ken (2009). CJKV Information Processing: Chinese, Japanese, Korean & Vietnamese Computing (2nd ed.). Sebastopol, CA: O'Reilly. pp. 94–111. ISBN   978-0-596-51447-1.
  4. 1 2 3 4 CCITT (1992-07-13). Codes of the Chinese graphic character set for communication (PDF). ITSCJ/IPSJ. ISO-IR-165.
  5. 1 2 3 China Association for Standardization. Coded Chinese Graphic Character Set for Information Interchange (PDF). ITSCJ/IPSJ. ISO-IR-58.
  6. 1 2 3 4 5 6 7 8 9 Standardization Administration of China (SAC) (2005-11-18). GB 18030-2005: Information Technology—Chinese coded character set.
  7. Steele, Shawn (2000). "cp936 to Unicode table". Microsoft, Unicode Consortium.
  8. Lunde, Ken (1998). Appendix F: GB/T 12345 (PDF). ISBN   9781565922242.{{cite book}}: |work= ignored (help)
  9. 1 2 Standardization Administration of China (SAC) (2011-01-10). GB/T 6345.1-2010 信息技术 汉字编码字符集(基本集) 32点阵字型 第1部分宋体 (in Chinese (China)). China.{{cite book}}: CS1 maint: location missing publisher (link)
  10. 1 2 3 4 5 6 "Map (external version) from Mac OS Chinese Simplified encoding to Unicode 3.0 and later". Apple, Inc.
  11. 1 2 Microsoft. "CODEPAGE 936: PRC GBK (XGB) - ANSI, OEM". Unicode Consortium.
  12. "Unicode Character Encoding Stability Policies". Unicode Consortium. 2017-06-23.
  13. Viswanadha, Raghuram (2000-08-30). "Unicode to ISO-IR-165 table". International Components for Unicode . IBM. (Note: codes are listed in the source in 7-bit form: add 0x80 to each byte for EUC form, or subtract 0x20 for kuten form)