GBK (character encoding)

Last updated
Guójiā Biāozhǔn Kuòzhǎn (GBK)
GBK encoding.svg
Layout of GBK (see below for a larger copy of this diagram)
MIME / IANAGBK
Alias(es) CP936, MS936, windows-936, csGBK
Language(s)Web browsers, decode as GB 18030, supporting all languages, while the encoding (and other software decoders) is primarily used for Simplified Chinese, but also supports Traditional Chinese, Japanese, English, Russian and (partially) Greek.
StandardGBK 1.0
Classification Extended ASCII, [a] variable-width encoding, CJK encoding
Extends EUC-CN
Preceded by GB 2312
Succeeded by GB 18030
  1. Not in the strictest sense of the term, as ASCII bytes can appear as trail bytes.

GBK is an extension of the GB 2312 character set for Simplified Chinese characters, used in the People's Republic of China. It includes all unified CJK characters found in GB 13000.1-93, i.e. ISO/IEC 10646:1993, or Unicode 1.1. Since its initial release in 1993, GBK has been extended by Microsoft in Code page 936/1386 , which was then extended into GBK 1.0 . GBK is also the IANA-registered internet name for the Microsoft mapping, [1] which differs from other implementations primarily by the single-byte euro sign at 0x80.

Contents

GB abbreviates Guójiā Biāozhǔn, which means national standard in Chinese, while K stands for Extension (扩展 kuòzhǎn). GBK not only extended the old standard GB 2312 with Traditional Chinese characters, but also with Chinese characters that were simplified after the establishment of GB 2312 in 1981. With the arrival of GBK, certain names with characters formerly unrepresentable, like the 镕 (róng) character in former Chinese Premier Zhu Rongji's name, are now representable. [2]

As of October 2022, GBK is the third-most popular encoding served from China and territories (after UTF-8 and the subset GB 2312), with 1.9% of web servers serving a page that declares GBK. [3] However, all major web browsers decode GB2312-marked documents as if they were marked GBK, except for Safari and Edge on the label GB_2312. [4] Together, GBK and GB 2312 encodings have a combined 5.5% presence in China and territories. [3] Globally, GBK accounts for less than 0.07% of all web pages and GBK+GB2312 for 0.2%. [5]

History

In 1993, the Unicode 1.1 standard was released, including 20,902 characters used in mainland China, Taiwan, Japan and Korea. Following this, China released GB 13000.1-93, the Guobiao standard equivalent of Unicode 1.1.

The GBK character set was defined in 1993 as an extension of GB 2312-80, while also including the characters of GB 13000.1-93 through the unused codepoints available in GB 2312. Hence GBK is backward compatible with GB 2312. GBK was defined in a normative annex to GB 13000.1-93. [6]

Microsoft implemented GBK in Windows 95 and Windows NT 3.51 as Code Page 936. While GBK was never an official standard, widespread usage of Windows 95 led to GBK becoming the de facto standard. While GBK included all the Chinese characters defined in Unicode 1.1 and GB 13000.1-93, these standards used different code tables. The primary reason for its existence was simply to bridge the gap between GB 2312-80 and GB 13000.1-93.

In 1995, China National Information Technology Standardization Technical Committee set down the Chinese Internal Code Extension Specification (Chinese :汉字内码扩展规范 (GBK); pinyin :Hànzì Nèimǎ Kuòzhǎn Guīfàn (GBK)), Version 1.0, known as GBK 1.0, which is a slight extension of Codepage 936. The newly added 95 characters were not found in GB 13000.1-1993, and were provisionally assigned Unicode PUA code points. [7] :534

Microsoft later added the euro sign to Code page 936 and assigned the code 0x80 to it. This is not a valid code point in GBK 1.0.

In 2000, the GB 18030-2000 standard was released, superseding yet maintaining compatibility with GBK 1.0. It increased the number of definitions of Chinese characters and extended the number of possible characters through the implementation of four-byte character spaces. The subset of GB 18030 consisting of one-byte and two-byte characters is sometimes also referred to as GBK. Mapping to Unicode has been slightly changed, though, as some characters are now defined in Unicode. In the most up-to-date form of the standard, GB 18030-2005, only 24 [8] characters are still mapped to Unicode PUA (see GB 18030#PUA.)

In 2002, GBK was registered as an IANA charset; the registration uses code page 936 mapping as well as CP936/MS936 aliases, but refers to GBK 1.0 specification. [1] W3C's technical recommendation published in 2015 [9] defines a GBKencoder as a GB 18030 encoder with a single-byte euro sign and without four-byte sequences (while W3C's GBKdecoder specification has no such limitation, decodes as GB 18030, i.e. with same range of letters as all of Unicode).

Encoding

A character is encoded as 1 or 2 bytes. A byte in the range 007F is a single byte that means the same thing as it does in ASCII. Strictly speaking, there are 95 characters and 33 control codes in this range.

A byte with the high bit set indicates that it is the first of 2 bytes. Loosely speaking, the first byte is in the range 81FE (that is, never 80 or FF), and the second byte is 40A0 except 7F for some areas and A1FE for others.

More specifically, the following ranges of bytes are defined:

GBK Encoding Ranges
rangebyte 1byte 2code pointscharacters
GB 18030GBK 1.0Codepage 936GB 2312
Level GBK/1 A1A9A1FE846718 [7] :8–10717715682
Level GBK/2 B0F7A1FE6,7686,7636,7636,763
Level GBK/381A040FE except 7F6,0806,0806,080
Level GBK/4AAFE40A0 except 7F8,1608,1608,080
Level GBK/5 A8A940A0 except 7F192166153
user-defined 1 [7] AAAFA1FE564
user-defined 2F8FEA1FE658
user-defined 3A1A740A0 except 7F672
total:23,94021,88721,88621,7917,445

Layout diagram

In graphical form, the following figure shows the space of all 64K possible 2-byte codes. Green and yellow areas are assigned GBK codepoints, red are for user-defined characters. The uncolored areas are invalid byte combinations.

GBK encoding.svg

Relationship to other encodings

The areas indicated in the previous section as GBK/1 and GBK/2, taken by themselves, is simply GB 2312-80 in its usual encoding, GBK/1 being the non-hanzi region and GBK/2 the hanzi region. GB 2312, or more properly the EUC-CN encoding thereof, takes a pair of bytes from the range A1FE, like any 94² ISO-2022 character set loaded into GR. This corresponds to the lower-right quarter of the illustration above. However, GB 2312 does not assign any code points to the rows located at AAB0 and F8FE, even though it had staked out the territory. GBK added extensions to these rows. You can see that the two gaps were filled in with user-defined areas.

More significantly, GBK extended the range of the bytes. Having two-byte characters in the ISO-2022 GR range gives a limit of 94²=8,836 possibilities. Abandoning the ISO-2022 model of strict regions for graphics and control characters, but retaining the feature of low bytes being 1-byte characters and pairs of high bytes denoting a character, you could potentially have 128²=16,384 positions. GBK takes part of that, extending the range from A1FE (94 choices for each byte) to 81FE (126 choices) for the first byte and 40FE (191 choices) for the second byte, for a total of 24,066 positions.

Microsoft's Code Page 936 is generally thought of as being GBK. [1] However, the 95 PUA characters added in GBK 1.0 are not included in Code Page 936. Code Page 936 also has a single-byte euro sign at 0x80 which GBK 1.0 doesn't have. [10]

GBK's successor, GB 18030-2000, uses the remaining range available to the second byte (3039) to further expand the number of possibilities while retaining GBK as a subset.

Related Research Articles

<span class="mw-page-title-main">Character encoding</span> Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up a character encoding are known as code points and collectively comprise a code space, a code page, or character map.

While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.

Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.

In computing, Chinese character encodings can be used to represent text written in the CJK languages—Chinese, Japanese, Korean—and (rarely) obsolete Vietnamese, all of which use Chinese characters. Several general-purpose character encodings accommodate Chinese characters, and some of them were developed specifically for Chinese.

<span class="mw-page-title-main">GB 18030</span> Official Chinese character encoding

GB 18030 is a Chinese government standard, described as Information Technology — Chinese coded character set and defines the required language and character support necessary for software in China. GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB2312. As a Unicode Transformation Format, GB18030 supports both simplified and traditional Chinese characters. It is also compatible with legacy encodings including GB/T 2312, CP936, and GBK 1.0.

Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese (characters).

GB/T 2312-1980 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. GB refers to the Guobiao standards (国家标准), whereas the T suffix denotes a non-mandatory standard.

TRON Code is a multi-byte character encoding used in the TRON project. It is similar to Unicode but does not use Unicode's Han unification process: each character from each CJK character set is encoded separately, including archaic and historical equivalents of modern characters. This means that Chinese, Japanese, and Korean text can be mixed without any ambiguity as to the exact form of the characters; however, it also means that many characters with equivalent semantics will be encoded more than once, complicating some operations.

In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the standard. Three private use areas are defined: one in the Basic Multilingual Plane, and one each in, and nearly covering, planes 15 and 16. They are intentionally left undefined so that third parties may assign their own characters without conflicting with Unicode Consortium assignments. Under the Unicode Stability Policy, the Private Use Areas will remain allocated for that purpose in all future Unicode versions.

Windows code page 936, is Microsoft's legacy (pre-Unicode) character encoding for representing simplified Chinese text on computers. It is one of the four Windows DBCSs for East Asian languages, accompanying code pages 932 (Japanese), 949 (Korean) and 950. It is a variant of the Mainland Chinese Guójiā Biāozhǔn Kuòzhǎn (GBK) encoding, and roughly corresponds to IBM code page 1386.

<span class="mw-page-title-main">Code page 950</span> Windows character set for Traditional Chinese

Code page 950 is the code page used on Microsoft Windows for Traditional Chinese. It is Microsoft's implementation of the de facto standard Big5 character encoding. The code page is not registered with IANA, and hence, it is not a standard to communicate information over the internet, although it is usually labelled simply as big5, including by Microsoft library functions.

Tianweiban, formerly known as Tianmeidong, is a village in Donglu Town, Wenchang County, Hainan, China with a population of roughly 50 people in 10 households.

KPS 9566 is a North Korean standard specifying a character encoding for the Chosŏn'gŭl (Hangul) writing system used for the Korean language. The edition of 1997 specified an ISO 2022-compliant 94×94 two-byte coded character set. Subsequent editions have added additional encoded characters outside of the 94×94 plane, in a manner comparable to UHC or GBK.

The CCITT Chinese Primary Set is a multi-byte graphic character set for Chinese communications created for the Consultative Committee on International Telephone and Telegraph (CCITT) in 1992. It is defined in ITU T.101, annex C, which codifies Data Syntax 2 Videotex. It is registered with the ISO-IR registry for use with ISO/IEC 2022 as ISO-IR-165, and encodable in the ISO-2022-CN-EXT code version.

<span class="mw-page-title-main">Code page 949 (IBM)</span>

IBM code page 949 (IBM-949) is a character encoding which has been used by IBM to represent Korean language text on computers. It is a variable-width encoding which represents the characters from the Wansung code defined by the South Korean standard KS X 1001 in a format compatible with EUC-KR, but adds IBM extensions for additional hanja, additional precomposed Hangul syllables, and user-defined characters.

IBM code page 936 is a character encoding for Simplified Chinese including 1880 user-defined characters (UDC), which was superseded in 1993. It is a combination of the single-byte Code page 903 and the double-byte Code page 928. Code page 946 uses the same double-byte component, but an extended single-byte component.

GB 12345, entitled Code of Chinese ideogram set for information interchange supplementary set, is a Traditional Chinese character set standard established by China, and can be thought as the traditional counterpart of GB 2312. It is used as an encoding of traditional Chinese characters, although it is not as commonly used as Big5. It has 6,866 characters, and has no relationship nor compatibility with Big5 and CNS 11643.

CJK Unified Ideographs Extension I is a Unicode block comprising CJK Unified Ideographs included in drafts of an amendment to China's GB 18030 standard circulated in 2022 and 2023, which were fast-tracked into Unicode in 2023.

GB 13000.1 is a Guobiao standard of the People's Republic of China corresponding to ISO/IEC 10646. "GB 13000" or "GB 13000.1" may refer to:

A Chinese character set is a group of Chinese characters. Since the size of a set is the number of elements in it, an introduction to Chinese character sets will also introduce the Chinese character numbers in them.

References

  1. 1 2 3 "Character Sets" . Retrieved 3 October 2016.
  2. "Code Page 936 - PRC GBK (XGB)". Microsoft . Archived from the original on 2002-10-01. Conversion map between Codepage 936 and Unicode. Need manually selecting GB 18030 or GBK in browser to view it correctly.
  3. 1 2 "Distribution of Character Encodings among websites that use China and territories". w3techs.com. Retrieved 2022-10-25.
  4. "Encoding: Summarized test results". www.w3.org. Retrieved 2019-11-15.
  5. "Historical trends in the usage statistics of character encodings for websites, October 2022". w3techs.com. Retrieved 2022-10-25.
  6. "18.2: Ideographic Description Characters" (PDF). The Unicode Standard. Version 15.0.0. 2022. p. 763. The Ideographic Description characters are found in GBK—an extension to GB 2312-80 that added all 20,902 Unicode Version 1.1 ideographs not already in GB 2312-80. GBK is defined as a normative annex of GB 13000.1-93.
  7. 1 2 3 Standardization Administration of China (SAC) (2005-11-18). GB 18030-2005: Information Technology—Chinese coded character set.
  8. GB 18030-2005 Standard p.9, 79
  9. "Encoding Standard # gbk-encoder". W3C. Retrieved 2016-10-02.
  10. Scherer, Markus (4 January 2002). "Re: Fun with GBK & GB2312". Unicode Mail List Archive. Retrieved 4 March 2020.

Notes