Code page 936 (Microsoft Windows)

Last updated
Windows code page 936
MIME / IANAGBK
Language(s)Mainly used for Simplified Chinese, but also supports Traditional Chinese, Japanese, English, Russian and (partially) Greek.
Classification GBK variant, Extended ASCII, [lower-alpha 1] variable-width encoding, CJK encoding
Extends EUC-CN
Based on GBK (GB 13000.1-93 annex)
Succeeded by Code page 54936 (GB 18030)
  1. Not in the strictest sense of the term, as ASCII bytes can appear as trail bytes.

Windows code page 936 (abbreviated MS936, Windows-936 or (ambiguously) CP936), [1] is Microsoft's legacy (pre-Unicode) character encoding for representing simplified Chinese text on computers. It is one of the four Windows DBCSs for East Asian languages, accompanying code pages 932 (Japanese), 949 (Korean) and 950 (Traditional Chinese). It is a variant of the Mainland Chinese Guójiā Biāozhǔn Kuòzhǎn (GBK) encoding, and roughly corresponds to IBM code page 1386 (CP1386 or IBM-1386).

Contents

History

Originally, Windows-936 covered GB 2312 (in its EUC-CN form), but it was expanded to cover most of GBK with the release of Windows 95. The Euro sign (€), not defined in GBK, is encoded as 0x80 in Windows-936 and IBM-1386. On the other hand, 95 characters defined in GBK 1.0 were initially not encoded into Windows-936. This is partly resolved in later versions of Windows and, as in Windows 7, all GBK characters not in the Unicode BMP Private Use Area can be displayed using code page 936, but encoding the 95 characters was still not supported as of 2014.

Windows code page 936 was superseded by code page 54936 (GB 18030), but as of 2014 was still prevalent in use. The Windows console uses code page 936 as the default code page for simplified Chinese installations, although part of the GB 18030 was made mandatory for all software products sold in China. In 2002, the IANA Internet name GBK was registered with Windows-936's mapping, [2] [3] making it the de facto GBK definition on the Internet.

Terminology

Windows code page 936 corresponds roughly to IBM code page 1386, and is a different encoding from the obsolete IBM code page 936. IBM CJK Code Page Numbers.svg
Windows code page 936 corresponds roughly to IBM code page 1386, and is a different encoding from the obsolete IBM code page 936.

The name "code page 936" is ambiguous. IBM's code page 936, [4] , an obsolete IBM 5550 encoding, is also a Simplified Chinese encoding, but uses a different encoding method for GB 2312 (Shift GB), and so is entirely incompatible with Windows code page 936 (in contrast to IBM code page 932 being, to a first approximation, [lower-alpha 1] a subset of Windows code page 932)—although International Components for Unicode does not include an IBM-936 codec, and uses the Windows code page for the cp936 label. [1] IBM's code page for GBK coverage is code page 1386, which is defined as a combination of the single byte Code page 1114 and the double byte Code page 1385. [5]

The concepts of "Windows-936", "GBK", "GB2312" and "EUC-CN" are sometimes conflated in various software products. EUC-CN is registered with the IANA as GB2312, although it is a specific, variable-width 8-bit stateless, encoding format of GB 2312 (which also has other, less widely used, encoding formats such as HZ-GB-2312, ISO-2022-CN or the aforementioned Shift GB).

Since GBK is a superset of EUC-CN (although not itself an EUC code) and superseded GB 2312 long ago, and since Microsoft software continued to assign the GB2312 encoding label to code page 936 even after extending it to implement GBK rather than EUC-CN, most modern-day Windows-based software products mean partial support for GBK via Windows-936, rather than EUC-CN or other encoding formats of GB 2312, when they use the term "GB 2312" as a character encoding option. This can be observed in products such as Microsoft Internet Explorer and Notepad++.

Footnotes

Related Research Articles

<span class="mw-page-title-main">Character encoding</span> Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "character map".

While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.

Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.

In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers. Typically each number represents the binary value in a single byte.

<span class="mw-page-title-main">Windows-1252</span> Windows character set for Latin alphabet

Windows-1252 or CP-1252 is a single-byte character encoding of the Latin alphabet that was used by default in Microsoft Windows for English and many Romance and Germanic languages including Spanish, Portuguese, French, and German. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa.

In computing, Chinese character encodings can be used to represent text written in the CJK languages—Chinese, Japanese, Korean—and (rarely) obsolete Vietnamese, all of which use Chinese characters. Several general-purpose character encodings accommodate Chinese characters, and some of them were developed specifically for Chinese.

<span class="mw-page-title-main">GB 18030</span> Unicode character encoding mostly used for Simplified Chinese

GB 18030 is a Chinese government standard, described as Information Technology — Chinese coded character set and defines the required language and character support necessary for software in China. GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB2312. As a Unicode Transformation Format, GB18030 supports both simplified and traditional Chinese characters. It is also compatible with legacy encodings including GB2312, CP936, and GBK 1.0.

Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese (characters).

GB/T 2312-1980 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. GB refers to the Guobiao standards (国家标准), whereas the T suffix denotes a non-mandatory standard.

Windows-1256 is a code page used under Microsoft Windows to write Arabic and other languages that use Arabic script, such as Persian and Urdu.

Windows-1257 is an 8-bit, single-byte extended ASCII code page used to support the Estonian, Latvian and Lithuanian languages under Microsoft Windows. In Lithuania, it is standardised as LST 1590-3, alongside a modified variant named LST 1590-4.

<span class="mw-page-title-main">GBK (character encoding)</span> Simplified Chinese character encoding

GBK is an extension of the GB 2312 character set for Simplified Chinese characters, used in the People's Republic of China. It includes all unified CJK characters found in GB 13000.1-93, i.e. ISO/IEC 10646:1993, or Unicode 1.1. Since its initial release in 1993, GBK has been extended by Microsoft in Code page 936/1386, which was then extended into GBK 1.0. GBK is also the IANA-registered internet name for the Microsoft mapping, which differs from other implementations primarily by the single-byte euro sign at 0x80.

Windows code pages are sets of characters or code pages used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Windows, although they are still supported both within Windows and other platforms, and still apply when Alt code shortcuts are used.

<span class="mw-page-title-main">Code page 950</span> Windows character set for Traditional Chinese

Code page 950 is the code page used on Microsoft Windows for Traditional Chinese. It is Microsoft's implementation of the de facto standard Big5 character encoding. The code page is not registered with IANA, and hence, it is not a standard to communicate information over the internet, although it is usually labelled simply as big5, including by Microsoft library functions.

<span class="mw-page-title-main">Unified Hangul Code</span> Windows character set for Korean

Unified Hangul Code (UHC), or Extended Wansung, also known under Microsoft Windows as Code Page 949, is the Microsoft Windows code page for the Korean language. It is an extension of Wansung Code to include all 11172 non-partial Hangul syllables present in Johab. This corresponds to the pre-composed syllables available in Unicode 2.0 and later.

In mathematics, the radical symbol, radical sign, root symbol, radix, or surd is a symbol for the square root or higher-order root of a number. The square root of a number x is written as

The CCITT Chinese Primary Set is a multi-byte graphic character set for Chinese communications created for the Consultative Committee on International Telephone and Telegraph (CCITT) in 1992. It is defined in ITU T.101, annex C, which codifies Data Syntax 2 Videotex. It is registered with the ISO-IR registry for use with ISO/IEC 2022 as ISO-IR-165, and encodable in the ISO-2022-CN-EXT code version.

Microsoft Windows code page 932, also called Windows-31J amongst other names, is the Microsoft Windows code page for the Japanese language, which is an extended variant of the Shift JIS Japanese character encoding. It contains standard 7-bit ASCII codes, and Japanese characters are indicated by the high bit of the first byte being set to 1. Some code points in this page require a second byte, so characters use either 8 or 16 bits for encoding.

<span class="mw-page-title-main">Code page 949 (IBM)</span>

IBM code page 949 (IBM-949) is a character encoding which has been used by IBM to represent Korean language text on computers. It is a variable-width encoding which represents the characters from the Wansung code defined by the South Korean standard KS X 1001 in a format compatible with EUC-KR, but adds IBM extensions for additional hanja, additional precomposed Hangul syllables, and user-defined characters.

IBM code page 936 is a character encoding for Simplified Chinese including 1880 user-defined characters (UDC), which was superseded in 1993. It is a combination of the single-byte Code page 903 and the double-byte Code page 928. Code page 946 uses the same double-byte component, but an extended single-byte component.

References

  1. 1 2 "windows-936-2000 (alias cp936)". ICU Demonstration - Converter Explorer. International Components for Unicode.
  2. "Character Sets" . Retrieved 3 October 2016.
  3. Application of IANA Charset Registration for GBK
  4. "Coded character set identifiers - CCSID 936". IBM Globalization. IBM. Archived from the original on 2014-12-01.
  5. "Coded character set identifiers - CCSID 1386". IBM. Archived from the original on 2014-11-29.

Windows-936:

IBM-1386: