Shift JIS

Last updated
Shift JIS
MIME / IANAShift_JIS
Alias(es)MS_Kanji, [1] PCK [2] [3]
Language(s)Primarily Japanese, but also supporting English, Russian, Bulgarian, Greek
StandardJIS X 0208:1997 Appendix 1
Classification Extended ISO 646, [lower-alpha 1] variable-width encoding, CJK encoding
Extends JIS X 0201 8-bit format
Transforms / Encodes JIS X 0208
Succeeded by Shift_JIS-2004 (JIS)
Windows-31J (web)

Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS, known as PCK in Solaris contexts) [2] [3] is a character encoding for the Japanese language, originally developed by the Japanese company ASCII Corporation [lower-alpha 2] in conjunction with Microsoft and standardized as JIS X 0208 Appendix 1.

Contents

Shift JIS is based on character sets defined within JIS standards JIS X 0201:1997 (for the single-byte characters) and JIS X 0208:1997 (for the double-byte characters).

As of April 2024, 0.3% of surveyed web pages used Shift JIS (actually decoded as its superset Windows-31J encoding), a decline from 1.3% in July 2014. [4] Shift JIS is the second-most declared character encoding for Japanese websites, used by 5.2% of sites in the .jp domain, while UTF-8 is used by 94.8% of Japanese websites. [5] [6]

Structure

Shift JIS is an extension of the single-byte encoding JIS X 0201:1997, that uses unassigned code points in JIS X 0201 to encode the double-byte JIS X 0208:1997 character set. The lead bytes for the double-byte characters are "shifted" around the 64 halfwidth katakana characters in the single-byte range 0xA1 to 0xDF.

The single-byte characters 0x00 to 0x7F match the ASCII encoding, except for a yen sign (U+00A5) at 0x5C and an overline (U+203E) at 0x7E in place of the ASCII character set's backslash and tilde respectively (these deviations from ASCII align with JIS X 0201). The single-byte characters from 0xA1 to 0xDF map to the half-width katakana characters found in JIS X 0201.

For double-byte characters, the first byte is always in the range 0x81 to 0x9F or the range 0xE0 to 0xEF (these ranges are unassigned in JIS X 0201). If the first byte is odd, the second byte must be in the range 0x40 to 0x9E (but cannot be 0x7F); if the first byte is even, the second byte must in the range 0x9F to 0xFC.

Shift JIS only guarantees that the first byte of two-byte characters will be high-bit-set (0x80–0xFF); the value of the second byte can be either high or low. The appearance of byte values 0x40–0x7E as second bytes of code words makes reliable Shift JIS detection difficult, because the same codes are used for ASCII characters. Since the same byte value can be either first or second byte, string searches are difficult, since simple searches can match the second byte of a character and the first byte of the next, which is not a valid Shift JIS character. String-searching algorithms must be tailor-made for Shift JIS.

Compatibility

Shift JIS is fully backwards compatible with the JIS X 0201 single-byte encoding, meaning that any valid JIS X 0201 string is also a valid Shift JIS string.

Double-byte characters in JIS X 0208 need to be transformed in order to be encoded in Shift JIS. For a double-byte JIS X 0208 sequence , [lower-alpha 3] the transformation to the corresponding Shift JIS bytes is:

The competing 8-bit format EUC-JP, which does not support single-byte halfwidth katakana, allows for a cleaner and more direct conversion to and from JIS X 0208 code points, as all high-bit-set bytes are parts of a double-byte character and all codes from ASCII range represent single-byte characters.

Usage

HTML written in Shift JIS can still be interpreted to some extent when incorrectly tagged as ASCII, and when the charset tag is in the top of the document itself, since the important start and end of HTML tags and fields (<, >, /, ", &, ;) are encoded as the same bytes as in ASCII, and those bytes do not appear in two-byte sequences.

Shift JIS can be used in string literals in programming languages such as C, but a few things must be taken into consideration. Firstly, that the escape character 0x5C, normally backslash, is the half-width yen sign (¥) in Shift JIS. If the programmer is aware of this, it would be possible to use printf("ハローワールド¥n"); (where ハローワールド is Hello, world and ¥n is an escape sequence), assuming the I/O system supports Shift JIS output. Secondly, the 0x5C byte will cause problems when it appears as second byte of a two-byte character, because it will be interpreted as an escape sequence, which will mess up the interpretation, unless followed by another 0x5C.

Multiple versions

Euler diagram comparing repertoires of JIS X 0208, JIS X 0212, JIS X 0213, Windows-31J, the Microsoft standard repertoire and Unicode Euler diag for jp charsets.svg
Euler diagram comparing repertoires of JIS X 0208, JIS X 0212, JIS X 0213, Windows-31J, the Microsoft standard repertoire and Unicode
Relationship between Shift_JIS variants on the PC and related encodings, including intersections and other subsets. Names given are descriptive. JIS and Shift-JIS variants.svg
Relationship between Shift_JIS variants on the PC and related encodings, including intersections and other subsets. Names given are descriptive.

Many different versions of Shift JIS exist. There are two areas for expansion:

Firstly, JIS X 0208 does not fill the whole 94×94 space encoded for it in Shift JIS, therefore there is room for more characters here—these are really extensions to JIS X 0208 rather than to Shift JIS itself.

Secondly, Shift JIS has more encoding space than is needed for JIS X 0201 and JIS X 0208 (see § Shift JIS byte map below), and this space can and is used for yet more characters (as either single-byte or double-byte characters).

Windows-932 / Windows-31J

The most popular extension is Windows code page 932 (a CCSID also used for IBM's extension to Shift JIS), which is registered with the IANA as "Windows-31J", [1] separately from Shift JIS. This was popularized by Microsoft, although Microsoft itself does not recognize the Windows-31J name and instead calls that variation "shift_jis". [7] [8] IBM's code page 943 includes the same double-byte codes as Microsoft's code page 932, while IBM's code page 932 includes fewer extensions (excluding those which Microsoft incorporates from NEC), and retains the character order from the 1978 edition of JIS X 0208, rather than implementing the character variant swaps from the 1983 standard. [9]

Windows-31J assigns 0x5C to U+005C REVERSE SOLIDUS (the backslash), and 0x7E to U+007E TILDE, following US-ASCII. [10] However, most localised fonts on Windows display U+005C as a Yen sign for JIS X 0201 compatibility. [11] [12] It includes several extensions, namely "NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119)", [1] in addition to setting some encoding space aside for end user definition. [13]

Windows codepage 932 is the version used in the W3C/WHATWG encoding standard used by HTML5, which includes the "formerly proprietary extensions from IBM and NEC" from Windows-31J in its table for JIS X 0208, [14] and also treats the label "shift_jis" interchangeably with "windows-31j" with the intent of being "compatible with deployed content". [15]

MacJapanese

The version of Shift-JIS originating from the classic Mac OS (known as x-mac-japanese, Code page 10001 [7] or MacJapanese) assigned the tilde to 0x7E (following US-ASCII, not JIS X 0201 which assigns the overline here), but the Yen sign to 0x5C (as in JIS X 0201 and standard Shift JIS). It also extended JIS X 0201 by assigning the backslash to 0x80 (corresponding to 0x5C in US-ASCII), the non-breaking space to 0xA0, the copyright sign to 0xFD, the trademark symbol to 0xFE and the half-width horizontal ellipsis to 0xFF. It also added extended double byte characters; including 53 vertical presentation forms in the Shift_JIS range 0xEB410xED96, at 84 JIS rows down from their canonical forms, and 260 special characters in the Shift_JIS range 0x85400x886D. [16] This variant was introduced in KanjiTalk version 7. [17]

However, certain Mac OS typefaces used other variants. Sai Mincho and Chu Gothic use a "PostScript" variant of MacJapanese, which included additional vertical presentation forms and a different set of extended special characters, based on the NEC special characters, some of which were only available in the printer versions of the fonts. [16] Older versions of Maru Gothic and Hon Mincho from System 7.1 encoded vertical presentation forms at 10 (not 84) JIS rows down from their canonical forms, and did not include the special character extensions, this was subsequently changed. [16] [18] The typical variant used with KanjiTalk version 6 placed the vertical presentation forms 10 rows down, and also used the NEC extension layout for row 13. [19]

Shift_JISx0213 and Shift_JIS-2004

Shift_JIS-2004
Alias(es)Shift_JISx0213
Language(s) Japanese, Ainu, English, Russian
StandardJIS X 0213
ExtendsShift_JIS (1997),
JIS X 0201 (8-bit)
Transforms / Encodes JIS X 0213
Preceded byShift_JIS (1997)

The newer JIS X 0213 standard defines an extended variant of Shift_JIS referred to as Shift_JISx0213 (in a previous version of the standard) or Shift_JIS-2004. It is a superset of standard Shift JIS. [20]

In order to represent the allocated rows on both planes of JIS X 0213, Shift_JIS-2004 uses the following method of mapping codepoints. [21]

In the above, is a two-byte Shift_JIS-2004 sequence, is the plane (, men, surface) number (1 or 2), is the row (, ku, ward) number (1-94) and is the cell (, ten, point) number (1-94). The ku and ten numbers are equivalent to and respectively, where is a two-byte JIS sequence referencing a given plane.

The same set of characters can be represented by EUC-JIS-2004, the EUC-JP based counterpart.

Some of the additions collide with popular Shift JIS extensions, including Windows codepage 932 which is used in web standards (see above). For example, compare plane 1 row 89 in JIS X 0213 (beginning 硃, 硎, 硏...) [22] to row 89 in the JIS X 0208 variant defined in web standards (beginning 纊, 褜, 鍈...). [23] In addition, some of the characters map to Unicode characters beyond the BMP.

Other variants

The space with lead bytes 0xF5 to 0xF9 (beyond the region used for JIS X 0208) is used by Japanese mobile phone operators for pictographs for use in E-mail. [24] KDDI goes further and defines hundreds more in the space with lead bytes 0xF3 and 0xF4. [25]

Beyond even this, there have been numerous minor variations made on Shift JIS, with individual characters here and there altered. Most of these extensions and variants have no IANA registration, so there is much scope for confusion, if the extensions are used.

A variant is the one that must be used if wanting to encode Shift JIS in source code strings of C and similar programming languages. This variant doubles the byte 0x5C if it appears as second byte of a two-byte character, but not if it appears as a single "¥" (ASCII: "\") character, because 0x5C is the beginning of an escape sequence. The best way of handling this is a special editor which encodes Shift JIS this way.

Shift JIS byte map

As defined in JIS X 0208:1997

The chart below gives the detailed meaning of each byte in a stream encoded in standard Shift JIS (conforming to JIS X 0208:1997).

First byte
0123456789ABCDEF
0
1
2 !"#$ %&'()*+,-./
30123456789 : ;<=> ?
4@ABCDEFGHIJKLMNO
5PQRSTUVWXYZ[¥]^_
6`abcdefghijklmno
7pqrstuvwxyz{|}
8
9
A
Bソ
C
D
E
F
Second byte
0123456789ABCDEF
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
 
Non printable ASCII character
Unaltered ASCII character
Modified ASCII character
Single-byte half-width katakana
First byte of a double-byte JIS X 0208 character
Unused as first byte of a JIS X 0208 character
Second byte of a double-byte JIS X 0208 character whose first half of the JIS sequence was odd
Second byte of a double-byte JIS X 0208 character whose first half of the JIS sequence was even
Unused as second byte of a JIS X 0208 character

With vendor or JIS X 0213 extensions

Some of the bytes which are not used for single-byte codes or initial bytes in JIS X 0208:1997 are used by certain extensions, resulting in the layout detailed in the chart below.

First byte
0123456789ABCDEF
0
1
2 !"#$ %&'()*+,-./
30123456789 : ;<=> ?
4@ABCDEFGHIJKLMNO
5PQRSTUVWXYZ[¥]^_
6`abcdefghijklmno
7pqrstuvwxyz{|}
8
9
A
Bソ
C
D
E
F
Second byte
0123456789ABCDEF
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
 
Non printable ASCII character
Unaltered ASCII character
Modified ASCII character
Single-byte half-width katakana
First byte of a double-byte character, used by JIS X 0208 (and by extensions such as JIS X 0213 plane 1)
First byte of a double-byte character, unallocated in JIS X 0208 but used by JIS X 0213 plane 1 or by vendor extensions
First byte of a double-byte character beyond JIS X 0208, used for JIS X 0213 plane 2 or for unrelated extensions
Not used as first byte, used by some single byte extensions
Second byte of a double-byte character whose first half of the JIS sequence was odd
Second byte of a double-byte character whose first half of the JIS sequence was even
Unused as second byte of a double-byte character

See also

Footnotes

  1. Not in the strictest sense of the term, as ASCII bytes can appear as trail bytes.
  2. The ASCII Corporation should not be confused with the ASCII encoding used elsewhere in this article.
  3. In JIS X 0208, j1 and j2 are each in the range 33 (0x21) to 126 (0x7e) inclusive (i.e., 7-bit character values excluding control characters (031 (0x1f) and 127 (0x7f)) and space).

Related Research Articles

The yen and yuan sign (¥) is a currency sign used for the Japanese yen and the Chinese yuan currencies when writing in Latin scripts. This character resembles a capital letter Y with a single or double horizontal stroke. The symbol is usually placed before the value it represents, for example: ¥50, or JP¥50 and CN¥50 when disambiguation is needed. When writing in Japanese and Chinese, the Japanese kanji and Chinese character is written following the amount, for example 50円 in Japan, and 50元 or 50圆 in China.

In computing, JIS encoding refers to several Japanese Industrial Standards for encoding the Japanese language. Strictly speaking, the term means either:

ISO/IEC 2022Information technology—Character code structure and extension techniques, is an ISO/IEC standard in the field of character encoding. It is equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202. Originating in 1971, it was most recently revised in 1994.

Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese (characters).

GB/T 2312-1980 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. GB refers to the Guobiao standards (国家标准), whereas the T suffix denotes a non-mandatory standard.

IBM code page 932 is one of IBM's extensions of Shift JIS. The coded character sets are JIS X 0201:1976, JIS X 0208:1983, IBM extensions and IBM extensions for IBM 1880 UDC. It is the combination of the single-byte Code page 897 and the double-byte Code page 301. Code page 301 is designed to encode the same repertoire as IBM Japanese DBCS-Host.

<span class="mw-page-title-main">Unified Hangul Code</span> Windows character set for Korean

Unified Hangul Code (UHC), or Extended Wansung, also known under Microsoft Windows as Code Page 949, is the Microsoft Windows code page for the Korean language. It is an extension of Wansung Code to include all 11172 non-partial Hangul syllables present in Johab. This corresponds to the pre-composed syllables available in Unicode 2.0 and later.

<span class="mw-page-title-main">JIS X 0201</span> Japanese single byte character encoding

JIS X 0201, a Japanese Industrial Standard developed in 1969, was the first Japanese electronic character set to become widely used. The character set was initially known as JIS C 6220 before the JIS category reform. Its two forms were a 7-bit encoding or an 8-bit encoding, although the 8-bit form was dominant until Unicode replaced it. The full name of this standard is 7-bit and 8-bit coded character sets for information interchange (7ビット及び8ビットの情報交換用符号化文字集合).

Half-width kana are katakana characters displayed compressed at half their normal width, instead of the usual square (1:1) aspect ratio. For example, the usual (full-width) form of the katakana ka is カ while the half-width form is カ. Half-width hiragana is included in Unicode, and it is usable on Web or in e-books via CSS's font-feature-settings: "hwid" 1 with Adobe-Japan1-6 based OpenType fonts. Half-width kanji is usable on modern computers, and is used in some receipt printers, electric bulletin board and old computers.

JIS X 0212 is a Japanese Industrial Standard defining a coded character set for encoding supplementary characters for use in Japanese. This standard is intended to supplement JIS X 0208. It is numbered 953 or 5049 as an IBM code page.

JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. The official title of the current standard is 7-bit and 8-bit double byte coded KANJI sets for information interchange. It was originally established as JIS C 6226 in 1978, and has been revised in 1983, 1990, and 1997. It is also called Code page 952 by IBM. The 1978 version is also called Code page 955 by IBM.

KS X 1001, "Code for Information Interchange ", formerly called KS C 5601, is a South Korean coded character set standard to represent Hangul and Hanja characters on a computer.

Code page 895 is a 7-bit character set and is Japan's national ISO 646 variant. It is the Roman set of the JIS X 0201 Japanese Standard and is variously called Japan 7-Bit Latin, JISCII, JIS Roman, JIS C6220-1969-ro, ISO646-JP or Japanese-Roman. Its ISO-IR registration number is 14.

Microsoft Windows code page 932, also called Windows-31J amongst other names, is the Microsoft Windows code page for the Japanese language, which is an extended variant of the Shift JIS Japanese character encoding. It contains standard 7-bit ASCII codes, and Japanese characters are indicated by the high bit of the first byte being set to 1. Some code points in this page require a second byte, so characters use either 8 or 16 bits for encoding.

Code page 942 is one of IBM's extensions of Shift JIS. The coded character sets are JIS X 0201, JIS X 0208, IBM extensions for IBM 1880 UDC and IBM extensions. It is the combination of the single-byte Code page 1041 and the double-byte Code page 301.

<span class="mw-page-title-main">Code page 949 (IBM)</span>

IBM code page 949 (IBM-949) is a character encoding which has been used by IBM to represent Korean language text on computers. It is a variable-width encoding which represents the characters from the Wansung code defined by the South Korean standard KS X 1001 in a format compatible with EUC-KR, but adds IBM extensions for additional hanja, additional precomposed Hangul syllables, and user-defined characters.

IBM code page 936 is a character encoding for Simplified Chinese including 1880 user-defined characters (UDC), which was superseded in 1993. It is a combination of the single-byte Code page 903 and the double-byte Code page 928. Code page 946 uses the same double-byte component, but an extended single-byte component.

Code page 897 is IBM's implementation of the 8-bit form of JIS X 0201. It includes several additional graphical characters in the C0 control characters area, and the code points in question may be used as control characters or graphical characters depending on the context, similarly in concept to OEM-US, but with different graphical characters. The C0 rows are shown below.

Code page 903 is encoded for use as the single byte component of certain simplified Chinese character encodings. It is used in China. Despite this, it follows ISO 646-JP / the Roman half of JIS X 0201, in that it replaces the ASCII backslash 0x5C with the yen/yuan sign. It also uses the same C0 replacement graphics as code page 897. When combined with the double-byte Code page 928, it forms the two code-sets of IBM code page 936.

Several mutually incompatible versions of the Extended Binary Coded Decimal Interchange Code (EBCDIC) have been used to represent the Japanese language on computers, including variants defined by Hitachi, Fujitsu, IBM and others. Some are variable-width encodings, employing locking shift codes to switch between single-byte and double-byte modes. Unlike other EBCDIC locales, the lowercase basic Latin letters are often not preserved in their usual locations.

References

  1. 1 2 3 "Character Sets". IANA.
  2. 1 2 "convutf8.c". OpenSolaris . Line 305. 2008-11-12.
  3. 1 2 "Additional Japanese iconv Modules". What's New in the Solaris 9 9/04 Operating Environment. Oracle Corporation.
  4. "Historical trends in the usage of character encodings for websites, April 2024". w3techs.com. Retrieved 2024-04-09.
  5. "Distribution of Character Encodings among websites that use .jp". w3techs.com. Retrieved 2024-04-09.
  6. "Distribution of Character Encodings among websites that use Japanese". w3techs.com. Retrieved 2024-04-09.
  7. 1 2 "Encoding.WindowsCodePage Property – .NET Framework (current version)". MSDN. Microsoft.
  8. "Code Page Identifiers". Windows Dev Center. Microsoft. 7 January 2021.
  9. "IBM-943 and IBM-932". IBM Knowledge Center. IBM.
  10. "CP932.TXT". Unicode Consortium.
  11. "3.1.1 Details of Problems". Problems and Solutions for Unicode and User/Vendor Defined Characters. The Open Group Japan. Archived from the original on 1999-02-03.
  12. Kaplan, Michael S. (2005-09-17). "When is a backslash not a backslash?".
  13. Kaplan, Michael S (2007-05-26). "The PUA outside of Unicode". Sorting it all out.
  14. "5. Indexes (§ Index jis0208)". Encoding Standard. WHATWG.
  15. "4.2. Names and labels". Encoding Standard. WHATWG.
  16. 1 2 3 "JAPANESE.TXT: Map (external version) from Mac OS Japanese encoding to Unicode 2.1 and later". Apple Computer, Inc.; Unicode Consortium.
  17. Lunde, Ken (2019-03-21). "A Brief History of Japan's Era Name Ligatures". CJK Type Blog. Adobe Inc.
  18. "Encoding Variants for MacJapanese". Apple Developer Documentation. Apple.
  19. Lunde, Ken (2008). "Appendix E: Vendor Character Set Standards" (PDF). CJKV Information Processing. O'Reilly Media. ISBN   9780596514471.
  20. "JIS X 0213 Code Mapping Tables". x0213.org.
  21. "JIS X 0213の代表的な符号化方式 § Shift_JIS-2004" (in Japanese). Hexadecimal numbers in the source have been converted to decimal for display.
  22. Japanese Industrial Standards Committee (2004-04-13). Japanese Graphic Character Set for Information Interchange, Plane 1 (PDF). ITSCJ/IPSJ. ISO-IR-233.
  23. "Index jis0208 visualization". Encoding Standard. WHATWG.
  24. "Original Emoji from DoCoMo". FileFormat.info.
  25. "Original Emoji from KDDI". FileFormat.info.