ARIB STD B24 character set

Last updated
ARIB STB-B24 encoding
StandardARIB STB-B24 Volume 1
Classification ISO 2022 profile/extension
Transforms / EncodesARIB STB-B24 Kanji, Kana and mosaic sets,
JIS X 0201
ARIB STB-B24 Kanji set
ARIB Extended Font (Weather Symbols) ja.svg
Weather symbols: a few of the extended symbols included.
Language(s) Japanese, English, Russian
Partial support: Greek, Chinese
StandardARIB STB-B24 Volume 1
Classification ISO-2022-structured CJK DBCS
Extends JIS X 0208
Encoding formats
  • ARIB STB-B24 encoding (ISO 2022 based)
  • Shift JIS (ARIB variant) [1]

Volume 1 of the Association of Radio Industries and Businesses (ARIB) STD-B24 standard for Broadcast Markup Language [2] specifies, amongst other details, a character encoding for use in Japanese-language broadcasting. It was introduced on 1999-10-26. [2] The latest revision is version 6.3 as of 2016-07-06.

Contents

It includes a number of ARIB extended characters (ARIB外字, ARIB gaiji) not found in the base standards (JIS X 0208 and JIS X 0201). It was the source standard for many symbol characters which were added to Unicode, including portions of the Miscellaneous Symbols, Enclosed Alphanumeric Supplement and Enclosed Ideographic Supplement blocks. [3] Its contributions partially overlap the Unicode emoji, but were added a year earlier, in Unicode 5.2. [4]

Fascicle 1 of the ARIB STD-B62 standard, published in 2014, defines Unicode mappings for a selection of the B24 extended characters (excluding, for example, those duplicated by JIS X 0213), as well as a few extended Kanji. [5] It also includes a mapping of utilised characters outside the Basic Multilingual Plane to the BMP's private use area.

Sets and codes

The ARIB STD B24 standard defines multiple character sets and a method of switching between them. These include a Kanji set (an extension of JIS X 0208), an Alphanumeric set, a Hiragana set, Katakana sets of two distinct layouts and four mosaic sets. [6] The sets are selected using ISO 2022 mechanisms for 94-sets, using the following codes (proportional sets use the same layout as the corresponding non-proportional ones): [7]

SetTypeCode (column/line)Code (hexadecimal)Code (ASCII character)Comments
Kanji2-byte4/242BThe escape code B used for the ARIB Kanji set [7] is used for the 1983 version of JIS C 6226 (JIS X 0208, of which the ARIB Kanji set is an extension) in ISO-2022-JP. [8] [9]
Alphanumeric1-byte4/104AJJIS_C6220-ro (ISO646-JP, JIS X 0201 Roman set). Similar to ASCII, with two assignments differing. Escape code J matches usage in ISO-2022-JP. [9]
Proportional alphanumeric1-byte3/6366
Hiragana1-byte3/0300Hiragana themselves follow the same layout as row 4 of JIS X 0208, but without a lead byte. Also adds several additional assignments for punctuation.
Proportional Hiragana1-byte3/7377
Katakana1-byte3/1311Katakana themselves follow the same layout as row 5 of JIS X 0208, but without a lead byte. Also adds several additional assignments for punctuation.
Proportional Katakana1-byte3/8388
JIS X 0201 Katakana1-byte4/949IJIS_C6220-jp (JIS X 0201 Kana set). Escape code matches usage in ISO-2022-JP-3.
Mosaic A1-byte3/2322 Pseudographics
Mosaic B1-byte3/3333
Mosaic C1-byte3/4344Non-spacing pseudographics
Mosaic D1-byte3/5355

Code charts

Kanji (double-byte) set

This is a double-byte character set extending JIS X 0208.

Lead byte

The encoding bytes correspond to the row or cell number plus 0x20, or 32 in decimal (see below). Hence, the code set starting with 0x21 has a row number of 1, and its cell 1 has a continuation byte of 0x21 (or 33), and so forth. Most of the code corresponds to JIS X 0208.

ARIB STD-B24 Kanji (double-byte) set (lead bytes)
0123456789ABCDEF
2x  SP   1-_ 2-_ 3-_ 4-_ 5-_ 6-_ 7-_ 8-_ 9-_10-_11-_12-_13-_14-_15-_
3x 16-_ 17-_ 18-_ 19-_ 20-_ 21-_ 22-_ 23-_ 24-_ 25-_ 26-_ 27-_ 28-_ 29-_ 30-_ 31-_
4x 32-_ 33-_ 34-_ 35-_ 36-_ 37-_ 38-_ 39-_ 40-_ 41-_ 42-_ 43-_ 44-_ 45-_ 46-_ 47-_
5x 48-_ 49-_ 50-_ 51-_ 52-_ 53-_ 54-_ 55-_ 56-_ 57-_ 58-_ 59-_ 60-_ 61-_ 62-_ 63-_
6x 64-_ 65-_ 66-_ 67-_ 68-_ 69-_ 70-_ 71-_ 72-_ 73-_ 74-_ 75-_ 76-_ 77-_ 78-_ 79-_
7x 80-_ 81-_ 82-_ 83-_ 84-_ 85-_86-_87-_88-_89-_ 90-_ 91-_ 92-_ 93-_ 94-_ DEL
  Unused lead byte
  Lead byte
  Differences from JIS X 0208

Character sets 0x21-0x74 (row numbers 1-84: punctuation, alphabets, numbers, Kana, Kanji)

Character set 0x7A (row number 90, traffic symbols)

Characters 90-45 through 90-63 and 90-66 through 90-84 (shown below shaded) are listed in the B24 standard only in table 7-10 (the list of extension characters), and are also the only characters in rows 90 through 91 which are not transport-related symbols; this is noted in the B24 standard in an endnote to table 7-10. [10] The remainder of the extensions are listed in both table 7-4 (the double-byte code chart) and table 7-10. [10]

ARIB STD-B24 Kanji (double-byte) set (prefixed with 0x7A) [5] [11]
0123456789ABCDEF
2x
3x 🅿 🆊
4x
5x 🅊 🅌 🄿 🅆 🅋 🈐 🈑 🈒 🈓 🅂 🈔 🈕 🈖 🅍 🄱 🄽
6x 🈗 🈘 🈙 🈚 🈛 🈜 🈝 🈞 🈟 🈠 🈡 🈢 🈣
7x 🈤 🈥 🅎 🈀
  Additions from table 7-10 not in table 7-4.

Character set 0x7B (row number 91, map symbols)

Characters from ARIB STD-B24 which were not retained in ARIB STD-B62 are shown shaded.

ARIB STD-B24 Kanji (double-byte) set (prefixed with 0x7B) [5] [11] [12]
0123456789ABCDEF
2x [lower-alpha 1]
3x 🅗
4x 🅟 🆋 🆍 🆌 🅹 🅻
5x 🅼
6x
7x
  Not in ARIB STD-B62

Character set 0x7C (row number 92, units, enclosed forms, list markers, arrows)

Characters from ARIB STD-B24 which were not retained in ARIB STD-B62 are shown shaded.

ARIB STD-B24 Kanji (double-byte) set (prefixed with 0x7C) [5] [11] [12]
0123456789ABCDEF
2x
3x 🄀 [lower-alpha 2] [lower-alpha 2] [lower-alpha 2] [lower-alpha 2] [lower-alpha 2] [lower-alpha 2]
4x 🄁 🄂 🄃 🄄 🄅 🄆 🄇 🄈 🄉 🄊
5x ² ³ 🄭 (vn) [lower-alpha 3] (ob) [lower-alpha 3] (cb) [lower-alpha 3] (ce [lower-alpha 3] mb) [lower-alpha 3] (hp) [lower-alpha 3] (br) [lower-alpha 3] (p) [lower-alpha 3]
6x (s) [lower-alpha 3] (ms) [lower-alpha 3] (t) [lower-alpha 3] (bs) [lower-alpha 3] (b) [lower-alpha 3] (tb) [lower-alpha 3] (tp) [lower-alpha 3] (ds) [lower-alpha 3] (ag) [lower-alpha 3] (eg) [lower-alpha 3] (vo) [lower-alpha 3] (fl) [lower-alpha 3] (ke [lower-alpha 3] y) [lower-alpha 3] (sa [lower-alpha 3] x) [lower-alpha 3]
7x (sy [lower-alpha 3] n) [lower-alpha 3] (or [lower-alpha 3] g) [lower-alpha 3] (pe [lower-alpha 3] r) [lower-alpha 3] 🄬 🄫 🆐 🈦
  Not in ARIB STD-B62

Character set 0x7D (row number 93, game and weather symbols, fractions, units, enclosed forms)

Characters from ARIB STD-B24 which were not retained in ARIB STD-B62 are shown shaded.

ARIB STD-B24 Kanji (double-byte) set (prefixed with 0x7D) [5] [11] [12]
0123456789ABCDEF
2x
3x 🉀 🉁 🉂 🉃 🉄 🉅 🉆 🉇 🉈 🄪 🈧 🈨 🈩 🈔 🈪
4x 🈫 🈬 🈭 🈮 🈯 🈰 🈱
5x ½ ¼ ¾
6x
7x
  Not in ARIB STD-B62

Character set 0x7E (row number 94, list markers)

Characters from ARIB STD-B24 which were not retained in ARIB STD-B62 are shown shaded.

ARIB STD-B24 Kanji (double-byte) set (prefixed with 0x7E) [5] [11] [12]
0123456789ABCDEF
2x
3x
4x 🄐 🄑 🄒 🄓 🄔 🄕 🄖 🄗 🄘 🄙 🄚 🄛 🄜 🄝 🄞
5x 🄟 🄠 🄡 🄢 🄣 🄤 🄥 🄦 🄧 🄨 🄩
6x
7x
  Not in ARIB STD-B62

Single-byte sets

Alphanumeric set

ARIB STD-B24 Alphanumeric set [14]
0123456789ABCDEF
2x ! " # $ % & ' ( ) * + , - . /
3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4x @ A B C D E F G H I J K L M N O
5x P Q R S T U V W X Y Z [ ¥ ] ^ _
6x ` a b c d e f g h i j k l m n o
7x p q r s t u v w x y z { | }
  Differences from US-ASCII

Hiragana set

ARIB STD-B24 Hiragana set [15]
0123456789ABCDEF
2x
3x
4x
5x
6x
7x
  Character allocations not following row 4 of JIS X 0208

Katakana set

ARIB STD-B24 Katakana set [16]
0123456789ABCDEF
2x
3x
4x
5x
6x
7x
  Character allocations not following row 5 of JIS X 0208

JIS X 0201 Katakana set

ARIB STD-B24 JIS X 0201 Katakana set [17]
0123456789ABCDEF
2x
3x ソ
4x
5x
6x
7x

Mosaic sets

Shift_JIS variant

In addition to the modified ISO 2022 encoding, the B24 standard also specifies a Shift JIS encoding following JIS X 0208:1997, but with the addition of the extended characters in the kanji set. [1]

First byte
0123456789ABCDEF
0
1
2 !"#$ %&'()*+,-./
30123456789 : ;<=> ?
4@ABCDEFGHIJKLMNO
5PQRSTUVWXYZ[¥]^_
6`abcdefghijklmno
7pqrstuvwxyz{|}
8
9
A
Bソ
C
D
E
F
Second byte
0123456789ABCDEF
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
 
Non printable ASCII character
Unaltered ASCII character
Modified ASCII character
Single-byte half-width katakana
First byte of a double-byte character, used by JIS X 0208
First byte of an ARIB extended character
Not used as first byte, unallocated space in JIS X 0208
Not used as first byte
Second byte of a double-byte character whose first half of the JIS sequence was odd
Second byte of a double-byte character whose first half of the JIS sequence was even
Unused as second byte of a double-byte character

See also

Footnotes

  1. Glossed as "temple" (i.e. Buddhist temple) in B24 table 7-10 (the list of extension characters).
  2. 1 2 3 4 5 6 Small form (70% size per code chart / table 7-10) of a kanji character. Shown here simulated. Private Use Area code points shown are those used by the Nishiki-teki font. [13]
  3. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Musical abbreviation (or half thereof) not present in Unicode, simulated here with multiple characters. Private Use Area code points shown are those used by the Nishiki-teki font.

Related Research Articles

Character encoding Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "character map".

Yen and yuan sign Latin symbol for CN and JP currencies

The yen and yuan sign, ¥, is a currency sign used for the Japanese yen and the Chinese yuan currencies when writing in Latin scripts. This monetary symbol resembles a Latin letter Y with a single or double horizontal stroke. The symbol is usually placed before the value it represents, for example: ¥50, or JP¥50 and CN¥50 when disambiguation is needed. When writing in Japanese and Chinese, the Japanese kanji and Chinese character is written following the amount, for example 50円 in Japan, and 50元 or 50圆 in China.

Japanese language and computers

In relation to the Japanese language and computers many adaptation issues arise, some unique to Japanese and others common to languages which have a very large number of characters. The number of characters needed in order to write in English is quite small, and thus it is possible to use only one byte (28=256 possible values) to encode each English character. However, the number of characters in Japanese is many more than 256 and thus cannot be encoded using a single byte - Japanese is thus encoded using two or more bytes, in a so-called "double byte" or "multi-byte" encoding. Problems that arise relate to transliteration and romanization, character encoding, and input of Japanese text.

In computing, JIS encoding refers to several Japanese Industrial Standards for encoding the Japanese language. Strictly speaking, the term means either:

ISO/IEC 2022Information technology—Character code structure and extension techniques, is an ISO standard specifying:

Shift JIS is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporation in conjunction with Microsoft and standardized as JIS X 0208 Appendix 1. By February 2021, 0.1% of all web pages used Shift JIS, a decline from 1.3% in July 2014.

Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.

GB/T 2312-1980 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. GB refers to the Guobiao standards (国家标准), whereas the T suffix denotes a non-mandatory standard.

Chinese Character Code for Information Interchange Character encoding standard

The Chinese Character Code for Information Interchange or CCCII is a character set developed by the Chinese Character Analysis Group in Taiwan. It was first published in 1980, and significantly expanded in 1982 and 1987.

TRON Code is a multi-byte character encoding used in the TRON project. It is similar to Unicode but does not use Unicode's Han unification process: each character from each CJK character set is encoded separately, including archaic and historical equivalents of modern characters. This means that Chinese, Japanese, and Korean text can be mixed without any ambiguity as to the exact form of the characters; however, it also means that many characters with equivalent semantics will be encoded more than once, complicating some operations.

The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, an instruction to start a new line, or a message that the text has been received.

Japanese postal mark

is the service mark of Japan Post and its successor, Japan Post Holdings, the postal operator in Japan. It is also used as a Japanese postal code mark since the introduction of the latter in 1968. Historically, it was used by the Ministry of Communications, which operated the postal service. The mark is a stylized katakana syllable te (テ), from the word teishin. The mark was introduced on February 8, 1887.

IBM code page 932 is one of IBM's extensions of Shift JIS. The coded character sets are JIS X 0201:1976, JIS X 0208:1983, IBM extensions and IBM extensions for IBM 1880 UDC. It is the combination of the single-byte Code page 897 and the double-byte Code page 301. Code page 301 is designed to encode the same repertoire as IBM Japanese DBCS-Host.

JIS X 0201 Japanese single byte character encoding

JIS X 0201, a Japanese Industrial Standard developed in 1969, was the first Japanese electronic character set to become widely used. It is either a 7-bit encoding or an 8-bit encoding, although the 8-bit form is dominant for modern use. The full name of this standard is 7-bit and 8-bit coded character sets for information interchange (7ビット及び8ビットの情報交換用符号化文字集合).

Half-width kana are katakana characters displayed compressed at half their normal width, instead of the usual square (1:1) aspect ratio. For example, the usual (full-width) form of the katakana ka is カ while the half-width form is カ. Half-width hiragana is not included in Unicode, although it's usable on Web or E-books via CSS's font-feature-settings: "hwid" 1 with Adobe-Japan1-6 based OpenType fonts. Half-width kanji is not usable on modern computers even though it is used in some receipt printers, electric bulletin board or old computers.

JIS X 0213

JIS X 0213 is a Japanese Industrial Standard defining coded character sets for encoding the characters used in Japan. This standard extends JIS X 0208. The first version was published in 2000 and revised in 2004 (JIS2004) and 2012. As well as adding a number of special characters, characters with diacritic marks, etc., it included an additional 3,625 kanji. The full name of the standard is 7-bit and 8-bit double byte coded extended KANJI sets for information interchange.

JIS X 0212 is a Japanese Industrial Standard defining a coded character set for encoding supplementary characters for use in Japanese. This standard is intended to supplement JIS X 0208. It is numbered 953 or 5049 as an IBM code page.

JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. The official title of the current standard is 7-bit and 8-bit double byte coded KANJI sets for information interchange. It was originally established as JIS C 6226 in 1978, and has been revised in 1983, 1990, and 1997. It is also called Code page 952 by IBM. The 1978 version is also called Code page 955 by IBM.

Microsoft Windows code page 932, also called Windows-31J amongst other names, is the Microsoft Windows code page for the Japanese language, which is an extended variant of the Shift JIS Japanese character encoding. It contains standard 7-bit ASCII codes, and Japanese characters are indicated by the high bit of the first byte being set to 1. Some code points in this page require a second byte, so characters use either 8 or 16 bits for encoding.

Several mutually incompatible versions of the Extended Binary Coded Decimal Interchange Code (EBCDIC) have been used to represent the Japanese language on computers, including variants defined by Hitachi, Fujitsu, IBM and others. Some are variable-width encodings, employing locking shift codes to switch between single-byte and double-byte modes. Unlike other EBCDIC locales, the lowercase basic Latin letters are often not preserved in their usual locations.

References

  1. 1 2 ARIB (2008), p. 105, part 2, section 7.3
  2. 1 2 ARIB (2008)
  3. Suignard, Michel (2008-03-11). "ISO/IEC JTC1/SC2/WG2 N 3397: Japanese TV Symbols" (PDF).
  4. "Unicode 5.2 Emoji List". Emojipedia.
  5. 1 2 3 4 5 6 ARIB (2014) , pp. 33–50, part 2, Table 5-2
  6. ARIB (2008) , pp. 48–52
  7. 1 2 ARIB (2008) , p. 39, part 2, Table 7-3
  8. Japanese National Committee on ISO/TC97/SC2 (1984-07-01). Japanese Graphic Character Set for Information Interchange (PDF). ITSCJ/IPSJ. ISO-IR-87.
  9. 1 2 RFC   1468 (IETF)
  10. 1 2 ARIB (2008) , p. 72
  11. 1 2 3 4 5 ARIB (2008), pp. 54–72, part 2, Table 7-10
  12. 1 2 3 4 ARIB (2008), pp. 46–47, part 2, Table 7-4
  13. "Nishiki-teki Version 3.82b (2021-07-23) - 6,416 characters in the Private Use Areas" (PDF).
  14. ARIB (2008), p. 48, part 2, Table 7-5
  15. ARIB (2008), p. 50, part 2, Table 7-7
  16. ARIB (2008), p. 49, part 2, Table 7-6
  17. ARIB (2008), p. 52, part 2, Table 7-9

Further reading