Japanese language and computers

Last updated

A Japanese kana keyboard KB Japanese.svg
A Japanese kana keyboard

In relation to the Japanese language and computers many adaptation issues arise, some unique to Japanese and others common to languages which have a very large number of characters. The number of characters needed in order to write in English is quite small, and thus it is possible to use only one byte (28=256 possible values) to encode each English character. However, the number of characters in Japanese is many more than 256 and thus cannot be encoded using a single byte - Japanese is thus encoded using two or more bytes, in a so-called "double byte" or "multi-byte" encoding. Problems that arise relate to transliteration and romanization, character encoding, and input of Japanese text.

Contents

Character encodings

There are several standard methods to encode Japanese characters for use on a computer, including JIS, Shift-JIS, EUC, and Unicode. While mapping the set of kana is a simple matter, kanji has proven more difficult. Despite efforts, none of the encoding schemes have become the de facto standard, and multiple encoding standards were in use by the 2000s. As of 2017, the share of UTF-8 traffic on the Internet has expanded to over 90% worldwide, and only 1.2% was for using Shift-JIS and EUC. Yet, a few popular websites including 2channel and kakaku.com are still using Shift-JIS. [1]

Until 2000s, most Japanese emails were in ISO-2022-JP ("JIS encoding") and web pages in Shift-JIS and mobile phones in Japan usually used some form of Extended Unix Code. [2] If a program fails to determine the encoding scheme employed, it can cause mojibake (文字化け, "misconverted garbled/garbage characters", literally "transformed characters") and thus unreadable text on computers.

Kanji ROM card installed in PC-98, which stored about 3000 glyphs, and enabled a quick display. It also had a RAM to store gaiji. PC-9801F Kanji ROM board.jpg
Kanji ROM card installed in PC-98, which stored about 3000 glyphs, and enabled a quick display. It also had a RAM to store gaiji.
Embedded devices are still using half-width kana. Control panel of public background music system.jpg
Embedded devices are still using half-width kana.

The first encoding to become widely used was JIS X 0201, which is a single-byte encoding that only covers standard 7-bit ASCII characters with half-width katakana extensions. This was widely used in systems that were neither powerful enough nor had the storage to handle kanji (including old embedded equipment such as cash registers) because Kana-Kanji conversion required a complicated process, and output in kanji required much memory and high resolution. This means that only katakana, not kanji, was supported using this technique. Some embedded displays still have this limitation.

The development of kanji encodings was the beginning of the split. Shift JIS supports kanji and was developed to be completely backward compatible with JIS X 0201, and thus is in much embedded electronic equipment. However, Shift JIS has the unfortunate property that it often breaks any parser (software that reads the coded text) that is not specifically designed to handle it.

For example, some Shift-JIS characters include a backslash (0x5C "\") in the second byte, which is used as an escape character in many programming languages.

8d5c82ed82c882a2

A parser lacking support for Shift JIS will recognize 0x5C 0x82 as an invalid escape sequence, and remove it. [3] Therefore, the phrase cause mojibake.

8d 82ed82c882a2

This can happen for example in the C programming language, when having Shift-JIS in text strings. It does not happen in HTML since ASCII 0x000x3F (which includes ", %, & and some other used escape characters and string separators) do not appear as second byte in Shift-JIS, and backslash is not an escape characters there. But it can happen for JavaScript which can be embedded in HTML pages.

EUC, on the other hand, is handled much better by parsers that have been written for 7-bit ASCII (and thus EUC encodings are used on UNIX, where much of the file-handling code was historically only written for English encodings). But EUC is not backwards compatible with JIS X 0201, the first main Japanese encoding. Further complications arise because the original Internet e-mail standards only support 7-bit transfer protocols. Thus RFC   1468 ("ISO-2022-JP", often simply called JIS encoding) was developed for sending and receiving e-mails.

Gaiji is used in closed caption of Japanese TV broadcasting. Japanese TV closed caption using gaiji.jpg
Gaiji is used in closed caption of Japanese TV broadcasting.

In character set standards such as JIS, not all required characters are included, so gaiji (外字 "external characters") are sometimes used to supplement the character set. Gaiji may come in the form of external font packs, where normal characters have been replaced with new characters, or the new characters have been added to unused character positions. However, gaiji are not practical in Internet environments since the font set must be transferred with text to use the gaiji. As a result, such characters are written with similar or simpler characters in place, or the text may need to be encoded using a larger character set (such as Unicode) that supports the required character. [4]

Unicode was intended to solve all encoding problems over all languages. The UTF-8 encoding used to encode Unicode in web pages does not have the disadvantages that Shift-JIS has. Unicode is supported by international software, and it eliminates the need for gaiji. There are still controversies, however. For Japanese, the kanji characters have been unified with Chinese; that is, a character considered to be the same in both Japanese and Chinese is given a single number, even if the appearance is actually somewhat different, with the precise appearance left to the use of a locale-appropriate font. This process, called Han unification, has caused controversy.[ citation needed ] The previous encodings in Japan, Taiwan Area, Mainland China and Korea have only handled one language and Unicode should handle all. The handling of Kanji/Chinese have however been designed by a committee composed of representatives from all four countries/areas.[ citation needed ]

Text input

Written Japanese uses several different scripts: kanji (Chinese characters), 2 sets of kana (phonetic syllabaries) and roman letters. While kana and roman letters can be typed directly into a computer, entering kanji is a more complicated process as there are far more kanji than there are keys on most keyboards. To input kanji on modern computers, the reading of kanji is usually entered first, then an input method editor (IME), also sometimes known as a front-end processor, shows a list of candidate kanji that are a phonetic match, and allows the user to choose the correct kanji. More-advanced IMEs work not by word but by phrase, thus increasing the likelihood of getting the desired characters as the first option presented. Kanji readings inputs can be either via romanization ( rōmaji nyūryoku, ローマ字入力 ) or direct kana input (kana nyūryoku, かな入力 ). Romaji input is more common on PCs and other full-size keyboards (although direct input is also widely supported), whereas direct kana input is typically used on mobile phones and similar devices – each of the 10 digits (1–9,0) corresponds to one of the 10 columns in the gojūon table of kana, and multiple presses select the row.

There are two main systems for the romanization of Japanese, known as Kunrei-shiki and Hepburn ; in practice, "keyboard romaji" (also known as wāpuro rōmaji or "word processor romaji") generally allows a loose combination of both. IME implementations may even handle keys for letters unused in any romanization scheme, such as L, converting them to the most appropriate equivalent. With kana input, each key on the keyboard directly corresponds to one kana. The JIS keyboard system is the national standard, but there are alternatives, like the thumb-shift keyboard, commonly used among professional typists.

Direction of text

LibreOffice Writer supports downward text option. LibreOffice Writer 6.2.3.2 vertical text.png
LibreOffice Writer supports downward text option.

Japanese can be written in two directions. Yokogaki style writes left-to-right, top-to-bottom, as with English. Tategaki style writes first top-to-bottom, and then moves right-to-left.

To compete with Ichitaro, Microsoft provided several updates for early Japanese versions of Microsoft Word including support for downward text, such as Word 5.0 Power Up Kit and Word 98. [5] [6]

QuarkXPress was the most popular DTP software in Japan in 1990s, even it had a long development cycle. However, due to lacking support for downward text, it was surpassed by Adobe InDesign which had strong support for downward text through several updates. [7] [8]

At present,[ when? ] handling of downward text is incomplete. For example, HTML has no support for tategaki and Japanese users must use HTML tables to simulate it. However, CSS level 3 includes a property "writing-mode" which can render tategaki when given the value "vertical-rl" (i.e. top to bottom, right to left). Word processors and DTP software have more complete support for it.

See also

Related Research Articles

Katakana is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji and in some cases the Latin script.

Kana are syllabaries used to write Japanese phonological units, morae. Such syllabaries include (1) the original kana, or magana, which were Chinese characters (kanji) used phonetically to transcribe Japanese, the most prominent magana system being man'yōgana (万葉仮名); the two descendants of man'yōgana, (2) hiragana, and (3) katakana. There are also hentaigana, which are historical variants of the now-standard hiragana. In current usage, 'kana' can simply mean hiragana and katakana.

In computing, JIS encoding refers to several Japanese Industrial Standards for encoding the Japanese language. Strictly speaking, the term means either:

ISO/IEC 2022Information technology—Character code structure and extension techniques, is an ISO/IEC standard in the field of character encoding. It is equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202. Originating in 1971, it was most recently revised in 1994.

Shift JIS is a character encoding for the Japanese language, originally developed by the Japanese company ASCII Corporation in conjunction with Microsoft and standardized as JIS X 0208 Appendix 1.

Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese (characters).

<span class="mw-page-title-main">Japanese input method</span> Methods used to input Japanese characters on a computer

Japanese input methods are used to input Japanese characters on a computer.

<span class="mw-page-title-main">JIS X 0201</span> Japanese single byte character encoding

JIS X 0201, a Japanese Industrial Standard developed in 1969, was the first Japanese electronic character set to become widely used. The character set was initially known as JIS C 6220 before the JIS category reform. Its two forms were a 7-bit encoding or an 8-bit encoding, although the 8-bit form was dominant until Unicode replaced it. The full name of this standard is 7-bit and 8-bit coded character sets for information interchange (7ビット及び8ビットの情報交換用符号化文字集合).

U is one of the Japanese kana, each of which represents one mora. In the modern Japanese system of alphabetical order, they occupy the third place in the modern Gojūon (五十音) system of collating kana. In the Iroha, they occupied the 24th position, between む and ゐ. In the Gojūon chart, う lies in the first column and the third row. Both represent the sound. In the Ainu language, the small katakana ゥ represents a diphthong, and is written as w in the Latin alphabet.

Half-width kana are katakana characters displayed compressed at half their normal width, instead of the usual square (1:1) aspect ratio. For example, the usual (full-width) form of the katakana ka is カ while the half-width form is カ. Half-width hiragana is included in Unicode, and it is usable on Web or in e-books via CSS's font-feature-settings: "hwid" 1 with Adobe-Japan1-6 based OpenType fonts. Half-width kanji is usable on modern computers, and is used in some receipt printers, electric bulletin board and old computers.

ATOK is a Japanese input method editor (IME) produced by JustSystems, a Japanese software company.

, in hiragana or in katakana, is one of the Japanese kana, which each represent one mora. Both represent and their shapes come from the kanji 久.

Tsu is one of the Japanese kana, each of which represents one mora. Both are phonemically, reflected in the Nihon-shiki and Kunrei-shiki Romanization tu, although for phonological reasons, the actual pronunciation is, reflected in the Hepburn romanization tsu.

Wi is an obsolete Japanese kana, which is normally pronounced in current-day Japanese. The combination of a W-column kana letter with ゐ゙ in hiragana was introduced to represent in the 19th century and 20th century. It is presumed that 'ゐ' represented, and that 'ゐ' and 'い' represented distinct pronunciations before merging to sometime between the Kamakura and Taishō periods. Along with the kana for we, this kana was deemed obsolete in Japanese with the orthographic reforms of 1946, to be replaced by 'い/イ' in all contexts. It is now rare in everyday usage; in onomatopoeia and foreign words, the katakana form 'ウィ' (U-[small-i]) is used for the mora.

Language input keys, which are usually found on Japanese and Korean keyboards, are keys designed to translate letters using an input method editor (IME). On non-Japanese or Korean keyboard layouts using an IME, these functions can usually be reproduced via hotkeys, though not always directly corresponding to the behavior of these keys.

The Japanese script reform is the attempt to correlate standard spoken Japanese with the written word, which began during the Meiji period. This issue is known in Japan as the kokugo kokuji mondai. The reforms led to the development of the modern Japanese written language, and explain the arguments for official policies used to determine the usage and teaching of kanji rarely used in Japan.

JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. The official title of the current standard is 7-bit and 8-bit double byte coded KANJI sets for information interchange. It was originally established as JIS C 6226 in 1978, and has been revised in 1983, 1990, and 1997. It is also called Code page 952 by IBM. The 1978 version is also called Code page 955 by IBM.

KS X 1001, "Code for Information Interchange ", formerly called KS C 5601, is a South Korean coded character set standard to represent Hangul and Hanja characters on a computer.

<span class="mw-page-title-main">Thumb-shift keyboard</span> Keyboard design

The thumb-shift keyboard is a keyboard design for inputting Japanese sentences on word processors and computers. It was invented by Fujitsu in the late 1970s and released in 1980 as a feature of the line of Japanese word processors the company sold, named OASYS, to make Japanese input easier, faster and more natural. It is popular among people who input large quantities of Japanese sentences, such as writers, playwrights, lawyers and so on, because of its ease of use and speed. The rights regarding the use of this design were transferred to Nihongo Nyuuryoku Consortium, a technology sharing cooperative of interested companies, in 1989. It is referred to as an example of keyboard layout in Japanese Industrial Standards.

Several mutually incompatible versions of the Extended Binary Coded Decimal Interchange Code (EBCDIC) have been used to represent the Japanese language on computers, including variants defined by Hitachi, Fujitsu, IBM and others. Some are variable-width encodings, employing locking shift codes to switch between single-byte and double-byte modes. Unlike other EBCDIC locales, the lowercase basic Latin letters are often not preserved in their usual locations.

References

  1. "【やじうまWatch】 ウェブサイトにおける文字コードの割合、UTF-8が90%超え。Shift_JISやEUC-JPは? - INTERNET Watch". INTERNET Watch. 2017-10-17. Retrieved 2019-05-11.
  2. "文字コードについて". ASH Corporation. 2002. Retrieved 2019-05-14.
  3. "Shift_JIS文字を含むソースコードをgccでコンパイル後、警告メッセージが表示される". Novell. 2006-02-10. Retrieved 2019-05-14.
  4. 兵ちゃん (2016-02-18). "住基ネット統一文字コードによる外字の統一について". Archived from the original on 2020-08-02. Retrieved 2019-05-14.
  5. "ASCII EXPRESS : マイクロソフトが「Access」と「Word 5.0 Power Up Kit」を発売". ASCII . 18 (1). 1994.
  6. "Microsoft Office 97 Powered by Word 98 製品情報". Microsoft. 2001-08-01. Archived from the original on 2001-08-01. Retrieved 2019-05-14.
  7. エディット-U. "DTPって何よ(4) [編集って何よ]" . Retrieved 2019-05-14.
  8. "アンチQuarkユーザーが気になるQuarkXPress 8の機能トップ10(3) 縦書きの組版が面倒だったけどどうなのよ?". MyNavi News. 2008-07-04. Retrieved 2019-05-14.