Halfwidth and fullwidth forms

Last updated
A command prompt (cmd.exe) with Korean localisation, showing halfwidth and fullwidth characters Command Prompt on Windows XP (Korean).png
A command prompt (cmd.exe) with Korean localisation, showing halfwidth and fullwidth characters

In CJK (Chinese, Japanese, and Korean) computing, graphic characters are traditionally classed into fullwidth [lower-alpha 1] and halfwidth [lower-alpha 2] characters. Unlike monospaced fonts, a halfwidth character occupies half the width of a fullwidth character, hence the name.

Contents

Halfwidth and Fullwidth Forms is also the name of a Unicode block U+FF00FFEF, provided so that older encodings containing both halfwidth and fullwidth characters can have lossless translation to and from Unicode.

Rationale

Characters which appear in both JIS X 0201 (single byte) and JIS X 0208 / JIS X 0213 (double byte) have both a halfwidth and a fullwidth form in Shift JIS. Alternative names of JIS X 0213.svg
Characters which appear in both JIS X 0201 (single byte) and JIS X 0208 / JIS X 0213 (double byte) have both a halfwidth and a fullwidth form in Shift JIS.

In the days of text mode computing, Western characters were normally laid out in a grid on the screen, often 80 columns by 24 or 25 lines. Each character was displayed as a small dot matrix, often about 8 pixels wide, and a SBCS (single-byte character set) was generally used to encode characters of Western languages.

For aesthetic reasons and readability, it is preferable for Chinese characters to be approximately square-shaped, therefore twice as wide as these fixed-width SBCS characters. As these were typically encoded in a DBCS (double-byte character set), this also meant that their width on screen in a duospaced font was proportional to their byte length. Some terminals and editing programs could not deal with double-byte characters starting at odd columns, only even ones (some could not even put double-byte and single-byte characters in the same line). So the DBCS sets generally included Roman characters and digits also, for use alongside the CJK characters in the same line.

On the other hand, early Japanese computing used a single-byte code page called JIS X 0201 for katakana. These would be rendered at the same width as the other single-byte characters, making them half-width kana characters rather than normally proportioned kana. Although the JIS X 0201 standard itself did not specify half-width display for katakana, this became the visually distinguishing feature in Shift JIS between the single-byte JIS X 0201 and double-byte JIS X 0208 katakana. Some IBM code pages used a similar treatment for Korean jamo, [1] based on the N-byte Hangul code and its EBCDIC translation.

In Unicode

For compatibility with existing character sets that contained both half- and fullwidth versions of the same character, Unicode allocated a single block at U+FF00FFEF containing the necessary "alternative width" characters. This includes a fullwidth version of all the ASCII characters and some non-ASCII punctuation such as the Yen sign, halfwidth versions of katakana and hangul, and halfwidth versions of some other symbols such as circles. Only characters needed for lossless round trip to existing character sets were allocated, rather than (for instance) making a fullwidth version of every Latin accented character.

Unicode assigns every code point an "East Asian width" property. This may be: [2]

Unicode character properties based on width
AbbreviationNameDescription
WWideNaturally wide character, e.g. Hiragana.
NaNarrowNaturally narrow character, e.g. ISO Basic Latin alphabet.
FFullwidthWide variant with compatibility normalisation to naturally narrow character, e.g. fullwidth Latin script.
HHalfwidthNarrow variant with compatibility normalisation to naturally wide character, e.g. half-width kana. Includes U+20A9 () as an exception.
AAmbiguousCharacters included in East Asian DBCS codes but also in European SBCS codes, e.g. Greek alphabet. Duospaced behaviour can consequently vary.
NNeutralCharacters which do not appear in East Asian DBCS codes, e.g. Devanagari.

Terminal emulators can use this property to decide whether a character should consume one or two "columns" when figuring out tabs and cursor position.

In OpenType

OpenType has the fwid, halt, hwid, and vhal feature tags to be used to reproduce fullwidth or halfwidth form of a character. CSS provides control over these features using font-variant-east-asian and font-feature-settings properties. [3]

See also

Notes

  1. In Taiwan and Hong Kong: 全形; in CJK: 全角.
  2. In Taiwan and Hong Kong: 半形; in CJK: 半角.

Related Research Articles

<span class="mw-page-title-main">Unicode</span> Character encoding standard

Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text written in all of the world's major writing systems. Version 15.1 of the standard defines 149813 characters and 161 scripts used in various ordinary, literary, academic, and technical contexts.

Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.

<span class="mw-page-title-main">CJK characters</span> Logographs in shared East Asian written tradition

In internationalization, CJK characters is a collective term for the Chinese, Japanese, and Korean languages, all of which include Chinese characters and derivatives in their writing systems, sometimes paired with other scripts. Collectively, the CJK characters often include Hànzì in Chinese, Kanji and Kana in Japanese, and Hanja and Hangul in Korean. Vietnamese can be included, making the abbreviation CJKV, as Vietnamese historically used Chinese characters known as chữ Hán and chữ Nôm in Vietnamese.

The yen and yuan sign (¥) is a currency sign used for the Japanese yen and the Chinese yuan currencies when writing in Latin scripts. This character resembles a capital letter Y with a single or double horizontal stroke. The symbol is usually placed before the value it represents, for example: ¥50, or JP¥50 and CN¥50 when disambiguation is needed. When writing in Japanese and Chinese, the Japanese kanji and Chinese character is written following the amount, for example 50円 in Japan, and 50元 or 50圆 in China.

A double-byte character set (DBCS) is a character encoding in which either all characters are encoded in two bytes, or merely every graphic character not representable by an accompanying single-byte character set (SBCS) is encoded in two bytes. A DBCS supports national languages that contain many unique characters or symbols. Examples of such languages include Japanese and Chinese. Korean Hangul does not contain as many characters, but KS X 1001 supports both Hangul and Hanja, and uses two bytes per character.

In computing, JIS encoding refers to several Japanese Industrial Standards for encoding the Japanese language. Strictly speaking, the term means either:

Shift JIS is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporation in conjunction with Microsoft and standardized as JIS X 0208 Appendix 1.

Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese (characters).

GB/T 2312-1980 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. GB refers to the Guobiao standards (国家标准), whereas the T suffix denotes a non-mandatory standard.

<span class="mw-page-title-main">Chinese Character Code for Information Interchange</span> Character encoding standard

The Chinese Character Code for Information Interchange or CCCII is a character set developed by the Chinese Character Analysis Group in Taiwan. It was first published in 1980, and significantly expanded in 1982 and 1987.

In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. For example, the common whitespace symbol U+0020 SPACE represents a blank space punctuation character in text, used as a word divider in Western scripts.

<span class="mw-page-title-main">Unified Hangul Code</span> Windows character encoding for Korean

Unified Hangul Code (UHC), or Extended Wansung, also known under Microsoft Windows as Code Page 949, is the Microsoft Windows code page for the Korean language. It is an extension of Wansung Code to include all 11172 non-partial Hangul syllables present in Johab. This corresponds to the pre-composed syllables available in Unicode 2.0 and later.

<span class="mw-page-title-main">JIS X 0201</span> Japanese single byte character encoding

JIS X 0201, a Japanese Industrial Standard developed in 1969, was the first Japanese electronic character set to become widely used. The character set was initially known as JIS C 6220 before the JIS category reform. Its two forms were a 7-bit encoding or an 8-bit encoding, although the 8-bit form was dominant until Unicode replaced it. The full name of this standard is 7-bit and 8-bit coded character sets for information interchange (7ビット及び8ビットの情報交換用符号化文字集合).

Half-width kana are katakana characters displayed compressed at half their normal width, instead of the usual square (1:1) aspect ratio. For example, the usual (full-width) form of the katakana ka is カ while the half-width form is カ. Half-width hiragana is not included in Unicode, although it is usable on Web or in e-books via CSS's font-feature-settings: "hwid" 1 with Adobe-Japan1-6 based OpenType fonts. Half-width kanji is not usable on modern computers, but is used in some receipt printers, electric bulletin board and old computers.

New Gulim (새굴림/SaeGulRim) is a sans-serif type Unicode font designed especially for the Korean-language script, designed by HanYang System Co., Limited. It is an expanded version of Hanyang Gulrim.

KPS 9566 is a North Korean standard specifying a character encoding for the Chosŏn'gŭl (Hangul) writing system used for the Korean language. The edition of 1997 specified an ISO 2022-compliant 94×94 two-byte coded character set. Subsequent editions have added additional encoded characters outside of the 94×94 plane, in a manner comparable to UHC or GBK.

KS X 1001, "Code for Information Interchange ", formerly called KS C 5601, is a South Korean coded character set standard to represent hangul and hanja characters on a computer.

Halfwidth and Fullwidth Forms is the name of a Unicode block U+FF00–FFEF, provided so that older encodings containing both halfwidth and fullwidth characters can have lossless translation to/from Unicode. It is the second-to-last block of the Basic Multilingual Plane, followed only by the short Specials block at U+FFF0–FFFF. Its block name in Unicode 1.0 was Halfwidth and Fullwidth Variants.

Several mutually incompatible versions of the Extended Binary Coded Decimal Interchange Code (EBCDIC) have been used to represent the Japanese language on computers, including variants defined by Hitachi, Fujitsu, IBM and others. Some are variable-width encodings, employing locking shift codes to switch between single-byte and double-byte modes. Unlike other EBCDIC locales, the lowercase basic Latin letters are often not preserved in their usual locations.

References

  1. "ICU Demonstration - Converter Explorer". demo.icu-project.org. Retrieved 7 May 2018.
  2. Lunde, Ken (2019-01-25). "Unicode® Standard Annex #11: East Asian Width". Unicode Consortium.
  3. "Syntax for OpenType features in CSS". Adobe . Retrieved 2023-09-20.