Halfwidth and fullwidth forms

Last updated
A command prompt (cmd.exe) with Korean localisation, showing halfwidth and fullwidth characters Command Prompt on Windows XP (Korean).png
A command prompt (cmd.exe) with Korean localisation, showing halfwidth and fullwidth characters

In CJK (Chinese, Japanese, and Korean) computing, graphic characters are traditionally classed into fullwidth [lower-alpha 1] and halfwidth [lower-alpha 2] characters. Unlike monospaced fonts, a halfwidth character occupies half the width of a fullwidth character, hence the name.

Contents

Halfwidth and Fullwidth Forms is also the name of a Unicode block U+FF00FFEF, provided so that older encodings containing both halfwidth and fullwidth characters can have lossless translation to and from Unicode.

Rationale

Characters which appear in both JIS X 0201 (single byte) and JIS X 0208 / JIS X 0213 (double byte) have both a halfwidth and a fullwidth form in Shift JIS. Alternative names of JIS X 0213.svg
Characters which appear in both JIS X 0201 (single byte) and JIS X 0208 / JIS X 0213 (double byte) have both a halfwidth and a fullwidth form in Shift JIS.

In the days of text mode computing, Western characters were normally laid out in a grid on the screen, often 80 columns by 24 or 25 lines. Each character was displayed as a small dot matrix, often about 8 pixels wide, and a SBCS (single-byte character set) was generally used to encode characters of Western languages.

For aesthetic reasons and readability, it is preferable for Chinese characters to be approximately square-shaped, therefore twice as wide as these fixed-width SBCS characters. As these were typically encoded in a DBCS (double-byte character set), this also meant that their width on screen in a duospaced font was proportional to their byte length. Some terminals and editing programs could not deal with double-byte characters starting at odd columns, only even ones (some could not even put double-byte and single-byte characters in the same line). So the DBCS sets generally included Roman characters and digits also, for use alongside the CJK characters in the same line.

On the other hand, early Japanese computing used a single-byte code page called JIS X 0201 for katakana. These would be rendered at the same width as the other single-byte characters, making them half-width kana characters rather than normally proportioned kana. Although the JIS X 0201 standard itself did not specify half-width display for katakana, this became the visually distinguishing feature in Shift JIS between the single-byte JIS X 0201 and double-byte JIS X 0208 katakana. Some IBM code pages used a similar treatment for Korean jamo, [1] based on the N-byte Hangul code and its EBCDIC translation.

In Unicode

For compatibility with existing character sets that contained both half- and fullwidth versions of the same character, Unicode allocated a single block at U+FF00FFEF containing the necessary "alternative width" characters. This includes a fullwidth version of all the ASCII characters and some non-ASCII punctuation such as the Yen sign, halfwidth versions of katakana and hangul, and halfwidth versions of some other symbols such as circles. Only characters needed for lossless round trip to existing character sets were allocated, rather than (for instance) making a fullwidth version of every Latin accented character.

Unicode assigns every code point an "East Asian width" property. This may be: [2]

Unicode character properties based on width
AbbreviationNameDescription
WWideNaturally wide character, e.g. Hiragana.
NaNarrowNaturally narrow character, e.g. ISO Basic Latin alphabet.
FFullwidthWide variant with compatibility normalisation to naturally narrow character, e.g. fullwidth Latin script.
HHalfwidthNarrow variant with compatibility normalisation to naturally wide character, e.g. half-width kana. Includes U+20A9 () as an exception.
AAmbiguousCharacters included in East Asian DBCS codes but also in European SBCS codes, e.g. Greek alphabet. Duospaced behaviour can consequently vary.
NNeutralCharacters which do not appear in East Asian DBCS codes, e.g. Devanagari.

Terminal emulators can use this property to decide whether a character should consume one or two "columns" when figuring out tabs and cursor position.

In OpenType

OpenType has the fwid, halt, hwid, and vhal feature tags to be used to reproduce fullwidth or halfwidth form of a character. CSS provides control over these features using font-variant-east-asian and font-feature-settings properties. [3]

See also

Notes

  1. In Taiwan and Hong Kong: 全形; in CJK: 全角.
  2. In Taiwan and Hong Kong: 半形; in CJK: 半角.

Related Research Articles

Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.

The yen and yuan sign (¥) is a currency sign used for the Japanese yen and the Chinese yuan currencies when writing in Latin scripts. This character resembles a capital letter Y with a single or double horizontal stroke. The symbol is usually placed before the value it represents, for example: ¥50, or JP¥50 and CN¥50 when disambiguation is needed. When writing in Japanese and Chinese, the Japanese kanji and Chinese character is written following the amount, for example 50円 in Japan, and 50元 or 50圆 in China.

A double-byte character set (DBCS) is a character encoding in which either all characters are encoded in two bytes, or merely every graphic character not representable by an accompanying single-byte character set (SBCS) is encoded in two bytes. A DBCS supports national languages that contain many unique characters or symbols. Examples of such languages include Japanese and Chinese. Korean Hangul does not contain as many characters, but KS X 1001 supports both Hangul and Hanja, and uses two bytes per character.

Shift JIS is a character encoding for the Japanese language, originally developed by the Japanese company ASCII Corporation in conjunction with Microsoft and standardized as JIS X 0208 Appendix 1.

Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese (characters).

GB/T 2312-1980 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. GB refers to the Guobiao standards (国家标准), whereas the T suffix denotes a non-mandatory standard.

<span class="mw-page-title-main">Chinese Character Code for Information Interchange</span> Character encoding standard

The Chinese Character Code for Information Interchange or CCCII is a character set developed by the Chinese Character Analysis Group in Taiwan. It was first published in 1980, and significantly expanded in 1982 and 1987.

A whitespace character is a character data element that represents white space when text is rendered for display by a computer.

<span class="mw-page-title-main">Unified Hangul Code</span> Windows character set for Korean

Unified Hangul Code (UHC), or Extended Wansung, also known under Microsoft Windows as Code Page 949, is the Microsoft Windows code page for the Korean language. It is an extension of Wansung Code to include all 11172 non-partial Hangul syllables present in Johab. This corresponds to the pre-composed syllables available in Unicode 2.0 and later.

<span class="mw-page-title-main">JIS X 0201</span> Japanese single byte character encoding

JIS X 0201, a Japanese Industrial Standard developed in 1969, was the first Japanese electronic character set to become widely used. The character set was initially known as JIS C 6220 before the JIS category reform. Its two forms were a 7-bit encoding or an 8-bit encoding, although the 8-bit form was dominant until Unicode replaced it. The full name of this standard is 7-bit and 8-bit coded character sets for information interchange (7ビット及び8ビットの情報交換用符号化文字集合).

Half-width kana are katakana characters displayed compressed at half their normal width, instead of the usual square (1:1) aspect ratio. For example, the usual (full-width) form of the katakana ka is カ while the half-width form is カ. Half-width hiragana is included in Unicode, and it is usable on Web or in e-books via CSS's font-feature-settings: "hwid" 1 with Adobe-Japan1-6 based OpenType fonts. Half-width kanji is usable on modern computers, and is used in some receipt printers, electric bulletin board and old computers.

New Gulim (새굴림/SaeGulRim) is a sans-serif type Unicode font designed especially for the Korean-language script, designed by HanYang System Co., Limited. It is an expanded version of Hanyang Gulrim.

In the Unicode standard, a plane is a contiguous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh). Plane 0 is the Basic Multilingual Plane (BMP), which contains most commonly used characters. The higher planes 1 through 16 are called "supplementary planes". The last code point in Unicode is the last code point in plane 16, U+10FFFF. As of Unicode version 15.1, five of the planes have assigned code points (characters), and seven are named.

KPS 9566 is a North Korean standard specifying a character encoding for the Chosŏn'gŭl (Hangul) writing system used for the Korean language. The edition of 1997 specified an ISO 2022-compliant 94×94 two-byte coded character set. Subsequent editions have added additional encoded characters outside of the 94×94 plane, in a manner comparable to UHC or GBK.

KS X 1001, "Code for Information Interchange ", formerly called KS C 5601, is a South Korean coded character set standard to represent Hangul and Hanja characters on a computer.

Halfwidth and Fullwidth Forms is the name of a Unicode block U+FF00–FFEF, provided so that older encodings containing both halfwidth and fullwidth characters can have lossless translation to/from Unicode. It is the second-to-last block of the Basic Multilingual Plane, followed only by the short Specials block at U+FFF0–FFFF. Its block name in Unicode 1.0 was Halfwidth and Fullwidth Variants.

Microsoft Windows code page 932, also called Windows-31J amongst other names, is the Microsoft Windows code page for the Japanese language, which is an extended variant of the Shift JIS Japanese character encoding. It contains standard 7-bit ASCII codes, and Japanese characters are indicated by the high bit of the first byte being set to 1. Some code points in this page require a second byte, so characters use either 8 or 16 bits for encoding.

<span class="mw-page-title-main">Code page 949 (IBM)</span>

IBM code page 949 (IBM-949) is a character encoding which has been used by IBM to represent Korean language text on computers. It is a variable-width encoding which represents the characters from the Wansung code defined by the South Korean standard KS X 1001 in a format compatible with EUC-KR, but adds IBM extensions for additional hanja, additional precomposed Hangul syllables, and user-defined characters.

Several mutually incompatible versions of the Extended Binary Coded Decimal Interchange Code (EBCDIC) have been used to represent the Japanese language on computers, including variants defined by Hitachi, Fujitsu, IBM and others. Some are variable-width encodings, employing locking shift codes to switch between single-byte and double-byte modes. Unlike other EBCDIC locales, the lowercase basic Latin letters are often not preserved in their usual locations.

References

  1. "ICU Demonstration - Converter Explorer". demo.icu-project.org. Retrieved 7 May 2018.
  2. Lunde, Ken (2019-01-25). "Unicode® Standard Annex #11: East Asian Width". Unicode Consortium.
  3. "Syntax for OpenType features in CSS". Adobe . Retrieved 2023-09-20.