Chinese character description languages

Last updated

The Chinese character description languages are several proposed languages to describe Chinese (or CJK) characters and information such as their list of components, list of strokes (basic and complex), their order, and the location of each of them on a background empty square. They are designed to overcome the inherent lack of information within a bitmap description. This enriched information can be used to identify variants of characters that are unified into one code point by Unicode and ISO/IEC 10646, as well as to provide an alternative form of representation for rare characters that do not yet have a standardized encoding in Unicode or ISO/IEC 10646. Many aim to work for Kaishu style and Song style, as well as to provide the character's internal structure which can be used for easier look-up of a character by indexing the character's internal make-up and cross-referencing among similar characters.

Contents

CDL

Character Description Language (CDL) is an XML-based declarative language co-created by Tom Bishop and Richard Cook for the Wenlin Institute. It defines characters by the arrangement of components, which are not required to reflect the semantic or etymological history of the character. In order for a component to fit into the allotted portion of the whole character's square, A set of fewer than 50 strokes allow one to construct approximately 1,000 components, which may in turn describe tens of thousands of characters. [1]

Ideographic Description Sequences

Chapter 12 of the Unicode specification [2] defines the "Ideographic Description Sequences" (IDS) syntax used to describe characters in featural terms, by arrangements of components with code points. Sixteen special characters in the range U+2FF0 to U+2FFF act as prefix operators to combine other characters or sequences to form larger characters.

Ideographic Description Characters in Unicode
CharacterUnicode Character NumberFull Unicode Name
U+2FF0Ideographic description characterleft to right
U+2FF1Ideographic description characterabove to below
U+2FF2Ideographic description characterleft to middle and right
U+2FF3Ideographic description characterabove to middle and below
U+2FF4Ideographic description characterfull surround
U+2FF5Ideographic description charactersurround from above
U+2FF6Ideographic description charactersurround from below
U+2FF7Ideographic description charactersurround from left
U+2FFCIdeographic description charactersurround from right
U+2FF8Ideographic description charactersurround from upper left
U+2FF9Ideographic description charactersurround from upper right
U+2FFAIdeographic description charactersurround from lower left
U+2FFDIdeographic description charactersurround from lower right
U+2FFBIdeographic description characteroverlaid
U+2FFEIdeographic description characterhorizontal reflection
⿿U+2FFFIdeographic description characterrotation

Two additional ideographic description characters are scattered in other Unicode blocks. U+303EIDEOGRAPHIC VARIATION INDICATOR is not officially an ideographic description character, but is sometimes used in ideographic description sequences.

Other related Ideographic Description Characters in Unicode
CharacterUnicode Character NumberBlockFull Unicode Name
U+303E CJK Symbols and Punctuation Ideographic variation indicator
U+31EF CJK Strokes Ideographic description charactersubtraction

These sequences are useful in describing to the reader a character that is not directly printable, either because it is absent in a given font, or is absent from the Unicode standard altogether. For example, the Sawndip character Saw sawndip.svg encoded in CJK Unified Ideographs Extension F as U+2DA21 𭨡 can be described as ⿰書史. Another use is for dictionary lookup purposes, as a rough input method for queries.

These sequences can be rendered either by keeping the individual characters separately or by parsing the Ideographic Description Sequence and drawing the ideograph so described. [3] They do not, by themselves, provide unambiguous rendering for all characters. For instance, the sequence ⿱十一 represents both 'EARTH' with the middle bar being narrower, and 'SCHOLAR' with the middle bar being wider.

Unicode's specification for these sequences is based on the characters and syntax of the earlier GBK standard. Additional symbols are later encoded to fill in the missing combinations.

The IDSgrep free software package by Matthew Skala [4] [5] extends Unicode's IDS syntax to include additional features for dictionary lookup; it is capable of converting KanjiVG's database to its own extended IDS format, or of searching EIDS files generated by the related Tsukurimashou font family.

See also

Notes

  1. Bishop & Cook 2013-12-31:pp2, 9
  2. "Ideographic Description Characters" (PDF). The Unicode Standard, Version 6.0 (PDF). Mountain View, CA: The Unicode Consortium. February 2011. pp. 409–412. Archived (PDF) from the original on 18 January 2024.
  3. "The Unicode® Standard – Version 12.0 – Core Specification" (PDF). Unicode Consortium. March 2019. p. 26. Archived (PDF) from the original on Jun 2, 2023.
  4. "IDSgrep". Tsukurimashou Project. 2024-01-31. Archived from the original on Feb 7, 2024.
  5. Skala, Matthew (2015). "A Structural Query System for Han Characters" (PDF). International Journal of Asian Language Processing. 23 (2): 127–159. arXiv: 1404.5585 . Archived from the original (PDF) on 2016-03-04. Retrieved 2016-01-13.

Wenlin CDL

Related Research Articles

Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a feature shared in common by written Chinese (hanzi), Japanese (kanji), Korean (hanja) and Vietnamese.

<span class="mw-page-title-main">Chinese Character Code for Information Interchange</span> Character encoding standard

The Chinese Character Code for Information Interchange or CCCII is a character set developed by the Chinese Character Analysis Group in Taiwan. It was first published in 1980, and significantly expanded in 1982 and 1987.

<i>Mojikyō</i> Character encoding scheme

Mojikyō, also known by its full name Konjaku Mojikyō, is a character encoding scheme. The Mojikyō Institute, which published the character set, also published computer software and TrueType fonts to accompany it. The Mojikyō Institute, chaired by Tadahisa Ishikawa (石川忠久), originally had its character set and related software and data redistributed on CD-ROMs sold in Kinokuniya stores.

The 214 Kangxi radicals, also known as Zihui radicals, were collated in the 18th-century Kangxi Dictionary to aid categorization of Chinese characters. They are primarily sorted by stroke count. They are the most popular system of radicals for dictionaries that order characters by radical and stroke count. They are encoded in Unicode alongside other CJK characters, under the block "Kangxi radicals", while graphical variants are included with in the "CJK Radicals Supplement".

Wenlin Software for Learning Chinese is a software application designed by Tom Bishop, who is also president of the Wenlin Institute. It is based on his experience of the needs of learners of the Chinese language, predominantly Mandarin. It contains a dictionary function, a corpus of Chinese texts, a function for reading and creating Chinese text files, and a flashcard function. By pointing the cursor at a Chinese character the software looks up an English word, and vice versa, working like a dictionary. The software recognizes files in Unicode, GB 2312, Big5, and HZ format.

<span class="mw-page-title-main">Biangbiang noodles</span> Type of Chinese noodles

Biangbiang noodles, alternatively known as youpo chemian in Chinese, are a type of Chinese noodle originating from Shaanxi cuisine. The noodles, touted as one of the "eight curiosities" of Shaanxi (陕西八大怪), are described as being like a belt, owing to their thickness and length.

The Chinese, Japanese and Korean (CJK) scripts share a common background, collectively known as CJK characters. During the process called Han unification, the common (shared) characters were identified and named CJK Unified Ideographs. As of Unicode 15.1, Unicode defines a total of 97,680 characters.

<span class="mw-page-title-main">Radical 213</span> Chinese character radical

Radical 213 meaning "turtle" is one of only two of the 214 Kangxi radicals that are composed of 16 strokes.

CJK Unified Ideographs Extension-A is a Unicode block containing rare Han ideographs submitted to the Ideographic Research Group between 1992 and 1998, plus ten ideographs added in Unicode 13.0 which had previously been mistakenly unified with others.

CJK Symbols and Punctuation is a Unicode block containing symbols and punctuation used for writing the Chinese, Japanese and Korean languages. It also contains one Chinese character.

A variant form is a different glyph for a character, encoded in Unicode through the mechanism of variation sequences: sequences in Unicode that consist of a base character followed by a variation selector character.

CJK Unified Ideographs Extension B is a Unicode block containing rare and historic CJK ideographs for Chinese, Japanese, Korean, and Vietnamese submitted to the Ideographic Research Group between 1998 and 2000, plus seven gongche characters for kunqu added in Unicode 13.0, and two characters for the Macao Supplementary Character Set added in Unicode 14.0.

CJK Unified Ideographs Extension C is a Unicode block containing rare and historic CJK ideographs for Chinese, Japanese, Korean, and Vietnamese submitted to the Ideographic Research Group between 2002 and 2006, plus five "urgently needed" characters added in Unicode versions 14.0 and 15.0, some of which had previously been mistakenly unified with other characters.

CJK Unified Ideographs Extension D is a Unicode block containing uncommon CJK ideographs for Chinese, Japanese, Korean, and Vietnamese, some of which are in current use. Much smaller than most Unicode blocks for CJK unified ideographs, Extension D consists of characters which were submitted to the Ideographic Research Group as "urgently needed characters" between 2006 and 2009. Characters submitted during the same period which were needed less urgently were included in CJK Unified Ideographs Extension E instead.

CJK Compatibility Ideographs is a Unicode block created to contain mostly Han characters that were encoded in multiple locations in other established character encodings, in addition to their CJK Unified Ideographs assignments, in order to retain round-trip compatibility between Unicode and those encodings. However, it also contains 12 unified ideographs sourced from Japanese character sets from IBM.

Ideographic Description Characters is a Unicode block containing graphic characters used for describing CJK ideographs. They are used in Ideographic Description Sequences (IDS) to provide a description of an ideograph, in terms of what other ideographs make it up and how they are laid out relative to one another. An IDS provides the reader with a description of an ideograph that cannot be represented properly, usually because it is not encoded in Unicode; rendering systems are not intended to automatically compose the pieces into a complete ideograph, and the descriptions are not standardized.

CJK Unified Ideographs Extension E is a Unicode block containing rare and historic CJK ideographs for Chinese, Japanese, Korean, and Vietnamese submitted to the Ideographic Research Group between 2006 and 2013, excluding the characters submitted as "urgently needed" between 2006 and 2009, which were included in CJK Unified Ideographs Extension D.

CJK Unified Ideographs Extension F is a Unicode block containing rare and historic CJK ideographs for Chinese, Japanese, Korean, and Vietnamese, as well as more than a thousand Sawndip characters for writing the Zhuang language, which were submitted to the Ideographic Research Group between 2012 and 2015.

<span class="mw-page-title-main">Chinese character strokes</span> Smallest writing units of Chinese characters

Strokes are the smallest structural units making up written Chinese characters. In the act of writing, a stroke is defined as a movement of a writing instrument on a writing material surface, or the trace left on the surface from a discrete application of the writing implement. The modern sense of discretized strokes first came into being with the clerical script during the Han dynasty. In the regular script that emerged during the Tang dynasty—the most recent major style, highly studied for its aesthetics in East Asian calligraphy—individual strokes are discrete and highly regularized. By contrast, the ancient seal script has line terminals within characters that are often unclear, making them nontrivial to count.

CJK Unified Ideographs Extension I is a Unicode block comprising CJK Unified Ideographs included in drafts of an amendment to China's GB 18030 standard circulated in 2022 and 2023, which were fast-tracked into Unicode in 2023.