Chinese computational linguistics

Last updated

Chinese computational linguistics is a subset of computational linguistics; it is the scientific study and information processing of the Chinese language by means of computers. The purpose is to obtain a better understanding of how the language works and to bring more convenience to language applications. The term Chinese computational linguistics is often employed interchangeably with Chinese information processing, though the former may sound more theoretical while the latter more technical. [1]

Contents

Rather than introducing computational linguistics in a general sense, this article will focus on the unique issues involved with implementing the Chinese language compared to other languages. The contents include Chinese character information processing, word segmentation, proper noun recognition, natural language understanding and generation, corpus linguistics, and machine translation. [1]

Chinese character information processing

Chinese character Information Technology (IT) is the technology of computer processing of Chinese characters. While the English writing system makes use of a few dozen different characters, Chinese language needs a much larger character set. There are over ten thousand characters in the Xinhua Dictionary. [2] In the Unicode multilingual character set of 149,813 characters, 98,682 (about 2/3) are Chinese characters. [3] This means that computer processing of Chinese characters is the most intensive among all languages.

Chinese character input

Computer input of Chinese characters is more complicated than languages which have simpler character systems. For example, the English language is written with 26 letters and a handful of other characters, and each character is assigned to a key on the keyboard. Theoretically, Chinese characters could be input in a similar way, but this approach is impractical for most applications due to the number of characters; it would require a massive keyboard with thousands of keys, and the user would find it difficult and time-consuming to locate individual characters on the keyboard. [4] An alternative method is to use the English keyboard layout, and encode each Chinese character in the English characters; this is the predominant method of Chinese character input today.

Sound-based encoding is normally based on an existing Latin character scheme for Chinese phonetics, such as the Pinyin Scheme for Mandarin Chinese or Putonghua, and the Jyutping Scheme for the Cantonese dialect. The input code of a Chinese character is its pinyin letter string followed by an optional number representing the tone. For example, the Putonghua Pinyin input code of 香港 (Hong Kong) is "xianggang" or "xiang1gang3", and the Cantonese Jyutping code is "hoenggong" or "hoeng1gong2", all of which can be easily input via an English keyboard.

A Chinese character can alternatively be input by form-based encoding. Most Chinese characters can be divided into a sequence of components each of which is in turn composed of a sequence of strokes in writing order. There are a few hundred basic components, [5] much less than the number of characters. By representing each component with an English letter and putting them in writing order of the character, the input method creator can get a letter string ready to be used as an input code on the English keyboard. Of course the creator can also design a rule to select representative letters from the string if it is too long. For example, in the Cangjie input method, character (border) is encoded as "NGMWM" corresponding to components "弓土一田一", with some components omitted. Popular form-based encoding methods include Wubi (五笔) in the Mainland and Cangjie (仓颉) in Taiwan and Hong Kong. [6]

The most important feature of intelligent input is the application of contextual constraints for candidate character selection. For example, on Microsoft Pinyin, when the user types input code "daxuejiaoshou", he/she will get "大学教授 / 大學教授" (University Professor), when types "daxuepiaopiao" the computer will suggest "大雪飘飘 / 大雪飄飄" (heavy snow flying). Though the non-toned Pinyin letters of 大学 and 大雪 are both "daxue", the computer can make a reasonable selection based on the subsequent words. [7]

Chinese character encoding for information interchange

Inside the computer each character is represented by an internal code. When a character is sent between two machines, it is in information interchange code. Nowadays, information interchange codes, such as ASCII and Unicode, are often directly employed as internal codes.

The first GB Chinese character encoding standard is GB2312, which was released by the PRC in 1980. It includes 6,763 Chinese characters, with 3,755 frequently-used ones sorted by Pinyin, and the rest by radicals (indexing components). GB2312 was designed for simplified Chinese characters. Traditional characters which have been simplified are not covered. The code of a character is represented by a two-byte hexadecimal number, for instance, the GB codes of 香港 (Hong Kong) are CFE3 and B8DB respectively. GB2312 is still in use on some computers and the WWW, though newer versions with extended character sets, such as GB13000.1 and GB18030, have been released. [8] The latest version of GB encoding is GB18030, which supports both simplified and traditional Chinese characters, and is consistent with the Unicode character set. [9]

The standard of Big5 encoding was designed by five big IT companies in Taiwan in the early 1980s, and has been the de facto standard for representing traditional Chinese in computers ever since. Big5 is popularly used in Taiwan, Hong Kong and Macau. The original Big5 standard included 13,053 Chinese characters, with no simplified characters of the Mainland. Each character is encoded with a two byte hexadecimal code, for example, 香 (ADBB) 港 (B4E4) 龍 (C073). Chinese characters in the Big5 character set are arranged in radical order. Extended versions of Big5 include Big-5E and Big5-2003, which include some simplified characters and Hong Kong Cantonese characters. [10]

The full version of the Unicode standard represents a character with a 4-byte digital code, providing a huge encoding space to cover all characters of all languages in the world. The Basic Multilingual Plane (BMP) is a 2-byte kernel version of Unicode with 2^16=65,536 code points for important characters of many languages. There are 27,522 characters in the CJKV (China, Japan, Korea and Vietnam) Ideographs Area, including all the simplified and traditional Chinese characters in GB2312 and Big5 traditional. In Unicode 15.0, there is a multilingual character set of 149,813 characters, among which overs 98,682 (about 2/3) are Chinese sorted by Kangxi Radicals. Even very rarely-used characters are available. For example: H (0048) K (004B), 香 (9999), 港 (6E2F), 龍(9F8D), 龙 (9F99), 龖 (9F96), 龘 (9F98), 𪚥 (2A6A5). [11] [12]

Unicode is becoming more and more popular. It is reported that UTF-8 (Unicode) is used by 98.1% of all the websites. It is widely believed that Unicode will ultimately replace all other information interchange codes and internal codes, and there will be no more code confusing. [13]

Chinese character output

Like English and other languages, Chinese characters are output on printers and screens in different fonts and styles. The most popular Chinese fonts are the Song (宋体), Kai (楷体), Hei (黑体) and Fangsong (仿宋体) families. [14]

Fonts appear in different sizes. In addition to the international measurement system of points, Chinese characters are also measured by size numbers (called zihao, 字号) invented by an American for Chinese printing in 1859. [15]

Word segmentation

It is straightforward to recognize words in English text because they are separated by spaces. However, Chinese words are not separated by any boundary markers. Hence, word segmentation is the first step for text analysis of Chinese. For example,

中文信息学报 (Chinese original text) 中文 信息 学报 (word-segmented text) Chinese information journal (word-by-word English translation) Journal of Chinese Information Processing (English name)

Chinese word segmentation on a computer is carried out by matching characters in the Chinese text against a lexicon (list of Chinese words) forwardly from the beginning of the sentence or backwardly from the end. There are two kinds of segmentation ambiguities: the intersection-type (交集型歧义字段) and polynomial type (多义型歧义字段) [16] ).

Typically an intersection ambiguity is in the format of

ABC, where A, AB, BC and C are all words in the lexicon.

It is possible to divide the original character string into word AB followed by C, or A followed by BC. For example ‘美国会’ may mean ‘美 国会’ (the US Parliament) or ‘美国 会’ (the US can/will).

The most common form of polynomial segmentation ambiguity is AB, where A, B, and AB are all words. That means the character string can be regarded as one single word or be divided into two. For example, string ‘可以’ in the following sentences:

(1) 你 可以 坐下。     you can sit down.     You can sit down. (2) 你 可 以 他们 为 样板。     you can take them  as example.     You can take them as an example.

Word segmentation ambiguities can be resolved with contextual information, using linguistic rules and probability of word co-locations derived from Chinese corpora. Usually longer words matching are more reliable. The correctness rate of automatic word segmentation has reached 95 % [17] . However there will be no guarantee of 100% percent correctness in the foreseeable future, because that will involve a complete understanding of the text. An alternative solution is to encourage people to write in a word segmented way, like the case in English [18] . But that does not means computer word segmentation will no longer be needed, because even in English, word segmentation is required for speech analysis.

Proper noun recognition

A proper noun is the name of a person, a place, an institution, etc. and is written in English with the initial letter of each word capitalized, for example, ‘Mr. John Nealon’, ‘America’ and ‘Cambridge University’. However, Chinese proper nouns are usually not marked in any style. [19]

Recognition of names of people and place in Chinese text can be supported by a list of names. However such a list can never be complete, considering the huge number of places and people all over the world, not to mention their dynamic feature of coming, changing and going. And there are names similar to non-proper nouns. For example, there is a town named 民众 (Minzhong) in southern China, which is also a common noun meaning ‘people’. Therefore, recognition of names of people and place has to make use of their distinguishing features in internal composition and external context. Corpora with proper nouns annotated can also serve as useful reference. [19]

A people’s name not found in the dictionary can be recognized with a list of surnames and titles, for example ‘张大方先生’’,李经理’, where 张 (Zhang) and 李 (Li) are Chinese surnames, and 先生 (Mr.) and 经理 (Manager) are titles. In 张大方说, 张大方 can be successfully recognized as a person’s name by the rule that a Chinese given name normally follow the surname and consists of 1 or 2 characters, and the fact that people can speak (说).

Names of place also have characteristics useful for computer recognition. For example, in ‘在广东省中山市民众镇’, component words 省 (province), 市 (city) and 镇 (town) are end markers of place names, while 在 (in, at, on) is a preposition frequently appearing in front of a location.

The correctness rate of computer recognition has reached around 90 % for persons’ names and 95 % for place names [17] .


Journals and proceedings

See also

Notes

    Related Research Articles

    Several input methods allow the use of Chinese characters with computers. Most allow selection of characters based either on their pronunciation or their graphical shape. Phonetic input methods are easier to learn but are less efficient, while graphical methods allow faster input, but have a steep learning curve.

    <span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

    Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

    <span class="mw-page-title-main">Logogram</span> Grapheme which represents a word or a morpheme

    In a written language, a logogram, also logograph or lexigraph, is a written character that represents a semantic component of a language, such as a word or morpheme. Chinese characters as used in Chinese as well as other languages are logograms, as are Egyptian hieroglyphs and characters in cuneiform script. A writing system that primarily uses logograms is called a logography. Non-logographic writing systems, such as alphabets and syllabaries, are phonemic: their individual symbols represent sounds directly and lack any inherent meaning. However, all known logographies have some phonetic component, generally based on the rebus principle, and the addition of a phonetic component to pure ideographs is considered to be a key innovation in enabling the writing system to adequately encode human language.

    Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.

    <span class="mw-page-title-main">CJK characters</span> Logographs in shared East Asian written tradition

    In internationalization, CJK characters is a collective term for graphemes used in the Chinese, Japanese, and Korean writing systems, which each include Chinese characters. The term CJKV also includes Chữ Nôm, the Chinese-origin logographic script formerly used for the Vietnamese language.

    <span class="mw-page-title-main">Mojibake</span> Garbled text as a result of incorrect character encodings

    Mojibake is the garbled or gibberish text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.

    In computing, Chinese character encodings can be used to represent text written in the CJK languages—Chinese, Japanese, Korean—and (rarely) obsolete Vietnamese, all of which use Chinese characters. Several general-purpose character encodings accommodate Chinese characters, and some of them were developed specifically for Chinese.

    <span class="mw-page-title-main">GB 18030</span> Official Chinese character encoding

    GB 18030 is a Chinese government standard, described as Information Technology — Chinese coded character set and defines the required language and character support necessary for software in China. GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB2312. As a Unicode Transformation Format, GB18030 supports both simplified and traditional Chinese characters. It is also compatible with legacy encodings including GB/T 2312, CP936, and GBK 1.0.

    <span class="mw-page-title-main">Chinese Character Code for Information Interchange</span> Character encoding standard

    The Chinese Character Code for Information Interchange or CCCII is a character set developed by the Chinese Character Analysis Group in Taiwan. It was first published in 1980, and significantly expanded in 1982 and 1987.

    Wenlin Software for Learning Chinese is a software application designed by Tom Bishop, who is also president of the Wenlin Institute. It is based on his experience of the needs of learners of the Chinese language, predominantly Mandarin. It contains a dictionary function, a corpus of Chinese texts, a function for reading and creating Chinese text files, and a flashcard function. By pointing the cursor at a Chinese character the software looks up an English word, and vice versa, working like a dictionary. The software recognizes files in Unicode, GB 2312, Big5, and HZ format.

    Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of Arabic, such signals are sometimes ambiguous and not present in all written languages.

    , in hiragana, or in katakana, is one of the Japanese kana, which each represent one mora. Both are phonemically, reflected in the Nihon-shiki and Kunrei-shiki romanization ti, although, for phonological reasons, the actual pronunciation is, which is reflected in the Hepburn romanization chi.

    , in hiragana, or in katakana, is one of the Japanese kana, which each represent one mora. Both can be written in two strokes, sometimes one for hiragana, and both are phonemically although for phonological reasons, the actual pronunciation is. The pronunciation of the voiceless palatal fricative [ç] is similar to that of the English word hue [çuː] for some speakers.

    , in hiragana or in katakana, is one of the Japanese kana, which each represents one mora. Both the hiragana and katakana forms are written in two strokes and represent the sound.

    The following outline is provided as an overview of and topical guide to natural-language processing:

    Modern Chinese characters are the Chinese characters used in modern languages, including Chinese, Japanese, Korean and Vietnamese. Chinese characters are composed of components, which are in turn composed of strokes. The 100 most frequently-used characters cover over 40% of modern Chinese texts. The 1000 most frequently-used characters cover approximately 90% of the texts. There are a variety of novel aspects of modern Chinese characters, including that of orthography, phonology, and semantics, as well as matters of collation and organization and statistical analysis, computer processing, and pedagogy.

    The YES stroke alphabetical order (一二三漢字筆順排檢法), also called YES stroke-order sorting, briefly YES order or YES sorting, is a Chinese character sorting method based on a stroke alphabet and stroke orders. It is a simplified stroke-based sorting method free of stroke counting and grouping.

    Chinese word-segmented writing, or Chinese word-separated writing, is a style of written Chinese where texts are written with spaces between words like written English. Chinese sentences are traditionally written as strings of characters, with no marks between words. Hence, word segmentation according to the context is a task for the reader.

    Chinese character IT is the information technology for computer processing of Chinese characters. While the English writing system uses a few dozen different characters, Chinese language needs a much larger character set. There are over ten thousand characters in the Xinhua Dictionary. In the Unicode multilingual character set of 149,813 characters, 98,682 are Chinese. That means computer processing of Chinese characters is the toughest among other languages.

    A Chinese character set is a group of Chinese characters. Since the size of a set is the number of elements in it, an introduction to Chinese character sets will also introduce the Chinese character numbers in them.

    References

    Citations

    1. 1 2 Zhang 2016, p. 420.
    2. Language Institute 2020.
    3. "Unicode Statistics". www.unicode.org. Retrieved 2023-12-08.
    4. Su 2014, p. 218.
    5. National Language Commission 1997.
    6. Zhang 2016, p. 422.
    7. Su 2014, p. 222.
    8. Su 2014, pp. 213–215.
    9. Lunde, Ken (4 August 2022). "The GB 18030-2022 Standard". Medium. Retrieved 7 August 2022.
    10. "[chinese mac] Character Sets". chinesemac.org. Retrieved 2023-11-24.
    11. "Unicode Statistics".
    12. Unicode Consortium 2023.
    13. "Usage Statistics and Market Share of UTF-8 for Websites, December 2023". w3techs.com. Retrieved 2023-12-08.
    14. Li 2013, p. 62.
    15. Zhang 2006.
    16. Liu 2000, pp. 58–61.
    17. 1 2 Xu 2006.
    18. Zhang 1998.
    19. 1 2 Zhang 2016, p. 427.

    Works cited

    • Fromkin, Victora (and Robert Rodman) (1993). An Introduction to Language ) (5th ed.). New York: HBJ. ISBN   0-03-075379-1.
    • Language Institute, Chinese Academy of Social Sciences (2020). 新华字典 (Xinhua Dictionary ) (in Chinese) (12th ed.). Beijing: Commercial Press. ISBN   978-7-100-17093-2.
    • Li, Dasui 李大遂 (2013). 简明实用汉字学[Concise and Practical Chinese Characters] (in Chinese) (3rd ed.). Beijing: Peking University Press. ISBN   978-7-301-21958-4.
    • Liu, Kaiying (刘开瑛) (2000). 中文文本自动分词和标注[Automatic word segmentation and annotation of Chinese text] (in Chinese). Beijing: Commercial Press). ISBN   7-100-03068-4.
    • National Language Commission (1997). Chinese Character Component Standard of GB13000.1 Character Set for Information Processing (PDF). Beijing: National Language Commission of China.
    • Su, Peicheng 苏培成 (2014). 现代汉字学纲要[Essentials of Modern Chinese Characters] (in Chinese) (3rd ed.). Beijing: 商务印书馆 (The Commercial Press, Shangwu). ISBN   978-7-100-10440-1.
    • Unicode Consortium (2023). Unicode Standard, Version 15.1.0. Mountain View, CA: Unicode Consortium.
    • Xu, Jialu (and Fu Yonghe) (2006). 中文信息处理现代汉语词汇研究[Morphological Studies in Modern Chinese Information Processing] (in Chinese). Guangzhou: 广东教育出版社 (Guangdong Education Press).
    • Zhang, Xiaoheng (1998). "也谈汉语书面语的分词问题 -- 分词连写十大好处 ('Written Chinese Word Segmentation Revisited: Ten Advantages of Word-segmented Writing')". Journal of Chinese Information Processing. 12 (3): 57–63.
    • Zhang, Xiaoheng (2006). "The Number, Point and Metric Systems of Font Size (字形的"号制""点制"与"米制")". Computer Engineering and Applications (计算机工程与应用). 42 (2006) (10): 175–177 & p 215.
    • Zhang, Xiaoheng (2016). "Computational Linguistics". The Routledge Encyclopedia of the Chinese Language. Oxfordford: Routledge. pp. 420–437. ISBN   978-0-415-53970-8.