LIVAC Synchronous Corpus

Last updated
LIVAC
Developer(s) Chilin (HK) Ltd.
Initial releaseJuly 1995
Stable release
V3.1 / Feb 2024
Operating system Cross-platform
Available inEnglish, Traditional and Simplified Chinese
Type Corpus
Website www.livac.org

LIVAC is an uncommon language corpus dynamically maintained since 1995. Different from other existing corpora, LIVAC has adopted a rigorous and regular "Windows" approach in processing and filtering massive media texts from representative Chinese speech communities such as Beijing, Hong Kong, Macau, Taipei, Singapore, Shanghai, as well as Guangzhou, and Shenzhen. [1] The contents are thus deliberately repetitive in most cases, represented by textual samples drawn from editorials, local and international news, cross-Taiwan Strait news, as well as news on finance, sports and entertainment. [2] By 2023, more than 3 billion characters of news media texts have been filtered, of which 700 million characters have been processed and analyzed and have yielded an expanding Pan-Chinese dictionary of 2.5 million words from the Pan-Chinese printed media. Through rigorous analysis based on computational linguistic methodology, LIVAC has at the same time accumulated a large amount of accurate and meaningful statistical data on the Chinese language and on their diverse speech communities in the Pan-Chinese context, and the results show considerable and important long standing as well as evolving variations. [3] [4]

Contents

The "Windows" approach is the most innovative feature of LIVAC and has enabled Pan-Chinese media texts to be quantitatively analyzed according to various attributes such as locations, time and subject domains. Thus, various types of comparative studies and applications in information technology as well as development of often related innovative applications have been possible. [5] [6] Moreover, LIVAC has allowed longitudinal developments to be taken into account, facilitating Key Word in Context (KWIC) search and comprehensive study of target words and their underlying concepts as well as linguistic structures over the past 25 years, based on the above mentioned variables of location, time and subject. Results from the extensive and accumulative data analysis contained in LIVAC have enabled the cultivation of textual databases of proper names, place names, organization names, new words, and bi-weekly and annual rosters of media figures. Related applications have included the establishment of verb and adjective databases, the formulation of sentiment indices, and related opinion mining, to measure and compare the popularity of global media figures in the Chinese media (LIVAC Annual Pan-Chinese Celebrity Rosters, later renamed as the Pan-Chinese Newsmaker Rosters), [7] [8] [9] [10] [11] and compilation of new word databases (LIVAC Annual Pan-Chinese New Word Rosters). [12] [13] [14] [15] [16] On this basis, the analysis of the emergence, diffusion and transformation of new words, and the publication of dictionaries of neologisms have been made possible. [17] [18]

A recent focus is on the relative balance between disyllabic words and growing trisyllabic words in the Chinese language, [19] and the comparative study of light verbs in three Chinese speech communities. [20] as well as the link between the language use and use of language as a reflection of epochal change in China. [21] A new LIVAC version 3.1 was launched in February 2024.

Corpus data processing

  1. Accessing media texts, manual input, etc.
  2. Text unification including conversion from simplified to traditional Chinese characters, stored as Big5 and Unicode versions
  3. Automatic word segmentation
  4. Automatic alignment of parallel texts
  5. Manual verification, part-of-speech tagging
  6. Extraction of words and addition to regional sub-corpora
  7. Combination of regional sub-corpora to update the LIVAC corpus, and master lexical database

Labeling for data curation

  1. Categories used include general terms and proper names, such as: general names, surnames, semi titles; geographical, organizations and commercial entities, etc.; time, prepositions, locations, etc.; stack-words; loanwords; case-word; numerals, etc.
  2. Construction of databases of proper names, place names, and specific terms, etc.
  3. Generate rosters: "new word rosters", "celebrity or media personality rosters", "place name rosters", compound words and matched words
  4. Other parts of speech tagging for sub-database, such as common nouns, numerals, numeral classifiers, different types of verbs, and of adjectives, pronouns, adverbs, prepositions, conjunctions, particles marking mood, onomatopoeia, interjection, etc.

Applications

  1. Compilation of Pan-Chinese dictionaries or local dictionaries
  2. Information technology research, such as predictive Chinese text input for mobile phones, automatic speech to text conversion, opinion mining
  3. Comparative studies on linguistic and cultural developments in the Pan-Chinese regions, especially in a critical period of history in modern China.
  4. Language teaching and learning research, and speech-to-text conversion
  5. Customized service on linguistic research and lexical search for international corporations and government agencies


The above applications are provided by the following functions:

  • Word Segmentation Search
  • Phrase Search
  • Example Sentence Selection
  • Multi-word Comparison
  • Word Cloud

See also

Related Research Articles

Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

In linguistics, the term lexis designates the complete set of all possible words in a language, or a particular subset of words that are grouped by some specific linguistic criteria. For example, the general term English lexis refers to all words of the English language, while more specific term English religious lexis refers to a particular subset within English lexis, encompassing only words that are semantically related to the religious sphere of life.

<span class="mw-page-title-main">Treebank</span>

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

Statistical machine translation (SMT) was a machine translation approach, that superseded the previous, rule-based approach because it required explicit description of each and every linguistic rule, which was costly, and which often did not generalize to other languages. Since 2003, the statistical approach itself has been gradually superseded by the deep learning-based neural network approach.

<span class="mw-page-title-main">Tsou language</span> Austronesian language

Tsou is a divergent Austronesian language spoken by the Tsou people of Taiwan. Tsou is a threatened language; however, this status is uncertain. Its speakers are located in the west-central mountains southeast of the Chiayi/Alishan area in Taiwan.

<span class="mw-page-title-main">Saaroa language</span> Austronesian language spoken in Taiwan

Saaroa or Lhaʼalua is a Southern Tsouic language spoken by the Saaroa (Hla'alua) people, an indigenous people of Taiwan. It is a Formosan language of the Austronesian family.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.

Linguistic categories include

A speech corpus is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models. In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.

<span class="mw-page-title-main">Internet linguistics</span> Domain of linguistics

Internet linguistics is a domain of linguistics advocated by the English linguist David Crystal. It studies new language styles and forms that have arisen under the influence of the Internet and of other new media, such as Short Message Service (SMS) text messaging. Since the beginning of human–computer interaction (HCI) leading to computer-mediated communication (CMC) and Internet-mediated communication (IMC), experts, such as Gretchen McCulloch have acknowledged that linguistics has a contributing role in it, in terms of web interface and usability. Studying the emerging language on the Internet can help improve conceptual organization, translation and web usability. Such study aims to benefit both linguists and web users combined.

The International Corpus of English (ICE) is a set of text corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.

The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU).

Linguistics is the scientific study of language. Linguistics is based on a theoretical as well as a descriptive study of language and is also interlinked with the applied fields of language studies and language learning, which entails the study of specific languages. Before the 20th century, linguistics evolved in conjunction with literary study and did not employ scientific methods. Modern-day linguistics is considered a science because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language – i.e., the cognitive, the social, the cultural, the psychological, the environmental, the biological, the literary, the grammatical, the paleographical, and the structural.

<span class="mw-page-title-main">Google Ngram Viewer</span> Online search engine

The Google Ngram Viewer or Google Books Ngram Viewer is an online search engine that charts the frequencies of any set of search strings using a yearly count of n-grams found in printed sources published between 1500 and 2019 in Google's text corpora in English, Chinese (simplified), French, German, Hebrew, Italian, Russian, or Spanish. There are also some specialized English corpora, such as American English, British English, and English Fiction.

The following outline is provided as an overview of and topical guide to natural-language processing:

<span class="mw-page-title-main">Sketch Engine</span> Corpus manager and text analysis software

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in 90+ languages.

The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (1010) words per language, which gave rise to the corpus family's name.

References

  1. Tsou, Benjamin; Lai, Tom; Chan, Samuel; and Wang, William S.-Y. (Eds). (1998). Quantitative and Computational Studies on the Chinese Language 《漢語計量與計算研究》. Language Information Sciences Research Centre, City University Press.
  2. Tsou, B. K., Kwong, O.Y. (Eds). (2015). Linguistic Corpus and Corpus Linguistics in the Chinese Context(Journal of Chinese Linguistics Monograph Series Number 25), Hong Kong: Chinese University Press.
  3. Tsou, Benjamin. (2004). "Chinese Language Processing at the Dawn of the 21st Century", in C R Huang and W Lenders (eds) Language and Linguistics Monograph Series B: Frontiers in Linguistics I, pp.189–207. Institute of Linguistics, Academia Sinica.
  4. Tsou, B. K. (2017). Loanwords in Mandarin Through Other Chinese Dialects. In R. Sybesma, W. Behr, Y. Gu, Z. Handel, C.-T. Huang & J. Myers (Eds.), The Encyclopaedia of Chinese Language and Linguistics (Vol. 2, pp. 641-647). Leiden; Boston: BRILL
  5. Tsou, Benjamin, and Kwong, Olivia. (2015). LIVAC as a Monitoring Corpus for Tracking Trends beyond Linguistics. In Tsou, Benjamin, and Kwong, Olivia., (eds.), Linguistic Corpus and Corpus Linguistics in the Chinese Context(Journal of Chinese Linguistics Monograph Series No.25). Hong Kong: The Chinese University Press, pp. 447-471.
  6. Tsou, Benjamin. (2016). Skipantism Revisited: Along with Neologisms and Terminological Truncation. In Chin, Chi-on Andy and Kwok, Bit-chee and Tsou, Benjamin K., (eds.), Commemorative Essays for Professor Yuen-Ren Chao: Father of Modern Chinese Linguistics. Taiwan: Crane Publishing. pp. 343-357.
  7. CityU releases 2015 LIVAC Pan-Chinese Media Personality Roster, City University of Hong Kong, Hong Kong, 28 December 2015.
  8. CityU releases 2016 LIVAC Pan-Chinese Media Personality Roster, City University of Hong Kong, Hong Kong, 02 January 2017.
  9. CityU releases 2019 LIVAC Pan-Chinese Media Personality Roster, City University of Hong Kong, Hong Kong, 07 January 2019.
  10. "Pan-Chinese top newsmakers of 2020". City University of Hong Kong. 13 January 2021. Retrieved 2021-01-18.
  11. "A Big Database Approach to 2 Decades of LIVAC Pan-Chinese Newsmaker Rosters: - chilin.hk". Chilin.hk. 20 January 2023. Retrieved 2023-01-20.
  12. CityU releases 2014 Pan-Chinese New Word Rosters, City University of Hong Kong, Hong Kong, 12 February 2015.
  13. CityU releases 2015 LIVAC Pan-Chinese New Word Rosters, City University of Hong Kong, Hong Kong, 04 February 2016.
  14. CityU releases 2019 LIVAC Pan-Chinese New Word Rosters, City University of Hong Kong, Hong Kong, 09 January 2019.
  15. "New Chinese Buzz words for 2020 released by LIVAC Pan-Chinese linguistic database". City University of Hong Kong. 18 January 2021. Retrieved 2021-01-18.
  16. "New Chinese Buzz words for 2021 released by CityU". City University of Hong Kong. Retrieved 2023-01-20.
  17. 鄒嘉彥、游汝杰(編)(2007),《21世紀華語新詞語詞典》(簡體字版),上海,復旦大學出版社。
  18. 鄒嘉彥、游汝杰(編)(2010),《全球華語新詞語詞典》,北京,商務印書館。
  19. 鄒嘉彥(2019), "泛華語地區多音節詞的近20年發展:從LIVAC大數據庫探討 (Developments if polysyllabic words in Pan-Chinese in the recent decades: Investigation based on LIVAC Big Database)",《漢語歷史詞彙語法國際學術研討會(International Conference of Historical Investigations into Chinese words and Grammar)》,北京大學。
  20. Tsou, Benjamin K., and Ka-Fai Yip. "A corpus-based comparative study of light verbs in three Chinese speech communities." Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation. 2020.
  21. Tsou, B. K. (2022). Some Salient as well as Divergent and Convergent Linguistic Developments in Chinese - A Big Data and Trans-Millennial Approach. The 28th Annual Conference of the International Association of Chinese Linguistics [Keynote Speech], Hong Kong.