Computational linguistics

Last updated

Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.

Contents

Origins

The field overlapped with artificial intelligence since the efforts in the United States in the 1950s to use computers to automatically translate texts from foreign languages, particularly Russian scientific journals, into English. [1] Since rule-based approaches were able to make arithmetic (systematic) calculations much faster and more accurately than humans, it was expected that lexicon, morphology, syntax and semantics can be learned using explicit rules, as well. After the failure of rule-based approaches, David Hays [2] coined the term in order to distinguish the field from AI and co-founded both the Association for Computational Linguistics (ACL) and the International Committee on Computational Linguistics (ICCL) in the 1970s and 1980s. What started as an effort to translate between languages evolved into a much wider field of natural language processing. [3] [4]

Annotated corpora

In order to be able to meticulously study the English language, an annotated text corpus was much needed. The Penn Treebank [5] was one of the most used corpora. It consisted of IBM computer manuals, transcribed telephone conversations, and other texts, together containing over 4.5 million words of American English, annotated using both part-of-speech tagging and syntactic bracketing. [6]

Japanese sentence corpora were analyzed and a pattern of log-normality was found in relation to sentence length. [7]

Modeling language acquisition

The fact that during language acquisition, children are largely only exposed to positive evidence, [8] meaning that the only evidence for what is a correct form is provided, and no evidence for what is not correct, [9] was a limitation for the models at the time because the now available deep learning models were not available in late 1980s. [10]

It has been shown that languages can be learned with a combination of simple input presented incrementally as the child develops better memory and longer attention span, [11] which explained the long period of language acquisition in human infants and children. [11]

Robots have been used to test linguistic theories. [12] Enabled to learn as children might, models were created based on an affordance model in which mappings between actions, perceptions, and effects were created and linked to spoken words. Crucially, these robots were able to acquire functioning word-to-meaning mappings without needing grammatical structure.

Using the Price equation and Pólya urn dynamics, researchers have created a system which not only predicts future linguistic evolution but also gives insight into the evolutionary history of modern-day languages. [13]

Chomsky's theories

Chomsky's theories have influenced computational linguistics, particularly in understanding how infants learn complex grammatical structures, such as those described in Chomsky normal form. [14] Attempts have been made to determine how an infant learns a "non-normal grammar" as theorized by Chomsky normal form. [9] Research in this area combines structural approaches with computational models to analyze large linguistic corpora like the Penn Treebank, helping to uncover patterns in language acquisition. [15]

See also

Related Research Articles

The following outline is provided as an overview and topical guide to linguistics:

Natural language processing (NLP) is an interdisciplinary subfield of computer science and artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning and deep learning.

Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious.

Cognitive linguistics is an interdisciplinary branch of linguistics, combining knowledge and research from cognitive science, cognitive psychology, neuropsychology and linguistics. Models and theoretical accounts of cognitive linguistics are considered as psychologically real, and research in cognitive linguistics aims to help understand cognition in general and is seen as a road into the human mind.

Cognitive science is the scientific study either of mind or of intelligence . Practically every formal introduction to cognitive science stresses that it is a highly interdisciplinary research area in which psychology, neuroscience, linguistics, philosophy, computer science, anthropology, and biology are its principal specialized or applied branches. Therefore, we may distinguish cognitive studies of either human or animal brains, the mind and the brain.

In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

<span class="mw-page-title-main">Charles J. Fillmore</span> American linguist

Charles J. Fillmore was an American linguist and Professor of Linguistics at the University of California, Berkeley. He received his Ph.D. in Linguistics from the University of Michigan in 1961. Fillmore spent ten years at Ohio State University and a year as a Fellow at the Center for Advanced Study in the Behavioral Sciences at Stanford University before joining Berkeley's Department of Linguistics in 1971. Fillmore was extremely influential in the areas of syntax and lexical semantics.

<span class="mw-page-title-main">Treebank</span> Text corpus with tree annotations

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

Linguistic categories include

Linguistics is the scientific study of language. Linguistics is based on a theoretical as well as a descriptive study of language and is also interlinked with the applied fields of language studies and language learning, which entails the study of specific languages. Before the 20th century, linguistics evolved in conjunction with literary study and did not employ scientific methods. Modern-day linguistics is considered a science because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language – i.e., the cognitive, the social, the cultural, the psychological, the environmental, the biological, the literary, the grammatical, the paleographical, and the structural.

Language and Communication Technologies is the scientific study of technologies that explore language and communication. It is an interdisciplinary field that encompasses the fields of computer science, linguistics and cognitive science.

The history of natural language processing describes the advances of natural language processing. There is some overlap with the history of machine translation, the history of speech recognition, and the history of artificial intelligence.

Deep Linguistic Processing with HPSG - INitiative (DELPH-IN) is a collaboration where computational linguists worldwide develop natural language processing tools for deep linguistic processing of human language. The goal of DELPH-IN is to combine linguistic and statistical processing methods in order to computationally understand the meaning of texts and utterances.

The following outline is provided as an overview of and topical guide to natural-language processing:

John A. Nerbonne is an American computational linguist. He was a professor of humanities computing at the University of Groningen until January 2017, when he gave his valedictory address at the celebration of the 30th anniversary of his department there.

The ISLRN or International Standard Language Resource Number is Persistent Unique Identifier for Language Resources.

Emily Menon Bender is an American linguist who is a professor at the University of Washington. She specializes in computational linguistics and natural language processing. She is also the director of the University of Washington's Computational Linguistics Laboratory. She has published several papers on the risks of large language models and on ethics in natural language processing.

Mona Talat Diab is a computer science professor and director of Carnegie Mellon University's Language Technologies Institute. Previously, she was a professor at George Washington University and a research scientist with Facebook AI. Her research focuses on natural language processing, computational linguistics, cross lingual/multilingual processing, computational socio-pragmatics, Arabic language processing, and applied machine learning.

References

  1. John Hutchins: Retrospect and prospect in computer-based translation. Archived 2008-04-14 at the Wayback Machine Proceedings of MT Summit VII, 1999, pp. 30–44.
  2. "Deceased members". ICCL members. Archived from the original on 17 May 2017. Retrieved 15 November 2017.
  3. Natural Language Processing by Liz Liddy, Eduard Hovy, Jimmy Lin, John Prager, Dragomir Radev, Lucy Vanderwende, Ralph Weischedel
  4. Arnold B. Barach: Translating Machine 1975: And the Changes To Come.
  5. Marcus, M. & Marcinkiewicz, M. (1993). "Building a large annotated corpus of English: The Penn Treebank" (PDF). Computational Linguistics. 19 (2): 313–330. Archived (PDF) from the original on 2022-10-09.
  6. Taylor, Ann (2003). "1". Treebanks. Spring Netherlands. pp. 5–22.
  7. Furuhashi, S. & Hayakawa, Y. (2012). "Lognormality of the Distribution of Japanese Sentence Lengths". Journal of the Physical Society of Japan. 81 (3): 034004. Bibcode:2012JPSJ...81c4004F. doi:10.1143/JPSJ.81.034004.
  8. Bowerman, M. (1988). The "no negative evidence" problem: How do children avoid constructing an overly general grammar. Explaining language universals.
  9. 1 2 Braine, M.D.S. (1971). On two types of models of the internalization of grammars. In D.I. Slobin (Ed.), The ontogenesis of grammar: A theoretical perspective. New York: Academic Press.
  10. Powers, D.M.W. & Turk, C.C.R. (1989). Machine Learning of Natural Language. Springer-Verlag. ISBN   978-0-387-19557-5.
  11. 1 2 Elman, Jeffrey L. (1993). "Learning and development in neural networks: The importance of starting small". Cognition. 48 (1): 71–99. CiteSeerX   10.1.1.135.4937 . doi:10.1016/0010-0277(93)90058-4. PMID   8403835. S2CID   2105042.
  12. Salvi, G.; Montesano, L.; Bernardino, A.; Santos-Victor, J. (2012). "Language bootstrapping: learning word meanings from the perception-action association". IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics. 42 (3): 660–71. arXiv: 1711.09714 . doi:10.1109/TSMCB.2011.2172420. PMID   22106152. S2CID   977486.
  13. Gong, T.; Shuai, L.; Tamariz, M. & Jäger, G. (2012). E. Scalas (ed.). "Studying Language Change Using Price Equation and Pólya-urn Dynamics". PLOS ONE. 7 (3): e33171. Bibcode:2012PLoSO...733171G. doi: 10.1371/journal.pone.0033171 . PMC   3299756 . PMID   22427981.
  14. Yogita, Bansal (2016). "Insight to Computational Linguistics" (PDF). International Journal 4.10. p. 94. Retrieved September 22, 2024.
  15. Yogita, Bansal (2016). "Insight to Computational Linguistics" (PDF). International Journal 4.10. p. 94. Retrieved September 22, 2024.

Further reading