Native-language identification

Last updated

Native-language identification (NLI) is the task of determining an author's native language (L1) based only on their writings in a second language (L2). [1] NLI works through identifying language-usage patterns that are common to specific L1 groups and then applying this knowledge to predict the native language of previously unseen texts. This is motivated in part by applications in second-language acquisition, language teaching and forensic linguistics, amongst others.

Contents

Overview

NLI works under the assumption that an author's L1 will dispose them towards particular language production patterns in their L2, as influenced by their native language. This relates to cross-linguistic influence (CLI), a key topic in the field of second-language acquisition (SLA) that analyzes transfer effects from the L1 on later learned languages.

Using large-scale English data, NLI methods achieve over 80% accuracy in predicting the native language of texts written by authors from 11 different L1 backgrounds. [2] This can be compared to a baseline of 9% for choosing randomly.

Applications

Pedagogy and language transfer

This identification of L1-specific features has been used to study language transfer effects in second-language acquisition. [3] This is useful for developing pedagogical material, teaching methods, L1-specific instructions and generating learner feedback that is tailored to their native language.

Forensic linguistics

NLI methods can also be applied in forensic linguistics as a method of performing authorship profiling in order to infer the attributes of an author, including their linguistic background. This is particularly useful in situations where a text, e.g. an anonymous letter, is the key piece of evidence in an investigation and clues about the native language of a writer can help investigators in identifying the source. This has already attracted interest and funding from intelligence agencies. [4]

Methodology

Natural language processing methods are used to extract and identify language usage patterns common to speakers of an L1-group. This is done using language learner data, usually from a learner corpus. Next, machine learning is applied to train classifiers, like support vector machines, for predicting the L1 of unseen texts. [5] A range of ensemble based systems have also been applied to the task and shown to improve performance over single classifier systems. [6] [7]

Various linguistic feature types have been applied for this task. These include syntactic features such as constituent parses, grammatical dependencies and part-of-speech tags. Surface level lexical features such as character, word and lemma n-grams have also been found to be quite useful for this task. However, it seems that character n-grams [8] [9] are the single best feature for the task.

2013 shared task

The Building Educational Applications (BEA) workshop at NAACL 2013 hosted the inaugural NLI shared task. [10] The competition resulted in 29 entries from teams across the globe, 24 of which also published a paper describing their systems and approaches.

See also

Related Research Articles

The following outline is provided as an overview and topical guide to linguistics:

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate speech. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

Cognitive linguistics is an interdisciplinary branch of linguistics, combining knowledge and research from cognitive science, cognitive psychology, neuropsychology and linguistics. Models and theoretical accounts of cognitive linguistics are considered as psychologically real, and research in cognitive linguistics aims to help understand cognition in general and is seen as a road into the human mind.

Language transfer is the application of linguistic features from one language to another by a bilingual or multilingual speaker. Language transfer may occur across both languages in the acquisition of a simultaneous bilingual, from a mature speaker's first language (L1) to a second language (L2) they are acquiring, or from an L2 back to the L1. Language transfer is most commonly discussed in the context of English language learning and teaching, but it can occur in any situation when someone does not have a native-level command of a language, as when translating into a second language. Language transfer is also a common topic in bilingual child language acquisition as it occurs frequently in bilingual children especially when one language is dominant.

<span class="mw-page-title-main">Association for Computational Linguistics</span> Professional organization devoted to linguistics

The Association for Computational Linguistics (ACL) is a scientific and professional organization for people working on natural language processing. Its namesake conference is one of the primary high impact conferences for natural language processing research, along with EMNLP. The conference is held each summer in locations where significant computational linguistics research is carried out.

Second-language acquisition (SLA), sometimes called second-language learning — otherwise referred to as L2acquisition, is the process by which people learn a second language. Second-language acquisition is also the scientific discipline devoted to studying that process. The field of second-language acquisition is regarded by some but not everybody as a sub-discipline of applied linguistics but also receives research attention from a variety of other disciplines, such as psychology and education.

An interlanguage is an idiolect which has been developed by a learner of a second language (L2) which preserves some features of their first language (L1) and can overgeneralize some L2 writing and speaking rules. These two characteristics give an interlanguage its unique linguistic organization. It is idiosyncratically based on the learner's experiences with L2. An interlanguage can fossilize, or cease developing, in any of its developmental stages. It is claimed that several factors shape interlanguage rules, including L1 transfer, previous learning strategies, strategies of L2 acquisition, L2 communication strategies, and the overgeneralization of L2 language patterns.

Contrastive analysis is the systematic study of a pair of languages with a view to identifying their structural differences and similarities. Historically it has been used to establish language genealogies.

The generative approach to second language (L2) acquisition (SLA) is a cognitive based theory of SLA that applies theoretical insights developed from within generative linguistics to investigate how second languages and dialects are acquired and lost by individuals learning naturalistically or with formal instruction in foreign, second language and lingua franca settings. Central to generative linguistics is the concept of Universal Grammar (UG), a part of an innate, biologically endowed language faculty which refers to knowledge alleged to be common to all human languages. UG includes both invariant principles as well as parameters that allow for variation which place limitations on the form and operations of grammar. Subsequently, research within the Generative Second-Language Acquisition (GenSLA) tradition describes and explains SLA by probing the interplay between Universal Grammar, knowledge of one's native language and input from the target language. Research is conducted in syntax, phonology, morphology, phonetics, semantics, and has some relevant applications to pragmatics.

The critical period hypothesis or sensitive period hypothesis claims that there is an ideal time window of brain development to acquire language in a linguistically rich environment, after which further language acquisition becomes much more difficult and effortful. It is the subject of a long-standing debate in linguistics and language acquisition over the extent to which the ability to acquire language is biologically linked to age. The critical period hypothesis was first proposed by Montreal neurologist Wilder Penfield and co-author Lamar Roberts in their 1959 book Speech and Brain Mechanisms, and was popularized by Eric Lenneberg in 1967 with Biological Foundations of Language.

In natural language processing, language identification or language guessing is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, solved with various statistical methods.

Michael Sharwood Smith, Emeritus Professor of Languages at Heriot-Watt University & Honorary Professorial Fellow at the University of Edinburgh, is a researcher into multilingualism and the acquisition of non-native languages, a branch of developmental linguistics and cognitive science. He is a founding editor of Second Language Research, successor to the Interlanguage Studies Bulletin.

In linguistics, grammaticality is determined by the conformity to language usage as derived by the grammar of a particular speech variety. The notion of grammaticality rose alongside the theory of generative grammar, the goal of which is to formulate rules that define well-formed, grammatical, sentences. These rules of grammaticality also provide explanations of ill-formed, ungrammatical sentences.

In linguistics, according to J. Richard et al., (2002), an error is the use of a word, speech act or grammatical items in such a way that it seems imperfect and significant of an incomplete learning (184). It is considered by Norrish as a systematic deviation which happens when a learner has not learnt something, and consistently gets it wrong. However, the attempts made to put the error into context have always gone hand in hand with either [language learning and second-language acquisition] processe, Hendrickson (1987:357) mentioned that errors are ‘signals’ that indicate an actual learning process taking place and that the learner has not yet mastered or shown a well-structured [linguistic competence|competence] in the target language.

In natural language processing, textual entailment (TE), also known as natural language inference (NLI), is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text.

Deep linguistic processing is a natural language processing framework which draws on theoretical and descriptive linguistics. It models language predominantly by way of theoretical syntactic/semantic theory. Deep linguistic processing approaches differ from "shallower" methods in that they yield more expressive and structural representations which directly capture long-distance dependencies and underlying predicate-argument structures.
The knowledge-intensive approach of deep linguistic processing requires considerable computational power, and has in the past sometimes been judged as being intractable. However, research in the early 2000s had made considerable advancement in efficiency of deep processing. Today, efficiency is no longer a major problem for applications using deep linguistic processing.

The following outline is provided as an overview of and topical guide to natural-language processing:

<span class="mw-page-title-main">Word embedding</span> Method in natural language processing

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

Scott Andrew Crossley is an American linguist. He is a professor of applied linguistics at Vanderbilt University, United States. His research focuses on natural language processing and the application of computational tools and machine learning algorithms in learning analytics including second language acquisition, second language writing, and readability. His main interest area is the development and use of natural language processing tools in assessing writing quality and text difficulty.

References

  1. Wong, Sze-Meng Jojo, and Mark Dras. "Exploiting parse structures for native language identification". Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011.
  2. Shervin Malmasi, Keelan Evanini, Aoife Cahill, Joel Tetreault, Robert Pugh, Christopher Hamill, Diane Napolitano, and Yao Qian. 2017. "A Report on the 2017 Native Language Identification Shared Task". In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pages 62–75, Copenhagen, Denmark. Association for Computational Linguistics.
  3. Malmasi, Shervin, and Mark Dras. "Language Transfer Hypotheses with Linear SVM Weights." Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014.
  4. Ria Perkins. 2014. "Linguistic identifiers of L1 Persian speakers writing in English: NLID for authorship analysis". Ph.D. thesis, Aston University.
  5. Tetreault et al, "Native Tongues, Lost and Found: Resources and Empirical Evaluations in Native Language Identification", In Proc. International Conf. on Computational Linguistics (COLING), 2012
  6. Malmasi, Shervin, Sze-Meng Jojo Wong, and Mark Dras. "NLI Shared Task 2013: MQ submission". Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. 2013.
  7. Habic, Vuk, Semenov, Alexander, and Pasiliao, Eduardo. "Multitask deep learning for native language identification" in Knowledge-Based Systems, 2020
  8. Radu Tudor Ionescu, Marius Popescu and Aoife Cahill. "String Kernels for Native Language Identification: Insights from Behind the Curtains", Computational Linguistics, 2016
  9. Radu Tudor Ionescu and Marius Popescu. "Can string kernels pass the test of time in Native Language Identification?", In Proceedings of BEA12, 2017.
  10. Tetreault et al, "A report on the first native language identification shared task", 2013