Phrase chunking

Last updated

Phrase chunking is a phase of natural language processing that separates and segments a sentence into its subconstituents, such as noun, verb, and prepositional phrases, abbreviated as NP, VP, and PP, respectively. Typically, each subconstituent or chunk is denoted by brackets. [1]

Contents

See also

Related Research Articles

Natural language processing (NLP) is an interdisciplinary subfield of linguistics and computer science. It is primarily concerned with processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Waveform Audio File Format is an audio file format standard, developed by IBM and Microsoft, for storing an audio bitstream on personal computers. It is the main format used on Microsoft Windows systems for uncompressed audio. The usual bitstream encoding is the linear pulse-code modulation (LPCM) format.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

<span class="mw-page-title-main">Collocation</span> Frequent occurrence of words next to each other

In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.

The Ho-Chunk language, also known as Winnebago, is the language of the Ho-Chunk people of the Ho-Chunk Nation of Wisconsin and Winnebago Tribe of Nebraska. The language is part of the Siouan language family and is closely related to other Chiwere Siouan dialects, including those of the Iowa, Missouria, and Otoe.

An interlanguage is an idiolect which has been developed by a learner of a second language (L2) which preserves some features of their first language (L1) and can overgeneralize some L2 writing and speaking rules. These two characteristics give an interlanguage its unique linguistic organization. It is idiosyncratically based on the learner's experiences with L2. An interlanguage can fossilize, or cease developing, in any of its developmental stages. It is claimed that several factors shape interlanguage rules, including L1 transfer, previous learning strategies, strategies of L2 acquisition, L2 communication strategies, and the overgeneralization of L2 language patterns.

Shallow parsing is an analysis of a sentence which first identifies constituent parts of sentences and then links them to higher order units that have discrete grammatical meanings. While the most elementary chunking algorithms simply link constituent parts on the basis of elementary search patterns, approaches that use machine learning techniques can take contextual information into account and thus compose chunks in such a way that they better reflect the semantic relations between the basic constituents. That is, these more advanced methods get around the problem that combinations of elementary constituents can have different higher level meanings depending on the context of the sentence.

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It supports classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania. NLTK includes graphical demonstrations and sample data. It is accompanied by a book that explains the underlying concepts behind the language processing tasks supported by the toolkit, plus a cookbook.

Robin Tolmach Lakoff is a professor emerita of linguistics at the University of California, Berkeley. Her 1975 book Language and Woman's Place is often credited for making language and gender a major debate in linguistics and other disciplines.

<span class="mw-page-title-main">Treebank</span>

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

In lexicography, a lexical item is a single word, a part of a word, or a chain of words (catena) that forms the basic elements of a language's lexicon (≈ vocabulary). Examples are cat, traffic light, take care of, by the way, and it's raining cats and dogs. Lexical items can be generally understood to convey a single meaning, much as a lexeme, but are not limited to single words. Lexical items are like semes in that they are "natural units" translating between languages, or in learning a new language. In this last sense, it is sometimes said that language consists of grammaticalized lexis, and not lexicalized grammar. The entire store of lexical items in a language is called its lexis.

Terminology extraction is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus.

Linguistic categories include

<span class="mw-page-title-main">Ontology learning</span> Automatic creation of ontologies

Ontology learning is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms and the relationships between the concepts that these terms represent from a corpus of natural language text, and encoding them with an ontology language for easy retrieval. As building ontologies manually is extremely labor-intensive and time-consuming, there is great motivation to automate the process.

MontyLingua is a popular natural language processing toolkit. It is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for both the Python and Java programming languages. It is enriched with common sense knowledge about the everyday world from Open Mind Common Sense. From English sentences, it extracts subject/verb/object tuples, extracts adjectives, noun phrases and verb phrases, and extracts people's names, places, events, dates and times, and other semantic information. It does not require training. It was written by Hugo Liu at MIT in 2003.
Because it is enriched with common sense knowledge it can avoid many mistakes. e.g.:

In natural language processing, semantic role labeling is the process that assigns labels to words or phrases in a sentence that indicates their semantic role in the sentence, such as that of an agent, goal, or result.

The lexical approach is a method of teaching foreign languages described by Michael Lewis in the early 1990s. The basic concept on which this approach rests is the idea that an important part of learning a language consists of being able to understand and produce lexical phrases as chunks. Students are taught to be able to perceive patterns of language (grammar) as well as have meaningful set uses of words at their disposal when they are taught in this way. In 2000, Norbert Schmitt, an American linguist and a Professor of Applied Linguistics at the University of Nottingham in the United Kingdom, contributed to a learning theory supporting the lexical approach he stated that "the mind stores and processes these [lexical] chunks as individual wholes." The short-term capacity of the brain is much more limited than long-term and so it is much more efficient for our brain to pull up a lexical chunk as if it were one piece of information as opposed to pulling up each word as separate pieces of information.

The following outline is provided as an overview of and topical guide to natural-language processing:

Manually Annotated Sub-Corpus (MASC) is a balanced subset of 500K words of written texts and transcribed speech drawn primarily from the Open American National Corpus (OANC). The OANC is a 15 million word corpus of American English produced since 1990, all of which is in the public domain or otherwise free of usage and redistribution restrictions.

The Usage-based linguistics is a linguistics approach within a broader functional/cognitive framework, that emerged since the late 1980s, and that assumes a profound relation between linguistic structure and usage. It challenges the dominant focus, in 20th century linguistics, on considering language as an isolated system removed from its use in human interaction and human cognition. Rather, usage-based models posit that linguistic information is expressed via context-sensitive mental processing and mental representations, which have the cognitive ability to succinctly account for the complexity of actual language use at all levels. Broadly speaking, a usage-based model of language accounts for language acquisition and processing, synchronic and diachronic patterns, and both low-level and high-level structure in language, by looking at actual language use.

References

  1. Erik F. Tjong Kim Sang and Sabine Buchholz. 2000. Introduction to the conll-2000 shared task: chunking. In Proceedings of the 2nd workshop on Learning language in logic and the 4th CONLL, pages 127–132, Morristown, NJ, USA. Association for Computational Linguistics.