Natural Language Toolkit

Natural Language Toolkit
Original author(s)	Steven Bird, Edward Loper, Ewan Klein
Developer(s)	Team NLTK
Initial release	2001;22 years ago
Stable release	3.8.1 / 2 January 2023;4 months ago
Repository	github.com/nltk/nltk ;
Written in	Python
Type	Natural language processing
License	Apache 2.0
Website	www.nltk.org

Last updated May 16, 2023

Parse tree generated with NLTK

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It supports classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities.^[4] It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.^[5] NLTK includes graphical demonstrations and sample data. It is accompanied by a book that explains the underlying concepts behind the language processing tasks supported by the toolkit,^[6] plus a cookbook.^[7]

NLTK is intended to support research and teaching in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning.^[8] NLTK has been used successfully as a teaching tool, as an individual study tool, and as a platform for prototyping and building research systems. There are 32 universities in the US and 25 countries using NLTK in their courses.

Library highlights

Discourse representation ^[9]
Lexical analysis: Word and text tokenizer
n-gram and collocations
Part-of-speech tagger
Tree model and Text chunker for capturing
Named-entity recognition

Related Research Articles

Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.

Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference.

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction

Shallow parsing is an analysis of a sentence which first identifies constituent parts of sentences and then links them to higher order units that have discrete grammatical meanings. While the most elementary chunking algorithms simply link constituent parts on the basis of elementary search patterns, approaches that use machine learning techniques can take contextual information into account and thus compose chunks in such a way that they better reflect the semantic relations between the basic constituents. That is, these more advanced methods get around the problem that combinations of elementary constituents can have different higher level meanings depending on the context of the sentence.

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of punctuation marks. In written English, a period may indicate the end of a sentence, or may denote an abbreviation, a decimal point, an ellipsis, or an email address, among other possibilities. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. Question marks and exclamation marks can be similarly ambiguous due to use in emoticons, computer code, and slang.

General Architecture for Text Engineering or GATE is a Java suite of tools originally developed at the University of Sheffield beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for many natural language processing tasks, including information extraction in many languages.

MontyLingua is a popular natural language processing toolkit. It is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for both the Python and Java programming languages. It is enriched with common sense knowledge about the everyday world from Open Mind Common Sense. From English sentences, it extracts subject/verb/object tuples, extracts adjectives, noun phrases and verb phrases, and extracts people's names, places, events, dates and times, and other semantic information. It does not require training. It was written by Hugo Liu at MIT in 2003.
Because it is enriched with common sense knowledge it can avoid many mistakes. e.g.:

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as language detection, tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing and coreference resolution. These tasks are usually required to build more advanced text processing services.

Apache cTAKES: clinical Text Analysis and Knowledge Extraction System is an open-source Natural Language Processing (NLP) system that extracts clinical information from electronic health record unstructured text. It processes clinical notes, identifying types of clinical named entities — drugs, diseases/disorders, signs/symptoms, anatomical sites and procedures. Each named entity has attributes for the text span, the ontology mapping code, context, and negated/not negated.

The following outline is provided as an overview of and topical guide to natural-language processing:

In natural language processing (NLP), a text graph is a graph representation of a text item. It is typically created as a preprocessing step to support NLP tasks such as text condensation term disambiguation (topic-based) text summarization, relation extraction and textual entailment.

spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion.

Ruslan Mitkov is a professor at the University of Wolverhampton, and a researcher in Natural Language Processing and Computational Linguistics. He completed his PhD at Technical University of Dresden under the supervision of Nikolaus Joachim Lehmann. He has published more than 240 refereed papers and is best known for his contributions to Anaphora Resolution, and his seminal work in computer-aided generation of multiple-choice tests among others.

Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages. The library is built on top of Apache Spark and its Spark ML library.

Bidirectional Encoder Representations from Transformers (BERT) is a family of masked-language models introduced in 2018 by researchers at Google. A 2020 literature survey concluded that "in a little over a year, BERT has become a ubiquitous baseline in Natural Language Processing (NLP) experiments counting over 150 research publications analyzing and improving the model."

CorCenCC or the National Corpus of Contemporary Welsh is a language resource for Welsh speakers, Welsh learners, Welsh language researchers, and anyone who is interested in the Welsh language. CorCenCC is a freely accessible collection of multiple language samples, gathered from real-life communication, and presented in the searchable online CorCenCC text corpus. The corpus is accompanied by an online teaching and learning toolkit – Y Tiwtiadur – which draws directly on the data from the corpus to provide resources for Welsh language learning at all ages and levels.

References

↑ "Project site on SourceForge". 9 July 2001.
↑ "Release 3.8.1". 2 January 2023. Retrieved 27 April 2023.
↑ "NLTK License". NLTK Project. Retrieved 14 February 2015.
↑ "NLTK Courses". Google Docs. Retrieved 15 June 2016.
↑ "Preface". www.nltk.org. Retrieved 15 June 2016.
↑ Bird, Steven; Klein, Ewan; Loper, Edward (2009). Natural Language Processing with Python. O'Reilly Media Inc. ISBN 978-0-596-51649-9.
↑ Perkins, Jacob (2010). Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing. ISBN 978-1849513609.
↑ Bird, Steven; Klein, Ewan; Loper, Edward; Baldridge, Jason (2008). "Multidisciplinary instruction with the Natural Language Toolkit" (PDF). Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics, ACL. Archived from the original (PDF) on 2 September 2011.{{cite journal}}: Cite journal requires |journal= (help)
↑ "NLTK :: Sample usage for drt". www.nltk.org. Retrieved 14 July 2022.

External links

Official website

This programming-language-related article is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Project site on SourceForge". 9 July 2001.

[wikidata-a1635827a64e4295f37c2389d08fceb7dce7732b-v5-2] "Release 3.8.1". 2 January 2023. Retrieved 27 April 2023.

[license-3] "NLTK License". NLTK Project. Retrieved 14 February 2015.

[4] "NLTK Courses". Google Docs. Retrieved 15 June 2016.

[5] "Preface". www.nltk.org. Retrieved 15 June 2016.

[Bird-Klein-Loper-6] Bird, Steven; Klein, Ewan; Loper, Edward (2009). Natural Language Processing with Python. O'Reilly Media Inc. ISBN 978-0-596-51649-9.

[Perkins-7] Perkins, Jacob (2010). Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing. ISBN 978-1849513609.

[Bird-Klein-Loper-Baldridge-8] Bird, Steven; Klein, Ewan; Loper, Edward; Baldridge, Jason (2008). "Multidisciplinary instruction with the Natural Language Toolkit" (PDF). Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics, ACL. Archived from the original (PDF) on 2 September 2011.{{cite journal}}: Cite journal requires |journal= (help)

[9] "NLTK :: Sample usage for drt". www.nltk.org. Retrieved 14 July 2022.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Natural Language Toolkit

Contents

Library highlights

See also

Related Research Articles

References

External links