Spark NLP

Last updated
Spark NLP
Original author(s) John Snow Labs
Initial releaseOctober 2017 [1]
Stable release
5.2.3 / January 2024;8 months ago (2024-01)
Repository github.com/JohnSnowLabs/spark-nlp
Written in Python, Scala
Operating system Linux, Windows, macOS, OS X
Type Natural language processing
License Apache licence
Website sparknlp.org

Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages. [2] [3] [4] The library is built on top of Apache Spark and its Spark ML library. [5]

Contents

Its purpose is to provide an API for natural language processing pipelines that implement recent academic research results as production-grade, scalable, and trainable software. The library offers pre-trained neural network models, pipelines, and embeddings, as well as support for training custom models. [5]

Features

The design of the library makes use of the concept of a pipeline which is an ordered set of text annotators. [6] Out of the box annotators include, tokenizer, normalizer, stemming, lemmatizer, regular expression, TextMatcher, chunker, DateMatcher, SentenceDetector, DeepSentenceDetector, POS tagger, ViveknSentimentDetector, sentiment analysis, named entity recognition, conditional random field annotator, deep learning annotator, spell checking and correction, dependency parser, typed dependency parser, document classification, and language detection. [7]

The Models Hub is a platform for sharing open-source as well as licensed pre-trained models and pipelines. It includes pre-trained pipelines with tokenization, lemmatization, part-of-speech tagging, and named entity recognition that exist for more than thirteen languages; word embeddings including GloVe, ELMo, BERT, ALBERT, XLNet, Small BERT, and ELECTRA; sentence embeddings including Universal Sentence Embeddings (USE) [8] and Language Agnostic BERT Sentence Embeddings (LaBSE). [9] It also includes resources and pre-trained models for more than two hundred languages. Spark NLP base code includes support for East Asian languages such as tokenizers for Chinese, Japanese, Korean; for right-to-left languages such as Urdu, Farsi, Arabic, Hebrew and pre-trained multilingual word and sentence embeddings such as LaUSE and a translation annotator.

Usage in healthcare

Spark NLP for Healthcare is a commercial extension of Spark NLP for clinical and biomedical text mining. [10] It provides healthcare-specific annotators, pipelines, models, and embeddings for clinical entity recognition, clinical entity linking, entity normalization, assertion status detection, de-identification, relation extraction, and spell checking and correction.

The library offers access to several clinical and biomedical transformers: JSL-BERT-Clinical, BioBERT, ClinicalBERT, [11] GloVe-Med, GloVe-ICD-O. It also includes over 50 pre-trained healthcare models, that can recognize the entities such as clinical, drugs, risk factors, anatomy, demographics, and sensitive data.

Spark OCR

Spark OCR is another commercial extension of Spark NLP for optical character recognition (OCR) from images, scanned PDF documents, and DICOM files. [7] It is a software library built on top of Apache Spark. It provides several image pre-processing features for improving text recognition results such as adaptive thresholding and denoising, skew detection & correction, adaptive scaling, layout analysis and region detection, image cropping, removing background objects.

Due to the tight coupling between Spark OCR and Spark NLP, users can combine NLP and OCR pipelines for tasks such as extracting text from images, extracting data from tables, recognizing and highlighting named entities in PDF documents or masking sensitive text in order to de-identify images. [12]

Several output formats are supported by Spark OCR such as PDF, images, or DICOM files with annotated or masked entities, digital text for downstream processing in Spark NLP or other libraries, structured data formats (JSON and CSV), as files or Spark data frames.

Users can also distribute the OCR jobs across multiple nodes in a Spark cluster.

License and availability

Spark NLP is licensed under the Apache 2.0 license. The source code is publicly available on GitHub as well as documentation and a tutorial. Prebuilt versions of Spark NLP are available in PyPi and Anaconda Repository for Python development, in Maven Central for Java & Scala development, and in Spark Packages for Spark development.

Award

In March 2019, Spark NLP received Open Source Award for its contributions in natural language processing in Python, Java, and Scala. [13]

Related Research Articles

Natural language processing (NLP) is an interdisciplinary subfield of computer science and artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning and deep learning.

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. Typically, this involves processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction.

Shallow parsing is an analysis of a sentence which first identifies constituent parts of sentences and then links them to higher order units that have discrete grammatical meanings. While the most elementary chunking algorithms simply link constituent parts on the basis of elementary search patterns, approaches that use machine learning techniques can take contextual information into account and thus compose chunks in such a way that they better reflect the semantic relations between the basic constituents. That is, these more advanced methods get around the problem that combinations of elementary constituents can have different higher level meanings depending on the context of the sentence.

Biomedical text mining refers to the methods and study of how text mining may be applied to texts and literature of the biomedical domain. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies in this field have been applied to the biomedical literature available through services such as PubMed.

Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of punctuation marks. In written English, a period may indicate the end of a sentence, or may denote an abbreviation, a decimal point, an ellipsis, or an email address, among other possibilities. About 47% of the periods in The Wall Street Journal corpus denote abbreviations. Question marks and exclamation marks can be similarly ambiguous due to use in emoticons, source code, and slang.

<span class="mw-page-title-main">Apache OpenNLP</span> Open-source natural language processing library

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as language detection, tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing and coreference resolution. These tasks are usually required to build more advanced text processing services.

<span class="mw-page-title-main">Apache cTAKES</span> Natural language processing system

Apache cTAKES: clinical Text Analysis and Knowledge Extraction System is an open-source Natural Language Processing (NLP) system that extracts clinical information from electronic health record unstructured text. It processes clinical notes, identifying types of clinical named entities — drugs, diseases/disorders, signs/symptoms, anatomical sites and procedures. Each named entity has attributes for the text span, the ontology mapping code, context, and negated/not negated.

The following outline is provided as an overview of and topical guide to natural-language processing:

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov and colleagues at Google and published in 2013.

GloVe, coined from Global Vectors, is a model for distributed word representation. The model is an unsupervised learning algorithm for obtaining vector representations for words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. As log-bilinear regression model for unsupervised learning of word representations, it combines the features of two model families, namely the global matrix factorization and local context window methods.

<span class="mw-page-title-main">Notebook interface</span> Programming tool blending code and documents

A notebook interface or computational notebook is a virtual notebook environment used for literate programming, a method of writing computer programs. Some notebooks are WYSIWYG environments including executable calculations embedded in formatted documents; others separate calculations and text into separate sections. Notebooks share some goals and features with spreadsheets and word processors but go beyond their limited data models.

spaCy Software library

spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion.

Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished.

Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learned by self-supervised learning to represent text as a sequence of vectors. It had the transformer encoder architecture. It was notable for its dramatic improvement over previous state of the art models, and as an early example of large language model. As of 2020, BERT was a ubiquitous baseline in natural language processing (NLP) experiments.

<span class="mw-page-title-main">ELMo</span> Word embedding system

ELMo is a word embedding method for representing a sequence of words as a corresponding sequence of vectors. It was created by researchers at the Allen Institute for Artificial Intelligence, and University of Washington and first released in February, 2018. It is a bidirectional LSTM which takes character-level as inputs and produces word-level embeddings.

<span class="mw-page-title-main">GPT-J</span> Open source artificial intelligence text generating language model developed by EleutherAI

GPT-J or GPT-J-6B is an open-source large language model (LLM) developed by EleutherAI in 2021. As the name suggests, it is a generative pre-trained transformer model designed to produce human-like text that continues from a prompt. The optional "6B" in the name refers to the fact that it has 6 billion parameters.

References

  1. Talby, David (19 October 2017). "Introducing the Natural Language Processing Library for Apache Spark". databricks.com. databricks. Retrieved 29 March 2019.
  2. Ellafi, Saif Addin (2018-02-28). "Comparing production-grade NLP libraries: Running Spark-NLP and spaCy pipelines". O'Reilly Media. Retrieved 2019-03-29.
  3. Ellafi, Saif Addin (2018-02-28). "Comparing production-grade NLP libraries: Accuracy, performance, and scalability". O'Reilly Media. Retrieved 2019-03-29.
  4. Ewbank, Kay. "Spark Gets NLP Library". www.i-programmer.info.
  5. 1 2 Thomas, Alex (July 2020). Natural Language Processing with Spark NLP: Learning to Understand Text at Scale (First ed.). United States of America: O'Reilly Media. ISBN   978-1492047766.
  6. Talby, David (2017-10-19). "Introducing the Natural Language Processing Library for Apache Spark - The Databricks Blog". Databricks. Retrieved 2019-08-27.
  7. 1 2 Jha, Bineet Kumar; G, Sivasankari G.; R, Venugopal K. (May 2, 2021). "Sentiment Analysis for E-Commerce Products Using Natural Language Processing". Annals of the Romanian Society for Cell Biology: 166–175 via www.annalsofrscb.ro.
  8. Cer, Daniel; Yang, Yinfei; Kong, Sheng-yi; Hua, Nan; Limtiaco, Nicole; John, Rhomni St; Constant, Noah; Guajardo-Cespedes, Mario; Yuan, Steve; Tar, Chris; Sung, Yun-Hsuan; Strope, Brian; Kurzweil, Ray (12 April 2018). "Universal Sentence Encoder". arXiv: 1803.11175 [cs.CL].
  9. Feng, Fangxiaoyu; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (3 July 2020). "Language-agnostic BERT Sentence Embedding". arXiv: 2007.01852 [cs.CL].
  10. Team, Editorial (2018-09-04). "The Use of NLP to Extract Unstructured Medical Data From Text". insideBIGDATA. Retrieved 2019-08-27.
  11. Alsentzer, Emily; Murphy, John; Boag, William; Weng, Wei-Hung; Jindi, Di; Naumann, Tristan; McDermott, Matthew (June 2019). "Publicly Available Clinical BERT Embeddings". Proceedings of the 2nd Clinical Natural Language Processing Workshop. Association for Computational Linguistics: 72–78. arXiv: 1904.03323 . doi:10.18653/v1/W19-1909. S2CID   102352093.
  12. "A Unified CV, OCR & NLP Model Pipeline for Document Understanding at DocuSign". NLP Summit. Retrieved 18 September 2020.
  13. Civis Analytics, Okera, Sigma Computing and Spark NLP Named Winners of Strata Data Awards

Sources