List of text mining software

Last updated March 13, 2024

Text mining computer programs are available from many commercial and open source companies and sources.

Commercial

Angoss – Angoss Text Analytics provides entity and theme extraction, topic categorization, sentiment analysis and document summarization capabilities via the embedded
AUTINDEX – is a commercial text mining software package based on sophisticated linguistics by IAI (Institute for Applied Information Sciences), Saarbrücken.
DigitalMR – social media listening & text+image analytics tool for market research.
FICO Score – leading provider of analytics^{[ citation needed ]}.
General Sentiment – Social Intelligence platform that uses natural language processing to discover affinities between the fans of brands with the fans of traditional television shows in social media. Stand alone text analytics to capture social knowledge base on billions of topics stored to 2004.
IBM LanguageWare – the IBM suite for text analytics (tools and Runtime).
IBM SPSS – provider of Modeler Premium (previously called IBM SPSS Modeler and IBM SPSS Text Analytics), which contains advanced NLP-based text analysis capabilities (multi-lingual sentiment, event and fact extraction), that can be used in conjunction with Predictive Modeling. Text Analytics for Surveys provides the ability to categorize survey responses using NLP-based capabilities for further analysis or reporting.
Inxight – provider of text analytics, search, and unstructured visualization technologies. (Inxight was bought by Business Objects that was bought by SAP AG in 2008).
Language Computer Corporation – text extraction and analysis tools, available in multiple languages.
Lexalytics – provider of a text analytics engine used in Social Media Monitoring, Voice of Customer, Survey Analysis, and other applications. Salience Engine. The software provides the unique capability of merging the output of unstructured, text-based analysis with structured data to provide additional predictive variables for improved predictive models and association analysis.
Linguamatics – provider of natural language processing (NLP) based enterprise text mining and text analytics software, I2E, for high-value knowledge discovery and decision support.
Mathematica – provides built in tools for text alignment, pattern matching, clustering and semantic analysis. See Wolfram Language, the programming language of Mathematica.
MATLAB offers Text Analytics Toolbox for importing text data, converting it to numeric form for use in machine and deep learning, sentiment analysis and classification tasks.^[1]
Medallia – offers one system of record for survey, social, text, written and online feedback.
NetOwl – suite of multilingual text and entity analytics products, including entity extraction, link and event extraction, sentiment analysis, geotagging, name translation, name matching, and identity resolution, among others.
PolyAnalyst - text analytics environment.
PoolParty Semantic Suite - graph-based text mining platform.
RapidMiner with its Text Processing Extension – data and text mining software.
SAS – SAS Text Miner and Teragram; commercial text analytics, natural language processing, and taxonomy software used for Information Management.
Sketch Engine – a corpus manager and analysis software which providing creating text corpora from uploaded texts or the Web including part-of-speech tagging and lemmatization or detecting a particular website.^[2]
Sysomos – provider social media analytics software platform, including text analytics and sentiment analysis on online consumer conversations.
WordStat – Content analysis and text mining add-on module of QDA Miner for analyzing large amounts of text data.

Open source

Carrot2 – text and search results clustering framework.
GATE – general Architecture for Text Engineering, an open-source toolbox for natural language processing and language engineering.
Gensim – large-scale topic modelling and extraction of semantic information from unstructured text (Python).
KH Coder – for Quantitative Content Analysis or Text Mining
The KNIME Text Processing extension.
Natural Language Toolkit (NLTK) – a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language.
OpenNLP – natural language processing.
Orange with its text mining add-on.
The PLOS Text Mining Collection.^[3]
The programming language R provides a framework for text mining applications in the package tm.^[4] The Natural Language Processing task view contains tm and other text mining library packages.^[5]
spaCy – open-source Natural Language Processing library for Python
Stanbol – an open source text mining engine targeted at semantic content management.
Voyant Tools – a web-based text analysis environment, created as a scholarly project.

Related Research Articles

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. Typically, this involves processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction.

UIMA, short for Unstructured Information Management Architecture, is an OASIS standard for content analytics, originally developed at IBM. It provides a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and integration with search technologies.

Unstructured data is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated in documents.

BasisTech is a software company specializing in applying artificial intelligence techniques to understanding documents and unstructured data written in different languages. It has headquarters in Somerville, Massachusetts with a subsidiary office in Tokyo. Its legal name is BasisTech LLC.

General Architecture for Text Engineering or GATE is a Java suite of tools originally developed at the University of Sheffield beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for many natural language processing tasks, including information extraction in many languages.

LanguageWare is a natural language processing (NLP) technology developed by IBM, which allows applications to process natural language text. It comprises a set of Java libraries which provide a range of NLP functions: language identification, text segmentation/tokenization, normalization, entity and relationship extraction, and semantic analysis and disambiguation. The analysis engine uses Finite State Machine approach at multiple levels, which aids its performance characteristics, while maintaining a reasonably small footprint.

IBM SPSS Modeler is a data mining and text analytics software application from IBM. It is used to build predictive models and conduct other analytic tasks. It has a visual interface which allows users to leverage statistical and data mining algorithms without programming.

The Ubiquitous Knowledge Processing Lab is a research lab at the Department of Computer Science at the Technische Universität Darmstadt. It was founded in 2006 by Iryna Gurevych.

Lexalytics, Inc. provides sentiment and intent analysis to an array of companies using SaaS and cloud based technology. Salience 6, the engine behind Lexalytics, was built as an on-premises, multi-lingual text analysis engine. It is leased to other companies who use it to power filtering and reputation management programs. In July, 2015 Lexalytics acquired Semantria to be used as a cloud option for its technology. In September, 2021 Lexalytics was acquired by CX company InMoment.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

Apache cTAKES: clinical Text Analysis and Knowledge Extraction System is an open-source Natural Language Processing (NLP) system that extracts clinical information from electronic health record unstructured text. It processes clinical notes, identifying types of clinical named entities — drugs, diseases/disorders, signs/symptoms, anatomical sites and procedures. Each named entity has attributes for the text span, the ontology mapping code, context, and negated/not negated.

The following outline is provided as an overview of and topical guide to natural-language processing:

NetOwl is a suite of multilingual text and identity analytics products that analyze big data in the form of text data – reports, web, social media, etc. – as well as structured entity data about people, organizations, places, and things.

<span class="mw-page-title-main">MeaningCloud</span> Software service

MeaningCloud is a Software as a Service product that enables users to embed text analytics and semantic processing in any application or system. It was previously branded as Textalytics.

Linguamatics, headquartered in Cambridge, England, with offices in the United States and UK, is a provider of text mining systems through software licensing and services, primarily for pharmaceutical and healthcare applications. Founded in 2001, the company was purchased by IQVIA in January 2019.

Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages. The library is built on top of Apache Spark and its Spark ML library.

References

↑ "Text Analytics Toolbox". mathworks.com. Retrieved 2019-07-10.
↑ "Text analysis with Sketch Engine". Sketch Engine. LEXICAL COMPUTING CZ s.r.o. 14 December 2017. Retrieved 17 January 2018.
↑ "Table of Contents: Text Mining". PLOS Collections. doi: 10.1371/issue.pcol.v01.i14 (inactive 31 January 2024). Archived from the original on 2013-07-04. Retrieved 2014-02-20.{{cite journal}}: CS1 maint: DOI inactive as of January 2024 (link)
↑ "Introduction to the tm Package: Text Mining in R" (PDF).
↑ Wild, Fridolin (February 20, 2020). "CRAN Task View: Natural Language Processing". CRAN.R Project.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Text Analytics Toolbox". mathworks.com. Retrieved 2019-07-10.

[2] "Text analysis with Sketch Engine". Sketch Engine. LEXICAL COMPUTING CZ s.r.o. 14 December 2017. Retrieved 17 January 2018.

[3] "Table of Contents: Text Mining". PLOS Collections. doi: 10.1371/issue.pcol.v01.i14 (inactive 31 January 2024). Archived from the original on 2013-07-04. Retrieved 2014-02-20.{{cite journal}}: CS1 maint: DOI inactive as of January 2024 (link)

[tm.pdf-4] "Introduction to the tm Package: Text Mining in R" (PDF).

[NLPTaskView-5] Wild, Fridolin (February 20, 2020). "CRAN Task View: Natural Language Processing". CRAN.R Project.

[1]

[2]

[3]

[4]

[5]