Basis Technology

Last updated
BasisTech
Company type Private
Industry Information technology
Information access
Digital forensics
Transliteration
Founded1995
Headquarters Somerville, Massachusetts, United States
Area served
Americas
Europe
Asia
Key people
Carl Hoffman (CEO, Co-Founder)
Steven Cohen (EVP/COO, Co-Founder)
Brian Carrier (CTO and GM Cyber Forensics)
Simson Garfinkel (Chief Scientist)
Junichi Hasegawa (VP Asia)
ProductsKonaSearch
Cyber Triage
Autopsy
Sleuth Kit
Subsidiaries BasisTech GK
Website http://www.basistech.com
http://www.konasearch.com
http://www.autopsy.com
http://www.cybertriage.com

BasisTech is a software company specializing in applying artificial intelligence techniques to understanding documents and unstructured data written in different languages. It has headquarters in Somerville, Massachusetts with a subsidiary office in Tokyo. Its legal name is BasisTech LLC.

Contents

The company was founded in 1995 by graduates of the Massachusetts Institute of Technology to use artificial intelligence techniques for natural language processing to help computer systems understand written human language. Its software focuses on analyzing freeform text so that applications can do a better job understanding the meaning of the words. For example, their software can identify tokens, part-of-speech, and lemmas. [1]

Their software also performs entity extraction, that is finding words which refer to people, places, and organizations from text for uses such as due diligence, intelligence and metadata tagging. [2]

The company is best known for its Rosette product which uses Natural Language Processing techniques to improve information retrieval, text mining, search engines and other applications. The tool is used to enable search engines to search in multiple languages, [3] and match identities and dates. [4] Rosette was sold to Babel Street in 2022. [5]

BasisTech software is also used by forensic analysts to search through files for words, tokens, phrases or numbers that may be important to investigators, [6] as well as provide software (Cyber Triage) that helps organizations respond to cyberattacks. [7]

Rosette

Rosette comes as a cloud (public or on-premise) deployment or Java SDK. [8] Rosette provides a variety of natural language processing tools for unstructured text: language identification, base linguistics, entity extraction, name matching, name translation, sentiment analysis, semantic similarity, relationship extraction, topic extraction, categorization, and Arabic chat translation. [9] It can be integrated into applications to enhance financial compliance onboarding, [10] communication surveillance compliance, [11] social media monitoring, [12] cyber threat intelligence, [13] and customer feedback analysis. [14]

The Rosette Linguistics Platform is composed of these modules:

Rosette is used in both the United States government offices to support translation and by major Internet infrastructure firms like search engines. [19] [20]

Digital forensics

BasisTech develops open-source digital forensics tools, The Sleuth Kit and Autopsy , to help identify and extract clues from data storage devices like hard disks or flash cards, as well as devices such as smart phones and iPods. The open-source licensing model allows them to be used as the foundation for larger projects like a Hadoop-based tool for massively parallel forensic analysis of very large data collections.

The digital forensics tool set is used to perform analysis of file systems, new media types, new file types and file system metadata. The tools can search for particular patterns in the files allowing it to target significant files or usage profiles.

KonaSearch

BasisTech acquired KonaSearch in June 2019, [21] a startup that specializes in search for Salesforce.com and other office database repositories, which can automate the search step of business workflows. [22]

Related Research Articles

Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning and deep learning.

<span class="mw-page-title-main">Semantic network</span> Knowledge base that represents semantic relations between concepts in a network

A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, which represent concepts, and edges, which represent semantic relations between concepts, mapping or connecting semantic fields. A semantic network may be instantiated as, for example, a graph database or a concept map. Typical standardized semantic networks are expressed as semantic triples.

Idiolect is an individual's unique use of language, including speech. This unique usage encompasses vocabulary, grammar, and pronunciation. This differs from a dialect, a common set of linguistic characteristics shared among a group of people.

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005), there are three perspectives of text mining: information extraction, data mining, and knowledge discovery in databases (KDD). Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.

<span class="mw-page-title-main">Computer forensics</span> Branch of digital forensic science

Computer forensics is a branch of digital forensic science pertaining to evidence found in computers and digital storage media. The goal of computer forensics is to examine digital media in a forensically sound manner with the aim of identifying, preserving, recovering, analyzing, and presenting facts and opinions about the digital information.

In-Q-Tel (IQT), formerly Peleus and In-Q-It, is an American not-for-profit venture capital firm based in Arlington, Virginia. It invests in companies to keep the Central Intelligence Agency, and other intelligence agencies, equipped with the latest in information technology in support of United States intelligence capability. The name "In-Q-Tel" is an intentional reference to Q, the fictional inventor who supplies technology to James Bond.

Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.

<span class="mw-page-title-main">General Architecture for Text Engineering</span>

General Architecture for Text Engineering (GATE) is a Java suite of natural language processing (NLP) tools for man tasks, including information extraction in many languages. It is now used worldwide by a wide community of scientists, companies, teachers and students. It was originally developed at the University of Sheffield beginning in 1995.

Enterprise search is software technology for searching data sources internal to a company, typically intranet and database content. The search is generally offered only to users internal to the company. Enterprise search can be contrasted with web search, which applies search technology to documents on the open web, and desktop search, which applies search technology to the content on a single computer.

<span class="mw-page-title-main">Mobile device forensics</span> Recovery of evidence from mobile devices

Mobile device forensics is a branch of digital forensics relating to recovery of digital evidence or data from a mobile device under forensically sound conditions. The phrase mobile device usually refers to mobile phones; however, it can also relate to any digital device that has both internal memory and communication ability, including PDA devices, GPS devices and tablet computers.

General Sentiment, Inc. was a Long Island-based social media and news media analytics company.

<span class="mw-page-title-main">Apache cTAKES</span> Natural language processing system

Apache cTAKES: clinical Text Analysis and Knowledge Extraction System is an open-source Natural Language Processing (NLP) system that extracts clinical information from electronic health record unstructured text. It processes clinical notes, identifying types of clinical named entities — drugs, diseases/disorders, signs/symptoms, anatomical sites and procedures. Each named entity has attributes for the text span, the ontology mapping code, context, and negated/not negated.

Elasticsearch is a search engine based on Apache Lucene. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Official clients are available in Java, .NET (C#), PHP, Python, Ruby and many other languages. According to the DB-Engines ranking, Elasticsearch is the most popular enterprise search engine.

The following outline is provided as an overview of and topical guide to natural-language processing:

NetOwl is a suite of multilingual text and identity analytics products that analyze big data in the form of text data – reports, web, social media, etc. – as well as structured entity data about people, organizations, places, and things.

<span class="mw-page-title-main">Sketch Engine</span> Corpus manager and text analysis software

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in over 90 languages.

<span class="mw-page-title-main">Apache Tika</span> Open-source content analysis framework

Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation. It detects and extracts metadata and text from over a thousand different file types, and as well as providing a Java library, has server and command-line editions suitable for use from other programming languages.

Code stylometry is the application of stylometry to computer code to attribute authorship to anonymous binary or source code. It often involves breaking down and examining the distinctive patterns and characteristics of the programming code and then comparing them to computer code whose authorship is known. Unlike software forensics, code stylometry attributes authorship for purposes other than intellectual property infringement, including plagiarism detection, copyright investigation, and authorship verification.

References

  1. "Base Linguistics".
  2. "Entity Extractor - Entity Recognition".
  3. "Elasticsearch Plugins - Elasticsearch Enrichment".
  4. "Elasticsearch Plugins - Elasticsearch Enrichment".
  5. "Babel Street Closes Highly Successful 2022 with Rosette Acquisition". www.businesswire.com. 2023-01-10. Retrieved 2024-04-11.
  6. "Custom Solutions for Digital Forensics".
  7. "About".
  8. "Base Linguistics".
  9. "Rosette Text Analytics".
  10. "Uphold".
  11. "Société Générale".
  12. "Sensika".
  13. "A Game-Changing Threat Intelligence Platform".
  14. "Understand, Measure, and Act on Consumer Feedback".
  15. Erard, Michael (March 1, 2004). "Translation in the Era of Terror". Technology Review.
  16. Boyd, Clark (January 14, 2004). "Language tools for fight on terror". BBC News.
  17. Weiss, Todd R. (March 10, 2003). "Language analysis software aids U.S. Web search for terrorist activity". Computerworld.
  18. Profile in Boston Business Journal
  19. Hollmer, Mark (March 21, 2003). "Basis Technology turns its focus to government security". Boston Business Journal.
  20. Baker, Loren (November 30, 2004). "MSN Search Engine Uses Basis Technology for Natural Language Processing". Search Engine Journal.
  21. "Basis Technology Brings Deep Search to Salesforce".
  22. "About Us".