RetrievalWare

Last updated
RetrievalWare
Developer(s) Fast Search & Transfer, Convera, Excalibur Technologies, ConQuest Software, Microsoft
Stable release
8.2 / October 13, 2006 (2006-10-13)
Written in C, C++, Java
Operating system Cross-platform
Type Search and Index

RetrievalWare is an enterprise search engine emphasizing natural language processing and semantic networks which was commercially available from 1992 to 2007 and is especially known for its use by government intelligence agencies. [1]

Contents

History

RetrievalWare was initially created by Paul Nelson, [2] Kenneth Clark, [3] and Edwin Addison [4] as part of ConQuest Software. Development began in 1989, but the software was not commercially available on a wide scale until 1992. Early funding was provided by Rome Laboratory via a Small Business Innovation Research grant. [5]

On July 6, 1995, ConQuest Software was merged with the NASDAQ company, Excalibur Technologies [6] and the product was rebranded as RetrievalWare. On December 21, 2000, Excalibur Technologies was combined with Intel Corporation's Interactive Media Services division to form the Convera Corporation. [7] Finally, on April 9, 2007, the RetrievalWare software and business was purchased by Fast Search & Transfer at which point the product was officially retired. [8] Microsoft Corporation continues to maintain the product for its existing customer base.

Annual revenues for RetrievalWare peaked in 2001 at around $40 million US dollars. [9]

Use of natural language techniques

RetrievalWare is a relevancy ranking text search system with processing enhancements drawn from the fields of natural language processing (NLP) and semantic networks. NLP algorithms include dictionary-based stemming (also known as lemmatisation) and dictionary-based phrase identification. Semantic networks are used by RetrievalWare to expand the query words entered by the user to related terms with terms weights determined by the distance from the user's original terms. In addition to automatic expansion, a feedback-mode whereby users could choose the meaning of the word before performing the expansion was available. The first semantic networks were built using WordNet.

In addition, RetrievalWare implemented a form of n-gram search (branded as APRP - Adaptive Pattern Recognition Processing [10] ), designed to search over documents with OCR errors. Query terms are divided into sets of 2-grams which are used to locate similarly matching terms from the inverted index. The resulting matches are weighted based on similarly measures and then used to search for documents.

All of these features were available no later than 1993 [11] and ConQuest software has claimed that it was the first commercial text-search system to implement these techniques. [12]

Other notable features

Other notable features of RetrievalWare include distributed search servers, [11] synchronizers for indexing external content management systems and relational databases, [13] a heterogeneous security model, [13] document categorization, [13] real-time document-query matching (profiling), [11] multi-lingual searches (queries containing terms from multiple languages searching for documents containing terms from multiple languages), and cross-lingual searches (queries in one language searching for documents in a different language). [14]

Participation in TREC

RetrievalWare participated in the Text REtrieval Conference in 1992 (TREC-1), 1993 (TREC-2), and 1995 (TREC-4). [15]

In TREC-1 [16] and TREC-4, [17] the RetrievalWare runs for manually entered queries produced the best results based on the 11-point averages over all search engines which participated in the ad hoc category where search engines are allowed a single opportunity to process previously unknown queries against an existing database.

Related Research Articles

Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query. The term "cross-language information retrieval" has many synonyms, of which the following are perhaps the most frequent: cross-lingual information retrieval, translingual information retrieval, multilingual information retrieval. The term "multilingual information retrieval" refers more generally both to technology for retrieval of multilingual collections and to technology which has been moved to handle material in one language to another. The term Multilingual Information Retrieval (MLIR) involves the study of systems that accept queries for information in various languages and return objects of various languages, translated into the user's language. Cross-language information retrieval refers more specifically to the use case where users formulate their information need in one language and the system retrieves relevant documents in another. To do so, most CLIR systems use various translation techniques. CLIR techniques can be classified into different categories based on different translation resources:

Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language.

Dictionary-based machine translation

Machine translation can use a method based on dictionary entries, which means that the words will be translated as a dictionary does – word by word, usually without much correlation of meaning between them. Dictionary lookups may be done with or without morphological analysis or lemmatisation. While this approach to machine translation is probably the least sophisticated, dictionary-based machine translation is ideally suitable for the translation of long lists of phrases on the subsentential level, e.g. inventories or simple catalogs of products and services.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by taking the cosine of the angle between the two vectors formed by any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.

Stop words are any word in a stop list which are filtered out before or after processing of natural language data (text). There is no single universal list of stop words used by all natural language processing tools, nor any agreed upon rules for identifying stop words, and indeed not all tools even use such a list. Therefore, any group of words can be chosen as the stop words for a given purpose. The "general trend in [information retrieval] systems over time has been from standard use of quite large stop lists to very small stop lists to no stop list whatsoever"

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.

Text Retrieval Conference Meetings for information retrieval research

The Text REtrieval Conference (TREC) is an ongoing series of workshops focusing on a list of different information retrieval (IR) research areas, or tracks. It is co-sponsored by the National Institute of Standards and Technology (NIST) and the Intelligence Advanced Research Projects Activity, and began in 1992 as part of the TIPSTER Text program. Its purpose is to support and encourage research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies and to increase the speed of lab-to-product transfer of technology.

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.

Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. In the context of search engines, query expansion involves evaluating a user's input and expanding the search query to match additional documents. Query expansion involves techniques such as:

Geographic information retrieval (GIR) or geographical information retrieval systems are search tools for searching the Web, enterprise documents, and mobile local search that combine traditional text-based queries with location querying, such as a map or placenames. Like traditional information retrieval systems, GIR systems index text and information from structured and unstructured documents, and also augment those indices with geographic information. The development and engineering of GIR systems aims to build systems that can reliably answer queries that include a geographic dimension, such as "What wars were fought in Greece?" or "restaurants in Beirut". Semantic similarity and word-sense disambiguation are important components of GIR. To identify place names, GIR systems often rely on natural language processing or other metadata to associate text documents with locations. Such georeferencing, geotagging, and geoparsing tools often need databases of location names, known as gazetteers.

Convera was formed in December 2000 by the merger of Intel's Interactive Services division and Excalibur Technologies Corporation. Until 2007, Convera's primary focus was the enterprise search market through its flagship product, RetrievalWare, which is widely used within the secure government sector in the United States, UK, Canada and a number of other countries. Convera sold its enterprise search business to FAST Search & Transfer in August 2007 for $23 million, at which point RetrievalWare was officially retired. Microsoft Corporation continues to maintain RetrievalWare for its existing customer base.

LanguageWare is a natural language processing (NLP) technology developed by IBM, which allows applications to process natural language text. It comprises a set of Java libraries which provide a range of NLP functions: language identification, text segmentation/tokenization, normalization, entity and relationship extraction, and semantic analysis and disambiguation. The analysis engine uses Finite State Machine approach at multiple levels, which aids its performance characteristics, while maintaining a reasonably small footprint.

A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

LGTE

Lucene Geographic and Temporal (LGTE) is an information retrieval tool developed at Technical University of Lisbon which can be used as a search engine or as evaluation system for information retrieval techniques for research purposes. The first implementation powered by LGTE was the search engine of DIGMAP, a project co-funded by the community programme eContentplus between 2006 and 2008, which was aimed to provide services available on the web over old digitized maps from a group of partners over Europe including several National Libraries.

The Ubiquitous Knowledge Processing Lab is a research lab at the Department of Computer Science at the Technische Universität Darmstadt. It was founded in 2006 by Iryna Gurevych.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

The following outline is provided as an overview of and topical guide to natural-language processing:

References

  1. Vise, David A. (2004-12-03). "Agencies Find What They're Looking For". The Washington Post. Retrieved 2010-05-22.[ dead link ]
  2. "Paul Nelson, Innovation Lead, Content Analytics at Accenture Analytics" . Retrieved 1 December 2020.
  3. "Arden & Ken". comcast.net. 23 July 2011. Archived from the original on 2011-07-23.
  4. "Ed Addison, Serial Entrepreneur, Venture Capitalist, Business Executive, Professor".
  5. . John McGrath joined the company in 1993 as VP of Sales and Marketing. The company quickly grew revenue from U.S. federal contracts, publishers, and enterprise customers requiring advanced text retrieval accuracy and performance. FY 1991 SBIR SOLICITATION - PHASE I AWARD ABSTRACTS - AIR FORCE PROJECTS - VOLUME III (PDF), 1992-07-06, pp. 70–71, archived from the original (PDF) on June 4, 2011 - Note that "Synchronetics" was the original name for ConQuest Software Incorporated.
  6. "Excalibur Technologies to merge with ConQuest Software; text and multimedia information retrieval leaders join forces to expand products, channels and markets" (Press release). Business Wire. 1995-07-06.
  7. "Intel and Excalibur Form Convera Corporation". Silicon Valley / San Jose Business Journal. 2000-12-21.
  8. "FAST Acquires Convera's RetrievalWare Business". Information Today, Inc. 2007-04-09. While FAST will continue to support the RetrievalWare platform, it will not continue development on it or add new features. RetrievalWare customers will be offered an upgrade path to FAST’s own offering.
  9. Convera Corp · 10-K · For 1/1/01, 2001-01-01 - Indicates that Convera products accounted for 85% of the total revenue of $51.5 million.
  10. Excalibur Announces Excalibur RetrievalWare 6.5 Featuring RetrievalWare FileRoom - Contains a description of APRP
  11. 1 2 3 Site Report for the Text REtrieval Conference by ConQuest Software Inc. (TREC2) - Find the complete proceedings here
  12. "Homework Helper debuts on Prodigy using ConQuest search engine" (Press release). Business Wire. 1995-02-09. ConQuest is the only search engine which uses dictionaries, thesauri and other lexical resources to build in a semantic knowledgebase of over 440,000 word meanings, and 1.6 million word relationships.
  13. 1 2 3 "Excalibur RetrievalWare: more than information retrieval". KMWorld. 1999-10-01.
  14. "Multimedia search, retrieval, categorization". KMWorld. 2002-03-25.
  15. Flank, Sharon (1998). "A Layered Approach to NLP-Based Information Retrieval". Proceedings of the 36th annual meeting on Association for Computational Linguistics -. Vol. 1. dl.acm.org. p. 397. doi:10.3115/980845.980913 . Retrieved 1 December 2020.
  16. Site Report for the Text REtrieval Conference by ConQuest Software Inc. (TREC-1) - Find the complete proceedings here
  17. The Excalibur TREC-4 System, Preparations, and Results - A PDF version of which can be found here Archived 2010-11-27 at the Wayback Machine and the complete proceedings can be found here