AUTINDEX

Last updated

AUTINDEX is a commercial text mining software package based on sophisticated linguistics. [1] [2] [3]

Contents

AUTINDEX, resulting from research in information extraction, [4] [5] is a product of the Institute of Applied Information Sciences (IAI) which is a non-profit institute that has been researching and developing language technology since its foundation in 1985. IAI is an institute affiliated to Saarland University in Saarbrücken, Germany.

AUTINDEX is the result of a number of research projects funded by the EU (Project BINDEX), [6] by Deutsche Forschungsgemeinschaft and the German Ministry for Economy. Amongst the latter there are the projects LinSearch, [7] and WISSMER, [8] see also the reference to IAI-Website. [9]

The basic functionality of AUTINDEX is the extraction of key words from a document to represent the semantics of the document. [10] Ideally the system is integrated with a thesaurus that defines the standardised terms to be used for key word assignment.
AUTINDEX is used in library applications (e.g. integrated in dandelon.com) as well as in high quality (expert) information systems, [11] and in document management and content management environments.

Together with AUTINDEX a number of additional software comes along such as an integration with Apache Solr / Lucene to provide a complete information retrieval environment, a classification and categorisation system on the basis of a machine learning software that assigns domains to the document, [12] and a system for searching with semantically similar terms that are collected in so called tag clouds. [13]

See also

Related Research Articles

Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

<span class="mw-page-title-main">Semantic Web</span> Extension of the Web to facilitate data exchange

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

A document management system (DMS) is usually a computerized system used to store, share, track and manage files or documents. Some systems include history tracking where a log of the various versions created and modified by different users is recorded. The term has some overlap with the concepts of content management systems. It is often viewed as a component of enterprise content management (ECM) systems and related to digital asset management, document imaging, workflow systems and records management systems.

Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query. The term "cross-language information retrieval" has many synonyms, of which the following are perhaps the most frequent: cross-lingual information retrieval, translingual information retrieval, multilingual information retrieval. The term "multilingual information retrieval" refers more generally both to technology for retrieval of multilingual collections and to technology which has been moved to handle material in one language to another. The term Multilingual Information Retrieval (MLIR) involves the study of systems that accept queries for information in various languages and return objects of various languages, translated into the user's language. Cross-language information retrieval refers more specifically to the use case where users formulate their information need in one language and the system retrieves relevant documents in another. To do so, most CLIR systems use various translation techniques. CLIR techniques can be classified into different categories based on different translation resources:

Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP) that is concerned with building systems that automatically answer questions that are posed by humans in a natural language.

A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a document-feature matrix where "features" may refer to other properties of a document besides terms. It is also common to encounter the transpose, or term-document matrix where documents are the columns and terms are the rows. They are useful in the field of natural language processing and computational text analysis.

Legal informatics is an area within information science.

<span class="mw-page-title-main">Outline of library and information science</span> Overview of and topical guide to library science

The following outline is provided as an overview of and topical guide to library science:

Aboutness is a term used in library and information science (LIS), linguistics, philosophy of language, and philosophy of mind. In LIS, it is often considered synonymous with subject (documents). In the philosophy of mind it has been often considered synonymous with intentionality, perhaps since John Searle (1983). In the philosophy of logic and language it is understood as the way a piece of text relates to a subject matter or topic. In general, the term refers to the concept that a text, utterance, image, or action is on or of something.

Knowledge retrieval seeks to return information in a structured form, consistent with human cognitive processes as opposed to simple lists of data items. It draws on a range of fields including epistemology, cognitive psychology, cognitive neuroscience, logic and inference, machine learning and knowledge discovery, linguistics, and information technology.

XML retrieval, or XML information retrieval, is the content-based retrieval of documents structured with XML. As such it is used for computing relevance of XML documents.

The following outlineof information science is provided as an overview of and topical guide to information science:

RetrievalWare is an enterprise search engine emphasizing natural language processing and semantic networks which was commercially available from 1992 to 2007 and is especially known for its use by government intelligence agencies.

Informatics is the study of computational systems. According to the ACM Europe Council and Informatics Europe, informatics is synonymous with computer science and computing as a profession, in which the central notion is transformation of information. In other countries, the term "informatics" is used with a different meaning in the context of library science, in which case it is synonymous with data storage and retrieval.

<span class="mw-page-title-main">Learning to rank</span> Use of machine learning to rank items

Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment for each item. The goal of constructing the ranking model is to rank new, unseen lists in a similar way to rankings in the training data.

Jack Mills was a British librarian and classification researcher, who worked for more than sixty years in the study, teaching, development and promotion of library classification and information retrieval, principally as a major figure in the British school of facet analysis which builds on the traditions of Henry E. Bliss and S.R. Ranganathan.

In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectoral representation of text that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document is represented as the centroid of the vectors representing its words. Typically, the text corpus is English Wikipedia, though other corpora including the Open Directory Project have been used.

Retrievability is a term associated with the ease with which information can be found or retrieved using an information system, specifically a search engine or information retrieval system.

Christopher D Paice was one of the pioneers of research into stemming. The Paice-Husk stemmer was published in 1990 and his method of evaluation of stemmer performance by means of Error Rate with Respect to Truncation (ERRT) was the first direct method of comparing under-stemming and over-stemming errors. Apart from his pioneering work on stemming algorithms and evaluation methods he made other research contributions in the area of Information Retrieval, anaphora resolution and automatic abstracting.

References

  1. Ripplinger, Bärbel 2001: Das Indexierungssystem AUTINDEX, in GLDV Tagung, Giessen
  2. Paul Schmidt, Mahmoud Gindiyeh & Gintare Grigonyte, 2009: Language Technology for Information Systems. In: Proceedings of KDIR - The International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management Madeira, 6–8 October 2009, Portugal
  3. Paul Schmidt & Mahmoud Gindiyeh, 2009: Language Technology for Multilingual Information and Document Management. In: Proceedings of ASLIB, London, 19–20 November
  4. Paul Schmidt, Thomas Bähr & Dr.-Ing. Jens Biesterfeld &Thomas Risse & Kerstin Denecke & Claudiu Firan, 2008: LINSearch. Aufbereitung von Fachwissen für die gezielte Informationsversorgung. In: Proceedings of Knowtech, Frankfurt
  5. Ursula Deriu, Jörn Lehmann & Paul Schmidt, 2009: ‚Erstellung einer Technik-Ontologie auf der Basis ausgefeilter Sprachtechnologie’. In: Proceedings Knowtech, Frankfurt
  6. . Dieter Maas, Nuebel Rita, Catherine Pease, Paul Schmidt: Bilingual Indexing for Information Retrieval with AUTINDEX. LREC 2002.
  7. . Project LinSearch. P. 32.
  8. . Project Wissmer.
  9. . Wissmer-Project on IAI-Site.
  10. Paul Schmidt, Mahmoud Gindiyeh, Gintare Grigonyte: Language Technology for Information Systems. In: Proceedings of KDIR – The International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management Madeira. 6.–8. Oktober 2009, Portugal. 2009, S. 259 - 262.
  11. . WTI Information system.
  12. Mahmoud Gindiyeh: Anwendung wahrscheinlichkeitstheoretischer Methoden in der linguistischen Informationsverarbeitung, Logos Verlag, Berlin, 2013.
  13. . Electro mobility information system.

Publications