Zettair

Last updated

Zettair is a compact text search engine for indexing and search of HTML (or TREC) collections. It is an open source software developed by a group of researchers at RMIT University.

The Text REtrieval Conference (TREC) is an ongoing series of workshops focusing on a list of different information retrieval (IR) research areas, or tracks. It is co-sponsored by the National Institute of Standards and Technology (NIST) and the Intelligence Advanced Research Projects Activity, and began in 1992 as part of the TIPSTER Text program. Its purpose is to support and encourage research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies and to increase the speed of lab-to-product transfer of technology.

Open-source software software licensed to ensure source code usage rights

Open-source software (OSS) is a type of computer software in which source code is released under a license in which the copyright holder grants users the rights to study, change, and distribute the software to anyone and for any purpose. Open-source software may be developed in a collaborative public manner. According to scientists who have studied it, open-source software is a prominent example of open collaboration. The term is often written without a hyphen as "open source software".

RMIT University Australian university

RMIT University is an Australian public research university located in Melbourne, Victoria.

Contents

Its primary feature is the ability to handle large document collections (100 GB and more). It has a single executable, which performs both on-the-fly indexing and searching, with a command-line interface.

The gigabyte is a multiple of the unit byte for digital information. The prefix giga means 109 in the International System of Units (SI). Therefore, one gigabyte is 1000000000bytes. The unit symbol for the gigabyte is GB.

Executable file that can be run by a computer

In computing, executable code or an executable file or executable program, sometimes simply referred to as an executable, causes a computer "to perform indicated tasks according to encoded instructions," as opposed to a data file that must be parsed by a program to be meaningful.

Command-line interface type of computer interface based on entering text commands and viewing text output

A command-line interface or command language interpreter (CLI), also known as command-line user interface, console user interface and character user interface (CUI), is a means of interacting with a computer program where the user issues commands to the program in the form of successive lines of text. A program which handles the interface is called a command language interpreter or shell (computing).

It is licensed under the terms of the GPL license.

See also

Bibliography

Related Research Articles

Information retrieval (IR) is the activity of obtaining information system resources relevant to an information need from a collection. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for metadata that describe data, and for databases of texts, images or sounds.

Wide Area Information Server (WAIS) is a client–server text searching system that uses the ANSI Standard Z39.50 Information Retrieval Service Definition and Protocol Specifications for Library Applications" (Z39.50:1988) to search index databases on remote computers. It was developed in the late 1980s as a project of Thinking Machines, Apple Computer, Dow Jones, and KPMG Peat Marwick.

Apache Nutch open source web crawler software

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Apache Lucene is a free and open-source information retrieval software library, originally written completely in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License.

Isearch is open-source text retrieval software first developed in 1994 by Nassib Nassar as part of the Isite Z39.50 information framework. The project started at the Clearinghouse for Networked Information Discovery and Retrieval (CNIDR) of the North Carolina supercomputing center MCNC and funded by the National Science Foundation to follow in the track of WAIS and develop prototype systems for distributed information networks encompassing Internet applications, library catalogs and other information resources.

The Lemur Project is a collaboration between the Center for Intelligent Information Retrieval at the University of Massachusetts Amherst and the Language Technologies Institute at Carnegie Mellon University. It develops the Lemur Toolkit, an open-source software framework for building language modeling and information retrieval software, and the Indri search engine. This toolkit is used for developing search engines, text analysis tools, browser toolbars, and data resources in the area of IR.

Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. In the context of search engines, query expansion involves evaluating a user's input and expanding the search query to match additional documents. Query expansion involves techniques such as:

Geographic information retrieval (GIR) or geographical information retrieval is the augmentation of information retrieval with geographic information. GIR aims at solving textual queries that include a geographic dimension, such as "What wars were fought in Greece?" or "restaurants in Beirut". It is common in GIR to separate the text indexing and analysis from the geographic indexing. Semantic similarity and word-sense disambiguation are important components of GIR. To identify place names, GIR often relies on gazetteers.

Terrier IR Platform is a modular open source software for the rapid development of large-scale Information Retrieval applications. Terrier was developed by members of the Information Retrieval Research Group, Department of Computing Science, at the University of Glasgow.

Adversarial information retrieval is a topic in information retrieval related to strategies for working with a data source where some portion of it has been manipulated maliciously. Tasks can include gathering, indexing, filtering, retrieving and ranking information from such a data source. Adversarial IR includes the study of methods to detect, isolate, and defeat such manipulation.

Enterprise search is the practice of making content from multiple enterprise-type sources, such as databases and intranets, searchable to a defined audience.

A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

The Cranfield experiments were computer information retrieval experiments conducted by Cyril W. Cleverdon at the College of Aeronautics at Cranfield in the 1960s, to evaluate the efficiency of indexing systems.

RetrievalWare is an enterprise search engine emphasizing natural language processing and semantic networks which was commercially available from 1992 to 2007 and is especially known for its use by government intelligence agencies.

LGTE

Lucene Geographic and Temporal (LGTE) is an information retrieval tool developed at Technical University of Lisbon which can be used as a search engine or as evaluation system for information retrieval techniques for research purposes. The first implementation powered by LGTE was the search engine of DIGMAP, a project co-funded by the community programme eContentplus between 2006 and 2008, which was aimed to provide services available on the web over old digitized maps from a group of partners over Europe including several National Libraries.

Learning to rank

Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment for each item. The ranking model's purpose is to rank, i.e. produce a permutation of items in new, unseen lists in a way which is "similar" to rankings in the training data in some sense.

Nassib Nassar is an American computer scientist and classical pianist.