Lemur Project

Last updated January 06, 2023

The Lemur Project is a collaboration between the Center for Intelligent Information Retrieval at the University of Massachusetts Amherst and the Language Technologies Institute at Carnegie Mellon University. The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software. The project is best known for its Indri and Galago search engines, the ClueWeb09 and ClueWeb12 datasets, and the RankLib learning-to-rank library. The software and datasets are used widely in scientific and research applications, as well as in some commercial applications.

The Lemur Project's software development philosophy emphasizes state-of-the-art accuracy, flexibility, and efficiency. For example, the Indri search engine provides accurate search for large text collections 'out of the box', and data is stored in an accessible manner to support development of new retrieval strategies. Software from the Lemur Project is distributed under open-source licenses that provide flexibility to scientists and software developers.

The programming languages used to create Lemur are C, C++, and Java, and it comes along with the source files and build instructions. The provided source code can be modified for the purpose of developing new libraries. It is compatible with various operating systems which include Linux and Windows.

Features

Lemur supports the following features:

Indexing:
- English, Chinese, and Arabic text
- Word stemming
- Stop words
- Tokenization
- Passage and incremental indexing
Retrieval:
- Ad hoc retrieval (TF-IDF and InQuery)
- Passage and cross-lingual retrieval
- Language modeling
  - Query model updating
  - Two stage smoothing
- Relevance feedback
- Structured query language
- Wildcard term matching
Distributed IR:
- Query-based sampling
- Database based ranking (CORI)
- Results merging
Document clustering
Summarization
Simple text processing

Components

Lemur Project has the following components:

Indri search engine in C++
Galago search engine research framework in Java
RankLib learning-to-rank library
Sifaka data mining application
ClueWeb09 and ClueWeb12 datasets
Query Log Toolbar

Latest Version

Updates to the Lemur Project components are made twice a year, in June and December. The latest version of the Indri search engine is 5.17. The latest version of the Galago search engine is version 3.18. The latest version of the RankLib learning-to-rank library is 2.14. The latest version of the Sifaka data mining application is 1.8.

Indri Search Engine

The Indri search engine is one of the components developed by the Lemur Project. It is open source. The query language that is used in Indri allows researchers to index data or structure documents using simple command line instructions. Indri offers flexibility in terms of adaptation to various current applications. It also can be distributed across a cluster of nodes for high performance. The Indri search engine can handle large collections of data and can understand various data formats like HTML and XML.

The Indri API supports various programming and scripting languages like C++, Java, C#, and PHP.

Features of Indri Search Engine

Can make use of multiple document representations
Explicit term weighting
Robust query language
Formally well-grounded
Highly effective
Can be efficiently implemented

External links

The Lemur Project website

This free and open-source software article is a stub. You can help Wikipedia by expanding it.

Related Research Articles

Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

<span class="mw-page-title-main">Apache Nutch</span> Open source web crawler

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a standard foundation for non-research search applications.

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.

Xapian is a free and open-source probabilistic information retrieval library, released under the GNU General Public License (GPL). It is a full-text search engine library for programmers.

Prefuse is a Java-based toolkit for building interactive information visualization applications. It supports a rich set of features for data modeling, visualization and interaction. It provides optimized data structures for tables, graphs, and trees, a host of layout and visual encoding techniques, and support for animation, dynamic queries, integrated search, and database connectivity.

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.

Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. In the context of search engines, query expansion involves evaluating a user's input and expanding the search query to match additional documents. Query expansion involves techniques such as:

Solr is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document handling. Providing distributed search and index replication, Solr is designed for scalability and fault tolerance. Solr is widely used for enterprise search and analytics use cases and has an active development community and regular releases.

dtSearch Corp. is a software company which specializes in text retrieval software. It was founded in 1991, and is headquartered in Bethesda, Maryland. Its current range of software includes products for enterprise desktop search, Intranet/Internet spidering and search, and search engines for developers (SDK) to integrate into other software applications.

Human–computer information retrieval (HCIR) is the study and engineering of information retrieval techniques that bring human intelligence into the search process. It combines the fields of human-computer interaction (HCI) and information retrieval (IR) and creates systems that improve search by taking into account the human context, or through a multi-step search process that provides the opportunity for human feedback.

NewGenLib is an integrated library management system developed by Verus Solutions Pvt Ltd. Domain expertise is provided by Kesavan Institute of Information and Knowledge Management in Hyderabad, India. NewGenLib version 1.0 was released in March 2005. On 9 January 2008, NewGenLib was declared free and open-source under GNU GPL. The latest version of NewGenLib is 3.1.1 released on 16 April 2015. Many libraries across the globe are using NewGenLib as their Primary integrated library management system as seen from the NewGenlib discussion forum.

A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

db4o was an embeddable open-source object database for Java and .NET developers. It was developed, commercially licensed and supported by Actian. In October 2014, Actian declined to continue to actively pursue and promote the commercial db4o product offering for new customers.

<span class="mw-page-title-main">LGTE</span>

Lucene Geographic and Temporal (LGTE) is an information retrieval tool developed at Technical University of Lisbon which can be used as a search engine or as evaluation system for information retrieval techniques for research purposes. The first implementation powered by LGTE was the search engine of DIGMAP, a project co-funded by the community programme eContentplus between 2006 and 2008, which was aimed to provide services available on the web over old digitized maps from a group of partners over Europe including several National Libraries.

W. Bruce Croft is a distinguished professor of computer science at the University of Massachusetts Amherst whose work focuses on information retrieval. He is the founder of the Center for Intelligent Information Retrieval and served as the editor-in-chief of ACM Transactions on Information Systems from 1995 to 2002. He was also a member of the National Research Council Computer Science and Telecommunications Board from 2000 to 2003. Since 2015, he is the Dean of the College of Information and Computer Sciences at the University of Massachusetts Amherst. He was Chair of the UMass Amherst Computer Science Department from 2001 to 2007.

Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment for each item. The goal of constructing the ranking model is to rank new, unseen lists in a similar way to rankings in the training data.

The Ubiquitous Knowledge Processing Lab is a research lab at the Department of Computer Science at the Technische Universität Darmstadt. It was founded in 2006 by Iryna Gurevych.

<span class="mw-page-title-main">Sencha Touch</span> JavaScript framework

Sencha Touch is a user interface (UI) JavaScript library, or web framework, specifically built for the Mobile Web. It can be used by Web developers to develop user interfaces for mobile web applications that look and feel like native applications on supported mobile devices. It is based on web standards such as HTML5, CSS3 and JavaScript. The goal of Sencha Touch is to facilitate quick and easy development of HTML5 based mobile apps which run on Android, iOS, Windows, Tizen and BlackBerry devices, simultaneously allowing a native look and feel to the apps.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.