Isearch

Last updated

Isearch is open-source text retrieval software first developed in 1994 by Nassib Nassar as part of the Isite Z39.50 information framework. The project started at the Clearinghouse for Networked Information Discovery and Retrieval (CNIDR) of the North Carolina supercomputing center MCNC and funded by the National Science Foundation to follow in the track of WAIS and develop prototype systems for distributed information networks encompassing Internet applications, library catalogs and other information resources.

The main features of Isearch include full text and field searching, relevance ranking, Boolean queries, and support for many document types such as HTML, mail folders, list digests, MEDLINE, BibTeX, SGML/XML, FGDC Metadata, NASA DIF, ANZLIC metadata, ISO 19115 metadata and many other resource types and document formats.

It was the first search engine to be designed from the ground up to support SGML and Z39.50 search and retrieval. It included many innovations including the "document type" model—which is simply an (object oriented) method of associating each document with a class of functions providing a standard interface for accessing the document. It was one of the first engines (if not the first) to ever support XML.

The Isearch search/indexing text algorithms were based on Gaston Gonnet's seminal work into PAT arrays and trees for text retrieval--- ideas that were developed for the New Oxford English Dictionary Project at the Univ. of Waterloo, and provided the seeds for Tim Bray's PAT SGML engine that formed the basis of Open Text. One of the limiting factors, however, of the Isearch design was that it was not well suited to handle the extremely large data sets that became popular in the mid to late 1990s. In many cases Isearch was adapted or modified to use different algorithms but usually retained the document type model and the architectural relationship with Isite.

Isearch was widely adopted and used in hundreds of public search sites, including many high profile projects such as the U.S. Patent and Trademark Office (USPTO) patent search, the Federal Geographic Data Clearinghouse (FGDC), the NASA Global Change Master Directory, the NASA EOS Guide System, the NASA Catalog Interoperability Project, the astronomical pre-print service based at the Space Telescope Science Institute, The PCT Electronic Gazette at the World Intellectual Property Organization (WIPO), [[Linsearch (a search engine for Open Source Software designed by Miles Efron), the SAGE Project of the Special Collections Department at Emory University, Eco Companion Australasia (an environmental geospatial resources catalog), Australian National Genomic Information Service (ANGIS), the Open Directory Project and numerous governmental portals in the context of the Government Information Locator Service (GILS) GPO mandate (ended in 2005?).

From 1994 to 1998 most of the development was centered on the Clearinghouse for Networked Information Discovery and Retrieval (CNIDR) in North Carolina (Engine core) and BSn in Germany (Doctypes). By 1998 much of the open-source Isearch core developers re-focused development into several spin-offs. In 1998 it became part of the Advanced Search Facility reference software platform funded by the U.S. Department of Commerce.

A/WWW Enterprises now maintains the open source version for public usage, supported by paying government clients, such as the U.S. Patent and Trademark Office, NASA, and the FGDC who have provided support to enhance the functionality and reliability of the software. The software suite is considered a reference implementation of catalog service software.

As of 2010, the open source version of Isearch is still used on 250+ nodes of FGDC, and by ANZLIC in Australia and selected Geospatial OneStop contributors to facilitate harvesting by GOS, including NOAA, Census Bureau and the Tenn. Field Office of the US Fish and Wildlife Service, among others.

Related Research Articles

Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

Markup language Modern system for annotating a document

In computer text processing, a markup language is metadata for annotating a document, which is visually distinguishable from how the user typically sees the document. It is used only for formatting the text, thus when the document is rendered for display, the markup language doesn't appear. The idea and terminology evolved from the "marking up" of paper manuscripts, which is traditionally written with a red pen or blue pencil on authors' manuscripts. Such "markup" typically includes both content corrections, and also typographic instructions, such as to make a heading larger or boldface.

Standard Generalized Markup Language Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

Wide Area Information Server (WAIS) is a client–server text searching system that uses the ANSI Standard Z39.50 Information Retrieval Service Definition and Protocol Specifications for Library Applications" (Z39.50:1988) to search index databases on remote computers. It was developed in 1990 as a project of Thinking Machines, Apple Computer, Dow Jones, and KPMG Peat Marwick.

CiteSeerx is a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science. CiteSeer is considered as a predecessor of academic search tools such as Google Scholar and Microsoft Academic Search. CiteSeer-like engines and archives usually only harvest documents from publicly available websites and do not crawl publisher websites. For this reason, authors whose documents are freely available are more likely to be represented in the index.

In text retrieval, full-text search, sometimes referred too as free-text-search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.

Enterprise content management (ECM) extends the concept of content management by adding a timeline for each content item and, possibly, enforcing processes for its creation, approval and distribution. Systems using ECM generally provide a secure repository for managed items, analog or digital. They also include one methods for importing content to bring manage new items, and several presentation methods to make items available for use. Although ECM content may be protected by digital rights management (DRM), it is not required. ECM is distinguished from general content management by its cognizance of the processes and procedures of the enterprise for which it is created.

A metadata registry is a central location in an organization where metadata definitions are stored and maintained in a controlled method.

The following outline is provided as an overview of and topical guide to library science:

Pennsylvania Spatial Data Access (PASDA) is Pennsylvania's official public access geospatial information clearinghouse. PASDA serves as Pennsylvania's node on the National Spatial Data Infrastructure (NSDI). PASDA is a cooperative effort of the Pennsylvania Geospatial Technologies Office of the Office of Information Technology and the Pennsylvania State University Institutes of Energy and the Environment (PSIEE).

A bibliographic database is a database of bibliographic records, an organized digital collection of references to published literature, including journal and newspaper articles, conference proceedings, reports, government and legal publications, patents, books, etc. In contrast to library catalogue entries, a large proportion of the bibliographic records in bibliographic databases describe articles, conference papers, etc., rather than complete monographs, and they generally contain very rich subject descriptions in the form of keywords, subject classification terms, or abstracts.

Geospatial metadata is a type of metadata applicable to geographic data and information. Such objects may be stored in a geographic information system (GIS) or may simply be documents, data-sets, images or other objects, services, or related items that exist in some other native environment but whose features may be appropriate to describe in a (geographic) metadata catalog.

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing. Or just indexing.

BRS/Search is a full-text database and information retrieval system. BRS/Search uses a fully inverted indexing system to store, locate, and retrieve unstructured data. It was the search engine that in 1977 powered Bibliographic Retrieval Services (BRS) commercial operations with 20 databases ; it has changed ownership several times during its development and is currently sold as Livelink ECM Discovery Server by Open Text Corporation.

BASE (search engine)

BASE is a multi-disciplinary search engine to scholarly internet resources, created by Bielefeld University Library in Bielefeld, Germany. It is based on free and open-source software such as Apache Solr and VuFind. It harvests OAI metadata from institutional repositories and other academic digital libraries that implement the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), and then normalizes and indexes the data for searching. In addition to OAI metadata, the library indexes selected web sites and local data collections, all of which can be searched via a single search interface.

A web browser is a software application for retrieving, presenting and traversing information resources on the World Wide Web. It further provides for the capture or input of information which may be returned to the presenting system, then stored or processed as necessary. The method of accessing a particular page or content is achieved by entering its address, known as a Uniform Resource Identifier or URI. This may be a web page, image, video, or other piece of content. Hyperlinks present in resources enable users easily to navigate their browsers to related resources. A web browser can also be defined as an application software or program designed to enable users to access, retrieve and view documents and other resources on the Internet.

Metadata Data about data

Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including:

A digital library, also called an online library, an internet library, a digital repository, or a digital collection is an online database of digital objects that can include text, still images, audio, video, digital documents, or other digital media formats or a library accessible through the internet. Objects can consist of digitized content like print or photographs, as well as originally produced digital content like word processor files or social media posts. In addition to storing content, digital libraries provide means for organizing, searching, and retrieving the content contained in the collection.

The Earth Observing System (EOS) Clearinghouse, or ECHO refers to a system that was used by the National Aeronautics and Space Administration (NASA) to spatially, temporally and otherwise index the petabytes of data that NASA's Earth Science projects collect. It does not hold the data itself, but serves as a search engine that other applications can access via a web service based interface. While ECHO has been set up to support both data and services, as of mid-2008, data is well represented and services are yet to be focused on.

The Clearinghouse for Networked Information Discovery and Retrieval or CNIDR was an organization funded by the U.S. National Science Foundation from 1993 to 1997 and based at the Microelectronics Center of North Carolina (MCNC) in Research Triangle Park. CNIDR was active in the research and development of open source software and open standards, centered on information discovery and retrieval, in the emerging Internet.

References