ChemXSeer

Last updated

ChemXSeer project, funded by the National Science Foundation, is a public integrated digital library, database, and search engine for scientific papers in chemistry. [1] It is being developed by a multidisciplinary team of researchers at the Pennsylvania State University. ChemXSeer was conceived by Dr. Prasenjit Mitra, Dr. Lee Giles and Dr. Karl Mueller as a way to integrate the chemical scientific literature with experimental, analytical, and simulation data from different types of experimental systems. The goal of the project is to create an intelligent search and database which will provide access to relevant data to a diverse community of users who have a need for chemical information. It is hosted on the World Wide Web at the College of Information Sciences and Technology, The Pennsylvania State University.

Contents

Features

In order to provide access to relevant data to users ChemXSeer provides new features that are not available in traditional search engines or digital libraries.

  1. Chemical Entity Search: A tool capable of identifying Chemical formulae and chemical names, and extracting and disambiguating them from general terms within documents. Those disambiguated terms are used for performing searches.
  2. TableSeer: In scholarly articles Tables are used to present, list, summarize, and structure important data. TableSeer automatically identifies tables in digital documents, extracts the table Metadata as well as the cells content, and stores them in such a way that allows users to either query the table content or search for tables in a large set of documents.
  3. Dataset search: ChemXSeer provides tools to incorporate datasets from different experiments sources. The system is able to manipulate results from multiple formats such as XML, Microsoft Excel, Gaussian, and CHARMM, create databases, to allow direct queries over the data, create Metadata, using an annotation tool, which will allow users to search over the datasets, as well as a way to create links among datasets and/or between datasets and documents.

In addition to these tools, ChemXSeer will integrate the advances made by its sister project CiteSeerX to provide:

See also

Related Research Articles

Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

CiteSeerX is a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science.

<span class="mw-page-title-main">Entrez</span> Cross-database search engine for health sciences

The Entrez Global Query Cross-Database Search System is a federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information (NCBI) website. The NCBI is a part of the National Library of Medicine (NLM), which is itself a department of the National Institutes of Health (NIH), which in turn is a part of the United States Department of Health and Human Services. The name "Entrez" was chosen to reflect the spirit of welcoming the public to search the content available from the NLM.

Federated search retrieves information from a variety of sources via a search application built on top of one or more search engines. A user makes a single query request which is distributed to the search engines, databases or other query engines participating in the federation. The federated search then aggregates the results that are received from the search engines for presentation to the user. Federated search can be used to integrate disparate information resources within a single large organization ("enterprise") or for the entire web.

An IFilter is a plugin that allows Microsoft's search engines to index various file formats so that they become searchable. Without an appropriate IFilter, contents of a file cannot be parsed and indexed by the search engine.

Geospatial metadata is a type of metadata applicable to geographic data and information. Such objects may be stored in a geographic information system (GIS) or may simply be documents, data-sets, images or other objects, services, or related items that exist in some other native environment but whose features may be appropriate to describe in a (geographic) metadata catalog.

In computer science, the semantic desktop is a collective term for ideas related to changing a computer's user interface and data handling capabilities so that data are more easily shared between different applications or tasks and so that data that once could not be automatically processed by a computer could be. It also encompasses some ideas about being able to share information automatically between different people. This concept is very much related to the Semantic Web, but is distinct insofar as its main concern is the personal use of information.

ChemSpider is a freely accessible online database of chemicals owned by the Royal Society of Chemistry. It contains information on more than 100 million molecules from over 270 data sources, each of them receiving a unique identifier called ChemSpider Identifier.

Microsoft SQL Server is a proprietary relational database management system developed by Microsoft. As a database server, it is a software product with the primary function of storing and retrieving data as requested by other software applications—which may run either on the same computer or on another computer across a network. Microsoft markets at least a dozen different editions of Microsoft SQL Server, aimed at different audiences and for workloads ranging from small single-machine applications to large Internet-facing applications with many concurrent users.

Human–computer information retrieval (HCIR) is the study and engineering of information retrieval techniques that bring human intelligence into the search process. It combines the fields of human-computer interaction (HCI) and information retrieval (IR) and creates systems that improve search by taking into account the human context, or through a multi-step search process that provides the opportunity for human feedback.

A document-oriented database, or document store, is a computer program and data storage system designed for storing, retrieving and managing document-oriented information, also known as semi-structured data.

Amit Sheth is a computer scientist at University of South Carolina in Columbia, South Carolina. He is the founding Director of the Artificial Intelligence Institute, and a Professor of Computer Science and Engineering. From 2007 to June 2019, he was the Lexis Nexis Ohio Eminent Scholar, director of the Ohio Center of Excellence in Knowledge-enabled Computing, and a Professor of Computer Science at Wright State University. Sheth's work has been cited by over 48,800 publications. He has an h-index of 106, which puts him among the top 100 computer scientists with the highest h-index. Prior to founding the Kno.e.sis Center, he served as the director of the Large Scale Distributed Information Systems Lab at the University of Georgia in Athens, Georgia.

A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

<span class="mw-page-title-main">Metadata</span> Data about data

Metadata is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:

A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph. The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation. Graph databases hold the relationships between data as a priority. Querying relationships is fast because they are perpetually stored in the database. Relationships can be intuitively visualized using graph databases, making them useful for heavily inter-connected data.

AMiner is a free online service used to index, search, and mine big scientific data.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project, built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids the portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

Dotmatics is an R&D scientific software company used by scientists in the R&D process that help them be more efficient in their efforts to innovate. Founded in 2005, the company's primary office is in Boston with 14 offices around the globe. In March 2021, Insightful Science acquired Dotmatics. In April 2022, the two companies consolidated under the Dotmatics brand with Insightful Science CEO Thomas Swalla leading the new Dotmatics. Dotmatics' software is used by 2 million scientists and researchers and 10,000 customers.

BisQue is a free, open source web-based platform for the exchange and exploration of large, complex datasets. It is being developed at the Vision Research Lab at the University of California, Santa Barbara. BisQue specifically supports large scale, multi-dimensional multimodal-images and image analysis. Metadata is stored as arbitrarily nested and linked tag/value pairs, allowing for domain-specific data organization. Image analysis modules can be added to perform complex analysis tasks on compute clusters. Analysis results are stored within the database for further querying and processing. The data and analysis provenance is maintained for reproducibility of results. BisQue can be easily deployed in cloud computing environments or on computer clusters for scalability. BisQue has been integrated into the NSF Cyberinfrastructure project CyVerse. The user interacts with BisQue via any modern web browser.

References

  1. "Penn State (2007, July 26). Researchers Create Search Engine To Hunt Molecules Online". ScienceDaily. Retrieved 25 November 2011.