Heterogeneous database system

Last updated

A heterogeneous database system is an automated (or semi-automated) system for the integration of heterogeneous, disparate database management systems to present a user with a single, unified query interface.

Contents

Heterogeneous database systems (HDBs) are computational models and software implementations that provide heterogeneous database integration. [1] [2]

Problems of heterogeneous database integration

This article does not contain details of distributed database management systems (sometimes known as federated database systems).

Technical heterogeneity

Different file formats, access protocols, query languages etc. Often called syntactic heterogeneity from the point of view of data.

Data model heterogeneity

Different ways of representing and storing the same data. Table decompositions may vary, column names (data labels) may be different (but have the same semantics), data encoding schemes may vary (i.e., should a measurement scale be explicitly included in a field or should it be implied elsewhere). Also referred as schematic heterogeneity.

Semantic heterogeneity

Data across constituent databases may be related but different. Perhaps a database system must be able to integrate genomic and proteomic data. They are related—a gene may have several protein products—but the data are different (nucleotide sequences and amino acid sequences, or hydrophilic or -phobic amino acid sequence and positively or negatively charged amino acids). There may be many ways of looking at semantically similar, but distinct, datasets.

The system may also be required to present "new" knowledge to the user. Relationships may be inferred between data according to rules specified in domain ontologies.

See also

Related Research Articles

Bioinformatics Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combines biology, chemistry, physics, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques.

Database Organized collection of data

In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spans formal techniques and practical considerations including data modeling, efficient data representation and storage, query languages, security and privacy of sensitive data, and distributed computing issues including supporting concurrent access and fault tolerance.

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

Information science is an academic field which is primarily concerned with analysis, collection, classification, manipulation, storage, retrieval, movement, dissemination, and protection of information. Practitioners within and outside the field study the application and the usage of knowledge in organizations in addition to the interaction between people, organizations, and any existing information systems with the aim of creating, replacing, improving, or understanding information systems.

Biological database

Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information contained in biological databases includes gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures.

Enterprise information integration (EII) is the ability to support an unified view of data and information for an entire organization. In a data virtualization application of EII, a process of information integration, using data abstraction to provide a unified interface for viewing all the data within an organization, and a single set of structures and naming conventions to represent this data; the goal of EII is to get a large set of heterogeneous data sources to appear to a user or system as a single, homogeneous data source.

A federated database system (FDBS) is a type of meta-database management system (DBMS), which transparently maps multiple autonomous database systems into a single federated database. The constituent databases are interconnected via a computer network and may be geographically decentralized. Since the constituent database systems remain autonomous, a federated database system is a contrastable alternative to the task of merging several disparate databases. A federated database, or virtual database, is a composite of all constituent databases in a federated database system. There is no actual data integration in the constituent disparate databases as a result of data federation.

Rat Genome Database

The Rat Genome Database (RGD) is a database of rat genomics, genetics, physiology and functional data, as well as data for comparative genomics between rat, human and mouse. RGD is responsible for attaching biological information to the rat genome via structured vocabulary, or ontology, annotations assigned to genes and quantitative trait loci (QTL), and for consolidating rat strain data and making it available to the research community. They are also developing a suite of tools for mining and analyzing genomic, physiologic and functional data for the rat, and comparative data for rat, mouse, human, and five other species.

Federated search retrieves information from a variety of sources via a search application built on top of one or more search engines. A user makes a single query request which is distributed to the search engines, databases or other query engines participating in the federation. The federated search then aggregates the results that are received from the search engines for presentation to the user. Federated search can be used to integrate disparate information resources within a single large organization ("enterprise") or for the entire web.

Semantic integration is the process of interrelating information from diverse sources, for example calendars and to do lists, email archives, presence information, documents of all sorts, contacts, search results, and advertising and marketing relevance derived from them. In this regard, semantics focuses on the organization of and action upon information by acting as an intermediary between heterogeneous data sources, which may conflict not only by structure but also context or value.

Ontology alignment, or ontology matching, is the process of determining correspondences between concepts in ontologies. A set of correspondences is also called an alignment. The phrase takes on a slightly different meaning, in computer science, cognitive science or philosophy.

Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial and scientific domains. Data integration appears with increasing frequency as the volume and the need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. Data integration encourages collaboration between internal as well as external users. The data being integrated must be received from a heterogeneous database system and transformed to a single coherent data store that provides synchronous data across a network of files for clients. A common use of data integration is in data mining when analyzing and extracting information from existing databases that can be useful for Business information.

Ontology-based data integration involves the use of one or more ontologies to effectively combine data or information from multiple heterogeneous sources. It is one of the multiple data integration approaches and may be classified as Global-As-View (GAV). The effectiveness of ontology‑based data integration is closely tied to the consistency and expressivity of the ontology used in the integration process.

The terms schema matching and mapping are often used interchangeably for a database process. For this article, we differentiate the two as follows: Schema matching is the process of identifying that two objects are semantically related while mapping refers to the transformations between the objects. For example, in the two schemas DB1.Student and DB2.Grad-Student ; possible matches would be: DB1.Student ≈ DB2.Grad-Student; DB1.SSN = DB2.ID etc. and possible transformations or mappings would be: DB1.Marks to DB2.Grades.

Amit Sheth is a computer scientist at University of South Carolina in Columbia, South Carolina. He is the founding Director of the Artificial Intelligence Institute, and a Professor of Computer Science and Engineering. From 2007 to June 2019, he was the Lexis Nexis Ohio Eminent Scholar, director of the Ohio Center of Excellence in Knowledge-enabled Computing, and a Professor of Computer Science at Wright State University. Sheth's work has been cited by over 48,800 publications. He has an h-index of 106, which puts him among the top 100 computer scientists with the highest h-index. Prior to founding the Kno.e.sis Center, he served as the director of the Large Scale Distributed Information Systems Lab at the University of Georgia in Athens, Georgia.

Semantic matching is a technique used in computer science to identify information which is semantically related.

NeuroLex Dynamic lexicon of neuroscience concepts

NeuroLex is a dynamic lexicon of neuroscience concepts. It is a structured as a semantic wiki, using Semantic MediaWiki. NeuroLex is supported by the Neuroscience Information Framework project.

Metadatabase is a database model for (1) metadata management, (2) global query of independent databases, and (3) distributed data processing. The word metadatabase is an addition to the dictionary. Originally, metadata was only a common term referring simply to "data about data", such as tags, keywords, and markup headers. However, in this technology, the concept of metadata is extended to also include such data and knowledge representation as information models, application logic, and analytic models. In the case of analytic models, it is also referred to as a Modelbase.

The following is provided as an overview of and topical guide to databases:

Schema-agnostic databases or vocabulary-independent databases aim at supporting users to be abstracted from the representation of the data, supporting the automatic semantic matching between queries and databases. Schema-agnosticism is the property of a database of mapping a query issued with the user terminology and structure, automatically mapping it to the dataset vocabulary.

References

  1. Sujansky, Walter (August 2001). "Heterogeneous Database Integration in Biomedicine". Journal of Biomedical Informatics. 34 (4): 285–298. doi: 10.1006/jbin.2001.1024 . PMID   11977810.
  2. Sheth, Amit P.; James A. Larson (September 1990). "Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases" (PDF). ACM Computing Surveys. 22 (3): 183–236. CiteSeerX   10.1.1.381.9176 . doi:10.1145/96602.96604.