Federated database system

Last updated June 09, 2024

A federated database system (FDBS) is a type of meta- database management system (DBMS), which transparently maps multiple autonomous database systems into a single federated database. The constituent databases are interconnected via a computer network and may be geographically decentralized. Since the constituent database systems remain autonomous, a federated database system is a contrastable alternative to the (sometimes daunting) task of merging several disparate databases. A federated database, or virtual database, is a composite of all constituent databases in a federated database system. There is no actual data integration in the constituent disparate databases as a result of data federation.

Through data abstraction, federated database systems can provide a uniform user interface, enabling users and clients to store and retrieve data from multiple noncontiguous databases with a single query—even if the constituent databases are heterogeneous. To this end, a federated database system must be able to decompose the query into subqueries for submission to the relevant constituent DBMSs, after which the system must composite the result sets of the subqueries. Because various database management systems employ different query languages, federated database systems can apply wrappers to the subqueries to translate them into the appropriate query languages.

Definition

McLeod and Heimbigner^[1] were among the first to define a federated database system in the mid-1980s.

A FDBS is one which "define[s] the architecture and interconnect[s] databases that minimize central authority yet support partial sharing and coordination among database systems".^[1] This description might not accurately reflect the McLeod/Heimbigner^[1] definition of a federated database. Rather, this description fits what McLeod/Heimbigner called a composite database. McLeod/Heimbigner's federated database is a collection of autonomous components that make their data available to other members of the federation through the publication of an export schema and access operations; there is no unified, central schema that encompasses the information available from the members of the federation.

Among other surveys,^[2] practitioners define a Federated Database as a collection of cooperating component systems which are autonomous and are possibly heterogeneous.

The three important components of an FDBS are autonomy, heterogeneity and distribution.^[2] Another dimension which has also been considered is the Networking Environment Computer Network, e.g., many DBSs over a LAN or many DBSs over a WAN update related functions of participating DBSs (e.g., no updates, nonatomic transitions, atomic updates).

FDBS architecture

A DBMS can be classified as either centralized or distributed. A centralized system manages a single database while distributed manages multiple databases. A component DBS in a DBMS may be centralized or distributed. A multiple DBS (MDBS) can be classified into two types depending on the autonomy of the component DBS as federated and non federated. A nonfederated database system is an integration of component DBMS that are not autonomous. A federated database system consists of component DBS that are autonomous yet participate in a federation to allow partial and controlled sharing of their data.

Federated architectures differ based on levels of integration with the component database systems and the extent of services offered by the federation. A FDBS can be categorized as loosely or tightly coupled systems.

Loosely Coupled require component databases to construct their own federated schema. A user will typically access other component database systems by using a multidatabase language but this removes any levels of location transparency, forcing the user to have direct knowledge of the federated schema. A user imports the data they require from other component databases and integrates it with their own to form a federated schema.
Tightly coupled system consists of component systems that use independent processes to construct and publicize an integrated federated schema.

Multiple DBS of which FDBS are a specific type can be characterized along three dimensions: Distribution, Heterogeneity and Autonomy. Another characterization could be based on the dimension of networking, for example single databases or multiple databases in a LAN or WAN.

Distribution

Distribution of data in an FDBS is due to the existence of a multiple DBS before an FDBS is built. Data can be distributed among multiple databases which could be stored in a single computer or multiple computers. These computers could be geographically located in different places but interconnected by a network. The benefits of data distribution help in increased availability and reliability as well as improved access times.

Heterogeneity

Heterogeneities in databases arise due to factors such as differences in structures, semantics of data, the constraints supported or query language. Differences in structure occur when two data models provide different primitives such as object oriented (OO) models that support specialization and inheritance and relational models that do not. Differences due to constraints occur when two models support two different constraints. For example, the set type in CODASYL schema may be partially modeled as a referential integrity constraint in a relationship schema. CODASYL supports insertion and retention that are not captured by referential integrity alone. The query language supported by one DBMS can also contribute to heterogeneity between other component DBMSs. For example, differences in query languages with the same data models or different versions of query languages could contribute to heterogeneity.

Semantic heterogeneities arise when there is a disagreement about meaning, interpretation or intended use of data. At the schema and data level, classification of possible heterogeneities include:

Naming conflicts e.g. databases using different names to represent the same concept.
Domain conflicts or data representation conflicts e.g. databases using different values to represent same concept.
Precision conflicts e.g. databases using same data values from domains of different cardinalities for same data.
Metadata conflicts e.g. same concepts are represented at schema level and instance level.
Data conflicts e.g. missing attributes
Schema conflicts e.g. table versus table conflict which includes naming conflicts, data conflicts etc.

In creating a federated schema, one has to resolve such heterogeneities before integrating the component DB schemas.

Schema matching, schema mapping

Dealing with incompatible data types or query syntax is not the only obstacle to a concrete implementation of an FDBS. In systems that are not planned top-down, a generic problem lies in matching semantically equivalent, but differently named parts from different schemas (=data models) (tables, attributes). A pairwise mapping between n attributes would result in $n(n-1) \over 2$ mapping rules (given equivalence mappings) - a number that quickly gets too large for practical purposes. A common way out is to provide a global schema that comprises the relevant parts of all member schemas and provide mappings in the form of database views. Two principal approaches depend on the direction of the mapping:

Global as View (GaV): the global schema is defined in terms of the underlying schemas
Local as View (LaV): the local schemas are defined in terms of the global schema

Both are examples of data integration, called the schema matching problem.

Autonomy

Fundamental to the difference between an MDBS and an FDBS is the concept of autonomy. It is important to understand the aspects of autonomy for component databases and how they can be addressed when a component DBS participates in an FDBS. There are four kinds of autonomies addressed:

Design Autonomy which refers to ability to choose its design irrespective of data, query language or conceptualization, functionality of the system implementation.

Heterogeneities in an FDBS are primarily due to design autonomy.

Communication autonomy refers to the general operation of the DBMS to communicate with other DBMS or not.
Execution autonomy allows a component DBMS to control the operations requested by local and external operations.
Association autonomy gives a power to component DBS to disassociate itself from a federation which means FDBS can operate independently of any single DBS.

The ANSI/X3/SPARC Study Group outlined a three level data description architecture, the components of which are the conceptual schema, internal schema and external schema of databases. The three level architecture is however inadequate to describing the architectures of an FDBS. It was therefore extended to support the three dimensions of the FDBS namely Distribution, Autonomy and Heterogeneity. The five level schema architecture is explained below.

Concurrency control

The Heterogeneity and Autonomy requirements pose special challenges concerning concurrency control in an FDBS, which is crucial for the correct execution of its concurrent transactions (see also Global concurrency control). Achieving global serializability, the major correctness criterion, under these requirements has been characterized as very difficult and unsolved.^[2]

Five level schema architecture for FDBSs

The five level schema architecture includes the following:

Local Schema is basically the conceptual model of a component database expressed in a native data model.^[3]
Component schema is the subset of the local schema that the owner organisation is willing to share with other users of the FDBS and it is translated into a common data model.^[3]
Export Schema represents a subset of a component schema that is available to a particular federation.^[3] It may include access control information regarding its use by a specific federation user. The export schema helps in managing flow of control of data.
Federated Schema is an integration of multiple export schemas. It includes information on data distribution that is generated when integrating export schemas.^[3]
External schema is extracted from a federated schema, and is defined for the users/applications of a particular federation.^[3]

While accurately representing the state of the art in data integration, the Five Level Schema Architecture above does suffer from a major drawback, namely IT imposed look and feel. Modern data users demand control over how data is presented; their needs are somewhat in conflict with such bottom-up approaches to data integration.

Related Research Articles

In computing, a data warehouse, also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. Data warehouses are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating reports. This is beneficial for companies as it enables them to interrogate and draw insights from their data and make decisions.

In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an application associated with the database.

Structured Query Language (SQL) is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling structured data, i.e., data incorporating relations among entities and variables.

An object–relational database (ORD), or object–relational database management system (ORDBMS), is a database management system (DBMS) similar to a relational database, but with an object-oriented database model: objects, classes and inheritance are directly supported in database schemas and in the query language. In addition, just as with pure relational systems, it supports extension of the data model with custom data types and methods.

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB2 until 2017, when it changed to its present form.

A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format". Oracle defines it as a collection of tables with metadata. The term can have one of several closely related meanings pertaining to databases and database management systems (DBMS):

Data modeling in software engineering is the process of creating a data model for an information system by applying certain formal techniques. It may be applied as part of broader Model-driven engineering (MDE) concept.

Enterprise information integration (EII) is the ability to support a unified view of data and information for an entire organization. In a data virtualization application of EII, a process of information integration, using data abstraction to provide a unified interface for viewing all the data within an organization, and a single set of structures and naming conventions to represent this data; the goal of EII is to get a large set of heterogeneous data sources to appear to a user or system as a single, homogeneous data source.

A heterogeneous database system is an automated system for the integration of heterogeneous, disparate database management systems to present a user with a single, unified query interface.

Federated search retrieves information from a variety of sources via a search application built on top of one or more search engines. A user makes a single query request which is distributed to the search engines, databases or other query engines participating in the federation. The federated search then aggregates the results that are received from the search engines for presentation to the user. Federated search can be used to integrate disparate information resources within a single large organization ("enterprise") or for the entire web.

Uniface is a low-code development and deployment platform for enterprise applications that can run in a large range of runtime environments, including mobile, mainframe, web, Service-oriented architecture (SOA), Windows, Java EE, and .NET. Uniface is used to create mission-critical applications.

Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial and scientific domains. Data integration appears with increasing frequency as the volume, complexity and the need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. Data integration encourages collaboration between internal as well as external users. The data being integrated must be received from a heterogeneous database system and transformed to a single coherent data store that provides synchronous data across a network of files for clients. A common use of data integration is in data mining when analyzing and extracting information from existing databases that can be useful for Business information.

SAP IQ is a column-based, petabyte scale, relational database software system used for business intelligence, data warehousing, and data marts. Produced by Sybase Inc., now an SAP company, its primary function is to analyze large amounts of data in a low-cost, highly available environment. SAP IQ is often credited with pioneering the commercialization of column-store technology.

Ontology-based data integration involves the use of one or more ontologies to effectively combine data or information from multiple heterogeneous sources. It is one of the multiple data integration approaches and may be classified as Global-As-View (GAV). The effectiveness of ontology‑based data integration is closely tied to the consistency and expressivity of the ontology used in the integration process.

The terms schema matching and mapping are often used interchangeably for a database process. For this article, we differentiate the two as follows: schema matching is the process of identifying that two objects are semantically related while mapping refers to the transformations between the objects. For example, in the two schemas DB1.Student and DB2.Grad-Student ; possible matches would be: DB1.Student ≈ DB2.Grad-Student; DB1.SSN = DB2.ID etc. and possible transformations or mappings would be: DB1.Marks to DB2.Grades.

Amit Sheth is a computer scientist at University of South Carolina in Columbia, South Carolina. He is the founding Director of the Artificial Intelligence Institute, and a Professor of Computer Science and Engineering. From 2007 to June 2019, he was the Lexis Nexis Ohio Eminent Scholar, director of the Ohio Center of Excellence in Knowledge-enabled Computing, and a Professor of Computer Science at Wright State University. Sheth's work has been cited by over 48,800 publications. He has an h-index of 117, which puts him among the top 100 computer scientists with the highest h-index. Prior to founding the Kno.e.sis Center, he served as the director of the Large Scale Distributed Information Systems Lab at the University of Georgia in Athens, Georgia.

The three-schema approach, or three-schema concept, in software engineering is an approach to building information systems and systems information management that originated in the 1970s. It proposes three different views in systems development, with conceptual modelling being considered the key to achieving data integration.

Federated architecture (FA) is a pattern in enterprise architecture that allows interoperability and information sharing between semi-autonomous de-centrally organized lines of business (LOBs), information technology systems and applications.

The following is provided as an overview of and topical guide to databases:

A distributional–relational database, or word-vector database, is a database management system (DBMS) that uses distributional word-vector representations to enrich the semantics of structured data.

References

1 2 3 "McLeod and Heimbigner (1985). "A Federated Architecture for information management". ACM Transactions on Information Systems, Volume 3, Issue 3. pp. 253–278.
1 2 3 "Sheth and Larson (1990). "Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases". ACM Computing Surveys, Vol. 22, No.3. pp. 183–236.
1 2 3 4 5 Masood, Nayyer; Eaglestone, Barry (December 2003). "Component and Federation Concept Models in a Federated Database System" (PDF). Malaysian Journal of Computer Science. 16 (2): 47–57. Archived from the original (PDF) on 2016-03-07. Retrieved 2016-03-03.

External links

DB2 and Federated Databases
Issues of where to perform the join aka "pushdown" and other performance characteristics
Worked example federating Oracle, Informix, DB2, and Excel
Freitas, André, Edward Curry, João Gabriel Oliveira, and Sean O’Riain. 2012. “Querying Heterogeneous Datasets on the Linked Data Web: Challenges, Approaches, and Trends.” IEEE Internet Computing 16 (1): 24–33.
IBM Gaian Database: A dynamic Distributed Federated Database
Federated system and methods and mechanisms of implementing and using such a system

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[reftwo-1] 1 2 3 "McLeod and Heimbigner (1985). "A Federated Architecture for information management". ACM Transactions on Information Systems, Volume 3, Issue 3. pp. 253–278.

[refone-2] 1 2 3 "Sheth and Larson (1990). "Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases". ACM Computing Surveys, Vol. 22, No.3. pp. 183–236.

[:0-3] 1 2 3 4 5 Masood, Nayyer; Eaglestone, Barry (December 2003). "Component and Federation Concept Models in a Federated Database System" (PDF). Malaysian Journal of Computer Science. 16 (2): 47–57. Archived from the original (PDF) on 2016-03-07. Retrieved 2016-03-03.

[1]

[2]

[3]

v t e Database management systems
Types	Object-oriented comparison Relational list comparison Key–value Column-oriented list Document-oriented Wide-column store Graph NoSQL NewSQL In-memory list Multi-model comparison Cloud Blockchain-based database
Concepts	Database ACID Armstrong's axioms Codd's 12 rules CAP theorem CRUD Null Candidate key Foreign key Superkey Surrogate key Unique key
Objects	Relation table column row View Transaction Transaction log Trigger Index Stored procedure Cursor Partition
Components	Concurrency control Data dictionary JDBC XQJ ODBC Query language Query optimizer Query rewriting system Query plan
Functions	Administration Query optimization Replication Sharding
Related topics	Database models Database normalization Database storage Distributed database Federated database system Referential integrity Relational algebra Relational calculus Relational model Object–relational database Transaction processing
Category Outline WikiProject