Dataspaces

Last updated May 13, 2024

Dataspaces are an abstraction in data management that aim to overcome some of the problems encountered in data integration system. The aim is to reduce the effort required to set up a data integration system by relying on existing matching and mapping generation techniques, and to improve the system in "pay-as-you-go" fashion as it is used.^[1]^[2] Labor-intensive aspects of data integration are postponed until they are absolutely needed.^[3]

Traditionally, data integration and data exchange systems have aimed to offer many of the purported services of dataspace systems. Dataspaces can be viewed as a next step in the evolution of data integration architectures, but are distinct from current data integration systems in the following way. Data integration systems require semantic integration before any services can be provided. Hence, although there is not a single schema to which all the data conforms and the data resides in a multitude of host systems, the data integration system knows the precise relationships between the terms used in each schema. As a result, significant up-front effort is required in order to set up a data integration system.^[4]

Dataspaces shift the emphasis to a data co-existence approach providing base functionality over all data sources, regardless of how integrated they are. For example, a DataSpace Support Platform (DSSP) can provide keyword search over all of its data sources, similar to that provided by existing desktop search systems. When more sophisticated operations are required, such as relational-style queries, data mining, or monitoring over certain sources, then additional effort can be applied to more closely integrate those sources in an incremental fashion. Similarly, in terms of traditional database guarantees, initially a dataspace system can only provide weaker guarantees of consistency and durability. As stronger guarantees are desired, more effort can be put into making agreements among the various owners of data sources, and opening up certain interfaces (e.g., for commit protocols).^[5]^[6]

Based on the research by Sören Auer, Boris Otto, Jan Cirullies. in "Industrial Data Space: Digital Sovereignty Over Data" (2016), the concept of data spaces has further evolved to an industrial data space. This development connects various data sources under a set of defined rules to ensure privacy, security, and digital sovereignty. These rules offer a mechanism for users to control their data and dictate who has access to it, promoting data sovereignty. They also facilitate the ethical usage of data, making data spaces increasingly important in an age where data privacy concerns are paramount.^[7]

Moreover, the design of data spaces has seen significant evolution as described in "Designing Data Spaces" (2022) publication. According to this research, designing data spaces involves a user-centric approach where the focus is on meeting end user needs. Different users have unique requirements and use data differently, hence the user-centric design is proposed. The user-centric design approach ensures that data spaces offer meaningful data interactions and facilitate semantic interoperability amongst diverse systems and sources. ^[8]

Related Research Articles

<span class="mw-page-title-main">Semantic Web</span> Extension of the Web to facilitate data exchange

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

The database schema is the structure of a database described in a formal language supported typically by a relational database management system (RDBMS). The term "schema" refers to the organization of data as a blueprint of how the database is constructed. The formal definition of a database schema is a set of formulas (sentences) called integrity constraints imposed on a database. These integrity constraints ensure compatibility between parts of the schema. All constraints are expressible in the same language. A database can be considered a structure in realization of the database language. The states of a created conceptual schema are transformed into an explicit mapping, the database schema. This describes how real-world entities are modeled in the database.

Enterprise information integration (EII) is the ability to support a unified view of data and information for an entire organization. In a data virtualization application of EII, a process of information integration, using data abstraction to provide a unified interface for viewing all the data within an organization, and a single set of structures and naming conventions to represent this data; the goal of EII is to get a large set of heterogeneous data sources to appear to a user or system as a single, homogeneous data source.

A federated database system (FDBS) is a type of meta-database management system (DBMS), which transparently maps multiple autonomous database systems into a single federated database. The constituent databases are interconnected via a computer network and may be geographically decentralized. Since the constituent database systems remain autonomous, a federated database system is a contrastable alternative to the task of merging several disparate databases. A federated database, or virtual database, is a composite of all constituent databases in a federated database system. There is no actual data integration in the constituent disparate databases as a result of data federation.

MonetDB is an open-source column-oriented relational database management system (RDBMS) originally developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It is designed to provide high performance on complex queries against large databases, such as combining tables with hundreds of columns and millions of rows. MonetDB has been applied in high-performance applications for online analytical processing, data mining, geographic information system (GIS), Resource Description Framework (RDF), text retrieval and sequence alignment processing.

Semantic integration is the process of interrelating information from diverse sources, for example calendars and to do lists, email archives, presence information, documents of all sorts, contacts, search results, and advertising and marketing relevance derived from them. In this regard, semantics focuses on the organization of and action upon information by acting as an intermediary between heterogeneous data sources, which may conflict not only by structure but also context or value.

Ontology alignment, or ontology matching, is the process of determining correspondences between concepts in ontologies. A set of correspondences is also called an alignment. The phrase takes on a slightly different meaning, in computer science, cognitive science or philosophy.

Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial and scientific domains. Data integration appears with increasing frequency as the volume, complexity and the need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. Data integration encourages collaboration between internal as well as external users. The data being integrated must be received from a heterogeneous database system and transformed to a single coherent data store that provides synchronous data across a network of files for clients. A common use of data integration is in data mining when analyzing and extracting information from existing databases that can be useful for Business information.

A semantic web service, like conventional web services, is the server end of a client–server system for machine-to-machine interaction via the World Wide Web. Semantic services are a component of the semantic web because they use markup which makes data machine-readable in a detailed and sophisticated way.

Oracle Spatial and Graph, formerly Oracle Spatial, is a free option component of the Oracle Database. The spatial features in Oracle Spatial and Graph aid users in managing geographic and location-data in a native type within an Oracle database, potentially supporting a wide range of applications — from automated mapping, facilities management, and geographic information systems (AM/FM/GIS), to wireless location services and location-enabled e-business. The graph features in Oracle Spatial and Graph include Oracle Network Data Model (NDM) graphs used in traditional network applications in major transportation, telcos, utilities and energy organizations and RDF semantic graphs used in social networks and social interactions and in linking disparate data sets to address requirements from the research, health sciences, finance, media and intelligence communities.

Ontology-based data integration involves the use of one or more ontologies to effectively combine data or information from multiple heterogeneous sources. It is one of the multiple data integration approaches and may be classified as Global-As-View (GAV). The effectiveness of ontology‑based data integration is closely tied to the consistency and expressivity of the ontology used in the integration process.

The terms schema matching and mapping are often used interchangeably for a database process. For this article, we differentiate the two as follows: schema matching is the process of identifying that two objects are semantically related while mapping refers to the transformations between the objects. For example, in the two schemas DB1.Student and DB2.Grad-Student ; possible matches would be: DB1.Student ≈ DB2.Grad-Student; DB1.SSN = DB2.ID etc. and possible transformations or mappings would be: DB1.Marks to DB2.Grades.

Amit Sheth is a computer scientist at University of South Carolina in Columbia, South Carolina. He is the founding Director of the Artificial Intelligence Institute, and a Professor of Computer Science and Engineering. From 2007 to June 2019, he was the Lexis Nexis Ohio Eminent Scholar, director of the Ohio Center of Excellence in Knowledge-enabled Computing, and a Professor of Computer Science at Wright State University. Sheth's work has been cited by over 48,800 publications. He has an h-index of 117, which puts him among the top 100 computer scientists with the highest h-index. Prior to founding the Kno.e.sis Center, he served as the director of the Large Scale Distributed Information Systems Lab at the University of Georgia in Athens, Georgia.

Semantic matching is a technique used in computer science to identify information which is semantically related.

The BioCatalogue is a curated catalogue of Life Science Web Services. The BioCatalogue was launched in June 2009 at the Intelligent Systems for Molecular Biology Conference. The project is a collaboration between the myGrid project at the University of Manchester led by Carole Goble and the European Bioinformatics Institute led by Rodrigo Lopez. It is funded by the Biotechnology and Biological Sciences Research Council.

Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data.

Norman William Paton is a Professor in the Department of Computer Science at the University of Manchester in the UK where he co-leads the Information Management Group (IMG) with Carole Goble.

Semantic heterogeneity is when database schema or datasets for the same domain are developed by independent parties, resulting in differences in meaning and interpretation of data values. Beyond structured data, the problem of semantic heterogeneity is compounded due to the flexibility of semi-structured data and various tagging methods applied to documents or unstructured data. Semantic heterogeneity is one of the more important sources of differences in heterogeneous datasets.

Semantic queries allow for queries and analytics of associative and contextual nature. Semantic queries enable the retrieval of both explicitly and implicitly derived information based on syntactic, semantic and structural information contained in data. They are designed to deliver precise results or to answer more fuzzy and wide open questions through pattern matching and digital reasoning.

Laura M. Haas is an American computer scientist noted for her research in database systems and information integration. She is best known for creating systems and tools for the integration of heterogeneous data from diverse sources, including federated technology that virtualizes access to data, and mapping technology that enables non-programmers to specify how data should be integrated.

References

↑ Belhajjame, K.; Paton, N. W.; Embury, S. M.; Fernandes, A. A. A.; Hedeler, C. (2013). "Incrementally improving dataspaces based on user feedback". Information Systems. 38 (5): 656. CiteSeerX 10.1.1.303.1957 . doi:10.1016/j.is.2013.01.006.
↑ Belhajjame, K.; Paton, N. W.; Embury, S. M.; Fernandes, A. A. A.; Hedeler, C. (2010). "Feedback-based annotation, selection and refinement of schema mappings for dataspaces". Proceedings of the 13th International Conference on Extending Database Technology - EDBT '10. p. 573. CiteSeerX 10.1.1.298.3519 . doi:10.1145/1739041.1739110. ISBN 9781605589459.
↑ Dong, X.; Halevy, A. (2007). "Indexing dataspaces". Proceedings of the 2007 ACM SIGMOD international conference on Management of data - SIGMOD '07. p. 43. doi:10.1145/1247480.1247487. ISBN 9781595936868. S2CID 1184444.
↑ Howe, B.; Maier, D.; Rayner, N.; Rucker, J. (2008). "Quarrying dataspaces: Schemaless profiling of unfamiliar information sources". 2008 IEEE 24th International Conference on Data Engineering Workshop. p. 270. doi:10.1109/ICDEW.2008.4498331. ISBN 978-1-4244-2161-9. S2CID 14039616.
↑ Sarma, A. D.; Dong, X. (L.; Halevy, A. Y. (2009). "Data Modeling in Dataspace Support Platforms". Conceptual Modeling: Foundations and Applications. Lecture Notes in Computer Science. Vol. 5600. pp. 122–138. doi:10.1007/978-3-642-02463-4_8. ISBN 978-3-642-02462-7.
↑ Franklin, M.; Halevy, A.; Maier, D. (2005). "From databases to dataspaces". ACM SIGMOD Record. 34 (4): 27. doi:10.1145/1107499.1107502. S2CID 14092111.
↑ Otto, Boris; Auer, Sören; Cirullies, Jan (February 2016). "Industrial Data Space: Digital Souvereignity Over Data" (PDF). ResearchGate.
↑ Otto, Boris; ten Hompel, Michael; Wrobel, Stefan (2022). "Designing Data Spaces - The Ecosystem Approach to Competitive Advantage" (PDF). Springer. ISBN 978-3-030-93974-8.

Dataspaces

Contents

See also

Related Research Articles

References

Further reading