Dataspace

Last updated

A dataspace is an abstraction in data management that aims to overcome some of the problems encountered in a data integration system. A dataspace is defined as a set of participants (data sources) and the relations between them, for example that dataset A is a duplicate of dataset B. [1] It can contain all data sources of an organization regardless of their format, physical location, or data model. [1] The data space then provides a unified interface to query data regardless of format, sometimes in a "best-effort" fashion, and ways to further integrate the data when necessary. [1] It is very different than a traditional relational database, which requires that all data be in the same format. [1] The aim of the concept is to reduce the effort required to set up a data integration system by relying on existing matching and mapping generation techniques,[ clarification needed ] and to improve the system in "pay-as-you-go" fashion as it is used. [2] [3] Labor-intensive aspects of data integration are postponed until they are absolutely needed. [4]

Contents

Traditionally, data integration and data exchange systems have aimed to offer many of the purported services of dataspace systems. Dataspaces can be viewed as a next step in the evolution of data integration architectures, but are distinct from current data integration systems because they require semantic integration before any services can be provided. Hence, although there is not a single schema to which all the data conforms and the data resides in a multitude of host systems, the data integration system knows the precise relationships between the terms used in each schema. As a result, significant up-front effort is required in order to set up a data integration system. [5]

Dataspaces shift the emphasis to a data co-existence approach providing base functionality over all data sources, regardless of how integrated they are. For example, a DataSpace Support Platform (DSSP) can provide keyword search over all of its data sources, similar to that provided by existing desktop search systems. When more sophisticated operations are required, such as relational-style queries, data mining, or monitoring over certain sources, then additional effort can be applied to more closely integrate those sources in an incremental fashion. Similarly, in terms of traditional database guarantees, initially a dataspace system can only provide weaker guarantees of consistency and durability. As stronger guarantees are desired, more effort can be put into making agreements among the various owners of data sources, and opening up certain interfaces (e.g., for commit protocols). [6] [7]

History

According to a cyclic model of technology development, new technologies progress by first going through a phase of design competition, where the technology is explored and experiments are done, until the industry settles upon a dominant design and ceases to iterate so much. [1] As of 2019, dataspaces have already undergone a "first wave" of adoption, composed of exploratory and proof-of-concept projects, and have begun a "second wave" in which they are being adapted for more general and less nice use cases. [1]

The European Commission has been working on the development of shared dataspaces for various industries called "Common European Data Spaces" since February 2020. [8] Dataspaces are planned for the agriculture, energy, finance, health, media, manufacturing, mobility, and tourism industries as well as for the European Green Deal, languages, public administration, research and innovation, and skills. [8] [9] [ clarification needed ] The first concrete steps taken were a number of research and innovation initiatives funded as part of the European Public-Private Partnership on Big Data Value (Big Data Value PPP). [10]

See also

Related Research Articles

<span class="mw-page-title-main">Semantic Web</span> Extension of the Web to facilitate data exchange

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

<span class="mw-page-title-main">Database schema</span> Visual representation of database system relationships

The database schema is the structure of a database described in a formal language supported typically by a relational database management system (RDBMS). The term "schema" refers to the organization of data as a blueprint of how the database is constructed. The formal definition of a database schema is a set of formulas (sentences) called integrity constraints imposed on a database. These integrity constraints ensure compatibility between parts of the schema. All constraints are expressible in the same language. A database can be considered a structure in realization of the database language. The states of a created conceptual schema are transformed into an explicit mapping, the database schema. This describes how real-world entities are modeled in the database.

Enterprise information integration (EII) is the ability to support a unified view of data and information for an entire organization. In a data virtualization application of EII, a process of information integration, using data abstraction to provide a unified interface for viewing all the data within an organization, and a single set of structures and naming conventions to represent this data; the goal of EII is to get a large set of heterogeneous data sources to appear to a user or system as a single, homogeneous data source.

<span class="mw-page-title-main">MonetDB</span> Open source column-oriented relational database management system

MonetDB is an open-source column-oriented relational database management system (RDBMS) originally developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It is designed to provide high performance on complex queries against large databases, such as combining tables with hundreds of columns and millions of rows. MonetDB has been applied in high-performance applications for online analytical processing, data mining, geographic information system (GIS), Resource Description Framework (RDF), text retrieval and sequence alignment processing.

A bitmap index is a special kind of database index that uses bitmaps.

Ontology alignment, or ontology matching, is the process of determining correspondences between concepts in ontologies. A set of correspondences is also called an alignment. The phrase takes on a slightly different meaning, in computer science, cognitive science or philosophy.

Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial and scientific domains. Data integration appears with increasing frequency as the volume, complexity and the need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. Data integration encourages collaboration between internal as well as external users. The data being integrated must be received from a heterogeneous database system and transformed to a single coherent data store that provides synchronous data across a network of files for clients. A common use of data integration is in data mining when analyzing and extracting information from existing databases that can be useful for Business information.

A semantic web service, like conventional web services, is the server end of a client–server system for machine-to-machine interaction via the World Wide Web. Semantic services are a component of the semantic web because they use markup which makes data machine-readable in a detailed and sophisticated way.

Ontology-based data integration involves the use of one or more ontologies to effectively combine data or information from multiple heterogeneous sources. It is one of the multiple data integration approaches and may be classified as Global-As-View (GAV). The effectiveness of ontology‑based data integration is closely tied to the consistency and expressivity of the ontology used in the integration process.

The terms schema matching and mapping are often used interchangeably for a database process. For this article, we differentiate the two as follows: schema matching is the process of identifying that two objects are semantically related while mapping refers to the transformations between the objects. For example, in the two schemas DB1.Student and DB2.Grad-Student ; possible matches would be: DB1.Student ≈ DB2.Grad-Student; DB1.SSN = DB2.ID etc. and possible transformations or mappings would be: DB1.Marks to DB2.Grades.

Amit Sheth is a computer scientist at University of South Carolina in Columbia, South Carolina. He is the founding Director of the Artificial Intelligence Institute, and a Professor of Computer Science and Engineering. From 2007 to June 2019, he was the Lexis Nexis Ohio Eminent Scholar, director of the Ohio Center of Excellence in Knowledge-enabled Computing, and a Professor of Computer Science at Wright State University. Sheth's work has been cited by over 48,800 publications. He has an h-index of 117, which puts him among the top 100 computer scientists with the highest h-index. Prior to founding the Kno.e.sis Center, he served as the director of the Large Scale Distributed Information Systems Lab at the University of Georgia in Athens, Georgia.

The BioCatalogue is a curated catalogue of Life Science Web Services. The BioCatalogue was launched in June 2009 at the Intelligent Systems for Molecular Biology Conference. The project is a collaboration between the myGrid project at the University of Manchester led by Carole Goble and the European Bioinformatics Institute led by Rodrigo Lopez. It is funded by the Biotechnology and Biological Sciences Research Council.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data.

Norman William Paton is a Professor in the Department of Computer Science at the University of Manchester in the UK where he co-leads the Information Management Group (IMG) with Carole Goble.

Semantic queries allow for queries and analytics of associative and contextual nature. Semantic queries enable the retrieval of both explicitly and implicitly derived information based on syntactic, semantic and structural information contained in data. They are designed to deliver precise results or to answer more fuzzy and wide open questions through pattern matching and digital reasoning.

Laura M. Haas is an American computer scientist noted for her research in database systems and information integration. She is best known for creating systems and tools for the integration of heterogeneous data from diverse sources, including federated technology that virtualizes access to data, and mapping technology that enables non-programmers to specify how data should be integrated.

A distributional–relational database, or word-vector database, is a database management system (DBMS) that uses distributional word-vector representations to enrich the semantics of structured data.

<span class="mw-page-title-main">Knowledge graph</span> Type of knowledge base

In knowledge representation and reasoning, a knowledge graph is a knowledge base that uses a graph-structured data model or topology to represent and operate on data. Knowledge graphs are often used to store interlinked descriptions of entities – objects, events, situations or abstract concepts – while also encoding the free-form semantics or relationships underlying these entities.

<span class="mw-page-title-main">Data Commons</span> Knowledge repository integrating open datasets

Data Commons is an open-source platform created by Google that provides an open knowledge graph, combining economic, scientific and other public datasets into a unified view. Ramanathan V. Guha, a creator of web standards including RDF, RSS, and Schema.org, founded the project.

The International Data Spaces Association (IDSA) is a not-for-profit association of more than 140 organisations, incorporated under German law. It creates standards for sharing data in data spaces, that allow participants to have full control over their data.

References

  1. 1 2 3 4 5 6 Curry, Edward (2020), Curry, Edward (ed.), "Dataspaces: Fundamentals, Principles, and Techniques", Real-time Linked Dataspaces: Enabling Data Ecosystems for Intelligent Systems, Cham: Springer International Publishing, pp. 45–62, doi: 10.1007/978-3-030-29665-0_3 , ISBN   978-3-030-29665-0
  2. Belhajjame, K.; Paton, N. W.; Embury, S. M.; Fernandes, A. A. A.; Hedeler, C. (2013). "Incrementally improving dataspaces based on user feedback". Information Systems. 38 (5): 656. CiteSeerX   10.1.1.303.1957 . doi:10.1016/j.is.2013.01.006.
  3. Belhajjame, K.; Paton, N. W.; Embury, S. M.; Fernandes, A. A. A.; Hedeler, C. (2010). "Feedback-based annotation, selection and refinement of schema mappings for dataspaces". Proceedings of the 13th International Conference on Extending Database Technology - EDBT '10. p. 573. CiteSeerX   10.1.1.298.3519 . doi:10.1145/1739041.1739110. ISBN   9781605589459.
  4. Dong, X.; Halevy, A. (2007). "Indexing dataspaces". Proceedings of the 2007 ACM SIGMOD international conference on Management of data - SIGMOD '07. p. 43. doi:10.1145/1247480.1247487. ISBN   9781595936868. S2CID   1184444.
  5. Howe, B.; Maier, D.; Rayner, N.; Rucker, J. (2008). "Quarrying dataspaces: Schemaless profiling of unfamiliar information sources". 2008 IEEE 24th International Conference on Data Engineering Workshop. p. 270. doi:10.1109/ICDEW.2008.4498331. ISBN   978-1-4244-2161-9. S2CID   14039616.
  6. Sarma, A. D.; Dong, X. (L.; Halevy, A. Y. (2009). "Data Modeling in Dataspace Support Platforms". Conceptual Modeling: Foundations and Applications. Lecture Notes in Computer Science. Vol. 5600. pp. 122–138. doi:10.1007/978-3-642-02463-4_8. ISBN   978-3-642-02462-7.
  7. Franklin, M.; Halevy, A.; Maier, D. (2005). "From databases to dataspaces". ACM SIGMOD Record. 34 (4): 27. doi:10.1145/1107499.1107502. S2CID   14092111.
  8. 1 2 "Shaping Europe's digital future: Common European Data Spaces". European Commission. Retrieved 2024-08-24.
  9. "A view from Brussels: European strategy for data takes shape". International Association of Privacy Professionals. 11 January 2024. Retrieved 2024-08-24.
  10. Scerri, Simon; Tuikka, Tuomo; de Vallejo, Irene Lopez; Curry, Edward (2022), Curry, Edward; Scerri, Simon; Tuikka, Tuomo (eds.), "Common European Data Spaces: Challenges and Opportunities", Data Spaces : Design, Deployment and Future Directions, Cham: Springer International Publishing, pp. 337–357, doi: 10.1007/978-3-030-98636-0_16 , ISBN   978-3-030-98636-0

Further reading