Schema matching

Last updated

The terms schema matching and mapping are often used interchangeably for a database process. For this article, we differentiate the two as follows: schema matching is the process of identifying that two objects are semantically related (scope of this article) while mapping refers to the transformations between the objects. For example, in the two schemas DB1.Student (Name, SSN, Level, Major, Marks) and DB2.Grad-Student (Name, ID, Major, Grades); possible matches would be: DB1.Student ≈ DB2.Grad-Student; DB1.SSN = DB2.ID etc. and possible transformations or mappings would be: DB1.Marks to DB2.Grades (100–90 A; 90–80 B: etc.).

Contents

Automating these two approaches has been one of the fundamental tasks of data integration. In general, it is not possible to determine fully automatically the different correspondences between two schemas — primarily because of the differing and often not explicated or documented semantics of the two schemas.

Impediments

Among others, common challenges to automating matching and mapping have been previously classified in [1] especially for relational DB schemas; and in [2] – a fairly comprehensive list of heterogeneity not limited to the relational model recognizing schematic vs semantic differences/heterogeneity. Most of these heterogeneities exist because schemas use different representations or definitions to represent the same information (schema conflicts); OR different expressions, units, and precision result in conflicting representations of the same data (data conflicts). [1] Research in schema matching seeks to provide automated support to the process of finding semantic matches between two schemas. This process is made harder due to heterogeneities at the following levels [3]

Schema matching

[4] [5] [6] [7] [8]

Methodology

Discusses a generic methodology for the task of schema integration or the activities involved. [5] According to the authors, one can view the integration.

Approaches

Approaches to schema integration can be broadly classified as ones that exploit either just schema information or schema and instance level information. [4] [5]

Schema-level matchers only consider schema information, not instance data. The available information includes the usual properties of schema elements, such as name, description, data type, relationship types (part-of, is-a, etc.), constraints, and schema structure. Working at the element (atomic elements like attributes of objects) or structure level (matching combinations of elements that appear together in a structure), these properties are used to identify matching elements in two schemas. Language-based or linguistic matchers use names and text (i.e., words or sentences) to find semantically similar schema elements. Constraint based matchers exploit constraints often contained in schemas. Such constraints are used to define data types and value ranges, uniqueness, optionality, relationship types and cardinalities, etc. Constraints in two input schemas are matched to determine the similarity of the schema elements.

Instance-level matchers use instance-level data to gather important insight into the contents and meaning of the schema elements. These are typically used in addition to schema level matches in order to boost the confidence in match results, more so when the information available at the schema level is insufficient. Matchers at this level use linguistic and constraint based characterization of instances. For example, using linguistic techniques, it might be possible to look at the Dept, DeptName and EmpName instances to conclude that DeptName is a better match candidate for Dept than EmpName. Constraints like zipcodes must be 5 digits long or format of phone numbers may allow matching of such types of instance data. [9]

Hybrid matchers directly combine several matching approaches to determine match candidates based on multiple criteria or information sources. Most of these techniques also employ additional information such as dictionaries, thesauri, and user-provided match or mismatch information [10]

Reusing matching information Another initiative has been to re-use previous matching information as auxiliary information for future matching tasks. The motivation for this work is that structures or substructures often repeat, for example in schemas in the E-commerce domain. Such a reuse of previous matches however needs to be a careful choice. It is possible that such a reuse makes sense only for some part of a new schema or only in some domains. For example, Salary and Income may be considered identical in a payroll application but not in a tax reporting application. There are several open ended challenges in such reuse that deserves further work.

Sample Prototypes Typically, the implementation of such matching techniques can be classified as being either rule based or learner based systems. The complementary nature of these different approaches has instigated a number of applications using a combination of techniques depending on the nature of the domain or application under consideration. [4] [5]

Identified relationships

The relationship types between objects that are identified at the end of a matching process are typically those with set semantics such as overlap, disjointness, exclusion, equivalence, or subsumption. The logical encodings of these relationships are what they mean. Among others, an early attempt to use description logics for schema integration and identifying such relationships was presented. [11] Several state of the art matching tools today [4] [7] and those benchmarked in the Ontology Alignment Evaluation Initiative [12] are capable of identifying many such simple (1:1 / 1:n / n:1 element level matches) and complex matches (n:1 / n:m element or structure level matches) between objects.

Evaluation of quality

The quality of schema matching is commonly measured by precision and recall. While precision measures the number of correctly matched pairs out of all pairs that were matched, recall measures how many of the actual pairs have been matched.

See also

Related Research Articles

<span class="mw-page-title-main">Semantic Web</span> Extension of the Web to facilitate data exchange

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

A conceptual schema or conceptual data model is a high-level description of informational needs underlying the design of a database. It typically includes only the core concepts and the main relationships among them. This is a high-level model with insufficient detail to build a complete, functional database. It describes the structure of the whole database for a group of users. The conceptual model is also known as the data model that can be used to describe the conceptual schema when a database system is implemented. It hides the internal details of physical storage and targets the description of entities, datatypes, relationships and constraints.

<span class="mw-page-title-main">Data model</span> Abstract model

A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be composed of a number of other elements which, in turn, represent the color and size of the car and define its owner.

<span class="mw-page-title-main">Object–role modeling</span> Programming technique

Object–role modeling (ORM) is used to model the semantics of a universe of discourse. ORM is often used for data modeling and software engineering.

A federated database system (FDBS) is a type of meta-database management system (DBMS), which transparently maps multiple autonomous database systems into a single federated database. The constituent databases are interconnected via a computer network and may be geographically decentralized. Since the constituent database systems remain autonomous, a federated database system is a contrastable alternative to the task of merging several disparate databases. A federated database, or virtual database, is a composite of all constituent databases in a federated database system. There is no actual data integration in the constituent disparate databases as a result of data federation.

The semantic spectrum, sometimes referred to as the ontology spectrum, the smart data continuum, or semantic precision, is a series of increasingly precise or rather semantically expressive definitions for data elements in knowledge representations, especially for machine use.

<span class="mw-page-title-main">IDEF1X</span>

Integration DEFinition for information modeling (IDEF1X) is a data modeling language for the development of semantic data models. IDEF1X is used to produce a graphical information model which represents the structure and semantics of information within an environment or system.

Semantic integration is the process of interrelating information from diverse sources, for example calendars and to do lists, email archives, presence information, documents of all sorts, contacts, search results, and advertising and marketing relevance derived from them. In this regard, semantics focuses on the organization of and action upon information by acting as an intermediary between heterogeneous data sources, which may conflict not only by structure but also context or value.

<span class="mw-page-title-main">Semantic technology</span> Technology to help machines understand data

The ultimate goal of semantic technology is to help machines understand data. To enable the encoding of semantics with the data, well-known technologies are RDF and OWL. These technologies formally represent the meaning involved in information. For example, ontology can describe concepts, relationships between things, and categories of things. These embedded semantics with the data offer significant advantages such as reasoning over data and dealing with heterogeneous data sources.

Ontology alignment, or ontology matching, is the process of determining correspondences between concepts in ontologies. A set of correspondences is also called an alignment. The phrase takes on a slightly different meaning, in computer science, cognitive science or philosophy.

Ontology-based data integration involves the use of one or more ontologies to effectively combine data or information from multiple heterogeneous sources. It is one of the multiple data integration approaches and may be classified as Global-As-View (GAV). The effectiveness of ontology‑based data integration is closely tied to the consistency and expressivity of the ontology used in the integration process.

The Semantics of Business Vocabulary and Business Rules (SBVR) is an adopted standard of the Object Management Group (OMG) intended to be the basis for formal and detailed natural language declarative description of a complex entity, such as a business. SBVR is intended to formalize complex compliance rules, such as operational rules for an enterprise, security policy, standard compliance, or regulatory compliance rules. Such formal vocabularies and rules can be interpreted and used by computer systems. SBVR is an integral part of the OMG's model-driven architecture (MDA).

Amit Sheth is a computer scientist at University of South Carolina in Columbia, South Carolina. He is the founding Director of the Artificial Intelligence Institute, and a Professor of Computer Science and Engineering. From 2007 to June 2019, he was the Lexis Nexis Ohio Eminent Scholar, director of the Ohio Center of Excellence in Knowledge-enabled Computing, and a Professor of Computer Science at Wright State University. Sheth's work has been cited by over 48,800 publications. He has an h-index of 117, which puts him among the top 100 computer scientists with the highest h-index. Prior to founding the Kno.e.sis Center, he served as the director of the Large Scale Distributed Information Systems Lab at the University of Georgia in Athens, Georgia.

<span class="mw-page-title-main">Semantic data model</span> Database model

A semantic data model (SDM) is a high-level semantics-based database description and structuring formalism for databases. This database model is designed to capture more of the meaning of an application environment than is possible with contemporary database models. An SDM specification describes a database in terms of the kinds of entities that exist in the application environment, the classifications and groupings of those entities, and the structural interconnections among them. SDM provides a collection of high-level modeling primitives to capture the semantics of an application environment. By accommodating derived information in a database structural specification, SDM allows the same information to be viewed in several ways; this makes it possible to directly accommodate the variety of needs and processing requirements typically present in database applications. The design of the present SDM is based on our experience in using a preliminary version of it. SDM is designed to enhance the effectiveness and usability of database systems. An SDM database description can serve as a formal specification and documentation tool for a database; it can provide a basis for supporting a variety of powerful user interface facilities, it can serve as a conceptual database model in the database design process; and, it can be used as the database model for a new kind of database management system.

Business semantics management (BSM) encompasses the technology, methodology, organization, and culture that brings business stakeholders together to collaboratively realize the reconciliation of their heterogeneous metadata; and consequently the application of the derived business semantics patterns to establish semantic alignment between the underlying data structures.

Minimal mappings are the result of an advanced technique of semantic matching, a technique used in computer science to identify information which is semantically related.

Semantic matching is a technique used in computer science to identify information which is semantically related.

Semantic heterogeneity is when database schema or datasets for the same domain are developed by independent parties, resulting in differences in meaning and interpretation of data values. Beyond structured data, the problem of semantic heterogeneity is compounded due to the flexibility of semi-structured data and various tagging methods applied to documents or unstructured data. Semantic heterogeneity is one of the more important sources of differences in heterogeneous datasets.

Semantic queries allow for queries and analytics of associative and contextual nature. Semantic queries enable the retrieval of both explicitly and implicitly derived information based on syntactic, semantic and structural information contained in data. They are designed to deliver precise results or to answer more fuzzy and wide open questions through pattern matching and digital reasoning.

Schema-agnostic databases or vocabulary-independent databases aim at supporting users to be abstracted from the representation of the data, supporting the automatic semantic matching between queries and databases. Schema-agnosticism is the property of a database of mapping a query issued with the user terminology and structure, automatically mapping it to the dataset vocabulary.

References

  1. 1 2 Kim, W. & Seo, J. (Dec 1991). "Classifying Schematic and Data Heterogeneity in Multidatabase Systems.". Computer 24, 12.
  2. Sheth, A. P. & Kashyap, V. (1993). "So Far (Schematically) yet So Near (Semantically)". In Proceedings of the IFIP WG 2.6 Database Semantics Conference on interoperable Database Systems.
  3. Sheth, A. P. (1999). "Changing Focus on Interoperability in Information Systems: From System, Syntax, Structure to Semantics". In Interoperating Geographic Information Systems. M. F. Goodchild, M. J. Egenhofer, R. Fegeas, and C. A. Kottman (eds.), Kluwer, Academic Publishers.
  4. 1 2 3 4 Rahm, E. & Bernstein, P (2001). "A survey of approaches to automatic schema matching". The VLDB Journal 10, 4.
  5. 1 2 3 4 Batini, C., Lenzerini, M., and Navathe, S. B. (1986). "A comparative analysis of methodologies for database schema integration.". ACM Comput. Surv. 18, 4.{{cite conference}}: CS1 maint: multiple names: authors list (link)
  6. Doan, A. & Halevy, A. (2005). "Semantic-integration research in the database community". AI Mag. 26, 1.
  7. 1 2 Kalfoglou, Y. & Schorlemmer, M. (2003). "Ontology mapping: the state of the art". Knowl. Eng. Rev. 18, 1.
  8. Choi, N., Song, I., and Han, H. (2006). "A survey on ontology mapping". SIGMOD Rec. 35, 3.{{cite conference}}: CS1 maint: multiple names: authors list (link)
  9. Pereira Nunes, Bernardo; Mera, Alexander; Casanova, Marco Antonio; P. Paes Leme, Luis Andre; Dietze, Stefan (2013). "Complex Matching of RDF Datatype Properties". Database and Expert Systems Applications. Lecture Notes in Computer Science. Vol. 8055. pp. 195–208. doi:10.1007/978-3-642-40285-2_18. ISBN   978-3-642-40284-5.
  10. Hamdaqa, Mohammad; Tahvildari, Ladan (2014). "Prison Break: A Generic Schema Matching Solution to the Cloud Vendor Lock-in Problem". 2014 IEEE 8th International Symposium on the Maintenance and Evolution of Service-Oriented and Cloud-Based Systems. pp. 37–46. doi:10.1109/MESOCA.2014.13. ISBN   978-1-4799-6152-8. S2CID   14499875.
  11. Ashoka Savasere; Amit P. Sheth; Sunit K. Gala; Shamkant B. Navathe; H. Markus (1993). "On Applying Classification to Schema Integration". RIDE-IMS.
  12. Ontology Alignment Evaluation Initiative::2006