Semantic heterogeneity

Last updated

Semantic heterogeneity is when database schema or datasets for the same domain are developed by independent parties, resulting in differences in meaning and interpretation of data values. [1] Beyond structured data, the problem of semantic heterogeneity is compounded due to the flexibility of semi-structured data and various tagging methods applied to documents or unstructured data. Semantic heterogeneity is one of the more important sources of differences in heterogeneous datasets.

Contents

Yet, for multiple data sources to interoperate with one another, it is essential to reconcile these semantic differences. Decomposing the various sources of semantic heterogeneities provides a basis for understanding how to map and transform data to overcome these differences.

Classification

One of the first known classification schemes applied to data semantics is from William Kent more than two decades ago. [2] Kent's approach dealt more with structural mapping issues than differences in meaning, which he pointed to data dictionaries as potentially solving.

One of the most comprehensive classifications is from Pluempitiwiriyawej and Hammer, "Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources". [3] They classify heterogeneities into three broad classes:

Moreover, mismatches or conflicts can occur between set elements (a "population" mismatch) or attributes (a "description" mismatch).

Michael Bergman expanded upon this schema by adding a fourth major explicit category of language, and also added some examples of each kind of semantic heterogeneity, resulting in about 40 distinct potential categories [4] . [5] This table shows the combined 40 possible sources of semantic heterogeneities across sources:

ClassCategorySubcategoryExamples

Language

Encoding

Ingest Encoding Mismatch

For example, ASCII v UTF-8

Ingest Encoding LackingMis-recognition of tokens because not being parsed with the proper encoding
Query Encoding MismatchFor example, ASCII v UTF-8 in search
Query Encoding LackingMis-recognition of search tokens because not being parsed with the proper encoding
LanguagesScript MismatchVariations in how parsers handle, say, stemming, white spaces or hyphens
Parsing / Morphological Analysis Errors (many)Arabic languages (right-to-left) v Romance languages (left-to-right)
Syntactical Errors (many)

Ambiguous sentence references, such as I'm glad I'm a man, and so is Lola (Lola by Ray Davies and the Kinks)

Semantics Errors (many)River bankv money bankv billiards bank shot
ConceptualNamingCase SensitivityUppercase v lower case v Camel case

Synonyms

United States v USA v America v Uncle Sam v Great Satan

Acronyms

United States v USA v US

Homonyms

Such as when the same name refers to more than one concept, such as Name referring to a person v Name referring to a book
MisspellingsAs stated
Generalization / SpecializationWhen single items in one schema are related to multiple items in another schema, or vice versa. For example, one schema may refer to "phone" but the other schema has multiple elements such as "home phone", "work phone" and "cell phone"
AggregationIntra-aggregationWhen the same population is divided differently (such as, Census v Federal regions for states, England v Great Britain v United Kingdom, or full person names v first-middle-last)
Inter-aggregationMay occur when sums or counts are included as set members
Internal Path DiscrepancyCan arise from different source-target retrieval paths in two different schemas (for example, hierarchical structures where the elements are different levels of remove)
Missing ItemContent DiscrepancyDifferences in set enumerations or including items or not (say, US territories) in a listing of US states
Missing ContentDifferences in scope coverage between two or more datasets for the same concept
Attribute List DiscrepancyDifferences in attribute completeness between two or more datasets
Missing AttributeDifferences in scope coverage between two or more datasets for the same attribute
Item Equivalence

When two types (classes or sets) are asserted as being the same when the scope and reference are not (for example, Berlin the city v Berlin the official city-state)

When two individuals are asserted as being the same when they are actually distinct (for example, John F. Kennedy the president v John F. Kennedy the aircraft carrier)

Type MismatchWhen the same item is characterized by different types, such as a person being typed as an animal v human being v person
Constraint MismatchWhen attributes referring to the same thing have different cardinalities or disjointedness assertions

Domain

Schematic DiscrepancyElement-value to Element-label MappingOne of four errors that may occur when attribute names (say, Hair v Fur) may refer to the same attribute, or when same attribute names (say, Hair v Hair) may refer to different attribute scopes (say, Hair v Fur) or where values for these attributes may be the same but refer to different actual attributes or where values may differ but be for the same attribute and putative value.

Many of the other semantic heterogeneities herein also contribute to schema discrepancies
Attribute-value to Element-label Mapping
Element-value to Attribute-label Mapping
Attribute-value to Attribute-label Mapping
Scale or UnitsMeasurement TypeDifferences, say, in the metric v English measurement systems, or currencies
UnitsDifferences, say, in meters v centimeters v millimeters
PrecisionFor example, a value of 4.1 inches in one dataset v 4.106 in another dataset

Data representation

Primitive Data Type

Confusion often arises in the use of literals v URIs v object types

Data FormatDelimiting decimals by period v commas; various date formats; using exponents or aggregate units (such as thousands or millions)

Data

NamingCase SensitivityUppercase v lower case v Camel case
SynonymsFor example, centimeters v cm
AcronymsFor example, currency symbols v currency names
HomonymsSuch as when the same name refers to more than one attribute, such as Name referring to a person v Name referring to a book
MisspellingsAs stated
ID Mismatch or Missing IDURIs can be a particular problem here, due to actual mismatches but also use of name spaces or not and truncated URIs
Missing Data

A common problem, more acute with closed world approaches than with open world ones

Element OrderingSet members can be ordered or unordered, and if ordered, the sequences of individual members or values can differ

A different approach toward classifying semantics and integration approaches is taken by Sheth et al. [6] Under their concept, they split semantics into three forms: implicit, formal and powerful. Implicit semantics are what is either largely present or can easily be extracted; formal languages, though relatively scarce, occur in the form of ontologies or other description logics; and powerful (soft) semantics are fuzzy and not limited to rigid set-based assignments. Sheth et al.'s main point is that first-order logic (FOL) or description logic is inadequate alone to properly capture the needed semantics.

Relevant applications

Besides data interoperability, relevant areas in information technology that depend on reconciling semantic heterogeneities include data mapping, semantic integration, and enterprise information integration, among many others. From the conceptual to actual data, there are differences in perspective, vocabularies, measures and conventions once any two data sources are brought together. Explicit attention to these semantic heterogeneities is one means to get the information to integrate or interoperate.

A mere twenty years ago, information technology systems expressed and stored data in a multitude of formats and systems. The Internet and Web protocols have done much to overcome these sources of differences. While there is a large number of categories of semantic heterogeneity, these categories are also patterned and can be anticipated and corrected. These patterned sources inform what kind of work must be done to overcome semantic differences where they still reside.

See also

Related Research Articles

<span class="mw-page-title-main">Data model</span> Abstract model

A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be composed of a number of other elements which, in turn, represent the color and size of the car and define its owner.

A federated database system (FDBS) is a type of meta-database management system (DBMS), which transparently maps multiple autonomous database systems into a single federated database. The constituent databases are interconnected via a computer network and may be geographically decentralized. Since the constituent database systems remain autonomous, a federated database system is a contrastable alternative to the task of merging several disparate databases. A federated database, or virtual database, is a composite of all constituent databases in a federated database system. There is no actual data integration in the constituent disparate databases as a result of data federation.

A heterogeneous database system is an automated system for the integration of heterogeneous, disparate database management systems to present a user with a single, unified query interface.

In computing and data management, data mapping is the process of creating data element mappings between two distinct data models. Data mapping is used as a first step for a wide variety of data integration tasks, including:

Object–relational impedance mismatch is a set of difficulties going between data in relational data stores and data in domain-driven object models. Relational Database Management Systems (RDBMS) is the standard method for storing data in a dedicated database, while object-oriented (OO) programming is the default method for business-centric design in programming languages. The problem lies in neither relational databases nor OO programming, but in the conceptual difficulty mapping between the two logic models. Both logical models are differently implementable using database servers, programming languages, design patterns, or other technologies. Issues range from application to enterprise scale, whenever stored relational data is used in domain-driven object models, and vice versa. Object-oriented data stores can trade this problem for other implementation difficulties.

<span class="mw-page-title-main">IDEF1X</span>

Integration DEFinition for information modeling (IDEF1X) is a data modeling language for the development of semantic data models. IDEF1X is used to produce a graphical information model which represents the structure and semantics of information within an environment or system.

Semantic integration is the process of interrelating information from diverse sources, for example calendars and to do lists, email archives, presence information, documents of all sorts, contacts, search results, and advertising and marketing relevance derived from them. In this regard, semantics focuses on the organization of and action upon information by acting as an intermediary between heterogeneous data sources, which may conflict not only by structure but also context or value.

<span class="mw-page-title-main">Semantic technology</span> Technology to help machines understand data

The ultimate goal of semantic technology is to help machines understand data. To enable the encoding of semantics with the data, well-known technologies are RDF and OWL. These technologies formally represent the meaning involved in information. For example, ontology can describe concepts, relationships between things, and categories of things. These embedded semantics with the data offer significant advantages such as reasoning over data and dealing with heterogeneous data sources.

Ontology alignment, or ontology matching, is the process of determining correspondences between concepts in ontologies. A set of correspondences is also called an alignment. The phrase takes on a slightly different meaning, in computer science, cognitive science or philosophy.

Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial and scientific domains. Data integration appears with increasing frequency as the volume, complexity and the need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. Data integration encourages collaboration between internal as well as external users. The data being integrated must be received from a heterogeneous database system and transformed to a single coherent data store that provides synchronous data across a network of files for clients. A common use of data integration is in data mining when analyzing and extracting information from existing databases that can be useful for Business information.

Semantic interoperability is the ability of computer systems to exchange data with unambiguous, shared meaning. Semantic interoperability is a requirement to enable machine computable logic, inferencing, knowledge discovery, and data federation between information systems.

The concept of the Social Semantic Web subsumes developments in which social interactions on the Web lead to the creation of explicit and semantically rich knowledge representations. The Social Semantic Web can be seen as a Web of collective knowledge systems, which are able to provide useful information based on human contributions and which get better as more people participate. The Social Semantic Web combines technologies, strategies and methodologies from the Semantic Web, social software and the Web 2.0.

Ontology-based data integration involves the use of one or more ontologies to effectively combine data or information from multiple heterogeneous sources. It is one of the multiple data integration approaches and may be classified as Global-As-View (GAV). The effectiveness of ontology‑based data integration is closely tied to the consistency and expressivity of the ontology used in the integration process.

The terms schema matching and mapping are often used interchangeably for a database process. For this article, we differentiate the two as follows: schema matching is the process of identifying that two objects are semantically related while mapping refers to the transformations between the objects. For example, in the two schemas DB1.Student and DB2.Grad-Student ; possible matches would be: DB1.Student ≈ DB2.Grad-Student; DB1.SSN = DB2.ID etc. and possible transformations or mappings would be: DB1.Marks to DB2.Grades.

Amit Sheth is a computer scientist at University of South Carolina in Columbia, South Carolina. He is the founding Director of the Artificial Intelligence Institute, and a Professor of Computer Science and Engineering. From 2007 to June 2019, he was the Lexis Nexis Ohio Eminent Scholar, director of the Ohio Center of Excellence in Knowledge-enabled Computing, and a Professor of Computer Science at Wright State University. Sheth's work has been cited by over 48,800 publications. He has an h-index of 117, which puts him among the top 100 computer scientists with the highest h-index. Prior to founding the Kno.e.sis Center, he served as the director of the Large Scale Distributed Information Systems Lab at the University of Georgia in Athens, Georgia.

Business semantics management (BSM) encompasses the technology, methodology, organization, and culture that brings business stakeholders together to collaboratively realize the reconciliation of their heterogeneous metadata; and consequently the application of the derived business semantics patterns to establish semantic alignment between the underlying data structures.

Semantic matching is a technique used in computer science to identify information which is semantically related.

Semantic queries allow for queries and analytics of associative and contextual nature. Semantic queries enable the retrieval of both explicitly and implicitly derived information based on syntactic, semantic and structural information contained in data. They are designed to deliver precise results or to answer more fuzzy and wide open questions through pattern matching and digital reasoning.

<span class="mw-page-title-main">UMBEL</span>

UMBEL is a logically organized knowledge graph of 34,000 concepts and entity types that can be used in information science for relating information from disparate sources to one another. It was retired at the end of 2019. UMBEL was first released in July 2008. Version 1.00 was released in February 2011. Its current release is version 1.50.

Schema-agnostic databases or vocabulary-independent databases aim at supporting users to be abstracted from the representation of the data, supporting the automatic semantic matching between queries and databases. Schema-agnosticism is the property of a database of mapping a query issued with the user terminology and structure, automatically mapping it to the dataset vocabulary.

References

  1. Alon Halevy (2005). "Why your data won't mix". Queue. 3 (8).
  2. William Kent (February 27 – March 3, 1989). The many forms of a single fact. Proceedings of the IEEE COMPCON. San Francisco. 13 pp.
  3. Charnyote Pluempitiwiriyawej and Joachim Hammer (September 2000). "A classification scheme for semantic and schematic heterogeneities in XML data sources" (PDF). Gainesville, Florida: University of Florida. Technical Report TR00-004.
  4. M.K. Bergman (6 June 2006). "Sources and classification of semantic heterogeneities". AI3:::Adaptive Information. Retrieved 28 September 2014.
  5. M.K. Bergman (12 August 2014). "Big structure and data interoperability". AI3:::Adaptive Information. Retrieved 28 September 2014.
  6. Amit P. Sheth; Cartic Ramakrishnan; Christopher Thomas (2005). "Semantics for the semantic Web: the implicit, the formal and the powerful". International Journal on Semantic Web and Information Systems. 1 (1): 1–18. doi:10.4018/jswis.2005010101.

Further reading