Truth discovery

Last updated

Truth discovery (also known as truth finding) is the process of choosing the actual true value for a data item when different data sources provide conflicting information on it.

Contents

Several algorithms have been proposed to tackle this problem, ranging from simple methods like majority voting to more complex ones able to estimate the trustworthiness of data sources. [1]

Truth discovery problems can be divided into two sub-classes: single-truth and multi-truth. In the first case only one true value is allowed for a data item (e.g birthday of a person, capital city of a country). While in the second case multiple true values are allowed (e.g. cast of a movie, authors of a book). [2] [3]

Typically, truth discovery is the last step of a data integration pipeline, when the schemas of different data sources have been unified and the records referring to the same data item have been detected. [4]

General principles

The abundance of data available on the web makes more and more probable to find that different sources provide (partially or completely) different values for the same data item. This, together with the fact that we are increasing our reliance on data to derive important decisions, motivates the need of developing good truth discovery algorithms. [5]  

Many currently available methods rely on a voting strategy to define the true value of a data item. Nevertheless, recent studies, have shown that, if we rely only on majority voting, we could get wrong results even in 30% of the data items. [5]

The solution to this problem is to assess the trustworthiness of the sources and give more importance to votes coming from trusted sources. [4] [5]

Ideally, supervised learning techniques could be exploited to assign a reliability score to sources after hand-crafted labeling of the provided values; unfortunately, this is not feasible since the number of needed labeled examples should be proportional to the number of sources, and in many applications the number of sources can be prohibitive. [2] [6]

Single-truth vs multi-truth discovery

Single-truth and multi-truth discovery are two very different problems. [2]

Single-truth discovery is characterized by the following properties:

While in the multi-truth case the following properties hold:

Multi-truth discovery has unique features that make the problem more complex and should be taken into consideration when developing truth-discovery solutions. [2]

The examples below point out the main differences of the two methods. Knowing that in both examples the truth is provided by source 1, in the single truth case (first table) we can say that sources 2 and 3 oppose to the truth and as a result provide wrong values. On the other hand, in the second case (second table), sources 2 and 3 are neither correct nor erroneous, they instead provide a subset of the true values and at the same time they do not oppose the truth.

When was George Washington born?
SourceNameBirthdate
S1 George Washington 1732-02-22Correct
S2 George Washington 1738-09-17Erroneous
S3 George Washington 1734-10-23Erroneous
Who wrote “The nature of space and time”?
SourceTitleAuthors
S1 The nature of space and time Stephen Hawking, Roger Penrose Correct
S2 The nature of space and time Stephen Hawking Partial truth
S3 The nature of space and time Roger Penrose Partial truth
S4 The nature of space and time J. K. Rowling Erroneous

Source trustworthiness

The vast majority of truth discovery methods are based on a voting approach: each source votes for a value of a certain data item and, at the end, the value with the highest vote is select as the true one. In the more sophisticated methods, votes do not have the same weight for all the data sources, more importance is indeed given to votes coming from trusted sources. [5]

Source trustworthiness usually is not known apriori but estimated with an iterative approach. At each step of the truth discovery algorithm the trustworthiness score of each data source is refined, improving the assessment of the true values that in turn leads to a better estimation of the trustworthiness of the sources. This process usually ends when all the values reach a convergence state. [5]

Source trustworthiness can be based on different metrics, such as accuracy of provided values, copying values from other sources and domain coverage. [1]

Detecting copying behaviors is very important, in fact, copy allows to spread false values easily making truth discovery very hard, since many sources would vote for the wrong values. Usually systems decrease the weight of votes associated to copied values or even don’t count them at all. [7]

Single-truth methods

Most of the currently available truth discovery methods have been designed to work well only in the single-truth case. [1] [3]

Below are reported some of the characteristics of the most relevant typologies of single-truth methods and how different systems model source trustworthiness. [5]

Majority voting

Majority voting is the simplest method, the most popular value is selected as the true one. Majority voting is commonly used as a baseline when assessing the performances of more complex methods.

These methods estimate source trustworthiness exploiting a similar technique to the one used to measure authority of web pages based on web links. The vote assigned to a value is computed as the sum of the trustworthiness of the sources that provide that particular value, while the trustworthiness of a source is computed as the sum of the votes assigned to the values that the source provides. [5] [8]

Information-retrieval based

These methods estimate source trustworthiness using similarity measures typically used in information retrieval. Source trustworthiness is computed as the cosine similarity (or other similarity measures) between the set of values provided by the source and the set of values considered true (either selected in a probabilistic way or obtained from a ground truth). [5] [9]

Bayesian based

These methods use Bayesian inference to define the probability of a value being true conditioned on the values provided by all the sources.

where is a value provided for a data item and is the set of the observed values provided by all the sources for that specific data item.

The trustworthiness of a source is then computed based on the accuracy of the values that provides. [7] [10] Other more complex methods exploit Bayesian inference to detect copying behaviors and use these insights to better assess source trustworthiness. [7]

Multi-truth methods

Due to its complexity, less attention has been devoted to the study of the multi-truth discovery [2] [3]

Below are reported two typologies of multi-truth methods and their characteristics.

Bayesian based

These methods use Bayesian inference to define the probability of a group of values being true conditioned on the values provided by all the data sources. In this case, since there could be multiple true values for each data item, and sources can provide multiple values for a single data item, it is not possible to consider values individually. An alternative is to consider mappings and relations between set of provided values and sources providing them. The trustworthiness of a source is then computed based on the accuracy of the values that provides. [2]

More sophisticated methods also consider domain coverage and copying behaviors to better estimate source trustworthiness. [2] [3]

Probabilistic Graphical Models based

These methods use probabilistic graphical models to automatically define the set of true values of given data item and also to assess source quality without need of any supervision. [11]

Applications

Many real-world applications can benefit from the use of truth discovery algorithms. Typical domains of application include: healthcare, crowd/social sensing, crowdsourcing aggregation, information extraction and knowledge base construction. [1]

Truth discovery algorithms could be also used to revolutionize the way in which web pages are ranked in search engines, going from current methods based on link analysis like PageRank, to procedures that rank web pages based on the accuracy of the information they provide. [12]

See also

Related Research Articles

<span class="mw-page-title-main">Collaborative filtering</span> Algorithm

Collaborative filtering (CF) is a technique used by recommender systems. Collaborative filtering has two senses, a narrow one and a more general one.

In computer science, a selection algorithm is an algorithm for finding the th smallest value in a collection of ordered values, such as numbers. The value that it finds is called the th order statistic. Selection includes as special cases the problems of finding the minimum, median, and maximum element in the collection. Selection algorithms include quickselect, and the median of medians algorithm. When applied to a collection of values, these algorithms take linear time, as expressed using big O notation. For data that is already structured, faster algorithms may be possible; as an extreme case, selection in an already-sorted array takes time .

<span class="mw-page-title-main">Association rule learning</span> Method for discovering interesting relations between variables in databases

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected.

A recommender system, or a recommendation system, is a subclass of information filtering system that provide suggestions for items that are most pertinent to a particular user. Recommender systems are particularly useful when an individual needs to choose an item from a potentially overwhelming number of items that a service may offer.

<span class="mw-page-title-main">Cluster analysis</span> Grouping a set of objects by similarity

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

<span class="mw-page-title-main">Suffix tree</span> Tree containing all suffixes of a given text

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

In computing, cache replacement policies are optimizing instructions or algorithms which a computer program or hardware-maintained structure can utilize to manage a cache of information. Caching improves performance by keeping recent or often-used data items in memory locations which are faster, or computationally cheaper to access, than normal memory stores. When the cache is full, the algorithm must choose which items to discard to make room for new data.

Biclustering, block clustering, Co-clustering or two-mode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Boris Mirkin to name a technique introduced many years earlier, in 1972, by John A. Hartigan.

<span class="mw-page-title-main">MonetDB</span> Open source column-oriented relational database management system

MonetDB is an open-source column-oriented relational database management system (RDBMS) originally developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It is designed to provide high performance on complex queries against large databases, such as combining tables with hundreds of columns and millions of rows. MonetDB has been applied in high-performance applications for online analytical processing, data mining, geographic information system (GIS), Resource Description Framework (RDF), text retrieval and sequence alignment processing.

In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of several classes. In the multi-label problem the labels are nonexclusive and there is no constraint on how many of the classes the instance can be assigned to.

<span class="mw-page-title-main">Anomaly detection</span> Approach in data analysis

In data analysis, anomaly detection is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behaviour. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data.

In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

The Brooks–Iyengar algorithm or FuseCPA Algorithm or Brooks–Iyengar hybrid algorithm is a distributed algorithm that improves both the precision and accuracy of the interval measurements taken by a distributed sensor network, even in the presence of faulty sensors. The sensor network does this by exchanging the measured value and accuracy value at every node with every other node, and computes the accuracy range and a measured value for the whole network from all of the values collected. Even if some of the data from some of the sensors is faulty, the sensor network will not malfunction. The algorithm is fault-tolerant and distributed. It could also be used as a sensor fusion method. The precision and accuracy bound of this algorithm have been proved in 2016.

Laura M. Haas is an American computer scientist noted for her research in database systems and information integration. She is best known for creating systems and tools for the integration of heterogeneous data from diverse sources, including federated technology that virtualizes access to data, and mapping technology that enables non-programmers to specify how data should be integrated.

Data mining, the process of discovering patterns in large data sets, has been used in many applications.

Discovering communities in a network, known as community detection/discovery, is a fundamental problem in network science, which attracted much attention in the past several decades. In recent years, with the tremendous studies on big data, another related but different problem, called community search, which aims to find the most likely community that contains the query node, has attracted great attention from both academic and industry areas. It is a query-dependent variant of the community detection problem. A detailed survey of community search can be found at ref., which reviews all the recent studies

<span class="mw-page-title-main">Apache Beam</span> Unified programming model for data processing pipelines

Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing. Beam Pipelines are defined using one of the provided SDKs and executed in one of the Beam’s supported runners including Apache Flink, Apache Samza, Apache Spark, and Google Cloud Dataflow.

PAM (Parallel Augmented Maps) is an open-source parallel C++ library implementing the interface for sequence, ordered sets, ordered maps, and augmented maps. The library is available on GitHub. It uses the underlying balanced binary tree structure using join-based algorithms. PAM supports four balancing schemes, including AVL trees, red-black trees, treaps and weight-balanced trees.

Since the advent of differential privacy, a number of systems supporting differentially private data analyses have been implemented and deployed. This article tracks real-world deployments, production software packages, and research prototypes.

In network theory, collective classification is the simultaneous prediction of the labels for multiple objects, where each label is predicted using information about the object's observed features, the observed features and labels of its neighbors, and the unobserved labels of its neighbors. Collective classification problems are defined in terms of networks of random variables, where the network structure determines the relationship between the random variables. Inference is performed on multiple random variables simultaneously, typically by propagating information between nodes in the network to perform approximate inference. Approaches that use collective classification can make use of relational information when performing inference. Examples of collective classification include predicting attributes of individuals in a social network, classifying webpages in the World Wide Web, and inferring the research area of a paper in a scientific publication dataset.

References

  1. 1 2 3 4 Li, Yaliang; Gao, Jing; Meng, Chuishi; Li, Qi; Su, Lu; Zhao, Bo; Fan, Wei; Han, Jiawei (2016-02-25). "A Survey on Truth Discovery". ACM SIGKDD Explorations Newsletter. 17 (2): 1–16. doi:10.1145/2897350.2897352. S2CID   9060471.
  2. 1 2 3 4 5 6 7 Wang, Xianzhi; Sheng, Quan Z.; Fang, Xiu Susie; Yao, Lina; Xu, Xiaofei; Li, Xue (2015). "An Integrated Bayesian Approach for Effective Multi-Truth Discovery". Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. Melbourne, Australia: ACM Press. pp. 493–502. doi:10.1145/2806416.2806443. hdl: 2440/110033 . ISBN   9781450337946. S2CID   16207808.
  3. 1 2 3 4 Lin, Xueling; Chen, Lei (2018). "Domain-aware Multi-truth Discovery from Conflicting Sources". VLDB Endowment. 11 (5): 635–647. doi:10.1145/3187009.3177739.
  4. 1 2 Dong, Xin Luna; Srivastava, Divesh (2015-02-15). "Big Data Integration". Synthesis Lectures on Data Management. 7 (1): 1–198. doi:10.2200/S00578ED1V01Y201404DTM040. ISSN   2153-5418.
  5. 1 2 3 4 5 6 7 8 Li, Xian; Dong, Xin Luna; Lyons, Kenneth; Meng, Weiyi; Srivastava, Divesh (2012-12-01). "Truth finding on the deep web: is the problem solved?". Proceedings of the VLDB Endowment. 6 (2): 97–108. arXiv: 1503.00303 . doi:10.14778/2535568.2448943. S2CID   3133027.
  6. Ng, Andrew Y; Jordan, Michael I. (2001). "On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes". Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic: 841–848.
  7. 1 2 3 Dong, Xin Luna; Berti-Equille, Laure; Srivastava, Divesh (2009-08-01). "Integrating conflicting data: the role of source dependence". Proceedings of the VLDB Endowment. 2 (1): 550–561. doi:10.14778/1687627.1687690. S2CID   9664056.
  8. Kleinberg, Jon M. (1999-09-01). "Authoritative sources in a hyperlinked environment". Journal of the ACM. 46 (5): 604–632. doi: 10.1145/324133.324140 . S2CID   221584113.
  9. Galland, Alban; Abiteboul, Serge; Marian, Amélie; Senellart, Pierre (2010). "Corroborating information from disagreeing views". Proceedings of the third ACM international conference on Web search and data mining (PDF). New York, New York, USA: ACM Press. pp. 131–140. doi:10.1145/1718487.1718504. ISBN   9781605588896. S2CID   1761360.
  10. Xiaoxin Yin; Jiawei Han; Yu, P.S. (2008). "Truth Discovery with Multiple Conflicting Information Providers on the Web". IEEE Transactions on Knowledge and Data Engineering. 20 (6): 796–808. doi:10.1109/TKDE.2007.190745. ISSN   1041-4347.
  11. Zhao, Bo; Rubinstein, Benjamin I. P.; Gemmell, Jim; Han, Jiawei (2012-02-01). "A Bayesian approach to discovering truth from conflicting sources for data integration". Proceedings of the VLDB Endowment. 5 (6): 550–561. arXiv: 1203.0058 . doi:10.14778/2168651.2168656. S2CID   8837716.
  12. "The huge implications of Google's idea to rank sites based on their accuracy". www.washingtonpost.com. 2015.