Truth Discovery
   HOME

TheInfoList



OR:

Truth discovery (also known as truth finding) is the process of choosing the actual ''true value'' for a
data item A data item describes an atomic state of a particular object concerning a specific property at a certain time point. A collection of data items for the same object at the same time forms an object instance (or table row). Any type of complex inform ...
when different
data sources In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
provide conflicting information on it. Several
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
s have been proposed to tackle this problem, ranging from simple methods like
majority voting Majority rule is a principle that means the decision-making power belongs to the group that has the most members. In politics, majority rule requires the deciding vote to have majority, that is, more than half the votes. It is the binary deci ...
to more complex ones able to estimate the trustworthiness of
data sources In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
. Truth discovery problems can be divided into two sub-classes: single-truth and multi-truth. In the first case only one true value is allowed for a
data item A data item describes an atomic state of a particular object concerning a specific property at a certain time point. A collection of data items for the same object at the same time forms an object instance (or table row). Any type of complex inform ...
(e.g birthday of a person, capital city of a country). While in the second case multiple true values are allowed (e.g. cast of a movie, authors of a book). Typically, truth discovery is the last step of a
data integration Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial (such as when two similar companies ...
pipeline, when the schemas of different
data sources In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
have been unified and the records referring to the same
data item A data item describes an atomic state of a particular object concerning a specific property at a certain time point. A collection of data items for the same object at the same time forms an object instance (or table row). Any type of complex inform ...
have been detected.


General principles

The abundance of data available on the
web Web most often refers to: * Spider web, a silken structure created by the animal * World Wide Web or the Web, an Internet-based hypertext system Web, WEB, or the Web may also refer to: Computing * WEB, a literate programming system created by ...
makes more and more probable to find that different
sources Source may refer to: Research * Historical document * Historical source * Source (intelligence) or sub source, typically a confidential provider of non open-source intelligence * Source (journalism), a person, publication, publishing institute o ...
provide (partially or completely) different values for the same
data item A data item describes an atomic state of a particular object concerning a specific property at a certain time point. A collection of data items for the same object at the same time forms an object instance (or table row). Any type of complex inform ...
. This, together with the fact that we are increasing our reliance on data to derive important decisions, motivates the need of developing good truth discovery
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
s.   Many currently available methods rely on a voting strategy to define the true value of a
data item A data item describes an atomic state of a particular object concerning a specific property at a certain time point. A collection of data items for the same object at the same time forms an object instance (or table row). Any type of complex inform ...
. Nevertheless, recent studies, have shown that, if we rely only on
majority voting Majority rule is a principle that means the decision-making power belongs to the group that has the most members. In politics, majority rule requires the deciding vote to have majority, that is, more than half the votes. It is the binary deci ...
, we could get wrong results even in 30% of the
data item A data item describes an atomic state of a particular object concerning a specific property at a certain time point. A collection of data items for the same object at the same time forms an object instance (or table row). Any type of complex inform ...
s. The solution to this problem is to assess the trustworthiness of the
sources Source may refer to: Research * Historical document * Historical source * Source (intelligence) or sub source, typically a confidential provider of non open-source intelligence * Source (journalism), a person, publication, publishing institute o ...
and give more importance to votes coming from trusted sources. Ideally,
supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
techniques could be exploited to assign a reliability score to
sources Source may refer to: Research * Historical document * Historical source * Source (intelligence) or sub source, typically a confidential provider of non open-source intelligence * Source (journalism), a person, publication, publishing institute o ...
after hand-crafted labeling of the provided values; unfortunately, this is not feasible since the number of needed labeled examples should be proportional to the number of
sources Source may refer to: Research * Historical document * Historical source * Source (intelligence) or sub source, typically a confidential provider of non open-source intelligence * Source (journalism), a person, publication, publishing institute o ...
, and in many applications the number of sources can be prohibitive.


Single-truth vs multi-truth discovery

Single-truth and multi-truth discovery are two very different problems. Single-truth discovery is characterized by the following properties: * only one true value is allowed for each
data item A data item describes an atomic state of a particular object concerning a specific property at a certain time point. A collection of data items for the same object at the same time forms an object instance (or table row). Any type of complex inform ...
; * different values provided for a given data item oppose to each other; * values and
sources Source may refer to: Research * Historical document * Historical source * Source (intelligence) or sub source, typically a confidential provider of non open-source intelligence * Source (journalism), a person, publication, publishing institute o ...
can either be correct or erroneous. While in the multi-truth case the following properties hold: * the truth is composed by a set of values; * different values could provide a partial truth; * claiming one value for a given
data item A data item describes an atomic state of a particular object concerning a specific property at a certain time point. A collection of data items for the same object at the same time forms an object instance (or table row). Any type of complex inform ...
does not imply opposing to all the other values; * the number of true values for each
data item A data item describes an atomic state of a particular object concerning a specific property at a certain time point. A collection of data items for the same object at the same time forms an object instance (or table row). Any type of complex inform ...
is not known ''a priori.'' Multi-truth discovery has unique features that make the problem more complex and should be taken into consideration when developing truth-discovery solutions. The examples below point out the main differences of the two methods. Knowing that in both examples the truth is provided by source 1, in the single truth case (first table) we can say that sources 2 and 3 oppose to the truth and as a result provide wrong values. On the other hand, in the second case (second table), sources 2 and 3 are neither correct nor erroneous, they instead provide a subset of the true values and at the same time they do not oppose the truth.


Source trustworthiness

The vast majority of truth discovery methods are based on a voting approach: each source votes for a value of a certain
data item A data item describes an atomic state of a particular object concerning a specific property at a certain time point. A collection of data items for the same object at the same time forms an object instance (or table row). Any type of complex inform ...
and, at the end, the value with the highest vote is select as the true one. In the more sophisticated methods, votes do not have the same weight for all the
data sources In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
, more importance is indeed given to votes coming from trusted sources. Source trustworthiness usually is not known ''a'' ''priori'' but estimated with an iterative approach. At each step of the truth discovery
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
the trustworthiness score of each data source is refined, improving the assessment of the true values that in turn leads to a better estimation of the trustworthiness of the sources. This process usually ends when all the values reach a convergence state. Source trustworthiness can be based on different metrics, such as
accuracy Accuracy and precision are two measures of ''observational error''. ''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value'', while ''precision'' is how close the measurements are to each other ...
of provided values, copying values from other sources and domain coverage. Detecting copying behaviors is very important, in fact, copy allows to spread false values easily making truth discovery very hard, since many sources would vote for the wrong values. Usually systems decrease the weight of votes associated to copied values or even don’t count them at all.


Single-truth methods

Most of the currently available truth discovery methods have been designed to work well only in the single-truth case. Below are reported some of the characteristics of the most relevant typologies of single-truth methods and how different systems model source trustworthiness.


Majority voting

Majority voting Majority rule is a principle that means the decision-making power belongs to the group that has the most members. In politics, majority rule requires the deciding vote to have majority, that is, more than half the votes. It is the binary deci ...
is the simplest method, the most popular value is selected as the true one. Majority voting is commonly used as a baseline when assessing the performances of more complex methods.


Web-link based

These methods estimate source trustworthiness exploiting a similar technique to the one used to measure
authority In the fields of sociology and political science, authority is the legitimate power of a person or group over other people. In a civil state, ''authority'' is practiced in ways such a judicial branch or an executive branch of government.''The N ...
of web pages based on web links. The vote assigned to a value is computed as the sum of the trustworthiness of the sources that provide that particular value, while the trustworthiness of a source is computed as the sum of the votes assigned to the values that the source provides.


Information-retrieval based

These methods estimate source trustworthiness using
similarity measure In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such meas ...
s typically used in
information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
. Source trustworthiness is computed as the
cosine similarity In data analysis, cosine similarity is a measure of similarity between two sequences of numbers. For defining it, the sequences are viewed as vectors in an inner product space, and the cosine similarity is defined as the cosine of the angle betwe ...
(or other
similarity measure In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such meas ...
s) between the set of values provided by the source and the set of values considered true (either selected in a probabilistic way or obtained from a ground truth).


Bayesian based

These methods use
Bayesian inference Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, a ...
to define the probability of a value being true conditioned on the values provided by all the sources. P(v \mid \psi(o)) = \frac where \textstyle v is a value provided for a
data item A data item describes an atomic state of a particular object concerning a specific property at a certain time point. A collection of data items for the same object at the same time forms an object instance (or table row). Any type of complex inform ...
\textstyle o and \textstyle \psi(o) is the set of the observed values provided by all the sources for that specific
data item A data item describes an atomic state of a particular object concerning a specific property at a certain time point. A collection of data items for the same object at the same time forms an object instance (or table row). Any type of complex inform ...
. The trustworthiness of a source is then computed based on the
accuracy Accuracy and precision are two measures of ''observational error''. ''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value'', while ''precision'' is how close the measurements are to each other ...
of the values that provides. Other more complex methods exploit
Bayesian inference Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, a ...
to detect copying behaviors and use these insights to better assess source trustworthiness.


Multi-truth methods

Due to its
complexity Complexity characterises the behaviour of a system or model whose components interaction, interact in multiple ways and follow local rules, leading to nonlinearity, randomness, collective dynamics, hierarchy, and emergence. The term is generall ...
, less attention has been devoted to the study of the multi-truth discovery Below are reported two typologies of multi-truth methods and their characteristics.


Bayesian based

These methods use
Bayesian inference Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, a ...
to define the probability of a group of values being true conditioned on the values provided by all the
data sources In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
. In this case, since there could be multiple true values for each
data item A data item describes an atomic state of a particular object concerning a specific property at a certain time point. A collection of data items for the same object at the same time forms an object instance (or table row). Any type of complex inform ...
, and sources can provide multiple values for a single data item, it is not possible to consider values individually. An alternative is to consider mappings and relations between set of provided values and sources providing them. The trustworthiness of a source is then computed based on the
accuracy Accuracy and precision are two measures of ''observational error''. ''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value'', while ''precision'' is how close the measurements are to each other ...
of the values that provides. More sophisticated methods also consider domain coverage and copying behaviors to better estimate source trustworthiness.


Probabilistic Graphical Models based

These methods use probabilistic graphical models to automatically define the set of true values of given data item and also to assess source quality without need of any supervision.


Applications

Many real-world applications can benefit from the use of truth discovery algorithms. Typical domains of application include:
healthcare Health care or healthcare is the improvement of health via the prevention, diagnosis, treatment, amelioration or cure of disease, illness, injury, and other physical and mental impairments in people. Health care is delivered by health profe ...
, crowd/social sensing,
crowdsourcing Crowdsourcing involves a large group of dispersed participants contributing or producing goods or services—including ideas, votes, micro-tasks, and finances—for payment or as volunteers. Contemporary crowdsourcing often involves digita ...
aggregation,
information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
and
knowledge base A knowledge base (KB) is a technology used to store complex structured and unstructured information used by a computer system. The initial use of the term was in connection with expert systems, which were the first knowledge-based systems. Ori ...
construction. Truth discovery algorithms could be also used to revolutionize the way in which web pages are
ranked A ranking is a relationship between a set of items such that, for any two items, the first is either "ranked higher than", "ranked lower than" or "ranked equal to" the second. In mathematics, this is known as a weak order or total preorder of ...
in
search engines A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
, going from current methods based on
link analysis In network theory, link analysis is a data-analysis technique used to evaluate relationships (connections) between nodes. Relationships may be identified among various types of nodes (objects), including organizations, people and transactions. Li ...
like
PageRank PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. According ...
, to procedures that rank web pages based on the
accuracy Accuracy and precision are two measures of ''observational error''. ''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value'', while ''precision'' is how close the measurements are to each other ...
of the information they provide.{{Cite web, url=https://www.washingtonpost.com/news/energy-environment/wp/2015/03/11/the-huge-implications-of-googles-idea-to-rank-sites-based-on-their-accuracy/, title=The huge implications of Google's idea to rank sites based on their accuracy, date=2015, website=www.washingtonpost.com


See also

*
Data Integration Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial (such as when two similar companies ...
*
Information Integration Information integration (II) is the merging of information from heterogeneous sources with differing conceptual, contextual and typographical representations. It is used in data mining and consolidation of data from unstructured or semi-structured ...
*
Data Fusion Data fusion is the process of integrating multiple data sources to produce more consistent, accurate, and useful information than that provided by any individual data source. Data fusion processes are often categorized as low, intermediate, or hig ...
(data integration) *
Data Quality Data quality refers to the state of qualitative or quantitative pieces of information. There are many definitions of data quality, but data is generally considered high quality if it is "fit for tsintended uses in operations, decision making and ...


References

Databases