Truth discovery (also known as truth finding) is the process of choosing the actual ''true value'' for a

data item A data item describes an atomic state of a particular object concerning a specific property at a certain time point. A collection of data items for the same object at the same time forms an object instance (or table row). Any type of complex inform ...

when different

data sources In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...

provide conflicting information on it. Several

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...

s have been proposed to tackle this problem, ranging from simple methods like

majority voting Majority rule is a principle that means the decision-making power belongs to the group that has the most members. In politics, majority rule requires the deciding vote to have majority, that is, more than half the votes. It is the binary deci ...

to more complex ones able to estimate the trustworthiness of

. Truth discovery problems can be divided into two sub-classes: single-truth and multi-truth. In the first case only one true value is allowed for a

(e.g birthday of a person, capital city of a country). While in the second case multiple true values are allowed (e.g. cast of a movie, authors of a book). Typically, truth discovery is the last step of a

data integration Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial (such as when two similar companies ...

pipeline, when the schemas of different

have been unified and the records referring to the same

have been detected.

General principles

The abundance of data available on the

web Web most often refers to: * Spider web, a silken structure created by the animal * World Wide Web or the Web, an Internet-based hypertext system Web, WEB, or the Web may also refer to: Computing * WEB, a literate programming system created by ...

makes more and more probable to find that different

sources Source may refer to: Research * Historical document * Historical source * Source (intelligence) or sub source, typically a confidential provider of non open-source intelligence * Source (journalism), a person, publication, publishing institute o ...

provide (partially or completely) different values for the same

. This, together with the fact that we are increasing our reliance on data to derive important decisions, motivates the need of developing good truth discovery

s. Many currently available methods rely on a voting strategy to define the true value of a

. Nevertheless, recent studies, have shown that, if we rely only on

, we could get wrong results even in 30% of the

s. The solution to this problem is to assess the trustworthiness of the

and give more importance to votes coming from trusted sources. Ideally,

supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...

techniques could be exploited to assign a reliability score to

after hand-crafted labeling of the provided values; unfortunately, this is not feasible since the number of needed labeled examples should be proportional to the number of

, and in many applications the number of sources can be prohibitive.

Single-truth vs multi-truth discovery

Single-truth and multi-truth discovery are two very different problems. Single-truth discovery is characterized by the following properties: * only one true value is allowed for each

; * different values provided for a given data item oppose to each other; * values and

can either be correct or erroneous. While in the multi-truth case the following properties hold: * the truth is composed by a set of values; * different values could provide a partial truth; * claiming one value for a given

does not imply opposing to all the other values; * the number of true values for each

is not known ''a priori.'' Multi-truth discovery has unique features that make the problem more complex and should be taken into consideration when developing truth-discovery solutions. The examples below point out the main differences of the two methods. Knowing that in both examples the truth is provided by source 1, in the single truth case (first table) we can say that sources 2 and 3 oppose to the truth and as a result provide wrong values. On the other hand, in the second case (second table), sources 2 and 3 are neither correct nor erroneous, they instead provide a subset of the true values and at the same time they do not oppose the truth.

Source trustworthiness

The vast majority of truth discovery methods are based on a voting approach: each source votes for a value of a certain

and, at the end, the value with the highest vote is select as the true one. In the more sophisticated methods, votes do not have the same weight for all the

, more importance is indeed given to votes coming from trusted sources. Source trustworthiness usually is not known ''a'' ''priori'' but estimated with an iterative approach. At each step of the truth discovery

the trustworthiness score of each data source is refined, improving the assessment of the true values that in turn leads to a better estimation of the trustworthiness of the sources. This process usually ends when all the values reach a convergence state. Source trustworthiness can be based on different metrics, such as

accuracy Accuracy and precision are two measures of ''observational error''. ''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value'', while ''precision'' is how close the measurements are to each other ...

of provided values, copying values from other sources and domain coverage. Detecting copying behaviors is very important, in fact, copy allows to spread false values easily making truth discovery very hard, since many sources would vote for the wrong values. Usually systems decrease the weight of votes associated to copied values or even don’t count them at all.

Single-truth methods

Most of the currently available truth discovery methods have been designed to work well only in the single-truth case. Below are reported some of the characteristics of the most relevant typologies of single-truth methods and how different systems model source trustworthiness.

Majority voting

Majority voting Majority rule is a principle that means the decision-making power belongs to the group that has the most members. In politics, majority rule requires the deciding vote to have majority, that is, more than half the votes. It is the binary deci ...

is the simplest method, the most popular value is selected as the true one. Majority voting is commonly used as a baseline when assessing the performances of more complex methods.

Web-link based

These methods estimate source trustworthiness exploiting a similar technique to the one used to measure

authority In the fields of sociology and political science, authority is the legitimate power of a person or group over other people. In a civil state, ''authority'' is practiced in ways such a judicial branch or an executive branch of government.''The N ...

of web pages based on web links. The vote assigned to a value is computed as the sum of the trustworthiness of the sources that provide that particular value, while the trustworthiness of a source is computed as the sum of the votes assigned to the values that the source provides.

Information-retrieval based

These methods estimate source trustworthiness using

similarity measure In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such meas ...

s typically used in

information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...

. Source trustworthiness is computed as the

cosine similarity In data analysis, cosine similarity is a measure of similarity between two sequences of numbers. For defining it, the sequences are viewed as vectors in an inner product space, and the cosine similarity is defined as the cosine of the angle betwe ...

(or other

s) between the set of values provided by the source and the set of values considered true (either selected in a probabilistic way or obtained from a ground truth).

Bayesian based

These methods use

Bayesian inference Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, a ...

to define the probability of a value being true conditioned on the values provided by all the sources.

P(v \mid \psi(o)) = \frac

where

\textstyle v

is a value provided for a

\textstyle o

and

\textstyle \psi(o)

is the set of the observed values provided by all the sources for that specific

. The trustworthiness of a source is then computed based on the

of the values that provides. Other more complex methods exploit

to detect copying behaviors and use these insights to better assess source trustworthiness.

Multi-truth methods

Due to its

complexity Complexity characterises the behaviour of a system or model whose components interaction, interact in multiple ways and follow local rules, leading to nonlinearity, randomness, collective dynamics, hierarchy, and emergence. The term is generall ...

, less attention has been devoted to the study of the multi-truth discovery Below are reported two typologies of multi-truth methods and their characteristics.

Bayesian based

These methods use

to define the probability of a group of values being true conditioned on the values provided by all the

. In this case, since there could be multiple true values for each

, and sources can provide multiple values for a single data item, it is not possible to consider values individually. An alternative is to consider mappings and relations between set of provided values and sources providing them. The trustworthiness of a source is then computed based on the

of the values that provides. More sophisticated methods also consider domain coverage and copying behaviors to better estimate source trustworthiness.

Probabilistic Graphical Models based

These methods use probabilistic graphical models to automatically define the set of true values of given data item and also to assess source quality without need of any supervision.

Applications

Many real-world applications can benefit from the use of truth discovery algorithms. Typical domains of application include:

healthcare Health care or healthcare is the improvement of health via the prevention, diagnosis, treatment, amelioration or cure of disease, illness, injury, and other physical and mental impairments in people. Health care is delivered by health profe ...

, crowd/social sensing,

crowdsourcing Crowdsourcing involves a large group of dispersed participants contributing or producing goods or services—including ideas, votes, micro-tasks, and finances—for payment or as volunteers. Contemporary crowdsourcing often involves digita ...

aggregation,

information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...

and

knowledge base A knowledge base (KB) is a technology used to store complex structured and unstructured information used by a computer system. The initial use of the term was in connection with expert systems, which were the first knowledge-based systems. Ori ...

construction. Truth discovery algorithms could be also used to revolutionize the way in which web pages are

ranked A ranking is a relationship between a set of items such that, for any two items, the first is either "ranked higher than", "ranked lower than" or "ranked equal to" the second. In mathematics, this is known as a weak order or total preorder of ...

search engines A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...

, going from current methods based on

link analysis In network theory, link analysis is a data-analysis technique used to evaluate relationships (connections) between nodes. Relationships may be identified among various types of nodes (objects), including organizations, people and transactions. Li ...

PageRank PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. According ...

, to procedures that rank web pages based on the

of the information they provide.{{Cite web, url=https://www.washingtonpost.com/news/energy-environment/wp/2015/03/11/the-huge-implications-of-googles-idea-to-rank-sites-based-on-their-accuracy/, title=The huge implications of Google's idea to rank sites based on their accuracy, date=2015, website=www.washingtonpost.com

References

Databases

General principles

Single-truth vs multi-truth discovery

Source trustworthiness

Single-truth methods

Majority voting

Web-link based

Information-retrieval based

Bayesian based

Multi-truth methods

Bayesian based

Probabilistic Graphical Models based

Applications

See also

References