Bradley–Terry Model
   HOME

TheInfoList



OR:

The Bradley–Terry model is a
probability model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, ...
that can predict the outcome of a paired comparison. Given a pair of individuals and drawn from some
population Population typically refers to the number of people in a single area, whether it be a city or town, region, country, continent, or the world. Governments typically quantify the size of the resident population within their jurisdiction using a ...
, it estimates the probability that the
pairwise comparison Pairwise comparison generally is any process of comparing entities in pairs to judge which of each entity is preferred, or has a greater amount of some quantitative property, or whether or not the two entities are identical. The method of pairwis ...
turns out true, as :P(i > j) = \frac where is a positive
real-valued In mathematics, value may refer to several, strongly related notions. In general, a mathematical value may be any definite mathematical object. In elementary mathematics, this is most often a number – for example, a real number such as or an i ...
score assigned to individual . The comparison can be read as " is preferred to ", " ranks higher than ", or " beats ", depending on the application. For example, may represent the skill of a team in a sports tournament, estimated from the number of times has won a match. P(i>j) then represents the probability that will win a match against . Another example used to explain the model's purpose is that of scoring products in a certain category by quality. While it's hard for a person to draft a direct ranking of (many) brands of wine, it may be feasible to compare a sample of pairs of wines and say, for each pair, which one is better. The Bradley–Terry model can then be used to derive a full ranking.


History and applications

The model is named after R. A. Bradley and M. E. Terry, who presented it in 1952, although it had already been studied by
Zermelo Ernst Friedrich Ferdinand Zermelo (, ; 27 July 187121 May 1953) was a German logician and mathematician, whose work has major implications for the foundations of mathematics. He is known for his role in developing Zermelo–Fraenkel axiomatic se ...
in the 1920s. Real-world applications of the model include estimation of the influence of
statistical Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industria ...
journals, or ranking documents by relevance in
machine-learned Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
s. In the latter application, P(i > j) may reflect that document is more relevant to the user's query than document , so it should be displayed earlier in the results list. The individual then express the relevance of the document, and can be estimated from the frequency with which users click particular "hits" when presented with a result list.


Definition

The Bradley–Terry model can be parametrized in various ways. One way to do so is to pick a single parameter per observation, leading to a model of parameters . Another variant, in fact the version considered by Bradley and Terry, uses exponential score functions p_i = e^ so that :P(i > j) = \frac or, using the
logit In statistics, the logit ( ) function is the quantile function associated with the standard logistic distribution. It has many uses in data analysis and machine learning, especially in data transformations. Mathematically, the logit is the ...
(and disallowing ties), :\operatorname(P(i > j)) = \log\left(\frac\right) = \log\left(\frac\right) = \beta_i - \beta_j reducing the model to
logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
on pairs of individuals.


Estimating the parameters

The following
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
computes the parameters of the basic version of the model from a sample of observations. Formally, it computes a
maximum likelihood estimate In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statisti ...
, i.e., it maximizes the
likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
of the observed data. The algorithm dates back to the work of Zermelo. The observations required are the outcomes of previous comparisons, for example, pairs where beats . Summarizing these outcomes as , the number of times has beaten , we obtain the
log-likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...
of the parameter vector as :L(\mathbf) = \sum_i^n \sum_j^n \ln p_^ = \sum_i^n \sum_j^n \ln\left(\frac\right)^ = \sum_i^n \sum_j^n w_ \ln\left(\frac\right) = \sum_i^n \sum_j^n _ \ln p_i - w_ \ln(p_i + p_j) Denote the number of comparisons "won" by as . Starting from an arbitrary vector , the algorithm iteratively performs the update :p'_i = W_i \left( \sum_ \frac \right)^ for all . After computing all of the new parameters, they should be renormalized, :p_i \leftarrow \frac. This estimation procedure improves the log-likelihood in every iteration, and eventually converges to a unique maximum.


Worked Example of Iterated Procedure

Suppose there are 4 teams who have played a total of 22 games among themselves. Each team's wins are given in the rows of the table below and the opponents are given as the columns. For example, Team A has beat Team B twice and lost to team B three times; not played team C at all; won once and lost four times against team D. We would like to compute the relative strengths of the teams; that is, we want to compute one parameter per team, with higher parameters indicating greater prowess. We initialize the 4 entries in the parameter vector arbitrarily, for example, assigning the value 1 to each team: . p_1 = \frac = \frac = 0.6 p_2 = \frac = \frac = 1.231 p_3 = \frac = \frac = 0.667 p_4 = \frac = \frac = 1.556 Then, we normalize all the parameters by dividing them by 0.6 + 1.231 + 0.667 + 1.556 = 4.053 to get the estimated parameters . To get better estimates, we can repeat the process, using the new values. For example, p_1 = \frac = \frac = 0.147 Repeating this for the remaining parameters and normalizing, we get . Repeating this process a total of 20 times will show rapid convergence to . This shows that Team D is the strongest, Team B is the second strongest, and Team A and Team C are nearly equal in strength, but less than Teams B and D. The Bradley-Terry model lets us extrapolate the relationship between all 4 teams, even though all teams haven't played each other.


See also

*
Ordinal regression In statistics, ordinal regression, also called ordinal classification, is a type of regression analysis used for predicting an ordinal variable, i.e. a variable whose value exists on an arbitrary scale where only the relative ordering between dif ...
*
Rasch model The Rasch model, named after Georg Rasch, is a psychometric model for analyzing categorical data, such as answers to questions on a reading assessment or questionnaire responses, as a function of the trade-off between the respondent's abilities, at ...
*
Scale (social sciences) In the social sciences, scaling is the process of measuring or ordering entities with respect to quantitative attributes or traits. For example, a scaling technique might involve estimating individuals' levels of extraversion, or the perceived qual ...
*
Elo rating system The Elo rating system is a method for calculating the relative skill levels of players in zero-sum games such as chess. It is named after its creator Arpad Elo, a Hungarian-American physics professor. The Elo system was invented as an improved ch ...
*
Thurstonian model A Thurstonian model is a stochastic transitivity model with latent variables for describing the mapping of some continuous scale onto discrete, possibly ordered categories of response. In the model, each of these categories of response corresponds ...


References

{{DEFAULTSORT:Bradley-Terry model Machine learning Statistical models Logistic regression Regression models