In statistics, Somers’ ''D'', sometimes incorrectly referred to as Somer’s ''D'', is a measure of

ordinal association In statistics, a rank correlation is any of several statistics that measure an ordinal association — the relationship between rankings of different ordinal variables or different rankings of the same variable, where a "ranking" is the assignment ...

between two possibly dependent random variables and . Somers’ ''D'' takes values between

-1

when all pairs of the variables disagree and

1

when all pairs of the variables agree. Somers’ ''D'' is named after Robert H. Somers, who proposed it in 1962. Somers’ ''D'' plays a central role in rank statistics and is the parameter behind many nonparametric methods. It is also used as a quality measure of binary choice or

ordinal regression In statistics, ordinal regression, also called ordinal classification, is a type of regression analysis used for predicting an ordinal variable, i.e. a variable whose value exists on an arbitrary scale where only the relative ordering between di ...

(e.g.,

logistic regression In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...

s) and

credit scoring A credit score is a numerical expression based on a level analysis of a person's credit files, to represent the creditworthiness of an individual. A credit score is primarily based on a credit report, information typically sourced from credit bur ...

models.

Somers’ ''D'' for sample

We say that two pairs

(x_i,y_i)

and

(x_j,y_j)

are

concordant Concordance may refer to: * Agreement (linguistics), a form of cross-reference between different parts of a sentence or phrase * Bible concordance, an alphabetical listing of terms in the Bible * Concordant coastline, in geology, where beds, or la ...

if the ranks of both elements agree, or

x_i>x_j

and

y_i>y_j

or if

x_i and y_i . We say that two pairs (x_i,y_i) and (x_j,y_j) are discordant, if the ranks of both elements disagree, or if x_i>x_j and y_i or if x_i and y_i>y_j . If x_i=x_j or y_i=y_j, the pair is neither concordant nor discordant.

Let (x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n) be a set of observations of two possibly dependent random vectors  and . Define

Kendall tau rank correlation coefficient In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient (after the Greek letter τ, tau), is a statistic used to measure the ordinal association between two measured quantities. A τ test is a ...

\tau

as :

\tau=\frac,

where

N_C

is the number of concordant pairs and

N_D

is the number of discordant pairs. Somers’ ''D'' of with respect to is defined as

D_=\tau(X,Y)/\tau(X,X)

. Note that Kendall's tau is symmetric in and , whereas Somers’ ''D'' is asymmetric in and . As

\tau(X,X)

quantifies the number of pairs with unequal values, Somers’ ''D'' is the difference between the number of concordant and discordant pairs, divided by the number of pairs with values in the pair being unequal.

Somers’ ''D'' for distribution

Let two independent bivariate random variables

(X_1, Y_1)

and

(X_2, Y_2)

have the same probability distribution

\operatorname_

. Again, Somers’ ''D'', which measures ordinal association of random variables and in

\operatorname_

, can be defined through

Kendall's tau In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient (after the Greek letter τ, tau), is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non ...

\begin
\tau(X,Y) &= \operatorname\Bigl(\sgn(X_1-X_2)\sgn(Y_1-Y_2)\Bigr) \\
  &= \operatorname\Bigl(\sgn(X_1-X_2)\sgn(Y_1-Y_2)=1\Bigr) - 
     \operatorname\Bigl(\sgn(X_1-X_2)\sgn(Y_1-Y_2)=-1\Bigr), \\
\end

or the difference between the probabilities of concordance and discordance. Somers’ ''D'' of with respect to is defined as

D_ =\tau(X,Y)/\tau(X,X)

. Thus,

D_

is the difference between the two corresponding probabilities, conditional on the values not being equal. If has a

continuous probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

, then

\tau(X,X)=1

and Kendall's tau and Somers’ ''D'' coincide. Somers’ ''D'' normalizes Kendall's tau for possible mass points of variable . If and are both binary with values 0 and 1, then Somers’ ''D'' is the difference between two probabilities: :

D_=\operatorname(Y=1 \mid X=1)-\operatorname(Y=1\mid X=0).

Somers' ''D'' for binary dependent variables

In practice, Somers' ''D'' is most often used when the

dependent variable A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical functio ...

''Y'' is a

binary variable Binary data is data whose unit can take on only two possible states. These are often labelled as 0 and 1 in accordance with the binary numeral system and Boolean algebra. Binary data occurs in many different technical and scientific fields, whe ...

, i.e. for

binary classification Binary classification is the task of classifying the elements of a set into one of two groups (each called ''class''). Typical binary classification problems include: * Medical testing to determine if a patient has a certain disease or not; * Qual ...

or prediction of binary outcomes including binary choice models in econometrics. Methods for fitting such models include logistic and

probit regression In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from ''probability'' + ''unit''. The purpose of the model is to e ...

. Several statistics can be used to quantify the quality of such models: area under the

receiver operating characteristic A receiver operating characteristic curve, or ROC curve, is a graph of a function, graphical plot that illustrates the performance of a binary classifier model (can be used for multi class classification as well) at varying threshold values. ROC ...

(ROC) curve,

Goodman and Kruskal's gamma In statistics, Goodman and Kruskal's gamma is a measure of rank correlation, i.e., the similarity of the orderings of the data when ranked by each of the quantities. It measures the strength of association of the cross tabulated data when both v ...

, Kendall's tau (Tau-a), Somers’ ''D'', etc. Somers’ ''D'' is probably the most widely used of the available ordinal association statistics. Identical to the

Gini coefficient In economics, the Gini coefficient ( ), also known as the Gini index or Gini ratio, is a measure of statistical dispersion intended to represent the income distribution, income inequality, the wealth distribution, wealth inequality, or the ...

, Somers’ ''D'' is related to the area under the receiver operating characteristic curve (AUC), :

\mathrm=\frac2

. In the case where the independent (predictor) variable is and the dependent (outcome) variable is binary, Somers’ ''D'' equals :

D_=\frac,

where

N_T

is the number of neither concordant nor discordant pairs that are tied on variable and not on variable .

Example

Suppose that the independent (predictor) variable takes three values, , , or , and dependent (outcome) variable takes two values, or . The table below contains observed combinations of and : The number of concordant pairs equals :

N_C = 3 \times 7 + 3 \times 6 + 5 \times 6 = 69.

The number of discordant pairs equals :

N_D = 1 \times 5 + 1 \times 2 + 7 \times 2 = 21.

The number of pairs tied is equal to the total number of pairs minus the concordant and discordant pairs :

N_T = (3+5+2) \times (1+7+6) - 69 - 21 = 50

Thus, Somers’ ''D'' equals :

D_ = \frac \approx 0.34.

References

{{Reflist Nonparametric statistics Independence (probability theory)