statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, the score (or informant) is the

gradient In vector calculus, the gradient of a scalar-valued differentiable function f of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p gives the direction and the rate of fastest increase. The g ...

of the log-likelihood function with respect to the parameter vector. Evaluated at a particular value of the parameter vector, the score indicates the steepness of the log-likelihood function and thereby the sensitivity to

infinitesimal In mathematics, an infinitesimal number is a non-zero quantity that is closer to 0 than any non-zero real number is. The word ''infinitesimal'' comes from a 17th-century Modern Latin coinage ''infinitesimus'', which originally referred to the " ...

changes to the parameter values. If the log-likelihood function is

continuous Continuity or continuous may refer to: Mathematics * Continuity (mathematics), the opposing concept to discreteness; common examples include ** Continuous probability distribution or random variable in probability and statistics ** Continuous ...

over the

parameter space The parameter space is the space of all possible parameter values that define a particular mathematical model. It is also sometimes called weight space, and is often a subset of finite-dimensional Euclidean space. In statistics, parameter spaces a ...

, the score will vanish at a local maximum or minimum; this fact is used in

maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...

to find the parameter values that maximize the likelihood function. Since the score is a function of the

observations Observation in the natural sciences is an act or instance of noticing or perceiving and the acquisition of information from a primary source. In living beings, observation employs the senses. In science, observation can also involve the perceptio ...

, which are subject to

sampling error In statistics, sampling errors are incurred when the statistical characteristics of a population are estimated from a subset, or sample, of that population. Since the sample does not include all members of the population, statistics of the sample ...

, it lends itself to a

test statistic Test statistic is a quantity derived from the sample for statistical hypothesis testing.Berger, R. L.; Casella, G. (2001). ''Statistical Inference'', Duxbury Press, Second Edition (p.374) A hypothesis test is typically specified in terms of a tes ...

known as '' score test'' in which the parameter is held at a particular value. Further, the ratio of two likelihood functions evaluated at two distinct parameter values can be understood as a

definite integral In mathematics, an integral is the continuous analog of a sum, which is used to calculate areas, volumes, and their generalizations. Integration, the process of computing an integral, is one of the two fundamental operations of calculus,Int ...

of the score function.

Definition

The score is the

(the vector of

partial derivative In mathematics, a partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant (as opposed to the total derivative, in which all variables are allowed to vary). P ...

s) of

\log \mathcal(\theta;x)

, the

natural logarithm The natural logarithm of a number is its logarithm to the base of a logarithm, base of the e (mathematical constant), mathematical constant , which is an Irrational number, irrational and Transcendental number, transcendental number approxima ...

of the

likelihood function A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the ...

, with respect to an m-dimensional parameter vector

\theta

. :

s(\theta;x) \equiv \frac

This differentiation yields a

(1 \times m)

row vector at each value of

\theta

and

x

, and indicates the sensitivity of the likelihood (its derivative normalized by its value). In older literature, "linear score" may refer to the score with respect to infinitesimal translation of a given density. This convention arises from a time when the primary parameter of interest was the mean or median of a distribution. In this case, the likelihood of an observation is given by a density of the form

\mathcal L(\theta;X)=f(X+\theta)

. The "linear score" is then defined as :

s_
= \frac \log f(X)

Properties

Mean

While the score is a function of

\theta

, it also depends on the observations

\mathbf = (x_1, x_2, \ldots, x_T)

at which the likelihood function is evaluated, and in view of the random character of sampling one may take its

expected value In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...

over the

sample space In probability theory, the sample space (also called sample description space, possibility space, or outcome space) of an experiment or random trial is the set of all possible outcomes or results of that experiment. A sample space is usually den ...

. Under certain regularity conditions on the density functions of the random variables, the expected value of the score, evaluated at the true parameter value

\theta

, is zero. To see this, rewrite the likelihood function

\mathcal L

as a

probability density function In probability theory, a probability density function (PDF), density function, or density of an absolutely continuous random variable, is a Function (mathematics), function whose value at any given sample (or point) in the sample space (the s ...

\mathcal L(\theta; x) = f(x; \theta)

, and denote the

\mathcal

. Then: :

& = \int_ f(x; \theta) \frac\frac\, dx =\int_ \frac \, dx \end

The assumed regularity conditions allow the interchange of derivative and integral (see

Leibniz integral rule In calculus, the Leibniz integral rule for differentiation under the integral sign, named after Gottfried Wilhelm Leibniz, states that for an integral of the form \int_^ f(x,t)\,dt, where -\infty < a(x), b(x) < \infty and the integrands ...

), hence the above expression may be rewritten as :

\frac \int_
 f(x; \theta) \, dx
=
\frac1 = 0.

It is worth restating the above result in words: the expected value of the score, at true parameter value

\theta

is zero. Thus, if one were to repeatedly sample from some distribution, and repeatedly calculate the score, then the mean value of the scores would tend to zero

asymptotically In analytic geometry, an asymptote () of a curve is a line such that the distance between the curve and the line approaches zero as one or both of the ''x'' or ''y'' coordinates tends to infinity. In projective geometry and related contexts, ...

Variance

The

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

of the score,

\operatorname(s(\theta)) = \operatorname(s(\theta) s(\theta)^)

, can be derived from the above expression for the expected value. :

\mathsf \right) \end

Hence the variance of the score is equal to the negative expected value of the

Hessian matrix In mathematics, the Hessian matrix, Hessian or (less commonly) Hesse matrix is a square matrix of second-order partial derivatives of a scalar-valued Function (mathematics), function, or scalar field. It describes the local curvature of a functio ...

of the log-likelihood. :

\operatorname(s(\theta) s(\theta)^) = - \operatorname\left( \frac \right)

The latter is known as the

Fisher information In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable ''X'' carries about an unknown parameter ''θ'' of a distribution that models ''X''. Formally, it is the variance ...

and is written

\mathcal(\theta)

. Note that the Fisher information is not a function of any particular observation, as the random variable

X

has been averaged out. This concept of information is useful when comparing two methods of observation of some

random process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables in a probability space, where the index of the family often has the interpretation of time. Stoc ...

Examples

Bernoulli process

Consider observing the first ''n'' trials of a

Bernoulli process In probability and statistics, a Bernoulli process (named after Jacob Bernoulli) is a finite or infinite sequence of binary random variables, so it is a discrete-time stochastic process that takes only two values, canonically 0 and 1. The ...

, and seeing that ''A'' of them are successes and the remaining ''B'' are failures, where the probability of success is ''θ''. Then the likelihood

\mathcal L

is :

\mathcal L(\theta;A,B)=\frac\theta^A(1-\theta)^B,

so the score ''s'' is :

s=\frac=\frac\frac = \frac-\frac.

We can now verify that the expectation of the score is zero. Noting that the expectation of ''A'' is ''nθ'' and the expectation of ''B'' is ''n''(1 − ''θ'') ecall that ''A'' and ''B'' are random variables we can see that the expectation of ''s'' is :

E(s)
= \frac - \frac
= n - n 
= 0.

We can also check the variance of

s

. We know that ''A'' + ''B'' = ''n'' (so ''B'' = ''n'' − ''A'') and the variance of ''A'' is ''nθ''(1 − ''θ'') so the variance of ''s'' is :

\begin
\operatorname(s) & =\operatorname\left(\frac-\frac\right)
=\operatorname\left(A\left(\frac+\frac\right)\right) \\
& =\left(\frac+\frac\right)^2\operatorname(A)
=\frac.
\end

Binary outcome model

For models with binary outcomes (''Y'' = 1 or 0), the model can be scored with the logarithm of predictions :

S = Y \log( p ) + ( 1 - Y ) ( \log( 1 - p ) )

where ''p'' is the probability in the model to be estimated and ''S'' is the score.

Applications

Scoring algorithm

The scoring algorithm is an iterative method for numerically determining the

maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...

estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on Sample (statistics), observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguish ...

Score test

Note that

s

is a function of

\theta

and the observation

\mathbf = (x_1, x_2, \ldots, x_T)

, so that, in general, it is not a

statistic A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypot ...

. However, in certain applications, such as the score test, the score is evaluated at a specific value of

\theta

(such as a null-hypothesis value), in which case the result is a statistic. Intuitively, if the restricted estimator is near the maximum of the likelihood function, the score should not differ from zero by more than

. In 1948, C. R. Rao first proved that the square of the score divided by the information matrix follows an asymptotic χ²-distribution under the null hypothesis. Further note that the

likelihood-ratio test In statistics, the likelihood-ratio test is a hypothesis test that involves comparing the goodness of fit of two competing statistical models, typically one found by maximization over the entire parameter space and another found after imposing ...

is given by :

= 2 \int_^ \frac \, d \theta = 2 \int_^ s(\theta) \, d \theta

which means that the likelihood-ratio test can be understood as the area under the score function between

\theta_

and

\hat

History

The term "score function" may initially seem unrelated to its contemporary meaning, which centers around the derivative of the log-likelihood function in statistical models. This apparent discrepancy can be traced back to the term's historical origins. The concept of the "score function" was first introduced by British statistician

Ronald Fisher Sir Ronald Aylmer Fisher (17 February 1890 – 29 July 1962) was a British polymath who was active as a mathematician, statistician, biologist, geneticist, and academic. For his work in statistics, he has been described as "a genius who a ...

in his 1935 paper titled "The Detection of Linkage with 'Dominant' Abnormalities."Fisher, Ronald Aylmer. "The detection of linkage with 'dominant' abnormalities." Annals of Eugenics 6.2 (1935): 187-201. Fisher employed the term in the context of genetic analysis, specifically for families where a parent had a dominant genetic abnormality. Over time, the application and meaning of the "score function" have evolved, diverging from its original context but retaining its foundational principles. Fisher's initial use of the term was in the context of analyzing genetic attributes in families with a parent possessing a genetic abnormality. He categorized the children of such parents into four classes based on two binary traits: whether they had inherited the abnormality or not, and their

zygosity Zygosity (the noun, zygote, is from the Greek "yoked," from "yoke") () is the degree to which both copies of a chromosome or gene have the same genetic sequence. In other words, it is the degree of similarity of the alleles in an organism. Mos ...

status as either homozygous or heterozygous. Fisher devised a method to assign each family a "score," calculated based on the number of children falling into each of the four categories. This score was used to estimate what he referred to as the "linkage parameter," which described the probability of the genetic abnormality being inherited. Fisher evaluated the efficacy of his scoring rule by comparing it with an alternative rule and against what he termed the "ideal score." The ideal score was defined as the derivative of the logarithm of the sampling density, as mentioned on page 193 of his work. The term "score" later evolved through subsequent research, notably expanding beyond the specific application in genetics that Fisher had initially addressed. Various authors adapted Fisher's original methodology to more generalized statistical contexts. In these broader applications, the term "score" or "efficient score" started to refer more commonly to the derivative of the log-likelihood function of the statistical model in question. This conceptual expansion was significantly influenced by a 1948 paper by C. R. Rao, which introduced "efficient score tests" that employed the derivative of the log-likelihood function.Radhakrishna Rao, C. (1948). Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44(1), 50-57. doi:10.1017/S0305004100023987 Thus, what began as a specialized term in the realm of genetic statistics has evolved to become a fundamental concept in broader statistical theory, often associated with the derivative of the log-likelihood function.

Notes

References

* * *{{cite book , last = Schervish , first = Mark J. , title = Theory of Statistics , publisher =Springer , date =1995 , location =New York , pages = Section 2.3.1 , isbn = 0-387-94546-6 , no-pp = true Maximum likelihood estimation