decision theory Decision theory (or the theory of choice; not to be confused with choice theory) is a branch of applied probability theory concerned with the theory of making decisions based on assigning probabilities to various factors and assigning numerical ...

, a scoring rule provides a summary measure for the evaluation of probabilistic predictions or forecasts. It is applicable to tasks in which predictions assign probabilities to events, i.e. one issues a probability distribution

F

as prediction. This includes probabilistic classification of a set of mutually exclusive outcomes or classes. On the other side, a scoring function provides a summary measure for the evaluation of point predictions, i.e. one predicts a property or functional

T(F)

, like the expectation or the

median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic fe ...

. Scoring rules and scoring functions can be thought of as "cost function" or "

loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

". They are evaluated as empirical mean of a given sample, simply called score. Scores of different predictions or models can then be compared to conclude which model is best. If a cost is levied in proportion to a proper scoring rule, the minimal expected cost corresponds to reporting the true set of probabilities. Proper scoring rules are used in meteorology, finance, and pattern classification where a forecaster or algorithm will attempt to minimize the average score to yield refined, calibrated probabilities (i.e. accurate probabilities).

Definition

Consider a

sample space In probability theory, the sample space (also called sample description space, possibility space, or outcome space) of an experiment or random trial is the set of all possible outcomes or results of that experiment. A sample space is usually den ...

\Omega

, a

σ-algebra In mathematical analysis and in probability theory, a σ-algebra (also σ-field) on a set ''X'' is a collection Σ of subsets of ''X'' that includes the empty subset, is closed under complement, and is closed under countable unions and countabl ...

\mathcal A

of subsets of

\Omega

and a convex class

\mathcal F

of probability measures on

(\Omega, \mathcal A)

. A function defined on

\Omega

and taking values in the extended real line,

\overline = \infty, \infty /math>, is \mathcal F -quasi-integrable if it is measurable with
respect to \mathcal A and is quasi-integrable with respect to all F \in \mathcal .

Probabilistic Forecast

A probabilistic forecast is any probability measure

F \in \mathcal

Scoring rule

A scoring rule is any extended real-valued function

\mathbf: \mathcal \times \Omega \rightarrow \mathbb

such that

\mathbf(F, \cdot)

\mathcal F

-quasi-integrable for all

F \in \mathcal

\mathbf(F, y)

represents the loss or penalty when the forecast

F \in \mathcal

is issued and the observation

y \in \Omega

materializes.

Point forecast

A point forecast is a functional, i.e. a potentially set-valued mapping

F \rightarrow T(F) \subseteq \Omega

Scoring function

A scoring function is any real-valued function

S: \Omega \times \Omega \rightarrow \mathbb

where

S(x, y)

represents the loss or penalty when the point forecast

x \in \Omega

is issued and the observation

y \in \Omega

materializes.

Orientation

Scoring rules

\mathbf(F,y)

and scoring functions

S(x, y)

are negatively (positively) oriented if smaller (larger) values mean better. Here we adhere to negative orientation, hence the association with "loss".

Sample average score

Given a sample

y_i, i=1\ldots n

and corresponding forecasts

F_i

x_i

(e.g. forecasts from a single model), one calculates the average score as :

\bar=\frac\sum_i \mathbf(F_i,y_i)

or :

\bar=\frac\sum_i S(x_i,y_i)

Average scores are used to compare and rank different forecast(er)s or models.

Propriety and consistency

Strictly proper scoring rules and strictly consistent scoring functions encourage honest forecasts by maximization of the expected reward: If a forecaster is given a reward of

-\mathbf(F, y)

y

realizes (e.g.

y=rain

), then the highest expected reward (lowest score) is obtained by reporting the true probability distribution.

Proper scoring rules

We write for the expected score under

Q \in \mathcal

\mathbf(F, Q) = \int \mathbf(F, \omega) \mathrmQ(\omega)

A scoring rule

\mathbf

is proper relative to

\mathcal

if (assuming negative orientation) :

\mathbf(Q, Q) \leq \mathbf(F, Q)

for all

F,Q \in\mathcal

. It is strictly proper if the above equation holds with equality if and only if

F=Q

Consistent scoring functions

A scoring function

S

is consistent for the functional

T

relative to the class

\mathcal F

if :

\operatorname_F (t, Y) \leq \operatorname_F (x, Y) /math> for all F \in \mathcal, all t \in T /math> and all x \in \Omega .

It is strictly consistent if it is consistent and equality in the above equation implies that x \in T(F) .

Example application of scoring rules

An example of

probabilistic forecasting Probabilistic forecasting summarizes what is known about, or opinions about, future events. In contrast to single-valued forecasts (such as forecasting that the maximum temperature at a given site on a given day will be 23 degrees Celsius, or that t ...

is in meteorology where a

weather forecaster A meteorologist is a scientist who studies and works in the field of meteorology aiming to understand or predict Earth's atmospheric phenomena including the weather. Those who study meteorological phenomena are meteorologists in research, while t ...

may give the probability of rain on the next day. One could note the number of times that a 25% probability was quoted, over a long period, and compare this with the actual proportion of times that rain fell. If the actual percentage was substantially different from the stated probability we say that the forecaster is poorly calibrated. A poorly calibrated forecaster might be encouraged to do better by a

bonus Bonus commonly means: * Bonus, a Commonwealth term for a distribution of profits to a with-profits insurance policy * Bonus payment, an extra payment received as a reward for doing one's job well or as an incentive Bonus may also refer to: Plac ...

system. A bonus system designed around a proper scoring rule will incentivize the forecaster to report probabilities equal to his personal beliefs. In addition to the simple case of a binary decision, such as assigning probabilities to 'rain' or 'no rain', scoring rules may be used for multiple classes, such as 'rain', 'snow', or 'clear', or continuous responses like the amount of rain per day. The image to the right shows an example of a scoring rule, the logarithmic scoring rule, as a function of the probability reported for the event that actually occurred. One way to use this rule would be as a cost based on the probability that a forecaster or algorithm assigns, then checking to see which event actually occurs.

Examples of proper scoring rules

There are an infinite number of scoring rules, including entire parameterized families of strictly proper scoring rules. The ones shown below are simply popular examples.

Categorical variables

For a categorical response variable with

m

mutually exclusive events,

Y \in \Omega = \

, a probabilistic forecaster or algorithm will return a

probability vector In mathematics and statistics, a probability vector or stochastic vector is a vector with non-negative entries that add up to one. The positions (indices) of a probability vector represent the possible outcomes of a discrete random variable, and ...

\mathbf

with a probability for each of the

m

outcomes.

Logarithmic score

The logarithmic scoring rule is a local strictly proper scoring rule. This is also the negative of

surprisal In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative ...

, which is commonly used as a scoring criterion in

Bayesian inference Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, a ...

; the goal is to minimize expected surprise. This scoring rule has strong foundations in

information theory Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...

. :

L(\mathbf,i) = \ln(r_i)

Here, the score is calculated as the logarithm of the probability estimate for the actual outcome. That is, a prediction of 80% that correctly proved true would receive a score of . This same prediction also assigns 20% likelihood to the opposite case, and so if the prediction proves false, it would receive a score based on the 20%: . The goal of a forecaster is to maximize the score and for the score to be as large as possible, and −0.22 is indeed larger than −1.6. If one treats the truth or falsity of the prediction as a variable with value 1 or 0 respectively, and the expressed probability as , then one can write the logarithmic scoring rule as . Note that any logarithmic base may be used, since strictly proper scoring rules remain strictly proper under linear transformation. That is: :

L(\mathbf,i) = \log_b(r_i)

is strictly proper for all

b>1

Brier/Quadratic score

The quadratic scoring rule is a strictly proper scoring rule :

Q(\mathbf,i) = 2r_i - \mathbf\cdot \mathbf = 2r_i -\sum_^C r_j^2

where

r_i

is the probability assigned to the correct answer and

C

is the number of classes. The

Brier score The Brier Score is a Scoring rule#StrictlyProperScoringRules, ''strictly proper score function'' or ''strictly proper scoring rule'' that measures the accuracy of probabilistic classification, probabilistic predictions. For unidimensional predicti ...

, originally proposed by Glenn W. Brier in 1950, can be obtained by an

affine transform In Euclidean geometry, an affine transformation or affinity (from the Latin, ''affinis'', "connected with") is a geometric transformation that preserves lines and parallelism, but not necessarily Euclidean distances and angles. More generally, ...

from the quadratic scoring rule. :

B(\mathbf,i) = \sum_^C (y_j-r_j)^2

Where

y_j = 1

when the

j

th event is correct and

y_j = 0

otherwise and

C

is the number of classes. An important difference between these two rules is that a forecaster should strive to maximize the quadratic score yet minimize the Brier score. This is due to a negative sign in the linear transformation between them.

Hyvärinen scoring rule

The Hyvärinen scoring function (of a density p) is defined by

s(p) = 2 \Delta_y \log p(y) + \, \nabla_y \log p(y)\, _2^2

It can be used to computationally simplify parameter inference and address Bayesian model comparison with arbitrarily-vague priors. It was also used to introduce new information-theoretic quantities beyond the existing

Spherical score

The spherical scoring rule is also a strictly proper scoring rule :

S(\mathbf,i) = \frac = \frac

Continuous variables

Continuous ranked probability score

The continuous ranked probability score (CRPS) is a strictly proper scoring rule much used in Meteorology. It is defined as

CRPS(F,y)=\int_\mathbb ( F(x) - \mathbb(x \ge y) ) ^2 dx

where F is the forecast cumulative distribution function and

y \in \mathbb R

is the observation.

Interpretation of proper scoring rules

All proper scoring rules are equal to weighted sums (integral with a non-negative weighting functional) of the losses in a set of simple two-alternative decision problems that ''use'' the probabilistic prediction, each such decision problem having a particular combination of associated cost parameters for false positive and false negative decisions. A ''strictly'' proper scoring rule corresponds to having a nonzero weighting for all possible decision thresholds. Any given proper scoring rule is equal to the expected losses with respect to a particular probability distribution over the decision thresholds; thus the choice of a scoring rule corresponds to an assumption about the probability distribution of decision problems for which the predicted probabilities will ultimately be employed, with for example the quadratic loss (or Brier) scoring rule corresponding to a uniform probability of the decision threshold being anywhere between zero and one. The classification accuracy score (percent classified correctly), a single-threshold scoring rule which is zero or one depending on whether the predicted probability is on the appropriate side of 0.5, is a proper scoring rule but not a strictly proper scoring rule because it is optimized (in expectation) not only by predicting the true probability but by predicting ''any'' probability on the same side of 0.5 as the true probability.Hernandez-Orallo, Jose; Flach, Peter; and Ferri, Cesar (2012). "A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss." ''Journal of Machine Learning Research'' 13 2813–2869. http://www.jmlr.org/papers/volume13/hernandez-orallo12a/hernandez-orallo12a.pdf

Comparison of strictly proper scoring rules

Shown below on the left is a graphical comparison of the Logarithmic, Quadratic, and Spherical scoring rules for a binary classification problem. The ''x''-axis indicates the reported probability for the event that actually occurred. It is important to note that each of the scores have different magnitudes and locations. The magnitude differences are not relevant however as scores remain proper under affine transformation. Therefore, to compare different scores it is necessary to move them to a common scale. A reasonable choice of normalization is shown at the picture on the right where all scores intersect the points (0.5,0) and (1,1). This ensures that they yield 0 for a uniform distribution (two probabilities of 0.5 each), reflecting no cost or reward for reporting what is often the baseline distribution. All normalized scores below also yield 1 when the true class is assigned a probability of 1.

Characteristics

Affine transformation

A strictly proper scoring rule, whether binary or multiclass, after an

affine transformation In Euclidean geometry, an affine transformation or affinity (from the Latin, ''affinis'', "connected with") is a geometric transformation that preserves lines and parallelism, but not necessarily Euclidean distances and angles. More generally, ...

remains a strictly proper scoring rule. That is, if

S(\mathbf,i)

is a strictly proper scoring rule then

a+bS(\mathbf,i)

with

b \neq 0

is also a strictly proper scoring rule, though if

b < 0

then the optimization sense of the scoring rule switches between maximization and minimization.

Locality

A proper scoring rule is said to be ''local'' if its estimate for the probability of a specific event depends only on the probability of that event. This statement is vague in most descriptions but we can, in most cases, think of this as the optimal solution of the scoring problem "at a specific event" is invariant to all changes in the observation distribution that leave the probability of that event unchanged. All binary scores are local because the probability assigned to the event that did not occur is determined so there is no degree of flexibility to vary over. Affine functions of the logarithmic scoring rule are the only strictly proper local scoring rules on a finite set that is not binary.

Decomposition

The expectation value of a proper scoring rule

S

can be decomposed into the sum of three components, called ''uncertainty'', ''reliability'', and ''resolution'', which characterize different attributes of probabilistic forecasts: :

E(S) = \mathrm + \mathrm - \mathrm.

If a score is proper and negatively oriented (such as the Brier Score), all three terms are positive definite. The uncertainty component is equal to the expected score of the forecast which constantly predicts the average event frequency. The reliability component penalizes poorly calibrated forecasts, in which the predicted probabilities do not coincide with the event frequencies. The equations for the individual components depend on the particular scoring rule. For the Brier Score, they are given by :

\mathrm = \bar(1-\bar)

\mathrm = E(p-\pi(p))^2

\mathrm = E(\pi(p)-\bar)^2

where

\bar

is the average probability of occurrence of the binary event

x

, and

\pi(p)

is the conditional event probability, given

p

, i.e.

\pi(p) = P(x=1\mid p)

References

{{Reflist

External links

Video comparing spherical, quadratic and logarithmic scoring rules

Local Proper Scoring Rules

Scoring Rules and Decision Analysis Education

Strictly Proper Scoring RulesScoring Rules and uncertainty
Decision theory Probability assessment

Definition

Probabilistic Forecast

Scoring rule

Point forecast

Scoring function

Orientation

Sample average score

Propriety and consistency

Proper scoring rules

Consistent scoring functions

Example application of scoring rules

Examples of proper scoring rules

Categorical variables

Logarithmic score

Brier/Quadratic score

Hyvärinen scoring rule

Spherical score

Continuous variables

Continuous ranked probability score

Interpretation of proper scoring rules

Comparison of strictly proper scoring rules

Characteristics

Affine transformation

Locality

Decomposition

See also

References

External links