In
statistical inference
Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution, distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical ...
, specifically
predictive inference
Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers propertie ...
, a prediction interval is an estimate of an
interval in which a future observation will fall, with a certain probability, given what has already been observed. Prediction intervals are often used in
regression analysis
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
.
Prediction intervals are used in both
frequentist statistics
Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or pr ...
and
Bayesian statistics
Bayesian statistics is a theory in the field of statistics based on the Bayesian interpretation of probability where probability expresses a ''degree of belief'' in an event. The degree of belief may be based on prior knowledge about the event, ...
: a prediction interval bears the same relationship to a future observation that a frequentist
confidence interval
In frequentist statistics, a confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interval is computed at a designated ''confidence level''; the 95% confidence level is most common, but other levels, such as 9 ...
or Bayesian
credible interval
In Bayesian statistics, a credible interval is an interval within which an unobserved parameter value falls with a particular probability. It is an interval in the domain of a posterior probability distribution or a predictive distribution. T ...
bears to an unobservable population parameter: prediction intervals predict the distribution of individual future points, whereas confidence intervals and credible intervals of parameters predict the distribution of estimates of the true population mean or other quantity of interest that cannot be observed.
Introduction
For example, if one makes the
parametric assumption that the underlying distribution is a
normal distribution
In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is
:
f(x) = \frac e^
The parameter \mu ...
, and has a sample set , then confidence intervals and credible intervals may be used to estimate the
population mean
In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects (e.g. the set of all stars within the Milky Way galaxy) or a hypothe ...
''μ'' and
population standard deviation ''σ'' of the underlying population, while prediction intervals may be used to estimate the value of the next sample variable, ''X''
''n''+1.
Alternatively, in
Bayesian terms, a prediction interval can be described as a credible interval for the variable itself, rather than for a parameter of the distribution thereof.
The concept of prediction intervals need not be restricted to inference about a single future sample value but can be extended to more complicated cases. For example, in the context of river flooding where analyses are often based on annual values of the largest flow within the year, there may be interest in making inferences about the largest flood likely to be experienced within the next 50 years.
Since prediction intervals are only concerned with past and future observations, rather than unobservable population parameters, they are advocated as a better method than confidence intervals by some statisticians, such as
Seymour Geisser Seymour Geisser (October 5, 1929 – March 11, 2004) was an American statistician noted for emphasizing predictive inference. In his book ''Predictive Inference: An Introduction'', he held that conventional statistical inference about unobservable ...
, following the focus on observables by
Bruno de Finetti
Bruno de Finetti (13 June 1906 – 20 July 1985) was an Italian probabilist statistician and actuary, noted for the "operational subjective" conception of probability. The classic exposition of his distinctive theory is the 1937 "La prévision: ...
.
Normal distribution
Given a sample from a
normal distribution
In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is
:
f(x) = \frac e^
The parameter \mu ...
, whose parameters are unknown, it is possible to give prediction intervals in the frequentist sense, i.e., an interval
'a'', ''b''based on statistics of the sample such that on repeated experiments, ''X''
''n''+1 falls in the interval the desired percentage of the time; one may call these "predictive
confidence interval
In frequentist statistics, a confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interval is computed at a designated ''confidence level''; the 95% confidence level is most common, but other levels, such as 9 ...
s".
A general technique of frequentist prediction intervals is to find and compute a
pivotal quantity
In statistics, a pivotal quantity or pivot is a function of observations and unobservable parameters such that the function's probability distribution does not depend on the unknown parameters (including nuisance parameters). A pivot quantity need ...
of the observables ''X''
1, ..., ''X''
''n'', ''X''
''n''+1 – meaning a function of observables and parameters whose probability distribution does not depend on the parameters – that can be inverted to give a probability of the future observation ''X''
''n''+1 falling in some interval computed in terms of the observed values so far,
Such a pivotal quantity, depending only on observables, is called an
ancillary statistic An ancillary statistic is a measure of a sample whose distribution (or whose pmf or pdf) does not depend on the parameters of the model. An ancillary statistic is a pivotal quantity that is also a statistic. Ancillary statistics can be used to c ...
. The usual method of constructing pivotal quantities is to take the difference of two variables that depend on location, so that location cancels out, and then take the ratio of two variables that depend on scale, so that scale cancels out.
The most familiar pivotal quantity is the
Student's t-statistic
In statistics, the ''t''-statistic is the ratio of the departure of the estimated value of a parameter from its hypothesized value to its standard error. It is used in hypothesis testing via Student's ''t''-test. The ''t''-statistic is used in a ...
, which can be derived by this method and is used in the sequel.
Known mean, known variance
A prediction interval
'ℓ'',''u''for a future observation ''X'' in a normal distribution ''N''(''µ'',''σ''
2) with known
mean
There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set.
For a data set, the '' ari ...
and
variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
may be calculated from
: