A statistical model is a
mathematical model that embodies a set of
statistical assumptions
Statistics, like all mathematical disciplines, does not infer valid conclusions from nothing. Inferring interesting conclusions about real statistical populations almost always requires some background assumptions. Those assumptions must be made c ...
concerning the generation of
sample data (and similar data from a larger
population
Population typically refers to the number of people in a single area, whether it be a city or town, region, country, continent, or the world. Governments typically quantify the size of the resident population within their jurisdiction using a ...
). A statistical model represents, often in considerably idealized form, the data-generating process.
A statistical model is usually specified as a mathematical relationship between one or more
random variables and other non-random variables. As such, a statistical model is "a formal representation of a theory" (
Herman Adèr quoting
Kenneth Bollen
Kenneth A. Bollen (born 1951) is the Henry Rudolf Immerwahr Distinguished Professor of Sociology at the University of North Carolina at Chapel Hill. Bollen joined UNC-Chapel Hill in 1985. He is also a member of the faculty in the Quantitative Psy ...
).
All
statistical hypothesis tests and all
statistical estimators are derived via statistical models. More generally, statistical models are part of the foundation of
statistical inference.
Introduction
Informally, a statistical model can be thought of as a
statistical assumption (or set of statistical assumptions) with a certain property: that the assumption allows us to calculate the probability of any
event
Event may refer to:
Gatherings of people
* Ceremony, an event of ritual significance, performed on a special occasion
* Convention (meeting), a gathering of individuals engaged in some common interest
* Event management, the organization of e ...
. As an example, consider a pair of ordinary six-sided
dice. We will study two different statistical assumptions about the dice.
The first statistical assumption is this: for each of the dice, the probability of each face (1, 2, 3, 4, 5, and 6) coming up is . From that assumption, we can calculate the probability of both dice coming up 5: More generally, we can calculate the probability of any event: e.g. (1 and 2) or (3 and 3) or (5 and 6).
The alternative statistical assumption is this: for each of the dice, the probability of the face 5 coming up is (because the dice are
weighted). From that assumption, we can calculate the probability of both dice coming up 5: We cannot, however, calculate the probability of any other nontrivial event, as the probabilities of the other faces are unknown.
The first statistical assumption constitutes a statistical model: because with the assumption alone, we can calculate the probability of any event. The alternative statistical assumption does ''not'' constitute a statistical model: because with the assumption alone, we cannot calculate the probability of every event.
In the example above, with the first assumption, calculating the probability of an event is easy. With some other examples, though, the calculation can be difficult, or even impractical (e.g. it might require millions of years of computation). For an assumption to constitute a statistical model, such difficulty is acceptable: doing the calculation does not need to be practicable, just theoretically possible.
Formal definition
In mathematical terms, a statistical model is usually thought of as a pair (
), where
is the set of possible observations, i.e. the
sample space
In probability theory, the sample space (also called sample description space, possibility space, or outcome space) of an experiment or random trial is the set of all possible outcomes or results of that experiment. A sample space is usually den ...
, and
is a set of
probability distributions
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon ...
on
.
The intuition behind this definition is as follows. It is assumed that there is a "true" probability distribution induced by the process that generates the observed data. We choose
to represent a set (of distributions) which contains a distribution that adequately approximates the true distribution.
Note that we do not require that
contains the true distribution, and in practice that is rarely the case. Indeed, as Burnham & Anderson state, "A model is a simplification or approximation of reality and hence will not reflect all of reality"—hence the saying "
all models are wrong
All or ALL may refer to:
Language
* All, an indefinite pronoun in English
* All, one of the English determiners
* Allar language (ISO 639-3 code)
* Allative case (abbreviated ALL)
Music
* All (band), an American punk rock band
* ''All'' (All ...
".
The set
is almost always parameterized:
. The set
defines the
parameters
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
of the model. A parameterization is generally required to have distinct parameter values give rise to distinct distributions, i.e.
must hold (in other words, it must be
injective). A parameterization that meets the requirement is said to be ''
identifiable
In statistics, identifiability is a property which a model must satisfy for precise inference to be possible. A model is identifiable if it is theoretically possible to learn the true values of this model's underlying parameters after obtaining an ...
''.
An example
Suppose that we have a population of children, with the ages of the children distributed
uniformly, in the population. The height of a child will be
stochastically related to the age: e.g. when we know that a child is of age 7, this influences the chance of the child being 1.5 meters tall. We could formalize that relationship in a
linear regression model, like this:
height
''i'' = ''b''
0 + ''b''
1age
''i'' + ε
''i'', where ''b''
0 is the intercept, ''b''
1 is a parameter that age is multiplied by to obtain a prediction of height, ε
''i'' is the error term, and ''i'' identifies the child. This implies that height is predicted by age, with some error.
An admissible model must be consistent with all the data points. Thus, a straight line (height
''i'' = ''b''
0 + ''b''
1age
''i'') cannot be the equation for a model of the data—unless it exactly fits all the data points, i.e. all the data points lie perfectly on the line. The error term, ε
''i'', must be included in the equation, so that the model is consistent with all the data points.
To do
statistical inference, we would first need to assume some probability distributions for the ε
''i''. For instance, we might assume that the ε
''i'' distributions are
i.i.d.
In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...
Gaussian, with zero mean. In this instance, the model would have 3 parameters: ''b''
0, ''b''
1, and the variance of the Gaussian distribution.
We can formally specify the model in the form (
) as follows. The sample space,
, of our model comprises the set of all possible pairs (age, height). Each possible value of
= (''b''
0, ''b''
1, ''σ''
2) determines a distribution on
; denote that distribution by
. If
is the set of all possible values of
, then
. (The parameterization is identifiable, and this is easy to check.)
In this example, the model is determined by (1) specifying
and (2) making some assumptions relevant to
. There are two assumptions: that height can be approximated by a linear function of age; that errors in the approximation are distributed as i.i.d. Gaussian. The assumptions are sufficient to specify
—as they are required to do.
General remarks
A statistical model is a special class of
mathematical model. What distinguishes a statistical model from other mathematical models is that a statistical model is non-
deterministic. Thus, in a statistical model specified via mathematical equations, some of the variables do not have specific values, but instead have probability distributions; i.e. some of the variables are
stochastic. In the above example with children's heights, ε is a stochastic variable; without that stochastic variable, the model would be deterministic.
Statistical models are often used even when the data-generating process being modeled is deterministic. For instance,
coin tossing
A coin is a small, flat (usually depending on the country or value), round piece of metal or plastic used primarily as a medium of exchange or legal tender. They are standardized in weight, and produced in large quantities at a mint in order to ...
is, in principle, a deterministic process; yet it is commonly modeled as stochastic (via a
Bernoulli process).
Choosing an appropriate statistical model to represent a given data-generating process is sometimes extremely difficult, and may require knowledge of both the process and relevant statistical analyses. Relatedly, the statistician
Sir David Cox has said, "How
hetranslation from subject-matter problem to statistical model is done is often the most critical part of an analysis".
There are three purposes for a statistical model, according to Konishi & Kitagawa.
*Predictions
*Extraction of information
*Description of stochastic structures
Those three purposes are essentially the same as the three purposes indicated by Friendly & Meyer: prediction, estimation, description. The three purposes correspond with the three kinds of
logical reasoning
Two kinds of logical reasoning are often distinguished in addition to formal deduction: induction and abduction. Given a precondition or ''premise'', a conclusion or ''logical consequence'' and a rule or ''material conditional'' that implies the ...
:
deductive reasoning,
inductive reasoning,
abductive reasoning.
Dimension of a model
Suppose that we have a statistical model (
) with
. The model is said to be ''
parametric'' if
has a finite dimension. In notation, we write that
where is a positive integer (
denotes the
real numbers; other sets can be used, in principle). Here, is called the dimension of the model.
As an example, if we assume that data arise from a univariate
Gaussian distribution, then we are assuming that
:
.
In this example, the dimension, , equals 2.
As another example, suppose that the data consists of points (, ) that we assume are distributed according to a straight line with i.i.d. Gaussian residuals (with zero mean): this leads to the same statistical model as was used in the example with children's heights. The dimension of the statistical model is 3: the intercept of the line, the slope of the line, and the variance of the distribution of the residuals. (Note that in geometry, a straight line has
dimension
In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coor ...
1.)
Although formally
is a single parameter that has dimension , it is sometimes regarded as comprising separate parameters. For example, with the univariate Gaussian distribution,
is formally a single parameter with dimension 2, but it is sometimes regarded as comprising 2 separate parameters—the mean and the standard deviation.
A statistical model is
''nonparametric'' if the parameter set
is infinite dimensional. A statistical model is
''semiparametric'' if it has both finite-dimensional and infinite-dimensional parameters. Formally, if is the dimension of
and is the number of samples, both semiparametric and nonparametric models have
as
. If
as
, then the model is semiparametric; otherwise, the model is nonparametric.
Parametric models are by far the most commonly used statistical models. Regarding semiparametric and nonparametric models,
Sir David Cox has said, "These typically involve fewer assumptions of structure and distributional form but usually contain strong assumptions about independencies".
Nested models
Two statistical models are nested if the first model can be transformed into the second model by imposing constraints on the parameters of the first model. As an example, the set of all Gaussian distributions has, nested within it, the set of zero-mean Gaussian distributions: we constrain the mean in the set of all Gaussian distributions to get the zero-mean distributions. As a second example, the quadratic model
:
has, nested within it, the linear model
:
—we constrain the parameter to equal 0.
In both those examples, the first model has a higher dimension than the second model (for the first example, the zero-mean model has dimension 1). Such is often, but not always, the case. As a different example, the set of positive-mean Gaussian distributions, which has dimension 2, is nested within the set of all Gaussian distributions.
Comparing models
Comparing statistical models is fundamental for much of
statistical inference. Indeed, state this: "The majority of the problems in statistical inference can be considered to be problems related to statistical modeling. They are typically formulated as comparisons of several statistical models."
Common criteria for comparing models include the following:
''R''2,
Bayes factor
The Bayes factor is a ratio of two competing statistical models represented by their marginal likelihood, and is used to quantify the support for one model over the other. The models in questions can have a common set of parameters, such as a nul ...
,
Akaike information criterion
The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to e ...
, and the
likelihood-ratio test
In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after im ...
together with its generalization, the
relative likelihood In statistics, suppose that we have been given some data, and we are selecting a statistical model for that data. The relative likelihood compares the relative plausibilities of different candidate models or of different values of a parameter of ...
.
See also
*
All models are wrong
All or ALL may refer to:
Language
* All, an indefinite pronoun in English
* All, one of the English determiners
* Allar language (ISO 639-3 code)
* Allative case (abbreviated ALL)
Music
* All (band), an American punk rock band
* ''All'' (All ...
*
Blockmodel
Blockmodel (sometimes also block model) in blockmodeling (part of network science) is defined as a multitude of structures, which are obtained with:
* identification of all vertices (e.g., units, nodes) within a cluster and at the same time repr ...
*
Conceptual model
*
Design of experiments
The design of experiments (DOE, DOX, or experimental design) is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variation. The term is generally associ ...
*
Deterministic model
In mathematics, computer science and physics, a deterministic system is a system in which no randomness is involved in the development of future states of the system. A deterministic model will thus always produce the same output from a given sta ...
*
Effective theory In science, an effective theory is a scientific theory which proposes to describe a certain set of observations, but explicitly without the claim or implication that the mechanism employed in the theory has a direct counterpart in the actual causes ...
*
Predictive model
Predictive modelling uses statistics to predict outcomes. Most often the event one wants to predict is in the future, but predictive modelling can be applied to any type of unknown event, regardless of when it occurred. For example, predictive mod ...
*
Response modeling methodology
*
Scientific model
Scientific modelling is a scientific activity, the aim of which is to make a particular part or feature of the world easier to understand, define, quantify, visualize, or simulate by referencing it to existing and usually commonly accepted ...
*
Statistical inference
*
Statistical model specification
*
Statistical model validation
*
Statistical theory
The theory of statistics provides a basis for the whole range of techniques, in both study design and data analysis, that are used within applications of statistics.
The theory covers approaches to statistical-decision problems and to statistica ...
*
Stochastic process
Notes
References
* .
*.
*.
*.
*.
*.
Further reading
* Davison, A. C. (2008), ''Statistical Models'',
Cambridge University Press
Cambridge University Press is the university press of the University of Cambridge. Granted letters patent by King Henry VIII in 1534, it is the oldest university press in the world. It is also the King's Printer.
Cambridge University Pre ...
*
* Freedman, D. A. (2009), ''Statistical Models'',
Cambridge University Press
Cambridge University Press is the university press of the University of Cambridge. Granted letters patent by King Henry VIII in 1534, it is the oldest university press in the world. It is also the King's Printer.
Cambridge University Pre ...
* Helland, I. S. (2010), ''Steps Towards a Unified Basis for Scientific Models and Methods'',
World Scientific
World Scientific Publishing is an academic publisher of scientific, technical, and medical books and journals headquartered in Singapore. The company was founded in 1981. It publishes about 600 books annually, along with 135 journals in various ...
*
Kroese, D. P.; Chan, J. C. C. (2014), ''Statistical Modeling and Computation'',
Springer
Springer or springers may refer to:
Publishers
* Springer Science+Business Media, aka Springer International Publishing, a worldwide publishing group founded in 1842 in Germany formerly known as Springer-Verlag.
** Springer Nature, a multinationa ...
*
{{Statistics, inference
Mathematical modeling
Statistical theory