HOME

TheInfoList



OR:

A statistical model is a
mathematical model A mathematical model is a description of a system using mathematical concepts and language. The process of developing a mathematical model is termed mathematical modeling. Mathematical models are used in the natural sciences (such as physics, ...
that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger
population Population typically refers to the number of people in a single area, whether it be a city or town, region, country, continent, or the world. Governments typically quantify the size of the resident population within their jurisdiction using ...
). A statistical model represents, often in considerably idealized form, the data-generating process. A statistical model is usually specified as a mathematical relationship between one or more random variables and other non-random variables. As such, a statistical model is "a formal representation of a theory" ( Herman Adèr quoting Kenneth Bollen). All statistical hypothesis tests and all statistical estimators are derived via statistical models. More generally, statistical models are part of the foundation of statistical inference.


Introduction

Informally, a statistical model can be thought of as a
statistical assumption Statistics, like all mathematical disciplines, does not infer valid conclusions from nothing. Inferring interesting conclusions about real statistical populations almost always requires some background assumptions. Those assumptions must be made c ...
(or set of statistical assumptions) with a certain property: that the assumption allows us to calculate the probability of any event. As an example, consider a pair of ordinary six-sided
dice Dice (singular die or dice) are small, throwable objects with marked sides that can rest in multiple positions. They are used for generating random values, commonly as part of tabletop games, including dice games, board games, role-playing ...
. We will study two different statistical assumptions about the dice. The first statistical assumption is this: for each of the dice, the probability of each face (1, 2, 3, 4, 5, and 6) coming up is . From that assumption, we can calculate the probability of both dice coming up 5:    More generally, we can calculate the probability of any event: e.g. (1 and 2) or (3 and 3) or (5 and 6). The alternative statistical assumption is this: for each of the dice, the probability of the face 5 coming up is (because the dice are weighted). From that assumption, we can calculate the probability of both dice coming up 5:    We cannot, however, calculate the probability of any other nontrivial event, as the probabilities of the other faces are unknown. The first statistical assumption constitutes a statistical model: because with the assumption alone, we can calculate the probability of any event. The alternative statistical assumption does ''not'' constitute a statistical model: because with the assumption alone, we cannot calculate the probability of every event. In the example above, with the first assumption, calculating the probability of an event is easy. With some other examples, though, the calculation can be difficult, or even impractical (e.g. it might require millions of years of computation). For an assumption to constitute a statistical model, such difficulty is acceptable: doing the calculation does not need to be practicable, just theoretically possible.


Formal definition

In mathematical terms, a statistical model is usually thought of as a pair (S, \mathcal), where S is the set of possible observations, i.e. the sample space, and \mathcal is a set of probability distributions on S. The intuition behind this definition is as follows. It is assumed that there is a "true" probability distribution induced by the process that generates the observed data. We choose \mathcal to represent a set (of distributions) which contains a distribution that adequately approximates the true distribution. Note that we do not require that \mathcal contains the true distribution, and in practice that is rarely the case. Indeed, as Burnham & Anderson state, "A model is a simplification or approximation of reality and hence will not reflect all of reality"—hence the saying " all models are wrong". The set \mathcal is almost always parameterized: \mathcal=\. The set \Theta defines the parameters of the model. A parameterization is generally required to have distinct parameter values give rise to distinct distributions, i.e. P_ = P_ \Rightarrow \theta_1 = \theta_2 must hold (in other words, it must be injective). A parameterization that meets the requirement is said to be '' identifiable''.


An example

Suppose that we have a population of children, with the ages of the children distributed
uniformly Uniform distribution may refer to: * Continuous uniform distribution * Discrete uniform distribution * Uniform distribution (ecology) * Equidistributed sequence In mathematics, a sequence (''s''1, ''s''2, ''s''3, ...) of real numbers is said to be ...
, in the population. The height of a child will be
stochastic Stochastic (, ) refers to the property of being well described by a random probability distribution. Although stochasticity and randomness are distinct in that the former refers to a modeling approach and the latter refers to phenomena themselve ...
ally related to the age: e.g. when we know that a child is of age 7, this influences the chance of the child being 1.5 meters tall. We could formalize that relationship in a
linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is ...
model, like this: height''i'' = ''b''0 + ''b''1age''i'' + ε''i'', where ''b''0 is the intercept, ''b''1 is a parameter that age is multiplied by to obtain a prediction of height, ε''i'' is the error term, and ''i'' identifies the child. This implies that height is predicted by age, with some error. An admissible model must be consistent with all the data points. Thus, a straight line (height''i'' = ''b''0 + ''b''1age''i'') cannot be the equation for a model of the data—unless it exactly fits all the data points, i.e. all the data points lie perfectly on the line. The error term, ε''i'', must be included in the equation, so that the model is consistent with all the data points. To do statistical inference, we would first need to assume some probability distributions for the ε''i''. For instance, we might assume that the ε''i'' distributions are i.i.d. Gaussian, with zero mean. In this instance, the model would have 3 parameters: ''b''0, ''b''1, and the variance of the Gaussian distribution. We can formally specify the model in the form (S, \mathcal) as follows. The sample space, S, of our model comprises the set of all possible pairs (age, height). Each possible value of \theta = (''b''0, ''b''1, ''σ''2) determines a distribution on S; denote that distribution by P_. If \Theta is the set of all possible values of \theta, then \mathcal=\. (The parameterization is identifiable, and this is easy to check.) In this example, the model is determined by (1) specifying S and (2) making some assumptions relevant to \mathcal. There are two assumptions: that height can be approximated by a linear function of age; that errors in the approximation are distributed as i.i.d. Gaussian. The assumptions are sufficient to specify \mathcal—as they are required to do.


General remarks

A statistical model is a special class of
mathematical model A mathematical model is a description of a system using mathematical concepts and language. The process of developing a mathematical model is termed mathematical modeling. Mathematical models are used in the natural sciences (such as physics, ...
. What distinguishes a statistical model from other mathematical models is that a statistical model is non- deterministic. Thus, in a statistical model specified via mathematical equations, some of the variables do not have specific values, but instead have probability distributions; i.e. some of the variables are
stochastic Stochastic (, ) refers to the property of being well described by a random probability distribution. Although stochasticity and randomness are distinct in that the former refers to a modeling approach and the latter refers to phenomena themselve ...
. In the above example with children's heights, ε is a stochastic variable; without that stochastic variable, the model would be deterministic. Statistical models are often used even when the data-generating process being modeled is deterministic. For instance, coin tossing is, in principle, a deterministic process; yet it is commonly modeled as stochastic (via a Bernoulli process). Choosing an appropriate statistical model to represent a given data-generating process is sometimes extremely difficult, and may require knowledge of both the process and relevant statistical analyses. Relatedly, the statistician Sir David Cox has said, "How hetranslation from subject-matter problem to statistical model is done is often the most critical part of an analysis". There are three purposes for a statistical model, according to Konishi & Kitagawa. *Predictions *Extraction of information *Description of stochastic structures Those three purposes are essentially the same as the three purposes indicated by Friendly & Meyer: prediction, estimation, description. The three purposes correspond with the three kinds of logical reasoning:
deductive reasoning Deductive reasoning is the mental process of drawing deductive inferences. An inference is deductively valid if its conclusion follows logically from its premises, i.e. if it is impossible for the premises to be true and the conclusion to be false ...
, inductive reasoning,
abductive reasoning Abductive reasoning (also called abduction,For example: abductive inference, or retroduction) is a form of logical inference formulated and advanced by American philosopher Charles Sanders Peirce beginning in the last third of the 19th centur ...
.


Dimension of a model

Suppose that we have a statistical model (S, \mathcal) with \mathcal=\. The model is said to be '' parametric'' if \Theta has a finite dimension. In notation, we write that \Theta \subseteq \mathbb^k where is a positive integer (\mathbb denotes the
real numbers In mathematics, a real number is a number that can be used to measure a ''continuous'' one-dimensional quantity such as a distance, duration or temperature. Here, ''continuous'' means that values can have arbitrarily small variations. Every ...
; other sets can be used, in principle). Here, is called the dimension of the model. As an example, if we assume that data arise from a univariate
Gaussian distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu i ...
, then we are assuming that :\mathcal=\left\. In this example, the dimension, , equals 2. As another example, suppose that the data consists of points (, ) that we assume are distributed according to a straight line with i.i.d. Gaussian residuals (with zero mean): this leads to the same statistical model as was used in the example with children's heights. The dimension of the statistical model is 3: the intercept of the line, the slope of the line, and the variance of the distribution of the residuals. (Note that in geometry, a straight line has
dimension In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coor ...
1.) Although formally \theta \in \Theta is a single parameter that has dimension , it is sometimes regarded as comprising separate parameters. For example, with the univariate Gaussian distribution, \theta is formally a single parameter with dimension 2, but it is sometimes regarded as comprising 2 separate parameters—the mean and the standard deviation. A statistical model is ''nonparametric'' if the parameter set \Theta is infinite dimensional. A statistical model is ''semiparametric'' if it has both finite-dimensional and infinite-dimensional parameters. Formally, if is the dimension of \Theta and is the number of samples, both semiparametric and nonparametric models have k \rightarrow \infty as n \rightarrow \infty. If k/n \rightarrow 0 as n \rightarrow \infty, then the model is semiparametric; otherwise, the model is nonparametric. Parametric models are by far the most commonly used statistical models. Regarding semiparametric and nonparametric models, Sir David Cox has said, "These typically involve fewer assumptions of structure and distributional form but usually contain strong assumptions about independencies".


Nested models

Two statistical models are nested if the first model can be transformed into the second model by imposing constraints on the parameters of the first model. As an example, the set of all Gaussian distributions has, nested within it, the set of zero-mean Gaussian distributions: we constrain the mean in the set of all Gaussian distributions to get the zero-mean distributions. As a second example, the quadratic model : has, nested within it, the linear model : —we constrain the parameter to equal 0. In both those examples, the first model has a higher dimension than the second model (for the first example, the zero-mean model has dimension 1). Such is of