In
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, the variance function is a
smooth function
In mathematical analysis, the smoothness of a function (mathematics), function is a property measured by the number of Continuous function, continuous Derivative (mathematics), derivatives it has over some domain, called ''differentiability cl ...
which depicts the
variance of a
random quantity
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
as a function of its
mean. The variance function is a measure of
heteroscedasticity
In statistics, a sequence (or a vector) of random variables is homoscedastic () if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. The s ...
and plays a large role in many settings of statistical modelling. It is a main ingredient in the
generalized linear model
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and b ...
framework and a tool used in
non-parametric regression,
semiparametric regression and
functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a
smooth function
In mathematical analysis, the smoothness of a function (mathematics), function is a property measured by the number of Continuous function, continuous Derivative (mathematics), derivatives it has over some domain, called ''differentiability cl ...
.
Intuition
In a regression model setting, the goal is to establish whether or not a relationship exists between a response variable and a set of predictor variables. Further, if a relationship does exist, the goal is then to be able to describe this relationship as best as possible. A main assumption in
linear regression
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is call ...
is constant variance or (homoscedasticity), meaning that different response variables have the same variance in their errors, at every predictor level. This assumption works well when the response variable and the predictor variable are jointly Normal, see
Normal distribution. As we will see later, the variance function in the Normal setting, is constant, however, we must find a way to quantify heteroscedasticity (non-constant variance) in the absence of joint Normality.
When it is likely that the response follows a distribution that is a member of the exponential family, a
generalized linear model
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and b ...
may be more appropriate to use, and moreover, when we wish not to force a parametric model onto our data, a
non-parametric regression approach can be useful. The importance of being able to model the variance as a function of the mean lies in improved inference (in a parametric setting), and estimation of the regression function in general, for any setting.
Variance functions play a very important role in parameter estimation and inference. In general, maximum likelihood estimation requires that a likelihood function be defined. This requirement then implies that one must first specify the distribution of the response variables observed. However, to define a quasi-likelihood, one need only specify a relationship between the mean and the variance of the observations to then be able to use the quasi-likelihood function for estimation.
Quasi-likelihood estimation is particularly useful when there is
overdispersion. Overdispersion occurs when there is more variability in the data than there should otherwise be expected according to the assumed distribution of the data.
In summary, to ensure efficient inference of the regression parameters and the regression function, the heteroscedasticity must be accounted for. Variance functions quantify the relationship between the variance and the mean of the observed data and hence play a significant role in regression estimation and inference.
Types
The variance function and its applications come up in many areas of statistical analysis. A very important use of this function is in the framework of
generalized linear models
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and b ...
and
non-parametric regression.
Generalized linear model
When a member of the
exponential family has been specified, the variance function can easily be derived. The general form of the variance function is presented under the exponential family context, as well as specific forms for Normal, Bernoulli, Poisson, and Gamma. In addition, we describe the applications and use of variance functions in maximum likelihood estimation and quasi-likelihood estimation.
Derivation
The generalized linear model (GLM), is a generalization of ordinary regression analysis that extends to any member of the
exponential family. It is particularly useful when the response variable is categorical, binary or subject to a constraint (e.g. only positive responses make sense). A quick summary of the components of a GLM are summarized on this page, but for more details and information see the page on
generalized linear models
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and b ...
.
A GLM consists of three main ingredients:
:1. Random Component: a distribution of y from the exponential family,
:2. Linear predictor:
:3. Link function:
First it is important to derive a couple key properties of the exponential family.
Any random variable
in the exponential family has a probability density function of the form,
:
with loglikelihood,
:
Here,
is the canonical parameter and the parameter of interest, and
is a nuisance parameter which plays a role in the variance.
We use the Bartlett's Identities to derive a general expression for the variance function.
The first and second Bartlett results ensures that under suitable conditions (see
Leibniz integral rule), for a density function dependent on
,
:
:
These identities lead to simple calculations of the expected value and variance of any random variable
in the exponential family