Variance Function
   HOME

TheInfoList



OR:

In statistics, the variance function is a smooth function which depicts the
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
of a random quantity as a function of its
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the '' ari ...
. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression,
semiparametric regression In statistics, semiparametric regression includes regression models that combine parametric and nonparametric models. They are often used in situations where the fully nonparametric model may not perform well or when the researcher wants to u ...
and
functional data analysis Functional data analysis (FDA) is a branch of statistics that analyses data providing information about curves, surfaces or anything else varying over a continuum. In its most general form, under an FDA framework, each sample element of functional ...
. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.


Intuition

In a regression model setting, the goal is to establish whether or not a relationship exists between a response variable and a set of predictor variables. Further, if a relationship does exist, the goal is then to be able to describe this relationship as best as possible. A main assumption in linear regression is constant variance or (homoscedasticity), meaning that different response variables have the same variance in their errors, at every predictor level. This assumption works well when the response variable and the predictor variable are jointly Normal, see
Normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
. As we will see later, the variance function in the Normal setting, is constant, however, we must find a way to quantify heteroscedasticity (non-constant variance) in the absence of joint Normality. When it is likely that the response follows a distribution that is a member of the exponential family, a generalized linear model may be more appropriate to use, and moreover, when we wish not to force a parametric model onto our data, a non-parametric regression approach can be useful. The importance of being able to model the variance as a function of the mean lies in improved inference (in a parametric setting), and estimation of the regression function in general, for any setting. Variance functions play a very important role in parameter estimation and inference. In general, maximum likelihood estimation requires that a likelihood function be defined. This requirement then implies that one must first specify the distribution of the response variables observed. However, to define a quasi-likelihood, one need only specify a relationship between the mean and the variance of the observations to then be able to use the quasi-likelihood function for estimation. Quasi-likelihood estimation is particularly useful when there is
overdispersion In statistics, overdispersion is the presence of greater variability (statistical dispersion) in a data set than would be expected based on a given statistical model. A common task in applied statistics is choosing a parametric model to fit a ...
. Overdispersion occurs when there is more variability in the data than there should otherwise be expected according to the assumed distribution of the data. In summary, to ensure efficient inference of the regression parameters and the regression function, the heteroscedasticity must be accounted for. Variance functions quantify the relationship between the variance and the mean of the observed data and hence play a significant role in regression estimation and inference.


Types

The variance function and its applications come up in many areas of statistical analysis. A very important use of this function is in the framework of generalized linear models and non-parametric regression.


Generalized linear model

When a member of the
exponential family In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ...
has been specified, the variance function can easily be derived. The general form of the variance function is presented under the exponential family context, as well as specific forms for Normal, Bernoulli, Poisson, and Gamma. In addition, we describe the applications and use of variance functions in maximum likelihood estimation and quasi-likelihood estimation.


Derivation

The generalized linear model (GLM), is a generalization of ordinary regression analysis that extends to any member of the
exponential family In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ...
. It is particularly useful when the response variable is categorical, binary or subject to a constraint (e.g. only positive responses make sense). A quick summary of the components of a GLM are summarized on this page, but for more details and information see the page on generalized linear models. A GLM consists of three main ingredients: :1. Random Component: a distribution of y from the exponential family, E \mid X= \mu :2. Linear predictor: \eta = XB = \sum_^p X_^T B_j :3. Link function: \eta = g(\mu), \mu = g^(\eta) First it is important to derive a couple key properties of the exponential family. Any random variable \textit in the exponential family has a probability density function of the form, :f(y,\theta,\phi) = \exp\left(\frac - c(y,\phi)\right) with loglikelihood, :\ell(\theta,y,\phi)=\log(f(y,\theta,\phi)) = \frac - c(y,\phi) Here, \theta is the canonical parameter and the parameter of interest, and \phi is a nuisance parameter which plays a role in the variance. We use the Bartlett's Identities to derive a general expression for the variance function. The first and second Bartlett results ensures that under suitable conditions (see
Leibniz integral rule In calculus, the Leibniz integral rule for differentiation under the integral sign, named after Gottfried Leibniz, states that for an integral of the form \int_^ f(x,t)\,dt, where -\infty < a(x), b(x) < \infty and the integral are
), for a density function dependent on \theta, f_\theta(), :\operatorname_\theta\left frac \log(f_\theta(y)) \right= 0 :\operatorname_\theta\left frac\log(f_\theta(y))\right\operatorname_\theta\left frac \log(f_\theta(y))\right= 0 These identities lead to simple calculations of the expected value and variance of any random variable \textit in the exponential family E_\theta Var_\theta /math>. Expected value of ''Y'': Taking the first derivative with respect to \theta of the log of the density in the exponential family form described above, we have :\frac\log(f(y,\theta,\phi))= \frac\left frac - c(y,\phi)\right= \frac Then taking the expected value and setting it equal to zero leads to, :\operatorname_\theta\left frac\right= \frac=0 :\operatorname_\theta = b'(\theta) Variance of Y: To compute the variance we use the second Bartlett identity, :\operatorname_\theta\left frac\left(\frac - c(y,\phi)\right)\right\operatorname_\theta\left frac\left(\frac - c(y,\phi)\right)\right= 0 :\operatorname_\theta\left frac \phi \right+ \operatorname_\theta \left frac \phi \right= 0 : \operatorname_\theta\left \rightb''(\theta)\phi We have now a relationship between \mu and \theta, namely :\mu = b'(\theta) and \theta = b'^(\mu), which allows for a relationship between \mu and the variance, :V(\theta) = b''(\theta) = \text\theta :\operatorname(\mu) = b''(b'^(\mu)). \, Note that because \operatorname_\theta\left \right0, b''(\theta)>0, then b': \theta \rightarrow \mu is invertible. We derive the variance function for a few common distributions.


Example – normal

The
Normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
is a special case where the variance function is a constant. Let y \sim N(\mu,\sigma^2) then we put the density function of y in the form of the exponential family described above: :f(y) = \exp\left(\frac - \frac - \frac\ln\right) where : \theta = \mu, : b(\theta) = \frac, : \phi=\sigma^2, : c(y,\phi)=- \frac - \frac\ln To calculate the variance function V(\mu), we first express \theta as a function of \mu. Then we transform V(\theta) into a function of \mu :\theta=\mu :b'(\theta) = \theta= \operatorname = \mu :V(\theta) = b''(\theta) = 1 Therefore, the variance function is constant.


Example – Bernoulli

Let y \sim \text(p), then we express the density of the
Bernoulli distribution In probability theory and statistics, the Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli,James Victor Uspensky: ''Introduction to Mathematical Probability'', McGraw-Hill, New York 1937, page 45 is the discrete probabi ...
in exponential family form, : f(y) = \exp\left(y\ln\frac + \ln(1-p)\right) :\theta = \ln\frac =
logit In statistics, the logit ( ) function is the quantile function associated with the standard logistic distribution. It has many uses in data analysis and machine learning, especially in data transformations. Mathematically, the logit is the ...
(p), which gives us p = \frac = expit(\theta) :b(\theta) = \ln(1+e^\theta) and : b'(\theta) = \frac = expit(\theta) = p = \mu : b''(\theta) =\frac - \left(\frac\right)^2 This give us :V(\mu) = \mu(1-\mu)


Example – Poisson

Let y \sim \text(\lambda), then we express the density of the
Poisson distribution In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known co ...
in exponential family form, : f(y) = \exp(y\ln\lambda -\ln\lambda) :\theta = \ln\lambda = which gives us \lambda = e^\theta :b(\theta) = e^\theta and : b'(\theta) = e^\theta = \lambda = \mu : b''(\theta) =e^\theta = \mu This give us :V(\mu) = \mu Here we see the central property of Poisson data, that the variance is equal to the mean.


Example – Gamma

The
Gamma distribution In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma d ...
and density function can be expressed under different parametrizations. We will use the form of the gamma with parameters (\mu,\nu) :f_(y) = \frac\left(\frac\right)^\nu e^ Then in exponential family form we have :f_(y) = \exp\left(\frac+ \ln\left(\frac\right)\right) : \theta = \frac \rightarrow \mu = \frac : \phi = \frac : b(\theta) = -\ln(-\theta) : b'(\theta) = \frac = \frac = \mu : b''(\theta) = \frac = \mu^2 And we have V(\mu) = \mu^2


Application – weighted least squares

A very important application of the variance function is its use in parameter estimation and inference when the response variable is of the required exponential family form as well as in some cases when it is not (which we will discuss in quasi-likelihood). Weighted least squares (WLS) is a special case of generalized least squares. Each term in the WLS criterion includes a weight that determines that the influence each observation has on the final parameter estimates. As in regular least squares, the goal is to estimate the unknown parameters in the regression function by finding values for parameter estimates that minimize the sum of the squared deviations between the observed responses and the functional portion of the model. While WLS assumes independence of observations it does not assume equal variance and is therefore a solution for parameter estimation in the presence of heteroscedasticity. The
Gauss–Markov theorem In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the ...
and Aitken demonstrate that the
best linear unbiased estimator Best or The Best may refer to: People * Best (surname), people with the surname Best * Best (footballer, born 1968), retired Portuguese footballer Companies and organizations * Best & Co., an 1879–1971 clothing chain * Best Lock Corporation ...
(BLUE), the unbiased estimator with minimum variance, has each weight equal to the reciprocal of the variance of the measurement. In the GLM framework, our goal is to estimate parameters \beta, where Z = g(E \mid X = X\beta. Therefore, we would like to minimize (Z-XB)^W(Z-XB) and if we define the weight matrix W as :\underbrace_ = \begin \frac &0&\cdots&0&0 \\ 0&\frac&0&\cdots&0 \\ \vdots&\vdots&\vdots&\vdots&0\\ \vdots&\vdots&\vdots&\vdots&0\\ 0 &\cdots&\cdots&0&\frac \end, where \phi,V(\mu),g(\mu) are defined in the previous section, it allows for iteratively reweighted least squares (IRLS) estimation of the parameters. See the section on iteratively reweighted least squares for more derivation and information. Also, important to note is that when the weight matrix is of the form described here, minimizing the expression (Z-XB)^W(Z-XB) also minimizes the Pearson distance. See
Distance correlation In statistics and in probability theory, distance correlation or distance covariance is a measure of dependence between two paired random vectors of arbitrary, not necessarily equal, dimension. The population distance correlation coefficient is ze ...
for more. The matrix W falls right out of the estimating equations for estimation of \beta. Maximum likelihood estimation for each parameter \beta_r, 1\leq r \leq p, requires :\sum_^n \frac = 0, where \operatorname(\theta,y,\phi)=\log(\operatorname(y,\theta,\phi)) = \frac - c(y,\phi) is the log-likelihood. Looking at a single observation we have, :\frac = \frac \frac\frac\frac :\frac = x_r :\frac = \frac=\frac :\frac = \frac = \frac = \frac This gives us :\frac =\frac \fracx_r, and noting that :\frac = g'(\mu) we have that :\frac = (y-\mu )W\fracx_r The Hessian matrix is determined in a similar manner and can be shown to be, : H = X^T(y-\mu)\left fracW\frac\right- X^TWX Noticing that the Fisher Information (FI), :\text = -E = X^TWX, allows for asymptotic approximation of \hat :\hat \sim N_p(\beta,(X^TWX)^), and hence inference can be performed.


Application – quasi-likelihood

Because most features of GLMs only depend on the first two moments of the distribution, rather than the entire distribution, the quasi-likelihood can be developed by just specifying a link function and a variance function. That is, we need to specify : – Link function: E = \mu = g^(\eta) : – Variance function: V(\mu)\text\operatorname_\theta(y) = \sigma^2V(\mu) With a specified variance function and link function we can develop, as alternatives to the log-
likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
, the score function, and the
Fisher information In mathematical statistics, the Fisher information (sometimes simply called information) is a way of measuring the amount of information that an observable random variable ''X'' carries about an unknown parameter ''θ'' of a distribution that model ...
, a quasi-likelihood, a quasi-score, and the quasi-information. This allows for full inference of \beta. Quasi-likelihood (QL) Though called a quasi-likelihood, this is in fact a quasi-log-likelihood. The QL for one observation is : Q_i(\mu_i,y_i) = \int_^ \frac \, dt And therefore the QL for all n observations is :Q(\mu,y) = \sum_^n Q_i(\mu_i,y_i) = \sum_^n \int_^ \frac \, dt From the QL we have the quasi-score Quasi-score (QS) Recall the score function, U, for data with log-likelihood \operatorname(\mu\mid y) is :U = \frac. We obtain the quasi-score in an identical manner, :U = \frac Noting that, for one observation the score is :\frac = \frac The first two Bartlett equations are satisfied for the quasi-score, namely : E = 0 and : \operatorname(U) + E\left frac\right= 0. In addition, the quasi-score is linear in y. Ultimately the goal is to find information about the parameters of interest \beta. Both the QS and the QL are actually functions of \beta. Recall, \mu = g^(\eta), and \eta = X\beta, therefore, :\mu = g^(X\beta). Quasi-information (QI) The quasi-information, is similar to the
Fisher information In mathematical statistics, the Fisher information (sometimes simply called information) is a way of measuring the amount of information that an observable random variable ''X'' carries about an unknown parameter ''θ'' of a distribution that model ...
, :i_b = -\operatorname\left frac\right/math> QL, QS, QI as functions of \beta The QL, QS and QI all provide the building blocks for inference about the parameters of interest and therefore it is important to express the QL, QS and QI all as functions of \beta. Recalling again that \mu = g^(X\beta), we derive the expressions for QL, QS and QI parametrized under \beta. Quasi-likelihood in \beta, : Q(\beta,y) = \int_y^ \frac \, dt The QS as a function of \beta is therefore :U_j(\beta_j) = \frac Q(\beta,y) = \sum_^n \frac \frac :U(\beta) = \begin U_1(\beta)\\ U_2(\beta)\\ \vdots\\ \vdots\\ U_p(\beta) \end = D^TV^\frac Where, :\underbrace_ = \begin \frac &\cdots&\cdots&\frac \\ \frac &\cdots&\cdots&\frac \\ \vdots\\ \vdots\\ \frac &\cdots&\cdots&\frac \end \underbrace_ = \operatorname(V(\mu_1),V(\mu_2),\ldots,\ldots,V(\mu_n)) The quasi-information matrix in \beta is, :i_b = -\frac = \operatorname(U(\beta)) = \frac Obtaining the score function and the information of \beta allows for parameter estimation and inference in a similar manner as described in Application – weighted least squares.


Non-parametric regression analysis

Non-parametric estimation of the variance function and its importance, has been discussed widely in the literature In non-parametric regression analysis, the goal is to express the expected value of your response variable(y) as a function of your predictors (X). That is we are looking to estimate a mean function, g(x) = \operatorname \mid X=x/math> without assuming a parametric form. There are many forms of non-parametric
smoothing In statistics and image processing, to smooth a data set is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. In smoothing, the dat ...
methods to help estimate the function g(x). An interesting approach is to also look at a non-parametric variance function, g_v(x) = \operatorname(Y\mid X=x). A non-parametric variance function allows one to look at the mean function as it relates to the variance function and notice patterns in the data. :g_v(x) = \operatorname(Y\mid X=x) =\operatorname ^2\mid X=x- \left operatorname[y\mid_X=xright.html" ;"title="\mid_X=x.html" ;"title="operatorname[y\mid X=x">operatorname[y\mid X=xright">\mid_X=x.html" ;"title="operatorname[y\mid X=x">operatorname[y\mid X=xright2 An example is detailed in the pictures to the right. The goal of the project was to determine (among other things) whether or not the predictor, number of years in the major leagues (baseball,) had an effect on the response, salary, a player made. An initial scatter plot of the data indicates that there is heteroscedasticity in the data as the variance is not constant at each level of the predictor. Because we can visually detect the non-constant variance, it useful now to plot g_v(x) = \operatorname(Y\mid X=x) =\operatorname ^2\mid X=x- \left operatorname[y\mid_X=xright.html" ;"title="\mid_X=x.html" ;"title="operatorname[y\mid X=x">operatorname[y\mid X=xright">\mid_X=x.html" ;"title="operatorname[y\mid X=x">operatorname[y\mid X=xright2 , and look to see if the shape is indicative of any known distribution. One can estimate \operatorname[y^2\mid X=x] and \left operatorname[y\mid_X=xright.html" ;"title="\mid_X=x.html" ;"title="operatorname[y\mid X=x">operatorname[y\mid X=xright">\mid_X=x.html" ;"title="operatorname[y\mid X=x">operatorname[y\mid X=xright2 using a general
smoothing In statistics and image processing, to smooth a data set is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. In smoothing, the dat ...
method. The plot of the non-parametric smoothed variance function can give the researcher an idea of the relationship between the variance and the mean. The picture to the right indicates a quadratic relationship between the mean and the variance. As we saw above, the Gamma variance function is quadratic in the mean.


Notes


References

* *


External links

* {{statistics, correlation Functional analysis Nonparametric statistics Nonparametric regression Actuarial science Generalized linear models