HOME

TheInfoList



OR:

In statistics, Bayesian multivariate linear regression is a
Bayesian Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a follower ...
approach to
multivariate linear regression The general linear model or general multivariate regression model is a compact way of simultaneously writing several multiple linear regression models. In that sense it is not a separate statistical linear model. The various multiple linear reg ...
, i.e.
linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is ...
where the predicted outcome is a vector of correlated
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the p ...
s rather than a single scalar random variable. A more general treatment of this approach can be found in the article
MMSE estimator In statistics and signal processing, a minimum mean square error (MMSE) estimator is an estimation method which minimizes the mean square error (MSE), which is a common measure of estimator quality, of the fitted values of a dependent variable. I ...
.


Details

Consider a regression problem where the
dependent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or dema ...
to be predicted is not a single
real-valued In mathematics, value may refer to several, strongly related notions. In general, a mathematical value may be any definite mathematical object. In elementary mathematics, this is most often a number – for example, a real number such as or an ...
scalar but an ''m''-length vector of correlated real numbers. As in the standard regression setup, there are ''n'' observations, where each observation ''i'' consists of ''k''−1
explanatory variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
s, grouped into a vector \mathbf_i of length ''k'' (where a dummy variable with a value of 1 has been added to allow for an intercept coefficient). This can be viewed as a set of ''m'' related regression problems for each observation ''i'': \begin y_ &= \mathbf_i^\mathsf\boldsymbol\beta_ + \epsilon_ \\ &\;\;\vdots \\ y_ &= \mathbf_i^\mathsf\boldsymbol\beta_ + \epsilon_ \end where the set of errors \ are all correlated. Equivalently, it can be viewed as a single regression problem where the outcome is a
row vector In linear algebra, a column vector with m elements is an m \times 1 matrix consisting of a single column of m entries, for example, \boldsymbol = \begin x_1 \\ x_2 \\ \vdots \\ x_m \end. Similarly, a row vector is a 1 \times n matrix for some n, c ...
\mathbf_i^\mathsf and the regression coefficient vectors are stacked next to each other, as follows: \mathbf_i^\mathsf = \mathbf_i^\mathsf\mathbf + \boldsymbol\epsilon_^\mathsf. The coefficient matrix B is a k \times m matrix where the coefficient vectors \boldsymbol\beta_1,\ldots,\boldsymbol\beta_m for each regression problem are stacked horizontally: \mathbf = \begin \begin \\ \boldsymbol\beta_1 \\ \\ \end \cdots \begin \\ \boldsymbol\beta_m \\ \\ \end \end = \begin \begin \beta_ \\ \vdots \\ \beta_ \end \cdots \begin \beta_ \\ \vdots \\ \beta_ \end \end . The noise vector \boldsymbol\epsilon_ for each observation ''i'' is jointly normal, so that the outcomes for a given observation are correlated: \boldsymbol\epsilon_i \sim N(0, \boldsymbol\Sigma_). We can write the entire regression problem in matrix form as: \mathbf = \mathbf\mathbf + \mathbf, where Y and E are n \times m matrices. The
design matrix In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual ...
X is an n \times k matrix with the observations stacked vertically, as in the standard
linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is ...
setup: \mathbf = \begin \mathbf^\mathsf_1 \\ \mathbf^\mathsf_2 \\ \vdots \\ \mathbf^\mathsf_n \end = \begin x_ & \cdots & x_ \\ x_ & \cdots & x_ \\ \vdots & \ddots & \vdots \\ x_ & \cdots & x_ \end. The classical, frequentists
linear least squares Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and ...
solution is to simply estimate the matrix of regression coefficients \hat using the Moore-Penrose
pseudoinverse In mathematics, and in particular, algebra, a generalized inverse (or, g-inverse) of an element ''x'' is an element ''y'' that has some properties of an inverse element but not necessarily all of them. The purpose of constructing a generalized in ...
: \hat = (\mathbf^\mathsf\mathbf)^\mathbf^\mathsf\mathbf. To obtain the Bayesian solution, we need to specify the conditional likelihood and then find the appropriate conjugate prior. As with the univariate case of linear Bayesian regression, we will find that we can specify a natural conditional conjugate prior (which is scale dependent). Let us write our conditional likelihood asPeter E. Rossi, Greg M. Allenby, Rob McCulloch. ''Bayesian Statistics and Marketing''. John Wiley & Sons, 2012, p. 32. \rho(\mathbf, \boldsymbol\Sigma_) \propto , \boldsymbol\Sigma_, ^ \exp\left(-\tfrac \operatorname\left(\mathbf^\mathsf \mathbf \boldsymbol\Sigma_^\right) \right) , writing the error \mathbf in terms of \mathbf,\mathbf, and \mathbf yields \rho(\mathbf, \mathbf,\mathbf,\boldsymbol\Sigma_) \propto , \boldsymbol\Sigma_, ^ \exp(-\tfrac \operatorname((\mathbf-\mathbf \mathbf)^\mathsf (\mathbf-\mathbf \mathbf) \boldsymbol\Sigma_^ ) ) , We seek a natural conjugate prior—a joint density \rho(\mathbf,\Sigma_) which is of the same functional form as the likelihood. Since the likelihood is quadratic in \mathbf, we re-write the likelihood so it is normal in (\mathbf-\hat) (the deviation from classical sample estimate). Using the same technique as with
Bayesian linear regression Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients (as wel ...
, we decompose the exponential term using a matrix-form of the sum-of-squares technique. Here, however, we will also need to use the Matrix Differential Calculus (
Kronecker product In mathematics, the Kronecker product, sometimes denoted by ⊗, is an operation Operation or Operations may refer to: Arts, entertainment and media * ''Operation'' (game), a battery-operated board game that challenges dexterity * Oper ...
and
vectorization Vectorization may refer to: Computing * Array programming, a style of computer programming where operations are applied to whole arrays instead of individual elements * Automatic vectorization, a compiler optimization that transforms loops to vect ...
transformations). First, let us apply sum-of-squares to obtain new expression for the likelihood: \rho(\mathbf, \mathbf,\mathbf,\boldsymbol\Sigma_) \propto , \boldsymbol\Sigma_, ^ \exp(-\operatorname(\tfrac\mathbf^\mathsf \mathbf \boldsymbol\Sigma_^)) , \boldsymbol\Sigma_, ^ \exp(-\tfrac \operatorname((\mathbf-\hat)^\mathsf \mathbf^\mathsf \mathbf(\mathbf-\hat) \boldsymbol\Sigma_^ ) ) , \mathbf = \mathbf - \mathbf\hat We would like to develop a conditional form for the priors: \rho(\mathbf,\boldsymbol\Sigma_) = \rho(\boldsymbol\Sigma_)\rho(\mathbf, \boldsymbol\Sigma_), where \rho(\boldsymbol\Sigma_) is an
inverse-Wishart distribution In statistics, the inverse Wishart distribution, also called the inverted Wishart distribution, is a probability distribution defined on real-valued positive-definite matrices. In Bayesian statistics it is used as the conjugate prior for the c ...
and \rho(\mathbf, \boldsymbol\Sigma_) is some form of
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu i ...
in the matrix \mathbf. This is accomplished using the
vectorization Vectorization may refer to: Computing * Array programming, a style of computer programming where operations are applied to whole arrays instead of individual elements * Automatic vectorization, a compiler optimization that transforms loops to vect ...
transformation, which converts the likelihood from a function of the matrices \mathbf, \hat to a function of the vectors \boldsymbol\beta = \operatorname(\mathbf), \hat = \operatorname(\hat). Write \operatorname((\mathbf - \hat)^\mathsf\mathbf^\mathsf \mathbf(\mathbf - \hat) \boldsymbol\Sigma_\epsilon^) = \operatorname(\mathbf - \hat)^\mathsf \operatorname(\mathbf^\mathsf \mathbf(\mathbf - \hat) \boldsymbol\Sigma_^ ) Let \operatorname(\mathbf^\mathsf \mathbf(\mathbf - \hat) \boldsymbol\Sigma_^ ) = (\boldsymbol\Sigma_^ \otimes \mathbf^\mathsf\mathbf )\operatorname(\mathbf - \hat), where \mathbf \otimes \mathbf denotes the
Kronecker product In mathematics, the Kronecker product, sometimes denoted by ⊗, is an operation Operation or Operations may refer to: Arts, entertainment and media * ''Operation'' (game), a battery-operated board game that challenges dexterity * Oper ...
of matrices A and B, a generalization of the
outer product In linear algebra, the outer product of two coordinate vectors is a matrix. If the two vectors have dimensions ''n'' and ''m'', then their outer product is an ''n'' × ''m'' matrix. More generally, given two tensors (multidimensional arrays of n ...
which multiplies an m \times n matrix by a p \times q matrix to generate an mp \times nq matrix, consisting of every combination of products of elements from the two matrices. Then \begin &\operatorname(\mathbf - \hat)^\mathsf (\boldsymbol\Sigma_^ \otimes \mathbf^\mathsf\mathbf )\operatorname(\mathbf - \hat) \\ &= (\boldsymbol\beta - \hat)^\mathsf(\boldsymbol\Sigma_^ \otimes \mathbf^\mathsf\mathbf )(\boldsymbol\beta-\hat) \end which will lead to a likelihood which is normal in (\boldsymbol\beta - \hat). With the likelihood in a more tractable form, we can now find a natural (conditional) conjugate prior.


Conjugate prior distribution

The natural conjugate prior using the vectorized variable \boldsymbol\beta is of the form: \rho(\boldsymbol\beta, \boldsymbol\Sigma_) = \rho(\boldsymbol\Sigma_)\rho(\boldsymbol\beta, \boldsymbol\Sigma_), where \rho(\boldsymbol\Sigma_) \sim \mathcal^(\mathbf V_0,\boldsymbol\nu_0) and \rho(\boldsymbol\beta, \boldsymbol\Sigma_) \sim N(\boldsymbol\beta_0, \boldsymbol\Sigma_ \otimes \boldsymbol\Lambda_0^).


Posterior distribution

Using the above prior and likelihood, the posterior distribution can be expressed as: \begin \rho(\boldsymbol\beta,\boldsymbol\Sigma_, \mathbf,\mathbf) \propto& , \boldsymbol\Sigma_, ^\exp \\ &\times, \boldsymbol\Sigma_, ^\exp \\ &\times, \boldsymbol\Sigma_, ^\exp, \end where \operatorname(\mathbf B_0) = \boldsymbol\beta_0. The terms involving \mathbf can be grouped (with \boldsymbol\Lambda_0 = \mathbf^\mathsf\mathbf) using: \begin & \left(\mathbf - \mathbf B_0\right)^\mathsf \boldsymbol\Lambda_0 \left(\mathbf - \mathbf B_0\right) + \left(\mathbf - \mathbf\right)^\mathsf \left(\mathbf - \mathbf\right) \\ =& \left(\begin\mathbf Y \\ \mathbf U \mathbf B_0\end - \begin\mathbf\\ \mathbf\end\mathbf\right)^\mathsf \left(\begin\mathbf\\ \mathbf U \mathbf B_0\end-\begin\mathbf\\ \mathbf\end\mathbf\right) \\ =& \left(\begin\mathbf Y \\ \mathbf U \mathbf B_0\end - \begin\mathbf\\ \mathbf\end\mathbf B_n\right)^\mathsf\left(\begin\mathbf\\ \mathbf U \mathbf B_0\end-\begin\mathbf\\ \mathbf\end\mathbf B_n\right) + \left(\mathbf B - \mathbf B_n\right)^\mathsf \left(\mathbf^\mathsf \mathbf + \boldsymbol\Lambda_0\right) \left(\mathbf-\mathbf B_n\right) \\ =& \left(\mathbf - \mathbf X \mathbf B_n \right)^\mathsf \left(\mathbf - \mathbf X \mathbf B_n\right) + \left(\mathbf B_0 - \mathbf B_n\right)^\mathsf \boldsymbol\Lambda_0 \left(\mathbf B_0 - \mathbf B_n\right) + \left(\mathbf - \mathbf B_n\right)^\mathsf \left(\mathbf^\mathsf \mathbf + \boldsymbol\Lambda_0\right)\left(\mathbf B - \mathbf B_n\right), \end with \mathbf B_n = \left(\mathbf^\mathsf\mathbf + \boldsymbol\Lambda_0\right)^\left(\mathbf^\mathsf \mathbf \hat + \boldsymbol\Lambda_0\mathbf B_0\right) = \left(\mathbf^\mathsf \mathbf + \boldsymbol\Lambda_0\right)^\left(\mathbf^\mathsf \mathbf + \boldsymbol\Lambda_0 \mathbf B_0\right). This now allows us to write the posterior in a more useful form: \begin \rho(\boldsymbol\beta,\boldsymbol\Sigma_, \mathbf,\mathbf) \propto&, \boldsymbol\Sigma_, ^\exp \\ &\times, \boldsymbol\Sigma_, ^\exp. \end This takes the form of an
inverse-Wishart distribution In statistics, the inverse Wishart distribution, also called the inverted Wishart distribution, is a probability distribution defined on real-valued positive-definite matrices. In Bayesian statistics it is used as the conjugate prior for the c ...
times a
Matrix normal distribution In statistics, the matrix normal distribution or matrix Gaussian distribution is a probability distribution that is a generalization of the multivariate normal distribution to matrix-valued random variables. Definition The probability density ...
: \rho(\boldsymbol\Sigma_, \mathbf,\mathbf) \sim \mathcal^(\mathbf V_n,\boldsymbol\nu_n) and \rho(\mathbf, \mathbf,\mathbf,\boldsymbol\Sigma_) \sim \mathcal_(\mathbf B_n, \boldsymbol\Lambda_n^, \boldsymbol\Sigma_). The parameters of this posterior are given by: \mathbf V_n = \mathbf V_0 + (\mathbf-\mathbf)^\mathsf(\mathbf-\mathbf) + (\mathbf B_n - \mathbf B_0)^\mathsf\boldsymbol\Lambda_0(\mathbf B_n-\mathbf B_0) \boldsymbol\nu_n = \boldsymbol\nu_0 + n \mathbf B_n = (\mathbf^\mathsf\mathbf + \boldsymbol\Lambda_0)^(\mathbf^\mathsf \mathbf + \boldsymbol\Lambda_0\mathbf B_0) \boldsymbol\Lambda_n = \mathbf^\mathsf \mathbf + \boldsymbol\Lambda_0


See also

*
Bayesian linear regression Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients (as wel ...
*
Matrix normal distribution In statistics, the matrix normal distribution or matrix Gaussian distribution is a probability distribution that is a generalization of the multivariate normal distribution to matrix-valued random variables. Definition The probability density ...


References

* * * {{DEFAULTSORT:Bayesian Multivariate Linear Regression
Multivariate linear regression The general linear model or general multivariate regression model is a compact way of simultaneously writing several multiple linear regression models. In that sense it is not a separate statistical linear model. The various multiple linear reg ...
Single-equation methods (econometrics)