Model specification
   HOME

TheInfoList



OR:

In statistics, model specification is part of the process of building a statistical model: specification consists of selecting an appropriate functional form for the model and choosing which variables to include. For example, given
personal income In economics, personal income refers to an individual's total earnings from wages, investment enterprises, and other ventures. It is the sum of all the incomes received by all the individuals or household during a given period. Personal income is ...
y together with years of schooling s and on-the-job experience x, we might specify a functional relationship y = f(s,x) as follows: : \ln y = \ln y_0 + \rho s + \beta_1 x + \beta_2 x^2 + \varepsilon where \varepsilon is the unexplained
error term In mathematics and statistics, an error term is an additive type of error. Common examples include: * errors and residuals in statistics, e.g. in linear regression * the error term in numerical integration In analysis, numerical integration ...
that is supposed to comprise
independent and identically distributed In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is usual ...
Gaussian variables. The statistician Sir David Cox has said, "How hetranslation from subject-matter problem to statistical model is done is often the most critical part of an analysis".


Specification error and bias

Specification error occurs when the functional form or the choice of independent variables poorly represent relevant aspects of the true data-generating process. In particular,
bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group ...
(the expected value of the difference of an estimated
parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
and the true underlying value) occurs if an independent variable is correlated with the errors inherent in the underlying process. There are several different possible causes of specification error; some are listed below. *An inappropriate functional form could be employed. *A variable omitted from the model may have a relationship with both the dependent variable and one or more of the independent variables (causing
omitted-variable bias In statistics, omitted-variable bias (OVB) occurs when a statistical model leaves out one or more relevant variables. The bias results in the model attributing the effect of the missing variables to those that were included. More specifically, OV ...
). *An irrelevant variable may be included in the model (although this does not create bias, it involves overfitting and so can lead to poor predictive performance). *The dependent variable may be part of a system of
simultaneous equations In mathematics, a set of simultaneous equations, also known as a system of equations or an equation system, is a finite set of equations for which common solutions are sought. An equation system is usually classified in the same manner as single e ...
(giving simultaneity bias). Additionally, measurement errors may affect the independent variables: while this is not a specification error, it can create statistical bias. Note that all models will have some specification error. Indeed, in statistics there is a common aphorism that "
all models are wrong All or ALL may refer to: Language * All, an indefinite pronoun in English * All, one of the English determiners * Allar language (ISO 639-3 code) * Allative case (abbreviated ALL) Music * All (band), an American punk rock band * ''All'' (All ...
". In the words of Burnham & Anderson, "Modeling is an art as well as a science and is directed toward finding a good approximating model ... as the basis for statistical inference".


Detection of misspecification

The
Ramsey RESET test In statistics, the Ramsey Regression Equation Specification Error Test (RESET) test is a general specification test for the linear regression model. More specifically, it tests whether non-linear combinations of the fitted values help explain the ...
can help test for specification error in
regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
. In the example given above relating personal income to schooling and job experience, if the assumptions of the model are correct, then the least squares estimates of the parameters \rho and \beta will be efficient and
unbiased Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
. Hence specification diagnostics usually involve testing the first to fourth moment of the residuals.


Model building

Building a model involves finding a set of relationships to represent the process that is generating the data. This requires avoiding all the sources of misspecification mentioned above. One approach is to start with a model in general form that relies on a theoretical understanding of the data-generating process. Then the model can be fit to the data and checked for the various sources of misspecification, in a task called '' statistical model validation''. Theoretical understanding can then guide the modification of the model in such a way as to retain theoretical validity while removing the sources of misspecification. But if it proves impossible to find a theoretically acceptable specification that fits the data, the theoretical model may have to be rejected and replaced with another one. A quotation from Karl Popper is apposite here: "Whenever a theory appears to you as the only possible one, take this as a sign that you have neither understood the theory nor the problem which it was intended to solve".. Another approach to model building is to specify several different models as candidates, and then compare those candidate models to each other. The purpose of the comparison is to determine which candidate model is most appropriate for statistical inference. Common criteria for comparing models include the following: ''R''2,
Bayes factor The Bayes factor is a ratio of two competing statistical models represented by their marginal likelihood, and is used to quantify the support for one model over the other. The models in questions can have a common set of parameters, such as a nul ...
, and the
likelihood-ratio test In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after im ...
together with its generalization
relative likelihood In statistics, suppose that we have been given some data, and we are selecting a statistical model for that data. The relative likelihood compares the relative plausibilities of different candidate models or of different values of a parameter of ...
. For more on this topic, see '' statistical model selection''.


See also

* Abductive reasoning * Conceptual model * Data analysis *
Data transformation (statistics) In statistics, data transformation is the application of a deterministic mathematical function to each point in a data set—that is, each data point ''zi'' is replaced with the transformed value ''yi'' = ''f''(''zi''), where ''f'' is a functi ...
*
Design of experiments The design of experiments (DOE, DOX, or experimental design) is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variation. The term is generally associ ...
*
Durbin–Wu–Hausman test The Durbin–Wu–Hausman test (also called Hausman specification test) is a statistical hypothesis test in econometrics named after James Durbin, De-Min Wu, and Jerry A. Hausman. The test evaluates the consistency of an estimator when compared t ...
*
Exploratory data analysis In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but pri ...
*
Feature selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
* Information matrix test *
Model identification In statistics, identifiability is a property which a model must satisfy for precise inference to be possible. A model is identifiable if it is theoretically possible to learn the true values of this model's underlying parameters after obtaining ...
*
Principle of Parsimony Occam's razor, Ockham's razor, or Ocham's razor ( la, novacula Occami), also known as the principle of parsimony or the law of parsimony ( la, lex parsimoniae), is the problem-solving principle that "entities should not be multiplied beyond neces ...
*
Spurious relationship In statistics, a spurious relationship or spurious correlation is a mathematical relationship in which two or more events or variables are associated but '' not'' causally related, due to either coincidence or the presence of a certain third, u ...
*
Statistical conclusion validity Statistical conclusion validity is the degree to which conclusions about the relationship among variables based on the data are correct or "reasonable". This began as being solely about whether the statistical conclusion about the relationship of ...
* Statistical inference *
Statistical learning theory Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on dat ...


Notes


Further reading

* . * * * * . * * * * * {{cite journal, last = Sapra, first = Sunil, title = A regression error specification test (RESET) for generalized linear models, journal =
Economics Bulletin The ''Economics Bulletin'' is a peer-reviewed open access academic journal that publishes concise notes, comments, and preliminary results in all areas of economics. The journal does not accept appeals and new versions of previously declined ma ...
, volume = 3, issue = 1, year = 2005, pages = 1–6, url = http://economicsbulletin.vanderbilt.edu/2005/volume3/EB-04C50033A.pdf Regression variable selection Statistical models