In
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, missing data, or missing values, occur when no
data
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
value
Value or values may refer to:
Ethics and social
* Value (ethics) wherein said concept may be construed as treating actions themselves as abstract objects, associating value to them
** Values (Western philosophy) expands the notion of value beyo ...
is stored for the
variable
Variable may refer to:
* Variable (computer science), a symbolic name associated with a value and whose associated value may be changed
* Variable (mathematics), a symbol that represents a quantity in a mathematical expression, as used in many ...
in an
observation
Observation is the active acquisition of information from a primary source. In living beings, observation employs the senses. In science, observation can also involve the perception and recording of data via the use of scientific instruments. The ...
. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.
Missing data can occur because of nonresponse: no information is provided for one or more items or for a whole unit ("subject"). Some items are more likely to generate a nonresponse than others: for example items about private subjects such as income.
Attrition
Attrition may refer to
*Attrition warfare, the military strategy of wearing down the enemy by continual losses in personnel and material
**War of Attrition, fought between Egypt and Israel from 1968 to 1970
**War of attrition (game), a model of agg ...
is a type of missingness that can occur in longitudinal studies—for instance studying development where a measurement is repeated after a certain period of time. Missingness occurs when participants drop out before the test ends and one or more measurements are missing.
Data often are missing in research in
economics
Economics () is the social science that studies the Production (economics), production, distribution (economics), distribution, and Consumption (economics), consumption of goods and services.
Economics focuses on the behaviour and intera ...
,
sociology
Sociology is a social science that focuses on society, human social behavior, patterns of Interpersonal ties, social relationships, social interaction, and aspects of culture associated with everyday life. It uses various methods of Empirical ...
, and
political science
Political science is the scientific study of politics. It is a social science dealing with systems of governance and power, and the analysis of political activities, political thought, political behavior, and associated constitutions and la ...
because governments or private entities choose not to, or fail to, report critical statistics, or because the information is not available. Sometimes missing values are caused by the researcher—for example, when data collection is done improperly or mistakes are made in data entry.
These forms of missingness take different types, with different impacts on the validity of conclusions from research: Missing completely at random, missing at random, and missing not at random. Missing data can be handled similarly as
censored data.
Types
Understanding the reasons why data are missing is important for handling the remaining data correctly. If values are missing completely at random, the data sample is likely still representative of the population. But if the values are missing systematically, analysis may be biased. For example, in a study of the relation between IQ and income, if participants with an above-average IQ tend to skip the question ‘What is your salary?’, analyses that do not take into account this missing at random (MAR pattern (see below)) may falsely fail to find a positive association between IQ and salary. Because of these problems, methodologists routinely advise researchers to design studies to minimize the occurrence of missing values.
Graphical models can be used to describe the missing data mechanism in detail.
Missing completely at random
Values in a data set are missing completely at random (MCAR) if the events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random.
When data are MCAR, the analysis performed on the data is unbiased; however, data are rarely MCAR.
In the case of MCAR, the missingness of data is unrelated to any study variable: thus, the participants with completely observed data are in effect a random sample of all the participants assigned a particular intervention. With MCAR, the random assignment of treatments is assumed to be preserved, but that is usually an unrealistically strong assumption in practice.
Missing at random
Missing at random (MAR) occurs when the missingness is not random, but where missingness can be fully accounted for by variables where there is complete information. Since MAR is an assumption that is impossible to verify statistically, we must rely on its substantive reasonableness. An example is that males are less likely to fill in a depression survey but this has nothing to do with their level of depression, after accounting for maleness. Depending on the analysis method, these data can still induce parameter bias in analyses due to the contingent emptiness of cells (male, very high depression may have zero entries). However, if the parameter is estimated with Full Information Maximum Likelihood, MAR will provide asymptotically unbiased estimates.
Missing not at random
Missing not at random (MNAR) (also known as nonignorable nonresponse) is data that is neither MAR nor MCAR (i.e. the value of the variable that's missing is related to the reason it's missing).
To extend the previous example, this would occur if men failed to fill in a depression survey ''because'' of their level of depression.
Samuelson and Spirer (1992) discussed how missing and/or distorted data about demographics, law enforcement, and health could be indicators of patterns of human rights violations. They gave several fairly well documented examples.
Techniques of dealing with missing data
Missing data reduces the representativeness of the sample and can therefore distort inferences about the population. Generally speaking, there are three main approaches to handle missing data: (1) ''Imputation''—where values are filled in the place of missing data, (2) ''omission''—where samples with invalid data are discarded from further analysis and (3) ''analysis''—by directly applying methods unaffected by the missing values. One systematic review addressing the prevention and handling of missing data for patient-centered outcomes research identified 10 standards as necessary for the prevention and handling of missing data. These include standards for study design, study conduct, analysis, and reporting.
In some practical application, the experimenters can control the level of missingness, and prevent missing values before gathering the data. For example, in computer questionnaires, it is often not possible to skip a question. A question has to be answered, otherwise one cannot continue to the next. So missing values due to the participant are eliminated by this type of questionnaire, though this method may not be permitted by an ethics board overseeing the research. In survey research, it is common to make multiple efforts to contact each individual in the sample, often sending letters to attempt to persuade those who have decided not to participate to change their minds.
However, such techniques can either help or hurt in terms of reducing the negative inferential effects of missing data, because the kind of people who are willing to be persuaded to participate after initially refusing or not being home are likely to be significantly different from the kinds of people who will still refuse or remain unreachable after additional effort.
In situations where missing values are likely to occur, the researcher is often advised on planning to use methods of data analysis methods that are
robust
Robustness is the property of being strong and healthy in constitution. When it is transposed into a system, it refers to the ability of tolerating perturbations that might affect the system’s functional body. In the same line ''robustness'' ca ...
to missingness. An analysis is robust when we are confident that mild to moderate violations of the technique's key assumptions will produce little or no
bias
Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
, or distortion in the conclusions drawn about the population.
Imputation
Some
data analysis
Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enco ...
techniques are not robust to missingness, and require to "fill in", or
impute the missing data. Rubin (1987) argued that repeating imputation even a few times (5 or less) enormously improves the quality of estimation.
For many practical purposes, 2 or 3 imputations capture most of the relative efficiency that could be captured with a larger number of imputations. However, a too-small number of imputations can lead to a substantial loss of
statistical power
In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H_0) when a specific alternative hypothesis (H_1) is true. It is commonly denoted by 1-\beta, and represents the chances ...
, and some scholars now recommend 20 to 100 or more. Any multiply-imputed data analysis must be repeated for each of the imputed data sets and, in some cases, the relevant statistics must be combined in a relatively complicated way.
The
expectation-maximization algorithm is an approach in which values of the statistics which would be computed if a complete dataset were available are estimated (imputed), taking into account the pattern of missing data. In this approach, values for individual missing data-items are not usually imputed.
Interpolation
In the mathematical field of numerical analysis,
interpolation
In the mathematical field of numerical analysis, interpolation is a type of estimation, a method of constructing (finding) new data points based on the range of a discrete set of known data points.
In engineering and science, one often has a n ...
is a method of constructing new data points within the range of a discrete set of known data points.
In the comparison of two paired samples with missing data, a test statistic that uses all available data without the need for imputation is the partially overlapping samples t-test.
This is valid under normality and assuming MCAR
Partial deletion
Methods which involve reducing the data available to a dataset having no missing values include:
*
Listwise deletion In statistics, listwise deletion is a method for handling missing data. In this method, an entire record is excluded from analysis if any single value is missing.
Example
For example, consider the following questionnaire, as answered by 10 subjects ...
/casewise deletion
* Pairwise deletion
Full analysis
Methods which take full account of all information available, without the distortion resulting from using imputed values as if they were actually observed:
* Generative approaches:
**The
expectation-maximization algorithm
** full information
maximum likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimation
*Discriminative approaches:
**Max-margin classification of data with absent features
Partial identification In statistics and econometrics, set identification (or partial identification) extends the concept of identifiability (or "point identification") in statistical models to situations where the distribution of observable variables is not informative o ...
methods may also be used.
Model-based techniques
Model based techniques, often using graphs, offer
additional tools for testing missing data types (MCAR, MAR, MNAR) and for estimating parameters under missing data conditions. For example, a test for refuting MAR/MCAR reads as follows:
For any three variables ''X,Y'', and ''Z'' where ''Z'' is fully observed and ''X'' and ''Y'' partially observed, the data should satisfy:
.
In words, the observed portion of ''X'' should be independent on the missingness status of ''Y,'' conditional on every value of ''Z''.
Failure to satisfy this condition indicates that the problem belongs to the MNAR category.
(Remark:
These tests are necessary for variable-based MAR which is a slight variation of event-based MAR.
)
When data falls into MNAR category techniques are available for consistently estimating parameters when certain conditions hold in the model.
For example, if ''Y'' explains the reason for missingness in ''X'' and ''Y'' itself has missing values, the
joint probability distribution
Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...
of ''X'' and ''Y'' can still be estimated if the
missingness of ''Y'' is random.
The estimand in this case will be:
:
where
and
denote the observed portions of their respective variables.
Different model structures may yield different estimands and different procedures of estimation whenever consistent estimation is possible. The preceding estimand calls for first
estimating
from complete data and multiplying it by
estimated from cases in which ''Y'' is observed regardless of the status of ''X''. Moreover, in order to
obtain a consistent estimate it is crucial that the first term be
as opposed to
.
In many cases model based techniques permit the model structure to undergo refutation tests.
Any model which implies the independence between a partially observed variable ''X'' and the missingness indicator of another variable ''Y'' (i.e.
), conditional
on
can be submitted to the following refutation test:
.
Finally, the estimands that emerge from these techniques are derived in closed form and do not require iterative procedures such as Expectation Maximization that
are susceptible to local optima.
A special class of problems appears when the probability of the missingness depends on time. For example, in the trauma databases the probability to lose data about the trauma outcome depends on the day after trauma. In these cases various non-stationary
Markov chain
A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happe ...
models are applied.
See also
*
Censoring
*
Expectation–maximization algorithm
In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variabl ...
*
Imputation
*
Indicator variable
In regression analysis, a dummy variable (also known as indicator variable or just dummy) is one that takes the values 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome. For example, i ...
*
Inverse probability weighting Inverse probability weighting is a statistical technique for calculating statistics standardized to a pseudo-population different from that in which the data was collected. Study designs with a disparate sampling population and population of target ...
*
Latent variable
In statistics, latent variables (from Latin: present participle of ''lateo'', “lie hidden”) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or me ...
*
Matrix completion
Matrix completion is the task of filling in the missing entries of a partially observed matrix, which is equivalent to performing data imputation in statistics. A wide range of datasets are naturally organized in matrix form. One example is the mo ...
References
Further reading
*
*
*
*
*
*
*
*
*
*
*
External links
Background
Missing Data Department of Medical Statistics,
London School of Hygiene & Tropical Medicine
The London School of Hygiene and Tropical Medicine (LSHTM) is a public research university in Bloomsbury, central London, and a member institution of the University of London that specialises in public health and tropical medicine.
The inst ...
Spatial and temporal Trend Analysis of Long Term rainfall records in data-poor catchments with missing data, a case study of Lower Shire floodplain in Malawi for the period 1953–2010 R-miss-tastic A unified platform for missing values methods and workflows.
Software
MplusPROC MI and PROC MIANALYZE - SASSPSS
{{Statistics, collection, state=collapsed
Statistical data types