Types
Understanding the reasons why data are missing is important for handling the remaining data correctly. If values are missing completely at random, the data sample is likely still representative of the population. But if the values are missing systematically, analysis may be biased. For example, in a study of the relation between IQ and income, if participants with an above-average IQ tend to skip the question ‘What is your salary?’, analyses that do not take into account this missing at random (MAR pattern (see below)) may falsely fail to find a positive association between IQ and salary. Because of these problems, methodologists routinely advise researchers to design studies to minimize the occurrence of missing values. Graphical models can be used to describe the missing data mechanism in detail.Missing completely at random
Values in a data set are missing completely at random (MCAR) if the events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random. When data are MCAR, the analysis performed on the data is unbiased; however, data are rarely MCAR. In the case of MCAR, the missingness of data is unrelated to any study variable: thus, the participants with completely observed data are in effect a random sample of all the participants assigned a particular intervention. With MCAR, the random assignment of treatments is assumed to be preserved, but that is usually an unrealistically strong assumption in practice.Missing at random
Missing at random (MAR) occurs when the missingness is not random, but where missingness can be fully accounted for by variables where there is complete information. Since MAR is an assumption that is impossible to verify statistically, we must rely on its substantive reasonableness. An example is that males are less likely to fill in a depression survey but this has nothing to do with their level of depression, after accounting for maleness. Depending on the analysis method, these data can still induce parameter bias in analyses due to the contingent emptiness of cells (male, very high depression may have zero entries). However, if the parameter is estimated with Full Information Maximum Likelihood, MAR will provide asymptotically unbiased estimates.Missing not at random
Missing not at random (MNAR) (also known as nonignorable nonresponse) is data that is neither MAR nor MCAR (i.e. the value of the variable that's missing is related to the reason it's missing). To extend the previous example, this would occur if men failed to fill in a depression survey ''because'' of their level of depression. Samuelson and Spirer (1992) discussed how missing and/or distorted data about demographics, law enforcement, and health could be indicators of patterns of human rights violations. They gave several fairly well documented examples.Techniques of dealing with missing data
Missing data reduces the representativeness of the sample and can therefore distort inferences about the population. Generally speaking, there are three main approaches to handle missing data: (1) ''Imputation''—where values are filled in the place of missing data, (2) ''omission''—where samples with invalid data are discarded from further analysis and (3) ''analysis''—by directly applying methods unaffected by the missing values. One systematic review addressing the prevention and handling of missing data for patient-centered outcomes research identified 10 standards as necessary for the prevention and handling of missing data. These include standards for study design, study conduct, analysis, and reporting. In some practical application, the experimenters can control the level of missingness, and prevent missing values before gathering the data. For example, in computer questionnaires, it is often not possible to skip a question. A question has to be answered, otherwise one cannot continue to the next. So missing values due to the participant are eliminated by this type of questionnaire, though this method may not be permitted by an ethics board overseeing the research. In survey research, it is common to make multiple efforts to contact each individual in the sample, often sending letters to attempt to persuade those who have decided not to participate to change their minds. However, such techniques can either help or hurt in terms of reducing the negative inferential effects of missing data, because the kind of people who are willing to be persuaded to participate after initially refusing or not being home are likely to be significantly different from the kinds of people who will still refuse or remain unreachable after additional effort. In situations where missing values are likely to occur, the researcher is often advised on planning to use methods of data analysis methods that are robust to missingness. An analysis is robust when we are confident that mild to moderate violations of the technique's key assumptions will produce little or no bias, or distortion in the conclusions drawn about the population.Imputation
SomeInterpolation
In the mathematical field of numerical analysis,Partial deletion
Methods which involve reducing the data available to a dataset having no missing values include: * Listwise deletion/casewise deletion * Pairwise deletionFull analysis
Methods which take full account of all information available, without the distortion resulting from using imputed values as if they were actually observed: * Generative approaches: **The expectation-maximization algorithm ** full information maximum likelihood estimation *Discriminative approaches: **Max-margin classification of data with absent features Partial identification methods may also be used.Model-based techniques
Model based techniques, often using graphs, offer additional tools for testing missing data types (MCAR, MAR, MNAR) and for estimating parameters under missing data conditions. For example, a test for refuting MAR/MCAR reads as follows: For any three variables ''X,Y'', and ''Z'' where ''Z'' is fully observed and ''X'' and ''Y'' partially observed, the data should satisfy: . In words, the observed portion of ''X'' should be independent on the missingness status of ''Y,'' conditional on every value of ''Z''. Failure to satisfy this condition indicates that the problem belongs to the MNAR category. (Remark: These tests are necessary for variable-based MAR which is a slight variation of event-based MAR.) When data falls into MNAR category techniques are available for consistently estimating parameters when certain conditions hold in the model. For example, if ''Y'' explains the reason for missingness in ''X'' and ''Y'' itself has missing values, theSee also
* Censoring * Expectation–maximization algorithm * Imputation * Indicator variable * Inverse probability weighting * Latent variable * Matrix completionReferences
Further reading
* * * * * * * * * * *External links
Background
Software