HOME

TheInfoList



OR:

In
statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...
, censoring is a condition in which the
value Value or values may refer to: Ethics and social * Value (ethics) wherein said concept may be construed as treating actions themselves as abstract objects, associating value to them ** Values (Western philosophy) expands the notion of value beyo ...
of a
measurement Measurement is the quantification of attributes of an object or event, which can be used to compare with other objects or events. In other words, measurement is a process of determining how large or small a physical quantity is as compared ...
or
observation Observation is the active acquisition of information from a primary source. In living beings, observation employs the senses. In science, observation can also involve the perception and recording of data via the use of scientific instruments. The ...
is only partially known. For example, suppose a study is conducted to measure the impact of a drug on
mortality rate Mortality rate, or death rate, is a measure of the number of deaths (in general, or due to a specific cause) in a particular population, scaled to the size of that population, per unit of time. Mortality rate is typically expressed in units of d ...
. In such a study, it may be known that an individual's age at death is ''at least'' 75 years (but may be more). Such a situation could occur if the individual withdrew from the study at age 75, or if the individual is currently alive at the age of 75. Censoring also occurs when a value occurs outside the range of a
measuring instrument A measuring instrument is a device to measure a physical quantity. In the physical sciences, quality assurance, and engineering, measurement is the activity of obtaining and comparing physical quantities of real-world objects and events. Est ...
. For example, a bathroom scale might only measure up to 140 kg. If a 160-kg individual is weighed using the scale, the observer would only know that the individual's weight is at least 140 kg. The problem of censored data, in which the observed value of some variable is partially known, is related to the problem of
missing data In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data. Mis ...
, where the observed value of some variable is unknown. Censoring should not be confused with the related idea
truncation In mathematics and computer science, truncation is limiting the number of digits right of the decimal point. Truncation and floor function Truncation of positive real numbers can be done using the floor function. Given a number x \in \mathb ...
. With censoring, observations result either in knowing the exact value that applies, or in knowing that the value lies within an interval. With truncation, observations never result in values outside a given range: values in the population outside the range are never seen or never recorded if they are seen. Note that in statistics, truncation is not the same as
rounding Rounding means replacing a number with an approximate value that has a shorter, simpler, or more explicit representation. For example, replacing $ with $, the fraction 312/937 with 1/3, or the expression with . Rounding is often done to ob ...
.


Types

* ''Left censoring'' – a data point is below a certain value but it is unknown by how much. * ''Interval censoring'' – a data point is somewhere on an interval between two values. * ''Right censoring'' – a data point is above a certain value but it is unknown by how much. * ''Type I censoring'' occurs if an experiment has a set number of subjects or items and stops the experiment at a predetermined time, at which point any subjects remaining are right-censored. * ''Type II censoring'' occurs if an experiment has a set number of subjects or items and stops the experiment when a predetermined number are observed to have failed; the remaining subjects are then right-censored. * ''Random'' (or ''non-informative'') ''censoring'' is when each subject has a censoring time that is
statistically independent Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of o ...
of their failure time. The observed value is the minimum of the censoring and failure times; subjects whose failure time is greater than their censoring time are right-censored. Interval censoring can occur when observing a value requires follow-ups or inspections. Left and right censoring are special cases of interval censoring, with the beginning of the interval at zero or the end at infinity, respectively.
Estimation methods Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is de ...
for using left-censored data vary, and not all methods of estimation may be applicable to, or the most reliable, for all data sets. A common misconception with time interval data is to class as ''left censored'' intervals where the start time is unknown. In these cases we have a lower bound on the time ''interval'', thus the data is ''right censored'' (despite the fact that the missing start point is to the left of the known interval when viewed as a timeline!).


Analysis

Special techniques may be used to handle censored data. Tests with specific failure times are coded as actual failures; censored data are coded for the type of censoring and the known interval or limit. Special software programs (often reliability oriented) can conduct a
maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
for summary statistics, confidence intervals, etc.


Epidemiology

One of the earliest attempts to analyse a statistical problem involving censored data was
Daniel Bernoulli Daniel Bernoulli FRS (; – 27 March 1782) was a Swiss mathematician and physicist and was one of the many prominent mathematicians in the Bernoulli family from Basel. He is particularly remembered for his applications of mathematics to mecha ...
's 1766 analysis of
smallpox Smallpox was an infectious disease caused by variola virus (often called smallpox virus) which belongs to the genus Orthopoxvirus. The last naturally occurring case was diagnosed in October 1977, and the World Health Organization (WHO) c ...
morbidity and mortality data to demonstrate the efficacy of
vaccination Vaccination is the administration of a vaccine to help the immune system develop immunity from a disease. Vaccines contain a microorganism or virus in a weakened, live or killed state, or proteins or toxins from the organism. In stimulat ...
. An early paper to use the
Kaplan–Meier estimator The Kaplan–Meier estimator, also known as the product limit estimator, is a non-parametric statistic used to estimate the survival function from lifetime data. In medical research, it is often used to measure the fraction of patients living ...
for estimating censored costs was Quesenberry et al. (1989), however this approach was found to be invalid by Lin et al. unless all patients accumulated costs with a common deterministic rate function over time, they proposed an alternative estimation technique known as the Lin estimator.


Operating life testing

Reliability testing often consists of conducting a test on an item (under specified conditions) to determine the time it takes for a failure to occur. * Sometimes a failure is planned and expected but does not occur: operator error, equipment malfunction, test anomaly, etc. The test result was not the desired time-to-failure but can be (and should be) used as a time-to-termination. The use of censored data is unintentional but necessary. * Sometimes engineers plan a test program so that, after a certain time limit or number of failures, all other tests will be terminated. These suspended times are treated as right-censored data. The use of censored data is intentional. An analysis of the data from replicate tests includes both the times-to-failure for the items that failed and the time-of-test-termination for those that did not fail.


Censored regression

An earlier model for censored regression, the tobit model, was proposed by
James Tobin James Tobin (March 5, 1918 – March 11, 2002) was an American economist who served on the Council of Economic Advisers and consulted with the Board of Governors of the Federal Reserve System, and taught at Harvard and Yale Universities. He ...
in 1958.


Likelihood

The
likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...
is the probability or probability density of what was observed, viewed as a function of parameters in an assumed model. To incorporate censored data points in the likelihood the censored data points are represented by the probability of the censored data points as a function of the model parameters given a model, i.e. a function of CDF(s) instead of the density or probability mass. The most general censoring case is interval censoring: Pr( a< x\leqslant b) =F( b) -F( a), where F( x) is the CDF of the probability distribution, and the two special cases are: * left censoring: Pr( -\infty < x\leqslant b) =F( b) -F(-\infty)=F( b)-0=F(b) =Pr( x\leqslant b) * right censoring: Pr( a< x\leqslant \infty ) =F( \infty ) -F( a) =1-F( a) =1-Pr( x\leqslant a) =Pr( x >a) For continuous probability distributions: Pr( a< x\leqslant b) =Pr( a< x< b)


Example

Suppose we are interested in survival times, T_1, T_2, ..., T_n, but we don't observe T_i for all i. Instead, we observe :(U_i, \delta_i), with U_i = T_i and \delta_i = 1 if T_i is actually observed, and :(U_i, \delta_i), with U_i < T_i and \delta_i = 0 if all we know is that T_i is longer than U_i. When T_i > U_i, U_i is called the ''censoring time''.. If the censoring times are all known constants, then the likelihood is :L = \prod_ f(u_i) \prod_ S(u_i) where f(u_i) = the probability density function evaluated at u_i, and S(u_i) = the probability that T_i is greater than u_i, called the '' survival function''. This can be simplified by defining the
hazard function Failure rate is the frequency with which an engineered system or component fails, expressed in failures per unit of time. It is usually denoted by the Greek letter λ (lambda) and is often used in reliability engineering. The failure rate of a ...
, the instantaneous force of mortality, as :\lambda(u) = f(u)/S(u) so :f(u) = \lambda(u)S(u). Then :L = \prod_i \lambda(u_i)^ S(u_i). For the
exponential distribution In probability theory and statistics, the exponential distribution is the probability distribution of the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average ...
, this becomes even simpler, because the hazard rate, \lambda, is constant, and S(u) = \exp(-\lambda u). Then: :L(\lambda) = \lambda^k \exp (-\lambda \sum), where k = \sum. From this we easily compute \hat, the maximum likelihood estimate (MLE) of \lambda, as follows: :l(\lambda) = \log(L(\lambda)) = k \log(\lambda) - \lambda \sum. Then :dl / d\lambda = k/\lambda - \sum. We set this to 0 and solve for \lambda to get: :\hat \lambda = k / \sum u_i. Equivalently, the mean time to failure is: :1 / \hat\lambda = \sum u_i / k. This differs from the standard MLE for the
exponential distribution In probability theory and statistics, the exponential distribution is the probability distribution of the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average ...
in that the any censored observations are considered only in the numerator.


See also

*
Data analysis Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enc ...
*
Detection limit The limit of detection (LOD or LoD) is the lowest signal, or the lowest corresponding quantity to be determined (or extracted) from the signal, that can be observed with a sufficient degree of confidence or statistical significance. However, the ...
* Imputation (statistics) *
Inverse probability weighting Inverse probability weighting is a statistical technique for calculating statistics standardized to a pseudo-population different from that in which the data was collected. Study designs with a disparate sampling population and population of target ...
*
Sampling bias In statistics, sampling bias is a bias in which a sample is collected in such a way that some members of the intended population have a lower or higher sampling probability than others. It results in a biased sample of a population (or non-human f ...
*
Saturation arithmetic Saturation arithmetic is a version of arithmetic in which all operations, such as addition and multiplication, are limited to a fixed range between a minimum and maximum value. If the result of an operation is greater than the maximum, it is se ...
*
Survival analysis Survival analysis is a branch of statistics for analyzing the expected duration of time until one event occurs, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysi ...
*
Winsorising Winsorizing or winsorization is the transformation of statistics by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers. It is named after the engineer-turned-biostatistician Charles P. Winsor (1895 ...


References


Further reading

*Blower, S. (2004), D, Bernoulli's " ", ''Reviews of Medical Virology'', 14: 275–288 * * *Bagdonavicius, V., Kruopis, J., Nikulin, M.S. (2011),"Non-parametric Tests for Censored Data", London, ISTE/WILEY,.


External links

*"Engineering Statistics Handbook", NIST/SEMATEK

{{Statistics Statistical data types Survival analysis Reliability engineering