Design effect

TheInfoList

OR:

In survey methodology, the design effect (generally denoted as $D_$ or $D_^2$) is a measure of the expected impact of a sampling design on the
variance In probability theory and statistics, variance is the expected value, expectation of the squared Deviation (statistics), deviation of a random variable from its population mean or sample mean. Variance is a measure of statistical dispersion, di ...
of an
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, th ...
for some parameter. It is calculated as the ratio of the variance of an estimator based on a sample from an (often) complex
sampling design Sampling may refer to: * Sampling (signal processing), converting a continuous signal into a discrete signal * Sampling (graphics), converting continuous colors into discrete color components * Sampling (music), the reuse of a sound recording in a ...
, to the variance of an alternative estimator based on a
simple random sample In statistics, a simple random sample (or SRS) is a subset of individuals (a sample (statistics), sample) chosen from a larger Set (mathematics), set (a statistical population, population) in which a subset of individuals are chosen randomization, ...
(SRS) of the same number of elements. The Deff (be it estimated, or known a-priori) can be used to adjust the variance of an estimator in cases where the sample is not drawn using simple random sampling. It may also be useful in sample size calculations and for quantifying the representativeness of a sample. The term "design effect" was coined by Leslie Kish in 1965. The design effect is a positive
real number In mathematics, a real number is a number that can be used to measurement, measure a ''continuous'' one-dimensional quantity such as a distance, time, duration or temperature. Here, ''continuous'' means that values can have arbitrarily small var ...
that indicates an inflation ($D_>1$), or deflation ($D_<1$) in the
variance In probability theory and statistics, variance is the expected value, expectation of the squared Deviation (statistics), deviation of a random variable from its population mean or sample mean. Variance is a measure of statistical dispersion, di ...
of an estimator for some parameter, that is due to the study not using SRS (with $D_=1$, when the variances are identical). Some potential complex sampling that could introduce Deff that is different than 1 include:
cluster sampling In statistics, cluster sampling is a sampling (statistics), sampling plan used when mutually homogeneous yet internally heterogeneous groupings are evident in a statistical population. It is often used in marketing research. In this sampling p ...
(such as when there is
correlation In statistics, correlation or dependence is any statistical relationship, whether causality, causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in ...
between observations),
stratified sampling In statistics, stratified sampling is a method of sampling (statistics), sampling from a Population (statistics), population which can be partition of a set, partitioned into subpopulations. In statistical surveys, when subpopulations within an ...
, cluster randomized controlled trial, disproportional (unequal probability) sample, non-coverage, non-response, statistical adjustments of the data, etc.. Deff can be used in
sample size Sample size determination is the act of choosing the number of observations or Replication (statistics), replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make stat ...
calculations, quantifying the representative of a sample (to a target population), as well as for adjusting (often inflating) the variance of some estimator (in cases when we can calculate that estimator's variance assuming SRS). The term "Design effect" was coined by Leslie Kish in 1965. Ever since, many calculations (and estimators) have been proposed, in the literature, for describing the effect of known sampling design on the increase/decrease in the variance of estimators of interest. In general, the design effect varies between statistics of interests, such as the total or ratio mean; it also matters if the design (e.g.: selection probabilities) are correlated with the outcome of interest. And lastly, it is influenced by the distribution of the outcome itself. All of these should be considered when estimating and using design effect in practice.

# Definitions

## Deff

The design effect (Deff, or $D_$) is the ratio of two theoretical variances for
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, th ...
s of some
parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
($\theta$): :* In the numerator is the actual variance for an estimator of some parameter ($\hat \theta_w$) in a given sampling design $p$; :* In the denominator is the variance assuming the same sample size, but if the sample was obtained using the estimator we would use for a
simple random sampling In statistics, a simple random sample (or SRS) is a subset of individuals (a sample (statistics), sample) chosen from a larger Set (mathematics), set (a statistical population, population) in which a subset of individuals are chosen randomization, ...
''without'' replacement ($\hat \theta_$). So that: : $Deff_p\left(\hat \theta\right) = \frac$ Put differently, $D_$ is by how much more the variance had increased (or decreased, in some cases) because our sample was drawn and adjusted to a specific sampling design (e.g.: using weights, or other measures) as it would be if instead the sample was from a
simple random sampling In statistics, a simple random sample (or SRS) is a subset of individuals (a sample (statistics), sample) chosen from a larger Set (mathematics), set (a statistical population, population) in which a subset of individuals are chosen randomization, ...
(without replacement). There are many ways of calculation $D_$, depending on the parameter of interest (E.g.: population total, population mean, quantiles, ratio of quantities etc.), the estimator used, and the sampling design (e.g.: clustered sampling, stratified sampling, post-stratification, multi-stage sampling, etc.). For estimating the population mean, the Deff (for some sampling design p) is: : $Deff_p = \frac$ Where n is the sample size, f is the fraction of the sample from the population (n/N), (1-f) is the (squared) finite population correction (FPC), and $S^2_y =$ is the unbiassed sample variance. The estimates of unit variance (or element variance) is when multiplying Deff by the element's variance, so to incorporate all the complexities of the sample design. Notice how the definition of Deff is based on parameters of the population that we often do not know (i.e.: the variances of estimators under two different sampling designs). The process of estimating Deff for specific designs will be described in the following section.Kalton, G., J. M. Brick, and T. Le. "Estimating components of design effects for use in sample design. In household sample surveys in developing and transition countries,(Sales No. E. 05. XVII. 6). Department of Economic and Social Affairs." Statistics Division, United Nations, New York (2005)
(pdf)
/ref> A general formula for the (theoretical) design effect of estimating a total (not the mean), for some design, is given in Cochran 1977.

## Deft

A related quantity to Deff, proposed by Kish in 1995, is called Deft (Design Effect Factor). It is defined on the square root of the variance ratios, and also the denominator uses a simple random sample ''with'' replacement (srswr), instead of ''without replacement'' (srswor): $D_ = \sqrt$ In this later definition (proposed in 1995, vs 1965) it was argued that srs "without replacement" (with its positive effect on the variance) should be captured in the definition of the design effect, since it is part of the sampling design. It is also more directly related to the use in inference (since we often use +Z*DE*SE, not +Z*DE*VAR when creating
confidence interval In frequentist statistics, a confidence interval (CI) is a range of estimates for an unknown Statistical parameter, parameter. A confidence interval is computed at a designated ''confidence level''; the 95% confidence level is most common, but ...
s). Also since the finite population correction (FPC) is also harder to compute in some situations. But for many cases when the population is very large, Deft is (almost) the square root of Deff ($D_ \approx \sqrt$). The original intention for ''Deft'' was to have it "express the effects of sample design beyond the elemental variability $\frac$, removing both the unit of measurement and sample size as nuisance parameters", this is done in order to make the design effect generalizable (relevant for) many statistics and variables within the same survey (and even between surveys). However, followup works have shown that the calculation of design effect, for parameters such as a population total or mean, has dependence on the variability of the outcome measure, which limits the original aspiration of Kish for this measure. However, this statement may loosely (i.e.: under some conditions) be true for the weighted mean.

## Effective sample size

The effective sample size, also defined by Kish in 1965, is the original sample size divided by the design effect.Kish, Leslie, and J. Official Stat. "Weighting for unequal Pi." (1992): 183–200
/ref> This quantity reflects what would be the sample size that is needed to achieve the current variance of the estimator (for some parameter) with the existing design, if the sample design (and its relevant parameter estimator) were based on a
simple random sample In statistics, a simple random sample (or SRS) is a subset of individuals (a sample (statistics), sample) chosen from a larger Set (mathematics), set (a statistical population, population) in which a subset of individuals are chosen randomization, ...
. Namely: : $n_ = \frac$ Put differently, it says how many responses we are left with when using an estimator that correctly adjusts for the design effect of the sampling design. For example, using the weighted mean with inverse probability weighting, instead of the simple
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude (mathematics), magnitude and sign (mathematics), sign) of a gi ...
. It is also possible to get the effective sample size ratio by taking the inverse of Deff (i.e.: $\frac = \frac$). When using Kish's design effect for unequal weights, you may use the following simplified formula for " Kish's Effective Sample Size" : $n_ = \frac = \frac = \frac = \frac = \frac$

# Design effect for well-known sampling designs

## Sampling design dictates how design effect should be calculated

Different sampling designs differ substantially in their impact on estimators (such as the mean) in terms of their bias and variance. For example, in the cluster sampling case the units may have equal or unequal selection probabilities, irrespective to their intra-class correlation (and their negative effect of increasing the variance of our estimators). In the case of stratified sampling, the probabilities may be equal (EPSEM) or unequal. But regardless, the usage of the prior information on the stratum size in the population, during the sampling stage, could yield statistical efficiency of our estimators. For example: if we know that gender is correlated with our outcome of interest, and also know that the male-female ratio for some population is 50%-50%. Then if we made sure to sample exactly half of each gender, we've thus reduced the variance of the estimators because we've removed the variability caused by unequal proportion of males-females in our sample. Lastly, in case of adjusting to non-coverage, non-response or some stratum split of the population (unavailable during the sampling stage), we may use statistical procedures (E.g.: post-stratification and others). The result of such procedures may lead to estimations of the sampling probabilities that are similar, or very different, than the true sampling probabilities of the units. The quality of these estimators depends on the quality of the auxiliary information and the missing at random assumptions used in creating them. Even when these sampling probability estimators (propensity scores) manage to capture most of the phenomena that has produced them - the impact of the variable selection probabilities on the estimators may be small or large, depending on the data (details in the next section). Due to the large variety in sampling designs (with or without an effect on unequal selection probabilities), different formulas have been developed to capture the potential design effect, as well as to estimate the correct variance of estimators. Sometimes, these different design effects can be compounded together (as in the case of unequal selection probability and cluster sampling, more details in the following sections). Whether or not to use these formulas, or just assume SRS, depends on to expected amount of bias reduced vs the increase in estimator variance (and in the overhead of methodological and technical complexity).

## Unequal selection probabilities

### Sources for unequal selection probabilities

There are various ways to sample units so that each unit would have the exact same probability of selection. Such methods are called equal probability sampling (EPSEM) methods. Some of the more basic methods include
simple random sample In statistics, a simple random sample (or SRS) is a subset of individuals (a sample (statistics), sample) chosen from a larger Set (mathematics), set (a statistical population, population) in which a subset of individuals are chosen randomization, ...
(SRS, either with or without replacement) and
systematic sampling In survey methodology, systematic sampling is a statistics, statistical method involving the selection of elements from an ordered sampling frame. The most common form of systematic sampling is an equiprobability method. In this approach, progress ...
for getting a fixed sample size. There is also
Bernoulli sampling In the theory of finite population sampling, Bernoulli sampling is a sampling process where each element of the statistical population, population is subjected to an statistical independence, independent Bernoulli trial which determines whether the ...
with a random sample size. More advanced techniques such as
stratified sampling In statistics, stratified sampling is a method of sampling (statistics), sampling from a Population (statistics), population which can be partition of a set, partitioned into subpopulations. In statistical surveys, when subpopulations within an ...
and
cluster sampling In statistics, cluster sampling is a sampling (statistics), sampling plan used when mutually homogeneous yet internally heterogeneous groupings are evident in a statistical population. It is often used in marketing research. In this sampling p ...
can also be designed to be EPSEM. For example, in cluster sampling we can make sure to sample each cluster with probability that is proportional to its size, and then measure all the units inside the cluster. A more complex method for cluster sampling is to use a two-stage sampling by which we sample clusters at the first stage (as before, proportional to cluster size), and sample from each cluster at the second stage using SRS with a fixed proportion (E.g.: sample half of the cluster).Source: Frerichs, R.R. Rapid Surveys (unpublished), © 2004. N, chapter 4 - Equal Probability of Selection
pdf
In their works, Kish and others highlights several known reasons that lead to unequal selection probabilities: # Disproportional sampling due to selection frame or procedure. This happens when a researcher purposefully designs their sample so to over/under sample specific sub-populations or clusters. There are many cases in which this might happen. For example: #:* In
stratified sampling In statistics, stratified sampling is a method of sampling (statistics), sampling from a Population (statistics), population which can be partition of a set, partitioned into subpopulations. In statistical surveys, when subpopulations within an ...
when units from some strata are known to have a larger variance than other strata. In such cases, the intention of the researcher may be to use this prior knowledge about the variance between stratum in order to reduce the overall variance of an estimator of some population level parameter of interest (e.g.: the mean). This can be achieved by a strategy known as '' optimum allocation'', in which a strata $h$ is over sampled proportional to higher standard deviation and lower sampling cost (i.e.: $f_h \propto \frac$, where $S_h$ is the standard deviation of the outcome in $h$, and $C_h$ relates to the cost of recruiting one element from $h$). An example of an optimum allocation is ''Neyman's optimal allocation'' which, when cost is fixed for recruiting each stratum, the sample size is: $n_h = n\frac$. Where the summation is over all strata; ''n'' is the total sample size; $n_h$ is the sample size for stratum ''h''; $W_h = \frac$ the relative size of stratum ''h'' as compared to the entire population ''N''; and $S_$ is the standard error of in stratum ''h''. A related concept to optimum design is optimal experimental design. #:* If there is interest in comparing two stratum (E.g.: people from two specific socio-demographic groups, or from two regions, etc.), in which case the smaller group may be over-sampled. This way, the variance of the estimator that compares the two groups is reduced. #:* In cluster sampling there may be clusters of different sizes but the procedure samples from all clusters using SRS, and all elements in the cluster are measured (for example, if the cluster sizes are not known upfront at the stage of sampling). #:* When using a two-stage sampling so that in the first stage the clusters are sampled proportionally to their size (a.k.a.: PPS Probability Proportional to Size), but then at the second stage only a specific fixed number of units (E.g.: one or two) are selected from each cluster - this may happen due to convenience/budget considerations. A similar case is when the first stage attempt to sample using PPS, but the number of elements in each unit are inaccurate (so that some smaller cluster may have a higher-than-it-should chance of being selected. And vis-versa for larger clusters with too-small of a chance to be sampled). In such cases, the larger the errors in the sampling frame in the first stage - the larger will be the needed unequal selection probabilities. #:* When the frame used for sampling includes duplication of some of the items, thus leading some items to have a larger probability than others to be sampled (E.g.: if the sampling frame was created by merging several lists. Or if recruiting users from several ad channels - in which some of the users are available for recruitment from several of the channels, while others are available to be recruited from only one of the channels). In each of these cases - different units would have different sampling probability, thus making this sampling procedure to not be EPSEM. #:* When several different samples/frames are combined. For example, if running different ad campaigns for recruiting respondents. Or when combining results from several studies done by different researchers and/or at different times (i.e.:
Meta-analysis A meta-analysis is a statistical analysis that combines the results of multiple scientific studies. Meta-analyses can be performed when there are multiple scientific studies addressing the same question, with each individual study reporting m ...
). #: When disproportional sampling happens, due to sampling design decisions, the researcher may (sometimes) be able to traceback the decision and accurately calculate the exact inclusion probability. When these selection probabilities are hard to traceback, they may be estimated using some propensity score model combined with information from auxiliary variables (E.g.: age, gender, etc.). # Non-coverage. This happens, for example, if people are sampled based on some pre-defined list that doesn't include all the people in the population (E.g.: a phone book or using ads to recruit people to a survey). These missing units are missing due to some failure of creating the
sampling frame In statistics, a sampling frame is the source material or device from which a Sampling (statistics), sample is drawn. It is a list of all those within a Statistical population, population who can be sampled, and may include individuals, households o ...
, as opposed to deliberate exclusion of some people (E.g.: minors, people who cannot vote, etc.). The effect of non-coverage on sampling probability is considered difficult to measure (and adjust for) in various survey situations, unless strong assumptions are made. # Non-response. This refers to the failure of obtaining measurements on sampled units that are intended to be measured. Reasons for non-response are varied and depends on the context. A person may be temporarily unavailable, for example if they are not available to pick up the phone when the survey is done. A person may also refuse to answer the survey due to a variety of reasons, E.g.: different tendencies of people from different ethnic/demographic/socio-economic groups to respond in general; insufficient incentive to spend the time or share data; the identity of the institution that is running the survey; inability to respond (E.g.: due to illness, illiteracy, or a language barrier); respondent is not found (E.g.: they've moved an appartement); the response was lost/destroyed during encoding or transmission (i.e.: measurement error). In the context of surveys, these reasons may be related to answering the entire survey or just specific questions. # statistical adjustments. These may include methods such as post-stratification, raking, or propensity score (estimation) models - used to perform an ad-hoc adjustment of the sample to some known (or estimated) stratum sizes. Such procedures are used to mitigate issues in the sampling ranging from
sampling error In statistics, sampling errors are incurred when the statistical characteristics of a Statistical population, population are estimated from a subset, or sample (statistics), sample, of that population. Since the sample does not include all members ...
, under-coverage of the sampling frame to non-response.Kott, Phillip S. "Using calibration weighting to adjust for nonresponse and coverage errors." Survey Methodology 32.2 (2006): 133
(pdf)
/ref> For example, if a simple random sample is used, a post-stratification (using some auxiliary information) does not offer an estimator that is uniformly better than just an unweighted estimator. However, it can be viewed as a more "robust" estimator. Alternatively, these methods can be used to make the sample more similar to some target "controls" (i.e.: population of interest), a process also known as "standardization". In such cases, these adjustments help with providing unbiased estimators (often with the cost of increased variance, as seen in the following sections). If the original sample is a
nonprobability sampling Sampling (statistics), Sampling is the use of a subset of the population (statistics), population to represent the whole population or to inform about (social) processes that are meaningful beyond the particular cases, individuals or sites studied. ...
, then post-stratification adjustments are just similar to an ad-hoc quota sampling. When the sampling design is fully known (leading to some $p_h$ probability of selection for some element from strata h), and the non-response is measurable (i.e.: we know that only $r_h$ observations answered in strata h), then an exactly known inverse probability weight can be calculated for each element i from strata h using:$w_i = \frac$. Sometimes a statistical adjustment, such as post-stratification or raking, is used for estimating the selection probability. E.g.: when comparing the sample we have with same target population, also known as matching to controls. The estimation process may be focused only on adjusting the existing population to an alternative population (for example, if trying to extrapolate from a panel drawn from several regions to an entire country). In such a case, the adjust might be focused on some calibration factor $c_i$ and the weights be calculated as $w_i = \frac$. However, in other cases, both the under-coverage and non-response are all modeled in one go as part of the statistical adjustment, which leads to an estimation of the overall sampling probability (let's say $p_i\text{'}$). In such a case, the weights are simply: $w_i = \frac$. Notice that when statistical adjustments are used, $w_i$ is often estimated based on some model. The formulation in the following sections assume this $w_i$ is known, which is not true for statistical adjustments (since we only have $\widehat w_i$). However, if it is assumed that the estimation error of $\widehat w_i$ is very small then the following sections can be used as if it was known. Having this assumption be true depends on the size of the sample used for modeling, and is worth keeping in mind during analysis. When the selection probabilities may be different, the sample size is random, and the pairwise selection probabilities are independent, we call this Poisson sampling.

### "Design based" vs "model based" for describing properties of estimators

When adjusting for unequal probability selection through "individual case weights" (E.g.: inverse probability weighting), we get various types of estimators for quantities of interest. Estimators such as Horvitz–Thompson estimator yield unbiased estimators (if the selection probabilities are indeed known, or approximately known), for total and the mean of the population. Deville and Särndal (1992) coined the term “calibration estimator” for estimators using weights such that they satisfy some condition, such as having the sum of weights equal the population size. And more generally, that the weighted sum of weights is equal some quantity of an auxiliary variable: $\sum w_ix_i = X$ (e.g.: that the sum of weighted ages of the respondents is equal to the population size in each age bucket).Deville, Jean-Claude, and Carl-Erik Särndal. "Calibration estimators in survey sampling." Journal of the American statistical Association 87.418 (1992): 376-382. The two primary ways to argue about the properties of calibration estimators are: # randomization based (or, sampling design based) - in these cases, the weights ($w_i$) and values of the outcome of interest $y_i$ that are measured in the sample are all treated as known. In this framework, there is variability in the (known) values of the outcome (Y). However, the only randomness comes from which of the elements in the population were picked into the sample (often denoted as $I_i$, getting 1 if element $i$ is in the sample and 0 if it is not). For a
simple random sample In statistics, a simple random sample (or SRS) is a subset of individuals (a sample (statistics), sample) chosen from a larger Set (mathematics), set (a statistical population, population) in which a subset of individuals are chosen randomization, ...
, each $I_i$ will be an i.i.d
bernoulli distribution In probability theory and statistics, the Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli,James Victor Uspensky: ''Introduction to Mathematical Probability'', McGraw-Hill, New York 1937, page 45 is the discrete probabili ...
with some parameter $p$. For general EPSEM (equal probability sampling) $I_i$ will still be bernoulli with some parameter $p$, but they will no longer be
independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s * Independen ...
random variables. For something like post stratification, the number of elements at each strata can be modeled as a
multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided dice rolled ''n'' times. For ''n'' statistical independence, ind ...
with different $p_h$ inclusion probabilities for each element belonging to some strata $h$. In these cases the sample size itself can be a random variable. # model based - in these cases the sample is fixed, the weights are fixed, but the outcome of interest is treated as a random variable. For example, in the case of post-stratification, the outcome can be modeled as some linear regression function where the independent variables are indicator variables mapping each observation to its relevant strata, and the variability comes with the error term. As we will see later, some proofs in the literature rely on the randomization-based framework, while others focus on the model-based perspective. When moving from the mean to the weighted mean, more complexity is added. For example, in the context of survey methodology oftentimes the population size itself is considered an unknown quantity that is estimated. So in the calculation of the weighted mean is in fact based on a ratio estimator, with an estimator of the total at the numerator and an estimator of the population size in the denominator (making the variance calculation to be more complex).

### Common types of weights

There are many types (and subtypes) of weights, with different ways to use and interpret them. With some weights their absolute value has some important meaning, while with other weights the important part is the relative values of the weights to each other. This section presents some of the more common types of weights so that they can be referenced in followup sections. * Frequency weights are a basic type of weighting, presented in introduction to statistics courses. With these, each weight is an integer number that indicates the absolute frequency of an item in the sample. These are also sometimes termed repeat (or occurrence) weights. The specific value has an absolute meaning that is lost if the weights are transformed (e.g.:
scaling Scaling may refer to: Science and technology Mathematics and physics * Scaling (geometry), a linear transformation that enlarges or diminishes objects * Scale invariance, a feature of objects or laws that do not change if scales of length, energ ...
). For example: if we have the numbers 10 and 20 with the frequency weights values of 2 and 3, then when "spreading" our data it is: 10,10, 20, 20, 20 (with weights of 1 to each of these items). Frequency weights includes the amount of information contained in a dataset, and thus allows things like creating unbiased weighted variance estimation using Bessel's correction. Notice that such weights are often
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the p ...
s, since the specific number of items we will see from each value in the dataset is random. * inverse-variance weighting is when each element is assigned a weight that is the inverse of its (known) variance. When all elements have the same expectancy, using such weights for calculating weighted average has the least variance among all weighted averages. In the common formulation, these weights are known and not random (this seems related to reliability weights). * Normalized (convex) weights is a set of weights that form a convex combination. I.e.: each weight is a number between 0 and 1, and the sum of all weights is equal to 1. Any set of (non negative) weights can be turned into normalized weights by dividing each weight with the sum of all weights, making these weights normalized to sum to 1. : A related form are weights normalized to sum to sample size (n). These (non-negative) weights sum to the sample size (n), and their mean is 1. Any set of weights can be normalized to sample size by dividing each weight with the average of all weights. These weights have a nice relative interpretation where elements with weight larger than 1 are more "important" (in terms of their relative influence on, say, the weighted mean) then the average observation, while weights smaller than 1 are less "important" than the average observation. * Inverse probability weighting is when each element is given a weight that is (proportional) to the inverse probability of selecting that element. E.g., by using $w_i = \frac$. With inverse probability weights, we learn how many items each element "represents" in the target population. Hence, the sum of such weights returns the size of the target population of interest. Inverse probability weights can be normalized to sum to 1 or normalized to sum to the sample size (n), and many of the calculations from the following sections will yield the same results. : When a sample is EPSEM then all the probabilities are equal and the inverse of the selection probability yield weights that are all equal to one another (they are all equal to $\frac= \frac$, where $n$ is the sample size and $N$ is the population size). Such a sample is called a self weighting sample. There are also indirect ways of applying "weighted" adjustments. For example, the existing cases may be duplicated to impute missing observations (e.g.: from non-response), with variance estimated using methods such as multiple imputation. A complementary dealing of data is to remove (give weight of 0) to some cases. For example, when wanting to reduce the influence of over-sampled groups that are less essential for some analysis. Both cases are similar in nature to inverse probability weighting but the application in practice gives more/less rows of data (making the input potentially simpler to use in some software implementation), instead of applying an extra column of weights. Nevertheless, the consequences of such implementations are similar to just using weights. So while in the case of removing observations the data can easily be handled by common software implementations, the case of adding rows requires special adjustments for the uncertainty estimations. Not doing so may lead to erroneous conclusions(i.e.: there is no free lunch when using alternative representation of the underlying issues). The term "Haphazard weights", coined by Kish, is used to refer to weights that correspond to unequal selection probabilities, but ones that are not related to the expectancy or variance of the selected elements.

### = Formula

= When taking an unrestricted sample of $n$ elements, we can then randomly split these elements into $H$ disjoint stratum, each of them containing some size of $n_h$ elements so that $\sum\limits_^H n_h = n$. All elements in each strata $h$ has some (known) non-negative weight assigned to them ($w_h$). The weight $w_h$ can be produced by the inverse of some unequal selection probability for elements in each strata $h$ (i.e.: inverse probability weighting following something like post-stratification). In this setting, Kish's design effect, for the increase in variance of the sample weighted mean due to this design (reflected in the weights), versus SRS of some outcome variable y (when there is no correlation between the weights and the outcome, i.e.: haphazard weights) is: : $D_ = \frac$ By treating each item from coming from its own stratum $\forall h: n_h=1$, Kish (in 1992) simplified the above formula to the (well known) following version:Henry, Kimberly A., and Richard Valliant. "A design effect measure for calibration weighting in single-stage samples." Survey Methodology 41.2 (2015): 315-331
(pdf)
/ref> : $D_ = \frac = \frac = \frac$ This version of the formula is valid when one stratum had several observations taken from it (i.e.: each having the same weight), or when there are just many stratum were each one had one observation taken from it, but several of them had the same probability of selection. While the interpretation is slightly different, the calculation of the two scenarios comes out to be the same. Notice that Kish's definition of the design effect is closely tied to the coefficient of variation (also termed ''relative variance'', ''relvariance'' or ''relvar'' for short) of the weights (when using uncorrected (population level) sample standard deviation for estimation). This has several notations in the literature: : $D_ = 1 + L = 1 + ^2 = 1 + relvar\left(w\right) = 1 + \frac$. Where $V\left(w\right) = \frac$ is the population variance of $w$, and $\bar w = \frac$ is the mean. When the weights are normalized to sample size (so that their sum is equal to n and their mean is equal to 1), then $^2 = V\left(w\right)$ and the formula reduces to $D_ = 1 + V\left(w\right)$. While it is true we assume the weights are fixed, we can think of their variance as the variance of an empirical distribution defined by sampling (with equal probability) one weight from our set of weights (similar to how we would think about the correlation of x and y in a
simple linear regression In statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, a ...
). $^2 = \left\left(\right\right)^2 = \frac = \frac = \frac = \frac - 1 = D_ - 1 \implies D_ = 1 + ^2$

### = Assumptions and proofs

= The above formula gives the increase in the variance of the weighted mean based on "haphazard" weights, which reflects when y are observations selected using unequal selection probabilities (with no within-cluster correlation, and no relationship to the expectancy or variance of the outcome measurement); and y' are the observations we would have had if we got them from
simple random sample In statistics, a simple random sample (or SRS) is a subset of individuals (a sample (statistics), sample) chosen from a larger Set (mathematics), set (a statistical population, population) in which a subset of individuals are chosen randomization, ...
, then: $D_ =\frac = \frac$ From a model based perspective,Gabler, Siegfried, Sabine Häder, and Partha Lahiri. "A model based justification of Kish's formula for design effects for weighting and clustering." Survey Methodology 25 (1999): 105–106.
pdf
this formula holds when all n observations ($y_1, ..., y_n$) are (at least approximately)
uncorrelated In probability theory and statistics, two real-valued random variables, X, Y, are said to be uncorrelated if their covariance, \operatorname ,Y= \operatorname Y- \operatorname \operatorname is zero. If two variables are uncorrelated, t ...
($\forall \left(i \neq j\right): cor\left(y_i, y_j\right) = 0$), with the same
variance In probability theory and statistics, variance is the expected value, expectation of the squared Deviation (statistics), deviation of a random variable from its population mean or sample mean. Variance is a measure of statistical dispersion, di ...
($\sigma^2$) in the response variable of interest (y). It also assumes the weights themselves are not a
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the p ...
but rather some known constants (E.g.: the inverse of probability of selection, for some pre-determined and known
sampling design Sampling may refer to: * Sampling (signal processing), converting a continuous signal into a discrete signal * Sampling (graphics), converting continuous colors into discrete color components * Sampling (music), the reuse of a sound recording in a ...
). The following is a simplified proof for when there are no clusters (i.e.: no Intraclass correlation between element of the sample) and each strata includes only one observation: $\begin var\left\left(\bar_w\right\right) & \stackrel var\left\left(\frac \right\right) \stackrel var\left\left( \sum\limits_^n w_i\text{'} y_i \right\right) \stackrel \sum\limits_^n var\left\left( w_i\text{'} y_i \right\right) \\ & \stackrel \sum\limits_^n w_i\text{'}^2 var\left\left( y_i \right\right) \stackrel \sum\limits_^n w_i\text{'}^2 \sigma^2 \stackrel \sigma^2 \sum\limits_^n w_i\text{'}^2 \stackrel \sigma^2 \frac \\ & \stackrel \sigma^2 \frac \stackrel \sigma^2 \frac \stackrel \frac \frac \stackrel \frac \frac \stackrel var\left\left(\bar\text{'}\right\right) D_ \\ & \implies D_ =\frac \\ \end$ Transitions: # from definition of the weighted mean. # using normalized (convex) weights definition (weights that sum to 1): $w_i\text{'} = \frac$. # sum of uncorrelated random variables. # If the weights are constants (from the basic properties of the variance). Another way to say it is that the weights are known upfront for each observation i. Namely that we are actually calculating $var\left\left(\bar_w , w \right\right)$ # when all observations have the same
variance In probability theory and statistics, variance is the expected value, expectation of the squared Deviation (statistics), deviation of a random variable from its population mean or sample mean. Variance is a measure of statistical dispersion, di ...
($\sigma^2$). The conditions on y are trivially held if the y observations are i.i.d with the same expectation and
variance In probability theory and statistics, variance is the expected value, expectation of the squared Deviation (statistics), deviation of a random variable from its population mean or sample mean. Variance is a measure of statistical dispersion, di ...
. In such case we have $y=y\text{'}$, and we can estimate $var\left\left(\bar_w\right\right)$ by using $\overline = \overline \times D_$. If the y's are not all with the same expectations then we cannot use the estimated variance for calculation, since that estimation assumes that all $y_i$s have the same expectation. Specifically, if there is a correlation between the weights and the outcome variable y, then it means that the expectation of y is not the same for all observations (but rather, dependent on the specific weight value for each observation). In such a case, while the design effect formula might still be correct (if the other conditions are met), it would require a different estimator for the variance of the weighted mean. For example, it might be better to use a weighted variance estimator. If different $y_i$s have different variances, then while the weighted variance could capture the correct population-level variance, the Kish's formula for the design effect may no longer be true. A similar issue happens if there is some correlation structure in the samples (such as when using
cluster sampling In statistics, cluster sampling is a sampling (statistics), sampling plan used when mutually homogeneous yet internally heterogeneous groupings are evident in a statistical population. It is often used in marketing research. In this sampling p ...
).

### = Alternative definitions in the literature

= It is worth noting that some sources in the literature give the following alternative definition to Kish's design effect, stating it is: "the ratio of the variance of the weighted survey mean under disproportionate stratified sampling to the variance under proportionate stratified sampling when all stratum unit variances are equal". This definition can be slightly misleading, since it might be interpreted to mean that "proportionate stratified sampling" was achieved via
stratified sampling In statistics, stratified sampling is a method of sampling (statistics), sampling from a Population (statistics), population which can be partition of a set, partitioned into subpopulations. In statistical surveys, when subpopulations within an ...
, in which a pre-determined number of units is selected from each stratum. Such selection will yield reduced variance (as compared with
simple random sample In statistics, a simple random sample (or SRS) is a subset of individuals (a sample (statistics), sample) chosen from a larger Set (mathematics), set (a statistical population, population) in which a subset of individuals are chosen randomization, ...
), since it removes some of the uncertainty in the specific number of elements per stratum. This is different than Kish's original definition which compared the variance of the design to a
simple random sample In statistics, a simple random sample (or SRS) is a subset of individuals (a sample (statistics), sample) chosen from a larger Set (mathematics), set (a statistical population, population) in which a subset of individuals are chosen randomization, ...
(which would yield approximately probability proportionate to sample, but not exactly - due to the variance in sample sizes in each stratum). Park and Lee (2006) reflects on this by stating that "The rationale behind the above derivation is that the loss in precision of he weighted meandue to haphazard unequal weighting can be approximated by the ratio of the variance under disproportionate stratified sampling to that under the proportionate stratified sampling". How far are these two definitions differ from each other is not mentioned in the literature. In his book from 1977, Cochran provides a formula for the proportional increase in variance due to deviation from optimum allocation (what, it Kish's formulas, would be called ''L''). However, the connection from that formula to Kish's ''L'' is not apparent.

### = Alternative naming conventions

= Earlier papers would use the term $Deff$. As more definitions of design effect appeared, Kish's design effect for unequal selection probabilities was denoted $Deff_$ (or $Deft_^2$) or simply $deff_$ for short.Valliant, Richard, Jill A. Dever, and Frauke Kreuter. Practical tools for designing and weighting survey samples. New York: Springer, 2013. Kish's design effect is also known as the "Unequal Weighting Effect" (or just UWE), termed by Liu et al. in 2002.Liu, Jun, Vince Iannacchione, and Margie Byron. "Decomposing design effects for stratified sampling." Proceedings of the survey research methods section, american statistical association. 2002
(pdf)
/ref>

### = Spencer's Deff for estimated total ($\hat Y$)

= The estimator for the total is the "p-expanded with replacement" estimator (a.k.a.: ''pwr-estimator'' or Hansen and Hurwitz). It is based on a
simple random sample In statistics, a simple random sample (or SRS) is a subset of individuals (a sample (statistics), sample) chosen from a larger Set (mathematics), set (a statistical population, population) in which a subset of individuals are chosen randomization, ...
(with replacement, denoted ''SIR'') of ''m'' items ($y_k$) from a population of size M. Each item has a probability of $p_k$ (k from 1 to N) to be drawn in a single draw ($\sum_U p_k = 1$, i.e.: it's a
multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided dice rolled ''n'' times. For ''n'' statistical independence, ind ...
). The probability that a specific $y_k$ will appear in our sample is $p_k$. The "p-expanded with replacement" value is $Z_i = \frac$ with the following expectancy: . Hence $\hat Y_ = \frac \sum_i^m Z_i$, the pwr-estimator, is an unbiased estimator for the sum total of y. In 2000, Bruce D. Spencer proposed a formula for estimating the design effect for the variance of estimating the total (not the mean) of some quantity ($\hat Y$), when there is correlation between the selection probabilities of the elements and the outcome variable of interest.Spencer, Bruce D. "An approximate design effect for unequal weighting when measurements may correlate with selection probabilities." Survey Methodology 26 (2000): 137-138
(pdf)
/ref> In this setup, a sample of size ''n'' is drawn (with replacement) from a population of size ''N''. Each item is drawn with probability $P_i$ (where $\sum_^N P_i = 1$, i.e.:
multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided dice rolled ''n'' times. For ''n'' statistical independence, ind ...
). The selection probabilities are used to define the Normalized (convex) weights: $w_i = \frac$. Notice that for some random set of ''n'' items, the sum of weights will be equal to 1 only by expectancy () with some variability of the sum around it (i.e.: the sum of elements from a poisson binomial distribution). The relationship between $y_i$ and $P_i$ is defined by the following (population)
simple linear regression In statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, a ...
: : $y_i = \alpha + \beta P_i + \epsilon_i$ Where $y_i$ is the outcome of element ''i'', which linearly depends on $P_i$ with the intercept $\alpha$ and slope $\beta$. The residual from the fitted line is $\epsilon_i = y_i - \left(\alpha + \beta P_i\right)$. We can also define the population variances of the outcome and the residuals as $\sigma^2_y$ and $\sigma^2_\epsilon$. The correlation between $P_i$ and $y_i$ is $\rho_$. Spencer's (approximate) design effect, for estimating the total of ''y'', is:Park, Inho, and Hyunshik Lee. "The design effect: do we know all about it." Proceedings of the Annual Meeting of the American Statistical Association. 2001
(pdf)
/ref> : $Deff_ = \left(1- \hat \rho^2_\right)\left(1 + L\right) + \left\left(\frac\right\right)^2 L$ Where: * $\hat \rho^2_$ estimates $\rho^2_$ * $\hat \alpha$ estimates the slope $\alpha$ * $\hat \sigma_y$ estimates the population variance $\sigma_y$, and * L is the relative variance of the weights, as defined in Kish's formula: : $L = cv_w^2 = relvar\left(w\right) = \frac$. This assumes that the regression model fits well so that the probability of selection and the residuals are
independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s * Independen ...
, since it leads to the residuals, and the square residuals, to be uncorrelated with the weights. I.e.: that $\rho_ = 0$ and also $\rho_ = 0$. When the population size (N) is very large, the formula can be written as: : $Deff_ = \left(1 - \hat \rho^2_\right)\left(1 + cv_w^2\right) + \left\left(\frac\right\right)^2 cv_w^2$ (since $\alpha = \bar Y - \beta \times \bar P = \bar Y - \beta \times \frac \approx \bar Y$, where $cv_Y^2 = \frac$) This approximation assumes that the linear relationship between ''P'' and ''y'' holds. And also that the correlation of the weights with the errors, and the errors squared, are both zero. I.e.: $\rho_ = 0$ and $\rho_ = 0$. We notice that if $\hat \rho_ \approx 0$, then $\hat \alpha \approx \bar y$ (i.e.: the average of ''y''). In such a case, the formula reduces to : $Deff_ = \left(1 + L\right) + \left\left(\frac\right\right)^2 L$ Only if the variance of ''y'' is much larger than its mean then the right-most term is close to 0 (i.e.: $relvar\left(y\right) = \frac \approx 0$), which reduces Spencer's design effect (for the estimatoed total) to be equal to Kish's design effect (for the ratio means): $Deff_ \approx \left(1 + L\right) = Deff_$. Otherwise, the two formula's will yield different results, which demonstrates the difference between the design effect of the total vs the one of the mean.

### = Park and Lee's Deff for estimated ratio-mean ($\hat$)

= In 2001, Park and Lee extended Spencer's formula to the case of the ratio-mean (i.e.: estimating the mean by dividing the estimator of the total with the estimator of the population size). It is: : $Deff_ = \left(1 - \hat \rho^2_\right)\left(1 + cv_w^2\right) + \frac cv_w^2$ Where: * $cv_P^2$ is the (estimated) coefficient of variation of the probabilities of selection. Park and Lee's formula is exactly equal to Kish's formula when $\hat \rho_^2 = 0$. Both formula's relate to the design effect of the mean of ''y'' (while Spencer's Deff relates to the estimation of the total). In general, the Deff for the total ($\hat$) tends to be less efficient than the Deff for the ratio mean ($\hat$) when $\rho_$ is small. And in general, $\rho_$ impacts the efficiency of both design effects.

## Cluster sampling

For data collected using
cluster sampling In statistics, cluster sampling is a sampling (statistics), sampling plan used when mutually homogeneous yet internally heterogeneous groupings are evident in a statistical population. It is often used in marketing research. In this sampling p ...
we assume the following structure: * $n_k$ observations in each cluster and K clusters, and with a total of $n = \sum n_k$ observations. * The observations have a block correlation matrix in which every pair of observations from the same cluster is correlated with an intra-class correlation of $\rho$, while every pair from difference clusters are uncorrelated. I.e., for every pair of observations, $i$ and $j$, if they belong to the same cluster $k$, we get $cov\left(y_i, y_j\right) = \rho \sigma^2$. And two items from two different clusters are not correlated, i.e.: $cov\left(y_i, y_j\right) = 0$. * An element from any cluster is assumed to have the same variance: $var\left(y_i\right) = \sigma_h^2 = \sigma^2$. When clusters are all of the same size $n^*$, the design effect ''D''eff, proposed by Kish in 1965 (and later re-visited by others), is given by: :$D_\text = 1 + \left(n^* - 1\right) \rho .$ It is sometimes also denoted as $Deff_C$. In various papers, when cluster sizes are not equal, the above formula is also used with $n^*$ as the average cluster size (it is also sometimes denoted as $\bar b$).Kish, L. (1987). Weighting in $Deft^2$. The Survey Statistician, June 1987. (this paper doesn't seem to be available online, but is references in several places as the original source of this formula) In such cases, Kish's formula (using the average cluster weight) serves as a conservative (upper bound) of the exact design effect. Alternative formulas exists for unequal cluster sizes. Followup work had discussed the sensitivity of using the average cluster size with various assumptions.

## Unequal selection probabilities $\times$ Cluster sampling

In his paper from 1987, Kish proposed a combined design effect that incorporates both the effects due to weighting that accounts for unequal selection probabilities as well as cluster sampling:Gabler, Siegfried, Sabine Hader, and Peter Lynn. Design effects for multiple design samples. No. 2005-12. ISER Working Paper Series, 2005
(pdf)
/ref> : $Deff_ = \frac \left\left( 1 + \left(n^* - 1\right) \rho \right\right) = deff_k \times deff_C$ With notations similar to above. This formula received a model based justification, proposed in 1999 by Gabler et al.

## Stratified sampling $\times$ unequal selection probabilities $\times$ Cluster sampling

In 2000, Liu and Aragon proposed a decomposition of unequal selection probabilities design effect for different strata in stratified sampling. In 2002, Liu et al. extended that work to account for stratified sample were within each strata is a set of unequal selection probability weights. The cluster sampling is either global or per strata. Similar work was done also by Park et al. in 2003.

# Uses

Deff is primarily used for several purposes:Cochran, W. G. (1977). Sampling Techniques (3rd ed.). Nashville, TN: John Wiley & Sons. * When developing the design - to evaluate its efficiency. I.e.: if there is potentially "too much" increase in variance due to some decision, or if the new design is more efficient (e.g.: as in stratified sampling). * As a way for guiding sample size (overall, per stratum, per cluster, etc.), and also * When evaluating potential problems with a post-hoc weighting analysis (E.g.: from non-response adjustments). There is no universal rule-of-thumb for which design effect value is "too high", but the literature indicates that $Deff > 1.5$ is likely to lead to some attention. In his 1995 paper, Kish proposed the following categorization of when Deff is, and is not, useful: * Design effect is ''unnecessary'' when: the source population is closely i.i.d, or when the sample design of the data was drawn as a
simple random sample In statistics, a simple random sample (or SRS) is a subset of individuals (a sample (statistics), sample) chosen from a larger Set (mathematics), set (a statistical population, population) in which a subset of individuals are chosen randomization, ...
. It is also less useful when the sample size is relatively small (at least partially, for practical reasons). And also if only
descriptive statistics A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information, while descriptive statistics (in the mass noun sense) is the process of using and ana ...
are of interest (i.e.: point estimation). It is also suggested that if standard errors are needed for only a handful of statistics, it may be o.k. to ignore Deff. * Design effect is ''necessary'' when: averaging sampling errors for different variables measured on the same survey. OR when averaging the same measured quantity from several surveys over a period of time. Or when extrapolating from the error of simple statistics (e.g.: the mean) to more complex ones (e.g.: regression coefficients). When designing a future survey (but with proper caution). As an aiding statistic to identify glaring issues with the data or its analysis (e.g.: ranging from mistakes to the presence of Outliers). When planning the sample size, work has been done to correct the design effect so to separates the interviewer effect (measurement error) from the effects of the sampling design on the sampling variance. While Kish originally hoped to have the design effect be able to be agnostic as possible to the underlying distribution of the data, sampling probabilities, their correlations, and the statistics of interest - followup research has shown that these do influence the design effect. Hence, careful attention to these properties should be taken into account when deciding which Deff calculation to use, and how to use it.

# History

The term "Design effect" was introduced by Leslie Kish in 1965 in his book "Survey Sampling". In his paper from 1995,Kish, Leslie. "Methods for design effects." Journal of official Statistics 11.1 (1995): 55
pdf
Kish mentions that a similar concept, termed "Lexis ratio", was described at the end of the 19th century. The closely related Intraclass correlation was described by Fisher in 1950, while computations of ratios of variances were already published by Kish and others from the late 40s to the 50s. One of the precursors for Kish's definition was the work done by Cornfield in 1951.Cochran, William G. "Modern methods in the sampling of human populations." American journal of public health and the nation's health 41.6 (1951): 647–668.Park, Inho, and Hyunshik Lee. "Design effects for the weighted mean and total estimators under complex survey sampling." Quality control and applied statistics 51.4 (2006): 381–384 (based on google scholar). Vol. 30, No. 2, pp. 183-193. Statistics Canada, Catalogue No. 12-001. Survey Methodology December 2004 (based on the PDF)
pdf
In his original book from 1965, Kish proposed the general definition for the design effect (ratio of variances of two estimators, one from a sample with some design and the other from a simple random sample). In his book, Kish proposed the formula for the design effect of cluster sampling (with intraclass correlation); as well as the famous design effect formula for unequal probability sampling. These are often known as "Kish's design effect", and have been merged later into a single formula.