In
molecular biology
Molecular biology is the branch of biology that seeks to understand the molecular basis of biological activity in and between cells, including biomolecular synthesis, modification, mechanisms, and interactions. The study of chemical and physi ...
, a batch effect occurs when non-biological factors in an experiment cause changes in the data produced by the experiment. Such effects can lead to inaccurate conclusions when their causes are correlated with one or more outcomes of interest in an experiment. They are common in many types of
high-throughput sequencing
DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. Th ...
experiments, including those using
microarrays,
mass spectrometer
Mass spectrometry (MS) is an analytical technique that is used to measure the mass-to-charge ratio of ions. The results are presented as a '' mass spectrum'', a plot of intensity as a function of the mass-to-charge ratio. Mass spectrometry is us ...
s,
and
single-cell RNA-sequencing data.
They are most commonly discussed in the context of
genomics and high-throughput sequencing research, but they exist in other fields of science as well.
Definitions
Multiple definitions of the term "batch effect" have been proposed in the literature. Lazar et al. (2013) noted, "Providing a complete and unambiguous definition of the so-called batch effect is a challenging task, especially because its origins and the way it manifests in the data are not completely known or not recorded." Focusing on microarray experiments, they propose a new definition based on several previous ones: "
e batch effect represents the systematic technical differences when samples are processed and measured in different batches and which are unrelated to any biological variation recorded during the MAGE
icroarray gene expressionexperiment."
Causes
Many potentially variable factors have been identified as potential causes of batch effects, including the following:
*Laboratory conditions
*Choice of reagent lot or batch
*Personnel differences
*Time of day when the experiment was conducted
*Atmospheric
ozone
Ozone (), or trioxygen, is an inorganic molecule with the chemical formula . It is a pale blue gas with a distinctively pungent smell. It is an allotrope of oxygen that is much less stable than the diatomic allotrope , breaking down in the lo ...
levels
*Instruments used to conduct the experiment
Correction
Various statistical techniques have been developed to attempt to correct for batch effects in high-throughput experiments. These techniques are intended for use during the stages of experimental design and data analysis. They have historically mostly focused on genomics experiments, and have only recently begun to expand into other scientific fields such as
proteomics. One problem associated with such techniques is that they may unintentionally remove actual biological variation. Some techniques that have been used to detect and/or correct for batch effects include the following:
*For microarray data,
linear mixed models have been used, with confounding factors included as random intercepts.
*In 2007, Johnson et al. proposed an
empirical Bayesian
Empirical Bayes methods are procedures for statistical inference in which the Prior probability, prior probability distribution is estimated from the data. This approach stands in contrast to standard Bayesian probability, Bayesian methods, for ...
technique for correcting for batch effects. This approach represented an improvement over previous methods in that it could be effectively used with small batch sizes.
*In 2012, the sva
software package was introduced. It includes multiple functions to adjust for batch effects, including the use of
surrogate variable In clinical trials, a surrogate endpoint (or surrogate marker) is a measure of effect of a specific treatment that may correlate with a ''real'' clinical endpoint but does not necessarily have a guaranteed relationship. The National Institutes of H ...
estimation, which had previously been shown to improve reproducibility and reduce dependence in high-throughput experiments.
*Haghverdi et al. (2018) proposed a technique designed for single-cell RNA-seq data, based on the detection of
mutual nearest neighbors in the data.
*Papiez et al. (2019) proposed a
dynamic programming
Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. ...
algorithm to identify batch effects of unknown value in high-throughput data.
*Voß et al. (2022) proposed an algorithm called HarmonizR which enables data harmonization across independent proteomic datasets with appropriate handling of missing values.
References
{{Reflist
Biology terminology