Biostatistics (also known as biometry) are the development and application of

statistical Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industr ...

methods to a wide range of topics in

biology Biology is the scientific study of life. It is a natural science with a broad scope but has several unifying themes that tie it together as a single, coherent field. For instance, all organisms are made up of cells that process hereditary ...

. It encompasses the design of biological

experiment An experiment is a procedure carried out to support or refute a hypothesis, or determine the efficacy or likelihood of something previously untried. Experiments provide insight into cause-and-effect by demonstrating what outcome occurs whe ...

s, the collection and analysis of data from those experiments and the interpretation of the results.

History

Biostatistics and genetics

Biostatistical modeling forms an important part of numerous modern biological theories.

Genetics Genetics is the study of genes, genetic variation, and heredity in organisms.Hartl D, Jones E (2005) It is an important branch in biology because heredity is vital to organisms' evolution. Gregor Mendel, a Moravian Augustinian friar work ...

studies, since its beginning, used statistical concepts to understand observed experimental results. Some genetics scientists even contributed with statistical advances with the development of methods and tools. Gregor Mendel started the genetics studies investigating genetics segregation patterns in families of peas and used statistics to explain the collected data. In the early 1900s, after the rediscovery of Mendel's Mendelian inheritance work, there were gaps in understanding between genetics and evolutionary Darwinism. Francis Galton tried to expand Mendel's discoveries with human data and proposed a different model with fractions of the heredity coming from each ancestral composing an infinite series. He called this the theory of " Law of Ancestral Heredity". His ideas were strongly disagreed by William Bateson, who followed Mendel's conclusions, that genetic inheritance were exclusively from the parents, half from each of them. This led to a vigorous debate between the biometricians, who supported Galton's ideas, as Raphael Weldon, Arthur Dukinfield Darbishire and Karl Pearson, and Mendelians, who supported Bateson's (and Mendel's) ideas, such as

Charles Davenport Charles Benedict Davenport (June 1, 1866 – February 18, 1944) was a biologist and eugenicist influential in the American eugenics movement. Early life and education Davenport was born in Stamford, Connecticut, to Amzi Benedict Davenport, ...

and

Wilhelm Johannsen Wilhelm Johannsen (3 February 1857 – 11 November 1927) was a Danish pharmacist, botanist, plant physiologist, and geneticist. He is best known for coining the terms gene, phenotype and genotype, and for his 1903 "pure line" experiments in ...

. Later, biometricians could not reproduce Galton conclusions in different experiments, and Mendel's ideas prevailed. By the 1930s, models built on statistical reasoning had helped to resolve these differences and to produce the neo-Darwinian modern evolutionary synthesis. Solving these differences also allowed to define the concept of population genetics and brought together genetics and evolution. The three leading figures in the establishment of

population genetics Population genetics is a subfield of genetics that deals with genetic differences within and between populations, and is a part of evolutionary biology. Studies in this branch of biology examine such phenomena as adaptation, speciation, and po ...

and this synthesis all relied on statistics and developed its use in biology. * Ronald Fisher worked alongside statistician Betty Allan developing several basic statistical methods in support of his work studying the crop experiments at Rothamsted Research, published in Fisher's books Statistical Methods for Research Workers (1925) and

The Genetical Theory of Natural Selection ''The Genetical Theory of Natural Selection'' is a book by Ronald Fisher which combines Mendelian genetics with Charles Darwin's theory of natural selection, with Fisher being the first to argue that "Mendelism therefore validates Darwinism" and ...

(1930), as well as Allan's scientific papers. Fisher went on to give many contributions to genetics and statistics. Some of them include the ANOVA, p-value concepts, Fisher's exact test and

Fisher's equation In mathematics, Fisher's equation (named after statistician and biologist Ronald Fisher) also known as the Kolmogorov–Petrovsky–Piskunov equation (named after Andrey Kolmogorov, Ivan Petrovsky, and Nikolai Piskunov), KPP equation or Fis ...

for population dynamics. He is credited for the sentence "Natural selection is a mechanism for generating an exceedingly high degree of improbability". *

Sewall G. Wright Sewall Green Wright FRS(For) Honorary FRSE (December 21, 1889March 3, 1988) was an American geneticist known for his influential work on evolutionary theory and also for his work on path analysis. He was a founder of population genetics alongsi ...

developed F-statistics and methods of computing them and defined inbreeding coefficient. *

J. B. S. Haldane John Burdon Sanderson Haldane (; 5 November 18921 December 1964), nicknamed "Jack" or "JBS", was a British-Indian scientist who worked in physiology, genetics, evolutionary biology, and mathematics. With innovative use of statistics in biolo ...

's book, ''The Causes of Evolution'', reestablished natural selection as the premier mechanism of evolution by explaining it in terms of the mathematical consequences of Mendelian genetics. He also developed the theory of primordial soup. These and other biostatisticians, mathematical biologists, and statistically inclined geneticists helped bring together

evolutionary biology Evolutionary biology is the subfield of biology that studies the evolutionary processes (natural selection, common descent, speciation) that produced the diversity of life on Earth. It is also defined as the study of the history of life ...

and

genetics Genetics is the study of genes, genetic variation, and heredity in organisms.Hartl D, Jones E (2005) It is an important branch in biology because heredity is vital to organisms' evolution. Gregor Mendel, a Moravian Augustinian friar work ...

into a consistent, coherent whole that could begin to be

quantitative Quantitative may refer to: * Quantitative research, scientific investigation of quantitative properties * Quantitative analysis (disambiguation) * Quantitative verse, a metrical system in poetry * Statistics, also known as quantitative analysis ...

ly modeled. In parallel to this overall development, the pioneering work of D'Arcy Thompson in ''On Growth and Form'' also helped to add quantitative discipline to biological study. Despite the fundamental importance and frequent necessity of statistical reasoning, there may nonetheless have been a tendency among biologists to distrust or deprecate results which are not qualitatively apparent. One anecdote describes Thomas Hunt Morgan banning the Friden calculator from his department at

Caltech The California Institute of Technology (branded as Caltech or CIT)The university itself only spells its short form as "Caltech"; the institution considers other spellings such a"Cal Tech" and "CalTech" incorrect. The institute is also occasional ...

, saying "Well, I am like a guy who is prospecting for gold along the banks of the Sacramento River in 1849. With a little intelligence, I can reach down and pick up big nuggets of gold. And as long as I can do that, I'm not going to let any people in my department waste scarce resources in placer mining."

Research planning

Any research in life sciences is proposed to answer a scientific question we might have. To answer this question with a high certainty, we need accurate results. The correct definition of the main hypothesis and the research plan will reduce errors while taking a decision in understanding a phenomenon. The research plan might include the research question, the hypothesis to be tested, the

experimental design The design of experiments (DOE, DOX, or experimental design) is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variation. The term is generally associ ...

, data collection methods, data analysis perspectives and costs involved. It is essential to carry the study based on the three basic principles of experimental statistics: randomization, replication, and local control.

Research question

The research question will define the objective of a study. The research will be headed by the question, so it needs to be concise, at the same time it is focused on interesting and novel topics that may improve science and knowledge and that field. To define the way to ask the scientific question, an exhaustive literature review might be necessary. So the research can be useful to add value to the scientific community.

Hypothesis definition

Once the aim of the study is defined, the possible answers to the research question can be proposed, transforming this question into a hypothesis. The main propose is called

null hypothesis In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is ...

(H₀) and is usually based on a permanent knowledge about the topic or an obvious occurrence of the phenomena, sustained by a deep literature review. We can say it is the standard expected answer for the data under the situation in

test Test(s), testing, or TEST may refer to: * Test (assessment), an educational assessment intended to measure the respondents' knowledge or other abilities Arts and entertainment * ''Test'' (2013 film), an American film * ''Test'' (2014 film), ...

. In general, H_O assumes no association between _treatments. On the other hand, the alternative hypothesis is the denial of H_O. It assumes some degree of association between the treatment and the outcome. Although, the hypothesis is sustained by question research and its expected and unexpected answers. As an example, consider groups of similar animals (mice, for example) under two different diet systems. The research question would be: what is the best diet? In this case, H₀ would be that there is no difference between the two diets in mice

metabolism Metabolism (, from el, μεταβολή ''metabolē'', "change") is the set of life-sustaining chemical reactions in organisms. The three main functions of metabolism are: the conversion of the energy in food to energy available to run ...

(H₀: μ₁ = μ₂) and the alternative hypothesis would be that the diets have different effects over animals metabolism (H₁: μ₁ ≠ μ₂). The hypothesis is defined by the researcher, according to his/her interests in answering the main question. Besides that, the alternative hypothesis can be more than one hypothesis. It can assume not only differences across observed parameters, but their degree of differences (''i.e.'' higher or shorter).

Sampling

Usually, a study aims to understand an effect of a phenomenon over a

population Population typically refers to the number of people in a single area, whether it be a city or town, region, country, continent, or the world. Governments typically quantify the size of the resident population within their jurisdiction usi ...

. In

, a

is defined as all the

individuals An individual is that which exists as a distinct entity. Individuality (or self-hood) is the state or quality of being an individual; particularly (in the case of humans) of being a person unique from other people and possessing one's own nee ...

of a given

species In biology, a species is the basic unit of classification and a taxonomic rank of an organism, as well as a unit of biodiversity. A species is often defined as the largest group of organisms in which any two individuals of the appropriat ...

, in a specific area at a given time. In biostatistics, this concept is extended to a variety of collections possible of study. Although, in biostatistics, a

is not only the

, but the total of one specific component of their

organisms In biology, an organism () is any living system that functions as an individual entity. All organisms are composed of cells ( cell theory). Organisms are classified by taxonomy into groups such as multicellular animals, plants, and fu ...

, as the whole

genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...

, or all the sperm

cells Cell most often refers to: * Cell (biology), the functional basic unit of life Cell may also refer to: Locations * Monastic cell, a small room, hut, or cave in which a religious recluse lives, alternatively the small precursor of a monastery w ...

, for animals, or the total leaf area, for a plant, for example. It is not possible to take the measures from all the elements of a

. Because of that, the sampling process is very important for

statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properti ...

. Sampling is defined as to randomly get a representative part of the entire population, to make posterior inferences about the population. So, the

sample Sample or samples may refer to: Base meaning * Sample (statistics), a subset of a population – complete data set * Sample (signal), a digital discrete sample of a continuous analog signal * Sample (material), a specimen or small quantity of ...

might catch the most variability across a population. The

sample size Sample size determination is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a populati ...

is determined by several things, since the scope of the research to the resources available. In

clinical research Clinical research is a branch of healthcare science that determines the safety and effectiveness ( efficacy) of medications, devices, diagnostic products and treatment regimens intended for human use. These may be used for prevention, treat ...

, the trial type, as inferiority, equivalence, and

superior Superior may refer to: *Superior (hierarchy), something which is higher in a hierarchical structure of any kind Places *Superior (proposed U.S. state), an unsuccessful proposal for the Upper Peninsula of Michigan to form a separate state *Lake ...

ity is a key in determining sample

size Size in general is the magnitude or dimensions of a thing. More specifically, ''geometrical size'' (or ''spatial size'') can refer to linear dimensions ( length, width, height, diameter, perimeter), area, or volume. Size can also be me ...

Experimental design

Experimental designs sustain those basic principles of experimental statistics. There are three basic experimental designs to randomly allocate treatments in all plots of the

. They are completely randomized design,

randomized block design In the statistical theory of the design of experiments, blocking is the arranging of experimental units in groups (blocks) that are similar to one another. Blocking can be used to tackle the problem of pseudoreplication. Use Blocking reduces ...

, and

factorial designs In statistics, a full factorial experiment is an experiment whose design consists of two or more factors, each with discrete possible values or "levels", and whose experimental units take on all possible combinations of these levels across all s ...

. Treatments can be arranged in many ways inside the experiment. In

agriculture Agriculture or farming is the practice of cultivating plants and livestock. Agriculture was the key development in the rise of sedentary human civilization, whereby farming of domesticated species created food surpluses that enabled people ...

, the correct

is the root of a good study and the arrangement of treatments within the study is essential because

environment Environment most often refers to: __NOTOC__ * Natural environment, all living and non-living things occurring naturally * Biophysical environment, the physical and biological factors along with their chemical interactions that affect an organism or ...

largely affects the plots ( plants,

livestock Livestock are the domesticated animals raised in an agricultural setting to provide labor and produce diversified products for consumption such as meat, eggs, milk, fur, leather, and wool. The term is sometimes used to refer solely to ani ...

, microorganisms). These main arrangements can be found in the literature under the names of " lattices", "incomplete blocks", " split plot", "augmented blocks", and many others. All of the designs might include control plots, determined by the researcher, to provide an error estimation during inference. In

clinical studies Clinical trials are prospective biomedical or behavioral research studies on human participants designed to answer specific questions about biomedical or behavioral interventions, including new treatments (such as novel vaccines, drugs, diet ...

, the

s are usually smaller than in other biological studies, and in most cases, the

effect can be controlled or measured. It is common to use randomized controlled clinical trials, where results are usually compared with observational study designs such as case–control or

cohort Cohort or cohortes may refer to: * Cohort (educational group), a group of students working together through the same academic curriculum * Cohort (floating point), a set of different encodings of the same numerical value * Cohort (military unit) ...

Data collection

Data collection methods must be considered in research planning, because it highly influences the sample size and experimental design. Data collection varies according to type of data. For qualitative data, collection can be done with structured questionnaires or by observation, considering presence or intensity of disease, using score criterion to categorize levels of occurrence. For quantitative data, collection is done by measuring numerical information using instruments. In agriculture and biology studies, yield data and its components can be obtained by metric measures. However, pest and disease injuries in plats are obtained by observation, considering score scales for levels of damage. Especially, in genetic studies, modern methods for data collection in field and laboratory should be considered, as high-throughput platforms for phenotyping and genotyping. These tools allow bigger experiments, while turn possible evaluate many plots in lower time than a human-based only method for data collection. Finally, all data collected of interest must be stored in an organized data frame for further analysis.

Analysis and data interpretation

Descriptive tools

Data can be represented through tables or graphical representation, such as line charts, bar charts, histograms, scatter plot. Also, measures of central tendency and variability can be very useful to describe an overview of the data. Follow some examples:

Frequency tables

One type of tables are the

frequency Frequency is the number of occurrences of a repeating event per unit of time. It is also occasionally referred to as ''temporal frequency'' for clarity, and is distinct from ''angular frequency''. Frequency is measured in hertz (Hz) which is eq ...

table, which consists of data arranged in rows and columns, where the frequency is the number of occurrences or repetitions of data. Frequency can be: Absolute: represents the number of times that a determined value appear;

N = f_1 + f_2 + f_3 + ... + f_n

Relative: obtained by the division of the absolute frequency by the total number;

n_i = \frac

In the next example, we have the number of genes in ten operons of the same organism. :

Line graph

Line graphs represent the variation of a value over another metric, such as time. In general, values are represented in the vertical axis, while the time variation is represented in the horizontal axis.

Bar chart

A bar chart is a graph that shows categorical data as bars presenting heights (vertical bar) or widths (horizontal bar) proportional to represent values. Bar charts provide an image that could also be represented in a tabular format. In the bar chart example, we have the birth rate in Brazil for the December months from 2010 to 2016. The sharp fall in December 2016 reflects the outbreak of Zika virus in the birth rate in Brazil.

Histograms

The histogram (or frequency distribution) is a graphical representation of a dataset tabulated and divided into uniform or non-uniform classes. It was first introduced by Karl Pearson.

Scatter plot

A scatter plot is a mathematical diagram that uses Cartesian coordinates to display values of a dataset. A scatter plot shows the data as a set of points, each one presenting the value of one variable determining the position on the horizontal axis and another variable on the vertical axis. They are also called scatter graph, scatter chart, scattergram, or scatter diagram.

Mean

The arithmetic mean is the sum of a collection of values (

) divided by the number of items of this collection (

). :

\bar = \frac\left (\sum_^n\right ) = \frac

Median

The median is the value in the middle of a dataset.

Mode

The

mode Mode ( la, modus meaning "manner, tune, measure, due measure, rhythm, melody") may refer to: Arts and entertainment * '' MO''D''E (magazine)'', a defunct U.S. women's fashion magazine * ''Mode'' magazine, a fictional fashion magazine which is ...

is the value of a set of data that appears most often.

Box plot

Box plot is a method for graphically depicting groups of numerical data. The maximum and minimum values are represented by the lines, and the interquartile range (IQR) represent 25–75% of the data. Outliers may be plotted as circles.

Correlation coefficients

Although correlations between two different kinds of data could be inferred by graphs, such as scatter plot, it is necessary validate this though numerical information. For this reason, correlation coefficients are required. They provide a numerical value that reflects the strength of an association.

Pearson correlation coefficient

Pearson correlation coefficient is a measure of association between two variables, X and Y. This coefficient, usually represented by ''ρ'' (rho) for the population and ''r'' for the sample, assumes values between −1 and 1, where ''ρ'' = 1 represents a perfect positive correlation, ''ρ'' = −1 represents a perfect negative correlation, and ''ρ'' = 0 is no linear correlation.

Inferential statistics

It is used to make inferences about an unknown population, by estimation and/or hypothesis testing. In other words, it is desirable to obtain parameters to describe the population of interest, but since the data is limited, it is necessary to make use of a representative sample in order to estimate them. With that, it is possible to test previously defined hypotheses and apply the conclusions to the entire population. The standard error of the mean is a measure of variability that is crucial to do inferences. * Hypothesis testing Hypothesis testing is essential to make inferences about populations aiming to answer research questions, as settled in "Research planning" section. Authors defined four steps to be set: # ''The hypothesis to be tested'': as stated earlier, we have to work with the definition of a

(H₀), that is going to be tested, and an alternative hypothesis. But they must be defined before the experiment implementation. # ''Significance level and decision rule'': A decision rule depends on the level of significance, or in other words, the acceptable error rate (α). It is easier to think that we define a ''critical value'' that determines the statistical significance when a test statistic is compared with it. So, α also has to be predefined before the experiment. # ''Experiment and statistical analysis'': This is when the experiment is really implemented following the appropriate

, data is collected and the more suitable statistical tests are evaluated. # ''Inference'': Is made when the

is rejected or not rejected, based on the evidence that the comparison of p-values and α brings. It is pointed that the failure to reject H₀ just means that there is not enough evidence to support its rejection, but not that this hypothesis is true. * Confidence intervals A confidence interval is a range of values that can contain the true real parameter value in given a certain level of confidence. The first step is to estimate the best-unbiased estimate of the population parameter. The upper value of the interval is obtained by the sum of this estimate with the multiplication between the standard error of the mean and the confidence level. The calculation of lower value is similar, but instead of a sum, a subtraction must be applied.

Statistical considerations

Power and statistical error

When testing a hypothesis, there are two types of statistic errors possible: Type I error and

Type II error In statistical hypothesis testing, a type I error is the mistaken rejection of an actually true null hypothesis (also known as a "false positive" finding or conclusion; example: "an innocent person is convicted"), while a type II error is the fa ...

. The type I error or false positive is the incorrect rejection of a true null hypothesis and the type II error or false negative is the failure to reject a false

. The significance level denoted by α is the type I error rate and should be chosen before performing the test. The type II error rate is denoted by β and statistical power of the test is 1 − β.

p-value

The p-value is the probability of obtaining results as extreme as or more extreme than those observed, assuming the

(H₀) is true. It is also called the calculated probability. It is common to confuse the p-value with the significance level (α), but, the α is a predefined threshold for calling significant results. If p is less than α, the null hypothesis (H₀) is rejected.

Multiple testing

In multiple tests of the same hypothesis, the probability of the occurrence of falses positives (familywise error rate) increase and some strategy are used to control this occurrence. This is commonly achieved by using a more stringent threshold to reject null hypotheses. The Bonferroni correction defines an acceptable global significance level, denoted by α* and each test is individually compared with a value of α = α*/m. This ensures that the familywise error rate in all m tests, is less than or equal to α*. When m is large, the Bonferroni correction may be overly conservative. An alternative to the Bonferroni correction is to control the false discovery rate (FDR). The FDR controls the expected proportion of the rejected

null hypotheses In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is d ...

(the so-called discoveries) that are false (incorrect rejections). This procedure ensures that, for independent tests, the false discovery rate is at most q*. Thus, the FDR is less conservative than the Bonferroni correction and have more power, at the cost of more false positives.

Mis-specification and robustness checks

The main hypothesis being tested (e.g., no association between treatments and outcomes) is often accompanied by other technical assumptions (e.g., about the form of the probability distribution of the outcomes) that are also part of the null hypothesis. When the technical assumptions are violated in practice, then the null may be frequently rejected even if the main hypothesis is true. Such rejections are said to be due to model mis-specification. Verifying whether the outcome of a statistical test does not change when the technical assumptions are slightly altered (so-called robustness checks) is the main way of combating mis-specification.

Model selection criteria

Model criteria selection will select or model that more approximate true model. The Akaike's Information Criterion (AIC) and The Bayesian Information Criterion (BIC) are examples of asymptotically efficient criteria.

Developments and big data

Recent developments have made a large impact on biostatistics. Two important changes have been the ability to collect data on a high-throughput scale, and the ability to perform much more complex analysis using computational techniques. This comes from the development in areas as sequencing technologies,

Bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...

and

Machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

( Machine learning in bioinformatics).

Use in high-throughput data

New biomedical technologies like microarrays, next-generation sequencers (for genomics) and mass spectrometry (for proteomics) generate enormous amounts of data, allowing many tests to be performed simultaneously. Careful analysis with biostatistical methods is required to separate the signal from the noise. For example, a microarray could be used to measure many thousands of genes simultaneously, determining which of them have different expression in diseased cells compared to normal cells. However, only a fraction of genes will be differentially expressed. Multicollinearity often occurs in high-throughput biostatistical settings. Due to high intercorrelation between the predictors (such as gene expression levels), the information of one predictor might be contained in another one. It could be that only 5% of the predictors are responsible for 90% of the variability of the response. In such a case, one could apply the biostatistical technique of dimension reduction (for example via principal component analysis). Classical statistical techniques like linear or

logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression an ...

and linear discriminant analysis do not work well for high dimensional data (i.e. when the number of observations n is smaller than the number of features or predictors p: n < p). As a matter of fact, one can get quite high R²-values despite very low predictive power of the statistical model. These classical statistical techniques (esp. least squares linear regression) were developed for low dimensional data (i.e. where the number of observations n is much larger than the number of predictors p: n >> p). In cases of high dimensionality, one should always consider an independent validation test set and the corresponding residual sum of squares (RSS) and R² of the validation test set, not those of the training set. Often, it is useful to pool information from multiple predictors together. For example, Gene Set Enrichment Analysis (GSEA) considers the perturbation of whole (functionally related) gene sets rather than of single genes. These gene sets might be known biochemical pathways or otherwise functionally related genes. The advantage of this approach is that it is more robust: It is more likely that a single gene is found to be falsely perturbed than it is that a whole pathway is falsely perturbed. Furthermore, one can integrate the accumulated knowledge about biochemical pathways (like the JAK-STAT signaling pathway) using this approach.

Bioinformatics advances in databases, data mining, and biological interpretation

The development of

biological database Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genom ...

s enables storage and management of biological data with the possibility of ensuring access for users around the world. They are useful for researchers depositing data, retrieve information and files (raw or processed) originated from other experiments or indexing scientific articles, as

PubMed PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institutes of Health maintai ...

. Another possibility is search for the desired term (a gene, a protein, a disease, an organism, and so on) and check all results related to this search. There are databases dedicated to SNPs ( dbSNP), the knowledge on genes characterization and their pathways ( KEGG) and the description of gene function classifying it by cellular component, molecular function and biological process ( Gene Ontology). In addition to databases that contain specific molecular information, there are others that are ample in the sense that they store information about an organism or group of organisms. As an example of a database directed towards just one organism, but that contains much data about it, is the '' Arabidopsis thaliana'' genetic and molecular database – TAIR. Phytozome, in turn, stores the assemblies and annotation files of dozen of plant genomes, also containing visualization and analysis tools. Moreover, there is an interconnection between some databases in the information exchange/sharing and a major initiative was the International Nucleotide Sequence Database Collaboration (INSDC) which relates data from DDBJ, EMBL-EBI, and NCBI. Nowadays, increase in size and complexity of molecular datasets leads to use of powerful statistical methods provided by computer science algorithms which are developed by

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

area. Therefore, data mining and machine learning allow detection of patterns in data with a complex structure, as biological ones, by using methods of supervised and unsupervised learning, regression, detection of clusters and

association rule mining Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.Pi ...

, among others. To indicate some of them, self-organizing maps and ''k''-means are examples of cluster algorithms; neural networks implementation and

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborat ...

s models are examples of common machine learning algorithms. Collaborative work among molecular biologists, bioinformaticians, statisticians and computer scientists is important to perform an experiment correctly, going from planning, passing through data generation and analysis, and ending with biological interpretation of the results.

Use of computationally intensive methods

On the other hand, the advent of modern computer technology and relatively cheap computing resources have enabled computer-intensive biostatistical methods like bootstrapping and re-sampling methods. In recent times, random forests have gained popularity as a method for performing statistical classification. Random forest techniques generate a panel of decision trees. Decision trees have the advantage that you can draw them and interpret them (even with a basic understanding of mathematics and statistics). Random Forests have thus been used for clinical decision support systems.

Applications

Public health

Public health Public health is "the science and art of preventing disease, prolonging life and promoting health through the organized efforts and informed choices of society, organizations, public and private, communities and individuals". Analyzing the det ...

, including

epidemiology Epidemiology is the study and analysis of the distribution (who, when, and where), patterns and determinants of health and disease conditions in a defined population. It is a cornerstone of public health, and shapes policy decisions and evi ...

, health services research,

nutrition Nutrition is the biochemical and physiological process by which an organism uses food to support its life. It provides organisms with nutrients, which can be metabolized to create energy and chemical structures. Failure to obtain sufficient ...

environmental health Environmental health is the branch of public health concerned with all aspects of the natural and built environment affecting human health. In order to effectively control factors that may affect health, the requirements that must be met in ...

and health care policy & management. In these

medicine Medicine is the science and practice of caring for a patient, managing the diagnosis, prognosis, prevention, treatment, palliation of their injury or disease, and promoting their health. Medicine encompasses a variety of health care pr ...

contents, it's important to consider the design and analysis of the

clinical trial Clinical trials are prospective biomedical or behavioral research studies on human participants designed to answer specific questions about biomedical or behavioral interventions, including new treatments (such as novel vaccines, drugs, diet ...

s. As one example, there is the assessment of severity state of a patient with a prognosis of an outcome of a disease. With new technologies and genetics knowledge, biostatistics are now also used for Systems medicine, which consists in a more personalized medicine. For this, is made an integration of data from different sources, including conventional patient data, clinico-pathological parameters, molecular and genetic data as well as data generated by additional new-omics technologies.

Quantitative genetics

The study of

Population genetics Population genetics is a subfield of genetics that deals with genetic differences within and between populations, and is a part of evolutionary biology. Studies in this branch of biology examine such phenomena as adaptation, speciation, and po ...

and Statistical genetics in order to link variation in genotype with a variation in

phenotype In genetics, the phenotype () is the set of observable characteristics or traits of an organism. The term covers the organism's morphology (biology), morphology or physical form and structure, its Developmental biology, developmental proc ...

. In other words, it is desirable to discover the genetic basis of a measurable trait, a quantitative trait, that is under polygenic control. A genome region that is responsible for a continuous trait is called

Quantitative trait locus A quantitative trait locus (QTL) is a locus (section of DNA) that correlates with variation of a quantitative trait in the phenotype of a population of organisms. QTLs are mapped by identifying which molecular markers (such as SNPs or AFLPs ...

(QTL). The study of QTLs become feasible by using molecular markers and measuring traits in populations, but their mapping needs the obtaining of a population from an experimental crossing, like an F2 or

Recombinant inbred strain A recombinant inbred strain or recombinant inbred line (RIL) is an organism with chromosomes that incorporate an essentially permanent set of recombination events between chromosomes inherited from two or more inbred strains. F1 and F2 generations ...

s/lines (RILs). To scan for QTLs regions in a genome, a gene map based on linkage have to be built. Some of the best-known QTL mapping algorithms are Interval Mapping, Composite Interval Mapping, and Multiple Interval Mapping. However, QTL mapping resolution is impaired by the amount of recombination assayed, a problem for species in which it is difficult to obtain large offspring. Furthermore, allele diversity is restricted to individuals originated from contrasting parents, which limit studies of allele diversity when we have a panel of individuals representing a natural population. For this reason, the Genome-wide association study was proposed in order to identify QTLs based on linkage disequilibrium, that is the non-random association between traits and molecular markers. It was leveraged by the development of high-throughput SNP genotyping. In

animal Animals are multicellular, eukaryotic organisms in the biological kingdom Animalia. With few exceptions, animals consume organic material, breathe oxygen, are able to move, can reproduce sexually, and go through an ontogenetic stage ...

and plant breeding, the use of markers in selection aiming for breeding, mainly the molecular ones, collaborated to the development of

marker-assisted selection Marker assisted selection or marker aided selection (MAS) is an indirect selection process where a trait of interest is selected based on a marker ( morphological, biochemical or DNA/RNA variation) linked to a trait of interest (e.g. productivi ...

. While QTL mapping is limited due resolution, GWAS does not have enough power when rare variants of small effect that are also influenced by environment. So, the concept of Genomic Selection (GS) arises in order to use all molecular markers in the selection and allow the prediction of the performance of candidates in this selection. The proposal is to genotype and phenotype a training population, develop a model that can obtain the genomic estimated breeding values (GEBVs) of individuals belonging to a genotype and but not phenotype population, called testing population. This kind of study could also include a validation population, thinking in the concept of cross-validation, in which the real phenotype results measured in this population are compared with the phenotype results based on the prediction, what used to check the accuracy of the model. As a summary, some points about the application of quantitative genetics are: * This has been used in agriculture to improve crops ( Plant breeding) and

( Animal breeding). * In biomedical research, this work can assist in finding candidates

gene In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a b ...

allele An allele (, ; ; modern formation from Greek ἄλλος ''állos'', "other") is a variation of the same sequence of nucleotides at the same place on a long DNA molecule, as described in leading textbooks on genetics and evolution. ::"The chrom ...

s that can cause or influence predisposition to diseases in human genetics

Expression data

Studies for differential expression of genes from RNA-Seq data, as for

RT-qPCR A real-time polymerase chain reaction (real-time PCR, or qPCR) is a laboratory technique of molecular biology based on the polymerase chain reaction (PCR). It monitors the amplification of a targeted DNA molecule during the PCR (i.e., in real ...

and microarrays, demands comparison of conditions. The goal is to identify genes which have a significant change in abundance between different conditions. Then, experiments are designed appropriately, with replicates for each condition/treatment, randomization and blocking, when necessary. In RNA-Seq, the quantification of expression uses the information of mapped reads that are summarized in some genetic unit, as exons that are part of a gene sequence. As microarray results can be approximated by a normal distribution, RNA-Seq counts data are better explained by other distributions. The first used distribution was the Poisson one, but it underestimate the sample error, leading to false positives. Currently, biological variation is considered by methods that estimate a dispersion parameter of a negative binomial distribution. Generalized linear models are used to perform the tests for statistical significance and as the number of genes is high, multiple tests correction have to be considered. Some examples of other analysis on genomics data comes from microarray or proteomics experiments. Often concerning diseases or disease stages.

Other studies

Ecology Ecology () is the study of the relationships between living organisms, including humans, and their physical environment. Ecology considers organisms at the individual, population, community, ecosystem, and biosphere level. Ecology overl ...

, ecological forecasting * Biological sequence analysis * Systems biology for gene network inference or pathways analysis. *

Clinical research Clinical research is a branch of healthcare science that determines the safety and effectiveness ( efficacy) of medications, devices, diagnostic products and treatment regimens intended for human use. These may be used for prevention, treat ...

and pharmaceutical development * Population dynamics, especially in regards to

fisheries science Fisheries science is the academic discipline of managing and understanding fisheries. It is a multidisciplinary science, which draws on the disciplines of limnology, oceanography, freshwater biology, marine biology, meteorology, conservation, ...

. * Phylogenetics and

evolution Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation ...

Pharmacodynamics Pharmacodynamics (PD) is the study of the biochemical and physiologic effects of drugs (especially pharmaceutical drugs). The effects can include those manifested within animals (including humans), microorganisms, or combinations of organisms ...

* Pharmacokinetics *

Neuroimaging Neuroimaging is the use of quantitative (computational) techniques to study the structure and function of the central nervous system, developed as an objective way of scientifically studying the healthy human brain in a non-invasive manner. Incr ...

Tools

There are a lot of tools that can be used to do statistical analysis in biological data. Most of them are useful in other areas of knowledge, covering a large number of applications (alphabetical). Here are brief descriptions of some of them: *

ASReml ASReml is a statistical software package for fitting linear mixed models using restricted maximum likelihood, a technique commonly used in plant Plants are predominantly photosynthetic eukaryotes of the kingdom Plantae. Historically, the ...

: Another software developed by VSNi that can be used also in R environment as a package. It is developed to estimate variance components under a general linear mixed model using restricted maximum likelihood (REML). Models with fixed effects and random effects and nested or crossed ones are allowed. Gives the possibility to investigate different variance-covariance matrix structures. *CycDesigN: A computer package developed by VSNi that helps the researchers create experimental designs and analyze data coming from a design present in one of three classes handled by CycDesigN. These classes are resolvable, non-resolvable, partially replicated and crossover designs. It includes less used designs the Latinized ones, as t-Latinized design. * Orange: A programming interface for high-level data processing, data mining and data visualization. Include tools for gene expression and genomics. * R: An open source environment and programming language dedicated to statistical computing and graphics. It is an implementation of S language maintained by CRAN. In addition to its functions to read data tables, take descriptive statistics, develop and evaluate models, its repository contains packages developed by researchers around the world. This allows the development of functions written to deal with the statistical analysis of data that comes from specific applications. In the case of Bioinformatics, for example, there are packages located in the main repository (CRAN) and in others, as

Bioconductor Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology. Bioconductor is based primarily on the statistical R program ...

. It is also possible to use packages under development that are shared in hosting-services as

GitHub GitHub, Inc. () is an Internet hosting service for software development and version control using Git. It provides the distributed version control of Git plus access control, bug tracking, software feature requests, task management, cont ...

. * SAS: A data analysis software widely used, going through universities, services and industry. Developed by a company with the same name (

SAS Institute SAS Institute (or SAS, pronounced "sass") is an American multinational developer of analytics software based in Cary, North Carolina. SAS develops and markets a suite of analytics software ( also called SAS), which helps access, manage, ana ...

), it uses SAS language for programming. * PLA 3.0: Is a biostatistical analysis software for regulated environments (e.g. drug testing) which supports Quantitative Response Assays (Parallel-Line, Parallel-Logistics, Slope-Ratio) and Dichotomous Assays (Quantal Response, Binary Assays). It also supports weighting methods for combination calculations and the automatic data aggregation of independent assay data. * Weka: A

Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...

software for

and data mining, including tools and methods for visualization, clustering, regression, association rule, and classification. There are tools for cross-validation, bootstrapping and a module of algorithm comparison. Weka also can be run in other programming languages as Perl or R. *

Python (programming language) Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically-typed and garbage-collected. It supports multiple programming p ...

image analysis, deep-learning, machine-learning * SQL databases * NoSQL * NumPy numerical python *

SciPy SciPy (pronounced "sigh pie") is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, ...

* SageMath * LAPACK linear algebra *

MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementat ...

* Apache Hadoop * Apache Spark * Amazon Web Services

Scope and training programs

Almost all educational programmes in biostatistics are at

postgraduate Postgraduate or graduate education refers to academic or professional degrees, certificates, diplomas, or other qualifications pursued by post-secondary students who have earned an undergraduate ( bachelor's) degree. The organization and ...

level. They are most often found in schools of public health, affiliated with schools of medicine, forestry, or agriculture, or as a focus of application in departments of statistics. In the United States, where several universities have dedicated biostatistics departments, many other top-tier universities integrate biostatistics faculty into statistics or other departments, such as

. Thus, departments carrying the name "biostatistics" may exist under quite different structures. For instance, relatively new biostatistics departments have been founded with a focus on

bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...

and computational biology, whereas older departments, typically affiliated with schools of

public health Public health is "the science and art of preventing disease, prolonging life and promoting health through the organized efforts and informed choices of society, organizations, public and private, communities and individuals". Analyzing the det ...

, will have more traditional lines of research involving epidemiological studies and

s as well as bioinformatics. In larger universities around the world, where both a statistics and a biostatistics department exist, the degree of integration between the two departments may range from the bare minimum to very close collaboration. In general, the difference between a statistics program and a biostatistics program is twofold: (i) statistics departments will often host theoretical/methodological research which are less common in biostatistics programs and (ii) statistics departments have lines of research that may include biomedical applications but also other areas such as industry ( quality control), business and

economics Economics () is the social science that studies the production, distribution, and consumption of goods and services. Economics focuses on the behaviour and interactions of economic agents and how economies work. Microeconomics anal ...

and biological areas other than medicine.

Specialized journals

* Biostatistics * International Journal of Biostatistics * Journal of Epidemiology and Biostatistics * Biostatistics and Public Health * Biometrics * Biometrika * Biometrical Journal * Communications in Biometry and Crop Science * Statistical Applications in Genetics and Molecular Biology * Statistical Methods in Medical Research * Pharmaceutical Statistics * Statistics in Medicine

References

External links

The International Biometric Society

The Collection of Biostatistics Research Archive

Guide to Biostatistics (MedPageToday.com)

Biomedical Statistics
{{Authority control Bioinformatics