
Computational statistics, or statistical computing, is the study which is the intersection of
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
and
computer science
Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
, and refers to the statistical methods that are enabled by using computational methods. It is the area of
computational science
Computational science, also known as scientific computing, technical computing or scientific computation (SC), is a division of science, and more specifically the Computer Sciences, which uses advanced computing capabilities to understand and s ...
(or scientific computing) specific to the mathematical science of
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
. This area is fast developing. The view that the broader concept of computing must be taught as part of general
statistical education is gaining momentum.
As in
traditional statistics the goal is to transform
raw data
Raw data, also known as primary data, are ''data'' (e.g., numbers, instrument readings, figures, etc.) collected from a source. In the context of examinations, the raw data might be described as a raw score (after test scores).
If a scientist ...
into
knowledge
Knowledge is an Declarative knowledge, awareness of facts, a Knowledge by acquaintance, familiarity with individuals and situations, or a Procedural knowledge, practical skill. Knowledge of facts, also called propositional knowledge, is oft ...
,
[ Wegman, Edward J. �]
Computational Statistics: A New Agenda for Statistical Theory and Practice.
��
Journal of the Washington Academy of Sciences
', vol. 78, no. 4, 1988, pp. 310–322. ''JSTOR'' but the focus lies on
computer
A computer is a machine that can be Computer programming, programmed to automatically Execution (computing), carry out sequences of arithmetic or logical operations (''computation''). Modern digital electronic computers can perform generic set ...
intensive
statistical methods
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, such as cases with very large
sample size
Sample size determination or estimation is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences abo ...
and non-homogeneous
data set
A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more table (database), database tables, where every column (database), column of a table represents a particular Variable (computer sci ...
s.
The terms 'computational statistics' and 'statistical computing' are often used interchangeably, although Carlo Lauro (a former president of the
International Association for Statistical Computing) proposed making a distinction, defining 'statistical computing' as "the application of computer science to statistics",
and 'computational statistics' as "aiming at the design of algorithm for implementing
statistical methods on computers, including the ones unthinkable before the computer
age (e.g.
bootstrap,
simulation
A simulation is an imitative representation of a process or system that could exist in the real world. In this broad sense, simulation can often be used interchangeably with model. Sometimes a clear distinction between the two terms is made, in ...
), as well as to cope with analytically intractable problems"
'sic''">sic.html" ;"title="'sic">'sic''
The term 'Computational statistics' may also be used to refer to computationally ''intensive'' statistical methods including
resampling methods, Markov chain Monte Carlo">resampling (statistics)">resampling methods, Markov chain Monte Carlo methods, local regression, kernel density estimation, artificial neural networks and generalized additive models.
History
Though computational statistics is widely used today, it actually has a relatively short history of acceptance in the
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
community. For the most part, the founders of the field of statistics relied on mathematics and asymptotic approximations in the development of computational statistical methodology.
In 1908,
William Sealy Gosset
William Sealy Gosset (13 June 1876 – 16 October 1937) was an English statistician, chemist and brewer who worked for Guinness. In statistics, he pioneered small sample experimental design. Gosset published under the pen name Student and develo ...
performed his now well-known
Monte Carlo method simulation which led to the discovery of the
Student’s t-distribution. With the help of computational methods, he also has plots of the empirical distributions overlaid on the corresponding theoretical distributions. The computer has revolutionized simulation and has made the replication of Gosset’s experiment little more than an exercise.
Later on, the scientists put forward computational ways of generating
pseudo-random
A pseudorandom sequence of numbers is one that appears to be statistically random, despite having been produced by a completely deterministic and repeatable process. Pseudorandom number generators are often used in computer programming, as tradi ...
deviates, performed methods to convert uniform deviates into other distributional forms using inverse
cumulative distribution function
In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x.
Ever ...
or acceptance-rejection methods, and developed state-space methodology for
Markov chain Monte Carlo
In statistics, Markov chain Monte Carlo (MCMC) is a class of algorithms used to draw samples from a probability distribution. Given a probability distribution, one can construct a Markov chain whose elements' distribution approximates it – that ...
. One of the first efforts to generate random digits in a fully automated way, was undertaken by the RAND Corporation in 1947. The
tables produced were published as a
book in 1955, and also as a series of punch cards.
By the mid-1950s, several articles and patents for devices had been proposed for
random number generators
Random number generation is a process by which, often by means of a random number generator (RNG), a sequence of numbers or symbols is generated that cannot be reasonably predicted better than by random chance. This means that the particular ou ...
. The development of these devices were motivated from the need to use
random digits to perform simulations and other fundamental components in statistical analysis. One of the most well known of such devices is ERNIE, which produces random numbers that determine the winners of the
Premium Bond
Premium Bonds is a lottery bond scheme organised by the Government of the United Kingdom, United Kingdom government since 1956. At present it is managed by the government's National Savings and Investments agency.
The principle behind Premium ...
, a lottery bond issued in the United Kingdom. In 1958,
John Tukey
John Wilder Tukey (; June 16, 1915 – July 26, 2000) was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distributi ...
’s
jackknife was developed. It is as a method to reduce the
bias
Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...
of parameter estimates in samples under nonstandard conditions. This requires computers for practical implementations. To this point, computers have made many tedious statistical studies feasible.
Methods
Maximum likelihood estimation
Maximum likelihood estimation
In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
is used to
estimate
Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is de ...
the
parameters of an assumed
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
, given some observed data. It is achieved by
maximizing a
likelihood function
A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the ...
so that the
observed data is most probable under the assumed
statistical model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repre ...
.
Monte Carlo method
Monte Carlo
Monte Carlo ( ; ; or colloquially ; , ; ) is an official administrative area of Monaco, specifically the Ward (country subdivision), ward of Monte Carlo/Spélugues, where the Monte Carlo Casino is located. Informally, the name also refers to ...
is a statistical method that relies on repeated
random sampling
In this statistics, quality assurance, and survey methodology, sampling is the selection of a subset or a statistical sample (termed sample for short) of individuals from within a statistical population to estimate characteristics of the who ...
to obtain numerical results. The concept is to use
randomness
In common usage, randomness is the apparent or actual lack of definite pattern or predictability in information. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. ...
to solve problems that might be
deterministic
Determinism is the metaphysical view that all events within the universe (or multiverse) can occur only in one possible way. Deterministic theories throughout the history of philosophy have developed from diverse and sometimes overlapping mo ...
in principle. They are often used in
physical and
mathematical
Mathematics is a field of study that discovers and organizes methods, Mathematical theory, theories and theorems that are developed and Mathematical proof, proved for the needs of empirical sciences and mathematics itself. There are many ar ...
problems and are most useful when it is difficult to use other approaches. Monte Carlo methods are mainly used in three problem classes:
optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...
,
numerical integration
In analysis, numerical integration comprises a broad family of algorithms for calculating the numerical value of a definite integral.
The term numerical quadrature (often abbreviated to quadrature) is more or less a synonym for "numerical integr ...
, and generating draws from a
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
.
Markov chain Monte Carlo
The
Markov chain Monte Carlo
In statistics, Markov chain Monte Carlo (MCMC) is a class of algorithms used to draw samples from a probability distribution. Given a probability distribution, one can construct a Markov chain whose elements' distribution approximates it – that ...
method creates samples from a continuous
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
, with
probability density proportional to a known function. These samples can be used to evaluate an integral over that variable, such as its
expected value
In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...
or
variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
. The more steps are included, the more closely the distribution of the sample matches the actual desired distribution.
Bootstrapping
The
bootstrap is a resampling technique used to generate samples from an
empirical probability distribution defined by an original sample of the population. It can be used to find a bootstrapped estimator of a population parameter. It can also be used to estimate the standard error of an estimator as well as to generate bootstrapped confidence intervals. The
jackknife is a related technique.
Applications
*
Computational biology
Computational biology refers to the use of techniques in computer science, data analysis, mathematical modeling and Computer simulation, computational simulations to understand biological systems and relationships. An intersection of computer sci ...
*
Computational linguistics
Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...
*
Computational mathematics
Computational mathematics is the study of the interaction between mathematics and calculations done by a computer.National Science Foundation, Division of Mathematical ScienceProgram description PD 06-888 Computational Mathematics 2006. Retri ...
*
Computational materials science
*
Computational physics
Computational physics is the study and implementation of numerical analysis to solve problems in physics. Historically, computational physics was the first application of modern computers in science, and is now a subset of computational science ...
*
Computational psychometrics
*
Computational social science
*
Computational sociology
*
Data journalism
Data journalism or data-driven journalism (DDJ) is journalism based on the filtering and analysis of large data sets for the purpose of creating or elevating a news story.
Data journalism reflects the increased role of numerical data in the p ...
*
Econometrics
Econometrics is an application of statistical methods to economic data in order to give empirical content to economic relationships. M. Hashem Pesaran (1987). "Econometrics", '' The New Palgrave: A Dictionary of Economics'', v. 2, p. 8 p. 8 ...
*
Machine Learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
Computational statistics journals
*''
Communications in Statistics - Simulation and Computation''
*''
Computational Statistics
Computational statistics, or statistical computing, is the study which is the intersection of statistics and computer science, and refers to the statistical methods that are enabled by using computational methods. It is the area of computational ...
''
*''
Computational Statistics & Data Analysis''
*''
Journal of Computational and Graphical Statistics
A journal, from the Old French ''journal'' (meaning "daily"), may refer to:
*Bullet journal, a method of personal organization
*Diary, a record of personal secretive thoughts and as open book to personal therapy or used to feel connected to onesel ...
''
*''
Journal of Statistical Computation and Simulation''
*''
Journal of Statistical Software''
*''
The R Journal''
*''
The Stata Journal''
*''
Statistics and Computing''
*''
Wiley Interdisciplinary Reviews: Computational Statistics''
Associations
*
International Association for Statistical Computing
See also
*
Algorithms for statistical classification
*
Data science
Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processing, scientific visualization, algorithms and systems to extract or extrapolate knowledge from potentially noisy, stru ...
*
Statistical methods in artificial intelligence
*
Free statistical software
Free statistical software is a practical alternative to commercial packages. Many of the free to use programs aim to be similar in function to commercial packages, in that they are general statistical packages that perform a variety of statistica ...
*
List of statistical algorithms
*
List of statistical packages
The following is a list of statistical software.
Open-source
* ADaMSoft – a generalized statistical software with data mining algorithms and methods for data management
* ADMB – a software suite for non-linear statistical modeling based on C+ ...
*
Machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
References
Further reading
Articles
*
*
Books
*
*
*
*
*
*
*
*
*
External links
Associations
International Association for Statistical ComputingStatistical Computing section of the American Statistical Association
Journals
Computational Statistics & Data AnalysisJournal of Computational & Graphical StatisticsStatistics and Computing
{{Authority control
Numerical analysis
Computational fields of study
Mathematics of computing