HOME

TheInfoList



OR:

Computational statistics, or statistical computing, is the bond between
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
and
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
. It means statistical methods that are enabled by using computational methods. It is the area of
computational science Computational science, also known as scientific computing or scientific computation (SC), is a field in mathematics that uses advanced computing capabilities to understand and solve complex problems. It is an area of science that spans many disc ...
(or scientific computing) specific to the mathematical science of
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
. This area is also developing rapidly, leading to calls that a broader concept of computing should be taught as part of general statistical education. As in traditional statistics the goal is to transform
raw data Raw data, also known as primary data, are ''data'' (e.g., numbers, instrument readings, figures, etc.) collected from a source. In the context of examinations, the raw data might be described as a raw score (after test scores). If a scientist ...
into
knowledge Knowledge can be defined as awareness of facts or as practical skills, and may also refer to familiarity with objects or situations. Knowledge of facts, also called propositional knowledge, is often defined as true belief that is distinc ...
, Wegman, Edward J. â
Computational Statistics: A New Agenda for Statistical Theory and Practice.
€ť
Journal of the Washington Academy of Sciences
', vol. 78, no. 4, 1988, pp. 310–322. ''JSTOR''
but the focus lies on
computer A computer is a machine that can be programmed to Execution (computing), carry out sequences of arithmetic or logical operations (computation) automatically. Modern digital electronic computers can perform generic sets of operations known as C ...
intensive
statistical methods Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industria ...
, such as cases with very large
sample size Sample size determination is the act of choosing the number of observations or Replication (statistics), replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make stat ...
and non-homogeneous
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the ...
s. The terms 'computational statistics' and 'statistical computing' are often used interchangeably, although Carlo Lauro (a former president of the
International Association for Statistical Computing The International Association for Statistical Computing (IASC) was founded during the 41st Session of the International Statistical Institute (ISI) in 1977, as a Section of the ISI. The objectives of the association are to foster worldwide interes ...
) proposed making a distinction, defining 'statistical computing' as "the application of computer science to statistics", and 'computational statistics' as "aiming at the design of algorithm for implementing statistical methods on computers, including the ones unthinkable before the computer age (e.g. bootstrap,
simulation A simulation is the imitation of the operation of a real-world process or system over time. Simulations require the use of Conceptual model, models; the model represents the key characteristics or behaviors of the selected system or proc ...
), as well as to cope with analytically intractable problems" 'sic''.html"_;"title="sic.html"_;"title="'sic">'sic''">sic.html"_;"title="'sic">'sic'' The_term_'Computational_statistics'_may_also_be_used_to_refer_to_computationally_''intensive''_statistical_methods_including_resampling_(statistics).html" ;"title="sic">'sic''.html" ;"title="sic.html" ;"title="'sic">'sic''">sic.html" ;"title="'sic">'sic'' The term 'Computational statistics' may also be used to refer to computationally ''intensive'' statistical methods including resampling (statistics)">resampling methods,
Markov chain Monte Carlo In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain ...
methods, local regression, kernel density estimation, artificial neural networks and generalized additive models.


History

Though computational statistics is widely used today, it actually has a relatively short history of acceptance in the
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
community. For the most part, the founders of the field of statistics relied on mathematics and asymptotic approximations in the development of computational statistical methodology. In statistical field, the first use of the term “computer” comes in an article in the ''Journal of the American Statistical Association'' archives by Robert P. Porter in 1891. The article discusses about the use of
Hermann Hollerith Herman Hollerith (February 29, 1860 – November 17, 1929) was a German-American statistician, inventor, and businessman who developed an electromechanical tabulating machine for punched cards to assist in summarizing information and, later, in ...
’s machine in the 11th Census of the United States. Hermann Hollerith’s machine, also called
tabulating machine The tabulating machine was an electromechanical machine designed to assist in summarizing information stored on punched cards. Invented by Herman Hollerith, the machine was developed to help process data for the 1890 U.S. Census. Later models w ...
, was an
electromechanical In engineering, electromechanics combines processes and procedures drawn from electrical engineering and mechanical engineering. Electromechanics focuses on the interaction of electrical and mechanical systems as a whole and how the two systems ...
machine designed to assist in summarizing information stored on
punched cards A punched card (also punch card or punched-card) is a piece of stiff paper that holds digital data represented by the presence or absence of holes in predefined positions. Punched cards were once common in data processing applications or to d ...
. It was invented by Herman Hollerith (February 29, 1860 – November 17, 1929), an American businessman, inventor, and statistician. His invention of the punched card tabulating machine was patented in 1884, and later was used in the 1890
Census A census is the procedure of systematically acquiring, recording and calculating information about the members of a given population. This term is used mostly in connection with national population and housing censuses; other common censuses incl ...
of
the United States The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territorie ...
. The advantages of the technology were immediately apparent. the 1880 Census, with about 50 million people, and it took over 7 years to tabulate. While in the 1890 Census, with over 62 million people, it took less than a year. This marks the beginning of the era of mechanized computational statistics and semiautomatic
data processing Data processing is the collection and manipulation of digital data to produce meaningful information. Data processing is a form of ''information processing'', which is the modification (processing) of information in any manner detectable by an ...
systems. In 1908,
William Sealy Gosset William Sealy Gosset (13 June 1876 – 16 October 1937) was an English statistician, chemist and brewer who served as Head Brewer of Guinness and Head Experimental Brewer of Guinness and was a pioneer of modern statistics. He pioneered small sa ...
performed his now well-known Monte Carlo method simulation which led to the discovery of the
Student’s t-distribution In probability and statistics, Student's ''t''-distribution (or simply the ''t''-distribution) is any member of a family of continuous probability distributions that arise when estimating the mean of a normally distributed population in situa ...
. With the help of computational methods, he also has plots of the empirical distributions overlaid on the corresponding theoretical distributions. The computer has revolutionized simulation and has made the replication of Gosset’s experiment little more than an exercise. Later on, the scientists put forward computational ways of generating
pseudo-random A pseudorandom sequence of numbers is one that appears to be statistically random, despite having been produced by a completely deterministic and repeatable process. Background The generation of random numbers has many uses, such as for rando ...
deviates, performed methods to convert uniform deviates into other distributional forms using inverse
cumulative distribution function In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Ev ...
or acceptance-rejection methods, and developed state-space methodology for
Markov chain Monte Carlo In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain ...
. By the mid-1950s, A lot of work was being done of testing the generators for
randomness In common usage, randomness is the apparent or actual lack of pattern or predictability in events. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. Individual rand ...
. Most of the computers could refer to random number tables now. In 1958,
John Tukey John Wilder Tukey (; June 16, 1915 – July 26, 2000) was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distributi ...
’s jackknife was developed. It is as a method to reduce the
bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
of parameter estimates in samples under nonstandard conditions. This requires computers for practical implementations. To this point, computers have made many tedious statistical studies feasible.


Methods


Maximum likelihood estimation

Maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statis ...
is used to
estimate Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is der ...
the
parameters A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
of an assumed
probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
, given some observed data. It is achieved by
maximizing Maximization is a style of decision-making characterized by seeking the best option through an exhaustive search through alternatives. It is contrasted with satisficing, in which individuals evaluate options until they find one that is "good enough" ...
a
likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
so that the observed data is most probable under the assumed
statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repres ...
.


Monte Carlo method

Monte Carlo Monte Carlo (; ; french: Monte-Carlo , or colloquially ''Monte-Carl'' ; lij, Munte Carlu ; ) is officially an administrative area of the Principality of Monaco, specifically the ward of Monte Carlo/Spélugues, where the Monte Carlo Casino is ...
a statistical method relies on repeated
random sampling In statistics, quality assurance, and survey methodology, sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians attempt ...
to obtain numerical results. The concept is to use
randomness In common usage, randomness is the apparent or actual lack of pattern or predictability in events. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. Individual rand ...
to solve problems that might be
deterministic Determinism is a philosophical view, where all events are determined completely by previously existing causes. Deterministic theories throughout the history of philosophy have developed from diverse and sometimes overlapping motives and consi ...
in principle. They are often used in
physical Physical may refer to: *Physical examination In a physical examination, medical examination, or clinical examination, a medical practitioner examines a patient for any possible medical signs or symptoms of a medical condition. It generally co ...
and
mathematical Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in modern mathematics ...
problems and are most useful when it is difficult to use other approaches. Monte Carlo methods are mainly used in three problem classes:
optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
,
numerical integration In analysis, numerical integration comprises a broad family of algorithms for calculating the numerical value of a definite integral, and by extension, the term is also sometimes used to describe the numerical solution of differential equations ...
, and generating draws from a
probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
.


Markov chain Monte Carlo

The
Markov chain Monte Carlo In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain ...
method creates samples from a continuous
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
, with
probability density In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can ...
proportional to a known function. These samples can be used to evaluate an integral over that variable, as its
expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...
or
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
.The more steps are included, the more closely the distribution of the sample matches the actual desired distribution.


Applications

*
Computational biology Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has fo ...
*
Computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
*
Computational physics Computational physics is the study and implementation of numerical analysis to solve problems in physics for which a quantitative theory already exists. Historically, computational physics was the first application of modern computers in science, ...
*
Computational mathematics Computational mathematics is an area of mathematics devoted to the interaction between mathematics and computer computation.National Science Foundation, Division of Mathematical ScienceProgram description PD 06-888 Computational Mathematics 2006 ...
*
Computational materials science ''Computational Materials Science'' is a monthly peer-reviewed scientific journal published by Elsevier. It was established in October 1992. The editor-in-chief is Susan Sinnott. The journal covers computational modeling and practical research fo ...


Computational statistics journals

*'' Communications in Statistics - Simulation and Computation'' *''
Computational Statistics Computational statistics, or statistical computing, is the bond between statistics and computer science. It means statistical methods that are enabled by using computational methods. It is the area of computational science (or scientific computi ...
'' *''
Computational Statistics & Data Analysis ''Computational Statistics & Data Analysis'' is a monthly peer-reviewed scientific journal covering research on and applications of computational statistics and data analysis. The journal was established in 1983 and is the official journal of the I ...
'' *''
Journal of Computational and Graphical Statistics The ''Journal of Computational and Graphical Statistics'' is a quarterly peer-reviewed scientific journal published by Taylor & Francis on behalf of the American Statistical Association. Established in 1992, the journal covers the use of computat ...
'' *''
Journal of Statistical Computation and Simulation The ''Journal of Statistical Computation and Simulation'' is a peer-reviewed scientific journal that covers computational statistics. It is published by Taylor & Francis and was established in 1972. The editors-in-chief are Richard Krutchkoff (Virg ...
'' *''
Journal of Statistical Software The ''Journal of Statistical Software'' is a peer-reviewed open-access scientific journal that publishes papers related to statistical software. The ''Journal of Statistical Software'' was founded in 1996 by Jan de Leeuw of the Department of St ...
'' *''
The R Journal ''The R Journal'' is a peer-reviewed open-access scientific journal published by The R Foundation since 2009. It publishes research articles in statistical computing that are of interest to users of the R programming language. The journal includes ...
'' *'' The Stata Journal'' *''
Statistics and Computing ''Statistics and Computing'' is a peer-reviewed academic journal that deals with statistics and computing. It was established in 1991 and is published by Springer Springer or springers may refer to: Publishers * Springer Science+Business Media, ...
'' *''
Wiley Interdisciplinary Reviews Computational Statistics ''Wiley Interdisciplinary Reviews: Computational Statistics'' (WIREs Comp Stats) is a review journal for computational and statistical techniques in the sciences, from the perspectives of both computation and statistics. It contain both tutorial rev ...
''


Associations

*
International Association for Statistical Computing The International Association for Statistical Computing (IASC) was founded during the 41st Session of the International Statistical Institute (ISI) in 1977, as a Section of the ISI. The objectives of the association are to foster worldwide interes ...


See also

*
Algorithms for statistical classification In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation (or observations) belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diag ...
*
Data science Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a br ...
* Statistical methods in artificial intelligence *
Free statistical software Free statistical software is a practical alternative to commercial packages. Many of the free to use programs aim to be similar in function to commercial packages, in that they are general statistical packages that perform a variety of statistica ...
* List of statistical algorithms *
List of statistical packages Statistical software are specialized computer programs for analysis in statistics and econometrics. Open-source * ADaMSoft – a generalized statistical software with data mining algorithms and methods for data management * ADMB – a softwar ...
*
Machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...


References


Further reading


Articles

* *


Books

* * * * * * * * *{{Citation, title=Data Science: Scientific and Statistical Computing , first=Reda. R. , last=Gharieb, publisher=Noor Publishing, year=2017, isbn=978-3-330-97256-8


External links


Associations


International Association for Statistical ComputingStatistical Computing section of the American Statistical Association


Journals


Computational Statistics & Data AnalysisJournal of Computational & Graphical Statistics

Statistics and Computing
Numerical analysis Computational fields of study Mathematics of computing