data science
   HOME

TheInfoList



OR:

Data science is an
interdisciplinary Interdisciplinarity or interdisciplinary studies involves the combination of multiple academic disciplines into one activity (e.g., a research project). It draws knowledge from several other fields like sociology, anthropology, psychology, ec ...
field that uses
scientific method The scientific method is an Empirical evidence, empirical method for acquiring knowledge that has characterized the development of science since at least the 17th century (with notable practitioners in previous centuries; see the article hist ...
s, processes,
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
s and systems to extract or extrapolate
knowledge Knowledge can be defined as awareness of facts or as practical skills, and may also refer to familiarity with objects or situations. Knowledge of facts, also called propositional knowledge, is often defined as true belief that is disti ...
and insights from noisy, structured and
unstructured data Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, n ...
, and apply knowledge from data across a broad range of application domains. Data science is related to data mining,
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
, big data, computational statistics and
analytics Analytics is the systematic computational analysis of data or statistics. It is used for the discovery, interpretation, and communication of meaningful patterns in data. It also entails applying data patterns toward effective decision-making. It ...
. Data science is a "concept to unify statistics,
data analysis Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, en ...
, informatics, and their related methods" in order to "understand and analyse actual
phenomena A phenomenon ( : phenomena) is an observable event. The term came into its modern philosophical usage through Immanuel Kant, who contrasted it with the noumenon, which ''cannot'' be directly observed. Kant was heavily influenced by Gottfried ...
" with
data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...
. It uses techniques and theories drawn from many fields within the context of mathematics, statistics,
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
,
information science Information science (also known as information studies) is an academic field which is primarily concerned with analysis, collection, classification, manipulation, storage, retrieval, movement, dissemination, and protection of information. ...
, and domain knowledge. However, data science is different from
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
and information science.
Turing Award The ACM A. M. Turing Award is an annual prize given by the Association for Computing Machinery (ACM) for contributions of lasting and major technical importance to computer science. It is generally recognized as the highest distinction in compu ...
winner Jim Gray imagined data science as a "fourth paradigm" of science ( empirical, theoretical, computational, and now data-driven) and asserted that "everything about science is changing because of the impact of
information technology Information technology (IT) is the use of computers to create, process, store, retrieve, and exchange all kinds of data . and information. IT forms part of information and communications technology (ICT). An information technology system ...
" and the data deluge. A data scientist is someone who creates programming code and combines it with statistical knowledge to create insights from data.


Foundations

Data science is an
interdisciplinary Interdisciplinarity or interdisciplinary studies involves the combination of multiple academic disciplines into one activity (e.g., a research project). It draws knowledge from several other fields like sociology, anthropology, psychology, ec ...
field focused on extracting knowledge from typically
large Large means of great size. Large may also refer to: Mathematics * Arbitrarily large, a phrase in mathematics * Large cardinal, a property of certain transfinite numbers * Large category, a category with a proper class of objects and morphisms ...
data sets and applying the knowledge and insights from that data to solve problems in a wide range of application domains. The field encompasses preparing data for analysis, formulating data science problems, analyzing data, developing data-driven solutions, and presenting findings to inform high-level decisions in a broad range of application domains. As such, it incorporates skills from computer science, statistics, information science, mathematics, data visualization, information visualization, data sonification, data
integration Integration may refer to: Biology * Multisensory integration * Path integration * Pre-integration complex, viral genetic material used to insert a viral genome into a host genome *DNA integration, by means of site-specific recombinase technolo ...
, graphic design, complex systems,
communication Communication (from la, communicare, meaning "to share" or "to be in relation with") is usually defined as the transmission of information. The term may also refer to the message communicated through such transmissions or the field of inqu ...
and business. Statistician Nathan Yau, drawing on Ben Fry, also links data science to
human–computer interaction Human–computer interaction (HCI) is research in the design and the use of computer technology, which focuses on the interfaces between people ( users) and computers. HCI researchers observe the ways humans interact with computers and design ...
: users should be able to intuitively control and
explore Exploration refers to the historical practice of discovering remote lands. It is studied by geographers and historians. Two major eras of exploration occurred in human history: one of convergence, and one of divergence. The first, covering most ...
data. In 2015, the
American Statistical Association The American Statistical Association (ASA) is the main professional organization for statisticians and related professionals in the United States. It was founded in Boston, Massachusetts on November 27, 1839, and is the second oldest continuous ...
identified
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
management, statistics and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
, and distributed and parallel systems as the three emerging foundational professional communities.


Relationship to statistics

Many statisticians, including
Nate Silver Nathaniel Read Silver (born January 13, 1978) is an American statistician, writer, and poker player who analyzes baseball (see sabermetrics), basketball, and elections (see psephology). He is the founder and editor-in-chief of ''FiveThirtyEight ...
, have argued that data science is not a new field, but rather another name for statistics. Others argue that data science is distinct from statistics because it focuses on problems and techniques unique to digital data. Vasant Dhar writes that statistics emphasizes quantitative data and description. In contrast, data science deals with quantitative and qualitative data (e.g. from images, text, sensors, transactions or customer information, etc) and emphasizes prediction and action. Andrew Gelman of
Columbia University Columbia University (also known as Columbia, and officially as Columbia University in the City of New York) is a private research university in New York City. Established in 1754 as King's College on the grounds of Trinity Church in Manha ...
has described statistics as a nonessential part of data science. Stanford professor David Donoho writes that data science is not distinguished from statistics by the size of datasets or use of computing and that many graduate programs misleadingly advertise their analytics and statistics training as the essence of a data-science program. He describes data science as an applied field growing out of traditional statistics.


Etymology


Early usage

In 1962,
John Tukey John Wilder Tukey (; June 16, 1915 – July 26, 2000) was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distributi ...
described a field he called "data analysis", which resembles modern data science. In 1985, in a lecture given to the Chinese Academy of Sciences in Beijing, C. F. Jeff Wu used the term "data science" for the first time as an alternative name for statistics. Later, attendees at a 1992 statistics symposium at the University of Montpellier II acknowledged the emergence of a new discipline focused on data of various origins and forms, combining established concepts and principles of statistics and data analysis with computing. The term "data science" has been traced back to 1974, when Peter Naur proposed it as an alternative name for computer science. In 1996, the International Federation of Classification Societies became the first conference to specifically feature data science as a topic. However, the definition was still in flux. After the 1985 lecture at the Chinese Academy of Sciences in Beijing, in 1997 C. F. Jeff Wu again suggested that statistics should be renamed data science. He reasoned that a new name would help statistics shed inaccurate stereotypes, such as being synonymous with accounting or limited to describing data. In 1998, Hayashi Chikio argued for data science as a new, interdisciplinary concept, with three aspects: data design, collection, and analysis. During the 1990s, popular terms for the process of finding patterns in datasets (which were increasingly large) included "knowledge discovery" and " data mining".


Modern usage

In 2012, technologists Thomas H. Davenport and DJ Patil declared "Data Scientist: The Sexiest Job of the 21st Century", a catch-phrase that was picked up even by major-city newspapers like the
New York Times ''The New York Times'' (''the Times'', ''NYT'', or the Gray Lady) is a daily newspaper based in New York City with a worldwide readership reported in 2020 to comprise a declining 840,000 paid print subscribers, and a growing 6 million paid ...
and the
Boston Globe ''The Boston Globe'' is an American daily newspaper founded and based in Boston, Massachusetts. The newspaper has won a total of 27 Pulitzer Prizes, and has a total circulation of close to 300,000 print and digital subscribers. ''The Boston Glob ...
. A decade later, they reaffirmed it, stating "the job is more in demand than ever with employers". The modern conception of data science as an independent discipline is sometimes attributed to
William S. Cleveland William Swain Cleveland II (born 1943) is an American computer scientist and Professor of Statistics and Professor of Computer Science at Purdue University, known for his work on data visualization, particularly on nonparametric regression and loc ...
. In a 2001 paper, he advocated an expansion of statistics beyond theory into technical areas; because this would significantly change the field, it warranted a new name. "Data science" became more widely used in the next few years: in 2002, the Committee on Data for Science and Technology launched ''Data Science Journal''. In 2003, Columbia University launched ''The Journal of Data Science''. In 2014, the
American Statistical Association The American Statistical Association (ASA) is the main professional organization for statisticians and related professionals in the United States. It was founded in Boston, Massachusetts on November 27, 1839, and is the second oldest continuous ...
's Section on Statistical Learning and Data Mining changed its name to the Section on Statistical Learning and Data Science, reflecting the ascendant popularity of data science. The professional title of "data scientist" has been attributed to DJ Patil and Jeff Hammerbacher in 2008. Though it was used by the
National Science Board The National Science Board (NSB) of the United States establishes the policies of the National Science Foundation (NSF) within the framework of applicable national policies set forth by the President and the Congress. The NSB also serves as an ...
in their 2005 report "Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century", it referred broadly to any key role in managing a digital data collection. There is still no consensus on the definition of data science, and it is considered by some to be a
buzzword A buzzword is a word or phrase, new or already existing, that becomes popular for a period of time. Buzzwords often derive from technical terms yet often have much of the original technical meaning removed through fashionable use, being simply used ...
. Big data is a related marketing term. Data scientists are responsible for breaking down big data into usable information and creating software and algorithms that help companies and organizations determine optimal operations.


See also

* Open Data Science Conference * Scientific Data * Women in Data


References

{{Data Information science Computer occupations Computational fields of study Data analysis