Data Warehouse Overview
   HOME

TheInfoList



OR:

In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality,
fact A fact is a datum about one or more aspects of a circumstance, which, if accepted as true and proven true, allows a logical conclusion to be reached on a true–false evaluation. Standard reference works are often used to check facts. Scient ...
,
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, other basic units of meaning, or simply sequences of
symbol A symbol is a mark, sign, or word that indicates, signifies, or is understood as representing an idea, object, or relationship. Symbols allow people to go beyond what is known or seen by creating linkages between otherwise very different conc ...
s that may be further interpreted. A datum is an individual value in a collection of data. Data is usually organized into
structure A structure is an arrangement and organization of interrelated elements in a material object or system, or the object or system so organized. Material structures include man-made objects such as buildings and machines and natural objects such as ...
s such as tables that provide additional context and meaning, and which may themselves be used as data in larger structures. Data may be used as variables in a
computational process Computation is any type of arithmetic or non-arithmetic calculation that follows a well-defined model (e.g., an algorithm). Mechanical or electronic devices (or, historically, people) that perform computations are known as ''computers''. An espe ...
. Data may represent abstract ideas or concrete measurements. Data is commonly used in scientific research, economics, and in virtually every other form of human organizational activity. Examples of data sets include price indices (such as
consumer price index A consumer price index (CPI) is a price index, the price of a weighted average market basket of consumer goods and services purchased by households. Changes in measured CPI track changes in prices over time. Overview A CPI is a statistica ...
), unemployment rates, literacy rates, and census data. In this context, data represents the raw facts and figures which can be used in such a manner in order to capture the useful information out of it. Data is collected using techniques such as
measurement Measurement is the quantification of attributes of an object or event, which can be used to compare with other objects or events. In other words, measurement is a process of determining how large or small a physical quantity is as compared ...
,
observation Observation is the active acquisition of information from a primary source. In living beings, observation employs the senses. In science, observation can also involve the perception and recording of data via the use of scientific instruments. The ...
, query, or analysis, and typically '' represented'' as numbers or
character Character or Characters may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
s which may be further processed. Field data are data that is collected in an uncontrolled in-situ environment. Experimental data is data that is generated in the course of a controlled scientific experiment. Data is analyzed using techniques such as
calculation A calculation is a deliberate mathematical process that transforms one or more inputs into one or more outputs or ''results''. The term is used in a variety of senses, from the very definite arithmetical calculation of using an algorithm, to th ...
, reasoning, discussion, presentation, visualization, or other forms of post-analysis. Prior to analysis, raw data (or unprocessed data) is typically cleaned:
Outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s are removed and obvious instrument or data entry errors are corrected. Data can be seen as the smallest units of factual information that can be used as a basis for calculation, reasoning, or discussion. Data can range from abstract ideas to concrete measurements, including but not limited to, statistics. Thematically connected data presented in some relevant context can be viewed as ''information''. Contextually connected pieces of information can then be described as ''data insights'' or ''intelligence''. The stock of insights and intelligence that accumulates over time resulting from the synthesis of data into information, can then be described as ''knowledge''. Data has been described as "the new oil of the
digital economy The digital economy is a portmanteau of digital computing and economy, and is an umbrella term that describes how traditional Brick and mortar, brick-and-mortar economic activities (production, distribution, trade) are being transformed by Interne ...
". Data, as a general concept, refers to the fact that some existing information or knowledge is '' represented'' or ''
code In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication ...
d'' in some form suitable for better usage or processing. Advances in computing technologies have led to the advent of
big data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...
, which usually refers to very large quantities of data, usually at the petabyte scale. Using traditional data analysis methods and computing, working with such large (and growing) datasets is difficult, even impossible. (Theoretically speaking, infinite data would yield infinite information, which would render extracting insights or intelligence impossible.) In response, the relatively new field of
data science Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a br ...
uses machine learning (and other artificial intelligence (AI)) methods that allow for efficient applications of analytic methods to big data.


Etymology and terminology

The Latin word ''data'' is the plural of 'datum', "(thing) given," neuter past participle of ''dare'' "to give". The first English use of the word "data" is from the 1640s. The word "data" was first used to mean "transmissible and storable computer information" in 1946. The expression "data processing" was first used in 1954. When "data" is used more generally as a synonym for "information", it is treated as a
mass noun In linguistics, a mass noun, uncountable noun, non-count noun, uncount noun, or just uncountable, is a noun with the syntactic property that any quantity of it is treated as an undifferentiated unit, rather than as something with discrete elemen ...
in singular form. This usage is common in everyday language and in technical and scientific fields such as
software development Software development is the process of conceiving, specifying, designing, programming, documenting, testing, and bug fixing involved in creating and maintaining applications, frameworks, or other software components. Software development invol ...
and computer science. One example of this usage is the term "
big data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...
". When used more specifically to refer to the processing and analysis of sets of data, the term retains its plural form. This usage is common in natural sciences, life sciences, social sciences, software development and computer science, and grew in popularity in the 20th and 21st centuries. Some style guides do not recognize the different meanings of the term, and simply recommend the form that best suits the target audience of the guide. For example, APA style as of the 7th edition requires "data" to be treated as a plural form.


Meaning

Data, information, knowledge, and wisdom are closely related concepts, but each has its role concerning the other, and each term has its meaning. According to a common view, data is collected and analyzed; data only becomes information suitable for making decisions once it has been analyzed in some fashion. One can say that the extent to which a set of data is informative to someone depends on the extent to which it is unexpected by that person. The amount of information contained in a data stream may be characterized by its Shannon entropy. Knowledge is the awareness of its environment that some entity possesses, whereas data merely communicate that knowledge. For example, the entry in a database specifying the height of Mount Everest is a datum that communicates a precisely-measured value. This measurement may be included in a book along with other data on Mount Everest to describe the mountain in a manner useful for those who wish to decide on the best method to climb it. An awareness the characteristics represented by these data is knowledge. Data is often assumed to be the least abstract concept, information the next least, and knowledge the most abstract. In this view, data becomes information by interpretation; e.g., the height of Mount Everest is generally considered "data", a book on Mount Everest geological characteristics may be considered "information", and a climber's guidebook containing practical information on the best way to reach Mount Everest's peak may be considered "knowledge". "Information" bears a diversity of meanings that ranges from everyday usage to technical use. This view, however, has also been argued to reverse how data emerges from information, and information from knowledge. Generally speaking, the concept of information is closely related to notions of constraint, communication, control, data, form, instruction, knowledge, meaning, mental stimulus, pattern, perception, and representation. Beynon-Davies uses the concept of a
sign A sign is an object, quality, event, or entity whose presence or occurrence indicates the probable presence or occurrence of something else. A natural sign bears a causal relation to its object—for instance, thunder is a sign of storm, or me ...
to differentiate between data and information; data is a series of symbols, while information occurs when the symbols are used to refer to something. Before the development of computing devices and machines, people had to manually collect data and impose patterns on it. Since the development of computing devices and machines, these devices can also collect data. In the 2010s, computers are widely used in many fields to collect data and sort or process it, in disciplines ranging from marketing, analysis of social services usage by citizens to scientific research. These patterns in data are seen as information that can be used to enhance knowledge. These patterns may be interpreted as " truth" (though "truth" can be a subjective concept) and may be authorized as aesthetic and ethical criteria in some disciplines or cultures. Events that leave behind perceivable physical or virtual remains can be traced back through data. Marks are no longer considered data once the link between the mark and observation is broken. Mechanical computing devices are classified according to how they represent data. An analog computer represents a datum as a voltage, distance, position, or other physical quantity. A digital computer represents a piece of data as a sequence of symbols drawn from a fixed alphabet. The most common digital computers use a binary alphabet, that is, an alphabet of two characters typically denoted "0" and "1". More familiar representations, such as numbers or letters, are then constructed from the binary alphabet. Some special forms of data are distinguished. A computer program is a collection of data, which can be interpreted as instructions. Most computer languages make a distinction between programs and the other data on which programs operate, but in some languages, notably
Lisp A lisp is a speech impairment in which a person misarticulates sibilants (, , , , , , , ). These misarticulations often result in unclear speech. Types * A frontal lisp occurs when the tongue is placed anterior to the target. Interdental lisping ...
and similar languages, programs are essentially indistinguishable from other data. It is also useful to distinguish
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
, that is, a description of other data. A similar yet earlier term for metadata is "ancillary data." The prototypical example of metadata is the library catalog, which is a description of the contents of books.


Data documents

Whenever data needs to be registered, data exists in the form of a data document. Kinds of data documents include: * data repository *data study * data set * software * data paper * database *data handbook * data journal Some of these data documents (data repositories, data studies, data sets, and software) are indexed in Data Citation Indexes, while data papers are indexed in traditional bibliographic databases, e.g., Science Citation Index.


Data collection

Gathering data can be accomplished through a primary source (the researcher is the first person to obtain the data) or a secondary source (the researcher obtains the data that has already been collected by other sources, such as data disseminated in a scientific journal). Data analysis methodologies vary and include data triangulation and data percolation. The latter offers an articulate method of collecting, classifying, and analyzing data using five possible angles of analysis (at least three) to maximize the research's objectivity and permit an understanding of the phenomena under investigation as complete as possible: qualitative and quantitative methods, literature reviews (including scholarly articles), interviews with experts, and computer simulation. The data is thereafter "percolated" using a series of pre-determined steps so as to extract the most relevant information.


Data longevity and accessibility

An important field in computer science, technology, and library science is the longevity of data. Scientific research generates huge amounts of data, especially in
genomics Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dim ...
and astronomy, but also in the medical sciences, e.g. in
medical imaging Medical imaging is the technique and process of imaging the interior of a body for clinical analysis and medical intervention, as well as visual representation of the function of some organs or tissues (physiology). Medical imaging seeks to rev ...
. In the past, scientific data has been published in papers and books, stored in libraries, but more recently practically all data is stored on hard drives or optical discs. However, in contrast to paper, these storage devices may become unreadable after a few decades. Scientific publishers and libraries have been struggling with this problem for a few decades, and there is still no satisfactory solution for the long-term storage of data over centuries or even for eternity. Data accessibility. Another problem is that much scientific data is never published or deposited in data repositories such as databases. In a recent survey, data was requested from 516 studies that were published between 2 and 22 years earlier, but less than 1 out of 5 of these studies were able or willing to provide the requested data. Overall, the likelihood of retrieving data dropped by 17% each year after publication. Similarly, a survey of 100 datasets in Dryad found that more than half lacked the details to reproduce the research results from these studies. This shows the dire situation of access to scientific data that is not published or does not have enough details to be reproduced. A solution to the problem of reproducibility is the attempt to require
FAIR data FAIR data are data which meet principles of findability, accessibility, interoperability, and reusability (FAIR). The acronym and principles were defined in a March 2016 paper in the journal ''Scientific Data'' by a consortium of scientists and ...
, that is, data that is Findable, Accessible, Interoperable, and Reusable. Data that fulfills these requirements can be used in subsequent research and thus advances science and technology.


In other fields

Although data is also increasingly used in other fields, it has been suggested that the highly interpretive nature of them might be at odds with the ethos of data as "given". Peter Checkland introduced the term ''capta'' (from the Latin ''capere'', “to take”) to distinguish between an immense number of possible data and a sub-set of them, to which attention is oriented. Johanna Drucker has argued that since the humanities affirm knowledge production as "situated, partial, and constitutive," using ''data'' may introduce assumptions that are counterproductive, for example that phenomena are discrete or are observer-independent. The term ''capta'', which emphasizes the act of observation as constitutive, is offered as an alternative to ''data'' for visual representations in the humanities.


See also

* Biological data * Computer data processing * Computer memory * Dark data * Data acquisition *
Data analysis Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enco ...
* Data bank * Data cable * Data curation * Data domain * Data element * Data farming * Data governance * Data integrity * Data maintenance *
Data management Data management comprises all disciplines related to handling data as a valuable resource. Concept The concept of data management arose in the 1980s as technology moved from sequential processing (first punched cards, then magnetic tape) to r ...
* Data mining *
Data modeling Data modeling in software engineering is the process of creating a data model for an information system by applying certain formal techniques. Overview Data modeling is a process used to define and analyze data requirements needed to suppo ...
* Data point * Data preservation * Data protection * Data publication * Data remanence *
Data science Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a br ...
* Data set *
Data structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
*
Data visualization Data and information visualization (data viz or info viz) is an interdisciplinary field that deals with the graphic representation of data and information. It is a particularly efficient way of communicating when the data or information is num ...
* Data warehouse * Database * Datasheet * Digital privacy * Environmental data rescue *
Fieldwork Field research, field studies, or fieldwork is the collection of raw data outside a laboratory, library, or workplace setting. The approaches and methods used in field research vary across disciplines. For example, biologists who conduct fie ...
* Information engineering * Machine learning * Open data * Scientific data archiving * Secondary Data *
Statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...


References


External links


Data is a singular noun
(a detailed assessment) {{Authority control Statistical data Data management