HOME

TheInfoList



OR:

Data extraction is the act or process of retrieving
data Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...
out of (usually unstructured or poorly structured) data sources for further
data processing Data processing is the collection and manipulation of digital data to produce meaningful information. Data processing is a form of ''information processing'', which is the modification (processing) of information in any manner detectable by an o ...
or
data storage Data storage is the recording (storing) of information (data) in a storage medium. Handwriting, phonographic recording, magnetic tape, and optical discs are all examples of storage media. Biological molecules such as RNA and DNA are con ...
(
data migration Data migration is the process of selecting, preparing, extracting, and transforming data and permanently transferring it from one computer storage system to another. Additionally, the validation of migrated data for completeness and the decommi ...
). The
import An importer is the receiving country in an export from the sending country. Importation and exportation are the defining financial transactions of international trade. Import is part of the International Trade which involves buying and receivin ...
into the intermediate extracting system is thus usually followed by
data transformation In computing, data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integrationCIO.com. Agile Comes to Data Integration. Retrieved from: https ...
and possibly the addition of
metadata Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive ...
prior to
export An export in international trade is a good produced in one country that is sold into another country or a service provided in one country for a national or resident of another country. The seller of such goods or the service provider is a ...
to another stage in the data
workflow Workflow is a generic term for orchestrated and repeatable patterns of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a seque ...
. Usually, the term data extraction is applied when (
experiment An experiment is a procedure carried out to support or refute a hypothesis, or determine the efficacy or likelihood of something previously untried. Experiments provide insight into cause-and-effect by demonstrating what outcome occurs whe ...
al) data is first imported into a computer from primary sources, like
measuring Measurement is the quantification of attributes of an object or event, which can be used to compare with other objects or events. In other words, measurement is a process of determining how large or small a physical quantity is as compared to ...
or recording devices. Today's
electronic device Electronics is a scientific and engineering discipline that studies and applies the principles of physics to design, create, and operate devices that manipulate electrons and other electrically charged particles. It is a subfield of physics and ...
s will usually present an
electrical connector Components of an electrical circuit are electrically connected if an electric current can run between them through an electrical conductor. An electrical connector is an electromechanical device used to create an electrical connection between ...
(e.g.
USB Universal Serial Bus (USB) is an industry standard, developed by USB Implementers Forum (USB-IF), for digital data transmission and power delivery between many types of electronics. It specifies the architecture, in particular the physical ...
) through which '
raw data Raw data, also known as primary data, are ''data'' (e.g., numbers, instrument readings, figures, etc.) collected from a source. In the context of examinations, the raw data might be described as a raw score (after test scores). If a scientist ...
' can be streamed into a
personal computer A personal computer, commonly referred to as PC or computer, is a computer designed for individual use. It is typically used for tasks such as Word processor, word processing, web browser, internet browsing, email, multimedia playback, and PC ...
.


Data sources

Typical unstructured data sources include
web page A web page (or webpage) is a World Wide Web, Web document that is accessed in a web browser. A website typically consists of many web pages hyperlink, linked together under a common domain name. The term "web page" is therefore a metaphor of pap ...
s,
email Electronic mail (usually shortened to email; alternatively hyphenated e-mail) is a method of transmitting and receiving Digital media, digital messages using electronics, electronic devices over a computer network. It was conceived in the ...
s, documents,
PDF Portable document format (PDF), standardized as ISO 32000, is a file format developed by Adobe Inc., Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, computer hardware, ...
s, social media, scanned text, mainframe reports, spool files, multimedia files, etc. Extracting data from these unstructured sources has grown into a considerable technical challenge, where as historically data extraction has had to deal with changes in physical hardware formats, the majority of current data extraction deals with extracting data from these unstructured data sources, and from different software formats. This growing process of data extraction from the web is referred to as "Web data extraction" or "
Web scraping Web scraping, web harvesting, or web data extraction is data scraping used for data extraction, extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. W ...
".


Imposing structure

The act of adding structure to unstructured data takes a number of forms * Using text
pattern matching In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually must be exact: "either it will or will not be a ...
such as
regular expression A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
s to identify small or large-scale structure e.g. records in a report and their associated data from headers and footers; * Using a table-based approach to identify common sections within a limited domain e.g. in emailed resumes, identifying skills, previous work experience, qualifications etc. using a standard set of commonly used headings (these would differ from language to language), e.g. Education might be found under Education/Qualification/Courses; * Using
text analytics Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from plain text, text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information ...
to attempt to understand the text and link it to other information


See also

*
Data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...
, discovery of patterns in large data sets using statistics, database knowledge or machine learning *
Data retrieval Data retrieval means obtaining data from a database management system (DBMS), like for example an object-oriented database (ODBMS). In this case, it is considered that data is represented in a structured way, and there is no ambiguity in data. I ...
, obtaining data from a database management system, often using a query with a set of criteria *
Extract, transform, load Extract, transform, load (ETL) is a three-phase computing process where data is ''extracted'' from an input source, ''transformed'' (including cleaning), and ''loaded'' into an output data container. The data can be collected from one or mor ...
(ETL), procedure for copying data from one or more sources, transforming the data at the source system, and copying into a destination system * Information extraction, automated extraction of structured information from unstructured or semi-structured machine-readable data, for example using natural language processing to extract content from images, audio or documents


References

{{DEFAULTSORT:Data Extraction Data engineering Data warehousing