Wrapper (data Mining)
   HOME

TheInfoList



OR:

Wrapper in data mining is a procedure that extracts regular subcontent of an unstructured or loosely-structured information source and translates it into a relational form, so it can be processed as structured data. Wrapper induction is the problem of devising extraction procedures on an automatic basis, with minimal reliance on hand-crafted rules. Many web pages are automatically generated from structured data – telephone directories, product catalogs, etc. – wrapped in a loosely structured presentation language (usually some variant of
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
), formatted for human browsing and navigation. Structured data are typically descriptions of objects retrieved from underlying databases and displayed in web pages following fixed templates at a low level, injected into pages where the high-level structure can vary from week to week, per the rapidly evolving fashion of the site's presentation skin. The precise dividing line between the fluid high-level skin and the less fluid structured data templates is rarely documented for public consumption, outside of the content management team at the web property. Software systems using such resources must translate HTML content into a relational form. Wrappers are commonly used as such translators. Formally, a wrapper is a function from a page to the set of
tuple In mathematics, a tuple is a finite ordered list (sequence) of elements. An -tuple is a sequence (or ordered list) of elements, where is a non-negative integer. There is only one 0-tuple, referred to as ''the empty tuple''. An -tuple is defi ...
s it contains.


Wrapper generation

There are two main approaches to wrapper generation: wrapper induction and automated
data extraction Data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration). The import into the intermediate extracting system is thus usual ...
. Wrapper induction uses supervised learning to learn data extraction rules from manually labeled training examples. The disadvantages of wrapper induction are * the time-consuming manual labeling process and * the difficulty of wrapper maintenance. Due to the manual labeling effort, it is hard to extract data from a large number of sites as each site has its own templates and requires separate manual labeling for wrapper learning. Wrapper maintenance is also a major issue because whenever a site changes the wrappers built for the site become obsolete. Due to these shortcomings, researchers have studied automated wrapper generation using unsupervised pattern mining. Automated extraction is possible because most Web data objects follow fixed templates. Discovering such templates or patterns enables the system to perform extraction automatically.Liu, B. Web ''Data Mining: Exploring Hyperlinks, Contents and Usage Data'', Springer, 2007. Wrapper generation on the Web is an important problem with a wide range of applications. Extraction of such data enables one to integrate data/information from multiple Web sites to provide value-added services, e.g., comparative shopping, object search, and information integration.


See also

*
Business intelligence Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis and management of business information. Common functions of business intelligence technologies include reporting, online analytical pr ...
(section semi-structured or
unstructured data Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, num ...
) *
Web scraping Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping ...


Sources

Data mining