Screenscraping
   HOME

TheInfoList



OR:

Data scraping is a technique where a computer program extracts data from human-readable output coming from another program.


Description

Normally, data transfer between programs is accomplished using
data structures In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
suited for automated processing by
computers A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations (computation) automatically. Modern digital electronic computers can perform generic sets of operations known as programs. These programs ...
, not people. Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and minimize ambiguity. Very often, these transmissions are not human-readable at all. Thus, the key element that distinguishes data scraping from regular parsing is that the output being scraped is intended for display to an end-user, rather than as an input to another program. It is therefore usually neither documented nor structured for convenient parsing. Data scraping often involves ignoring binary data (usually images or multimedia data),
display Display may refer to: Technology * Display device, output device for presenting information, including: ** Cathode ray tube, video display that provides a quality picture, but can be very heavy and deep ** Electronic visual display, output devi ...
formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or hinders automated processing. Data scraping is most often done either to interface to a legacy system, which has no other mechanism which is compatible with current hardware, or to interface to a third-party system which does not provide a more convenient API. In the second case, the operator of the third-party system will often see
screen scraping Data scraping is a technique where a computer program extracts data from human-readable output coming from another program. Description Normally, data transfer between programs is accomplished using data structures suited for automated processin ...
as unwanted, due to reasons such as increased system
load Load or LOAD may refer to: Aeronautics and transportation *Load factor (aeronautics), the ratio of the lift of an aircraft to its weight *Passenger load factor, the ratio of revenue passenger miles to available seat miles of a particular transpo ...
, the loss of
advertisement Advertising is the practice and techniques employed to bring attention to a product or service. Advertising aims to put a product or service in the spotlight in hopes of drawing it attention from consumers. It is typically used to promote a ...
revenue, or the loss of control of the information content. Data scraping is generally considered an '' ad hoc'', inelegant technique, often used only as a "last resort" when no other mechanism for data interchange is available. Aside from the higher programming and processing overhead, output displays intended for human consumption often change structure frequently. Humans can cope with this easily, but a computer program will fail. Depending on the quality and the extent of error handling logic present in the computer, this failure can result in error messages, corrupted output or even
program crash In computing, a crash, or system crash, occurs when a computer program such as a software application or an operating system stops functioning properly and exits. On some operating systems or individual applications, a crash reporting servic ...
es.


Technical variants


Screen scraping

Although the use of physical " dumb terminal" IBM 3270s is slowly diminishing, as more and more mainframe applications acquire
Web Web most often refers to: * Spider web, a silken structure created by the animal * World Wide Web or the Web, an Internet-based hypertext system Web, WEB, or the Web may also refer to: Computing * WEB, a literate programming system created by ...
interfaces, some Web applications merely continue to use the technique of screen scraping to capture old screens and transfer the data to modern front-ends. Screen scraping is normally associated with the programmatic collection of visual data from a source, instead of parsing data as in web scraping. Originally, ''screen scraping'' referred to the practice of reading text data from a computer display
terminal Terminal may refer to: Computing Hardware * Terminal (electronics), a device for joining electrical circuits together * Terminal (telecommunication), a device communicating over a line * Computer terminal, a set of primary input and output devic ...
's screen. This was generally done by reading the terminal's memory through its auxiliary port, or by connecting the terminal output port of one computer system to an input port on another. The term screen scraping is also commonly used to refer to the bidirectional exchange of data. This could be the simple cases where the controlling program navigates through the user interface, or more complex scenarios where the controlling program is entering data into an interface meant to be used by a human. As a concrete example of a classic screen scraper, consider a hypothetical legacy system dating from the 1960s—the dawn of computerized
data processing Data processing is the collection and manipulation of digital data to produce meaningful information. Data processing is a form of ''information processing'', which is the modification (processing) of information in any manner detectable by an ...
. Computer to user interfaces from that era were often simply text-based dumb terminals which were not much more than virtual teleprinters (such systems are still in use , for various reasons). The desire to interface such a system to more modern systems is common. A robust solution will often require things no longer available, such as source code, system
documentation Documentation is any communicable material that is used to describe, explain or instruct regarding some attributes of an object, system or procedure, such as its parts, assembly, installation, maintenance and use. As a form of knowledge manageme ...
, APIs, or
programmers A computer programmer, sometimes referred to as a software developer, a software engineer, a programmer or a coder, is a person who creates computer programs — often for larger computer software. A programmer is someone who writes/creates ...
with experience in a 50-year-old computer system. In such cases, the only feasible solution may be to write a screen scraper that "pretends" to be a user at a terminal. The screen scraper might connect to the legacy system via Telnet, emulate the keystrokes needed to navigate the old user interface, process the resulting display output, extract the desired data, and pass it on to the modern system. A sophisticated and resilient implementation of this kind, built on a platform providing the governance and control required by a major enterprise—e.g. change control, security, user management, data protection, operational audit, load balancing, and queue management, etc.—could be said to be an example of
robotic process automation Robotic process automation (RPA) is a form of business process automation technology based on metaphorical software robots (bots) or on artificial intelligence (AI)/digital workers. It is sometimes referred to as ''software robotics'' (not to be ...
software, called RPA or RPAAI for self-guided RPA 2.0 based on artificial intelligence. In the 1980s, financial data providers such as Reuters,
Telerate Telerate was a US company providing financial data to market participants, specialising in commercial paper and bond prices. It was a pioneer in the electronic distribution of real-time market information in the 1970s. With its main innovation ...
, and
Quotron Quotron was a Los Angeles-based company that in 1960 became the first financial data technology company to deliver stock market quotes to an electronic screen rather than on a printed ticker tape. The Quotron offered brokers and money managers up ...
displayed data in 24×80 format intended for a human reader. Users of this data, particularly
investment banks Investment banking pertains to certain activities of a financial services company or a corporate division that consist in advisory-based financial transactions on behalf of individuals, corporations, and governments. Traditionally associated with ...
, wrote applications to capture and convert this character data as numeric data for inclusion into calculations for trading decisions without re-keying the data. The common term for this practice, especially in the United Kingdom, was ''page shredding'', since the results could be imagined to have passed through a
paper shredder A paper shredder is a mechanical device used to cut sheets of paper into either strips or fine particles. Government organizations, businesses, and private individuals use shredders to destroy private, confidential, or otherwise sensitive docum ...
. Internally Reuters used the term 'logicized' for this conversion process, running a sophisticated computer system on VAX/VMS called the Logicizer. More modern screen scraping techniques include capturing the bitmap data from the screen and running it through an OCR engine, or for some specialised automated testing systems, matching the screen's bitmap data against expected results. This can be combined in the case of
GUI The GUI ( "UI" by itself is still usually pronounced . or ), graphical user interface, is a form of user interface that allows users to interact with electronic devices through graphical icons and audio indicator such as primary notation, inste ...
applications, with querying the graphical controls by programmatically obtaining references to their underlying programming objects. A sequence of screens is automatically captured and converted into a database. Another modern adaptation to these techniques is to use, instead of a sequence of screens as input, a set of images or PDF files, so there are some overlaps with generic "document scraping" and
report mining Data scraping is a technique where a computer program extracts data from Human-readable medium, human-readable output coming from another program. Description Normally, Data transmission, data transfer between programs is accomplished using data ...
techniques. There are many tools that can be used for screen scraping.


Web scraping

Web pages are built using text-based mark-up languages ( HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human
end-users In product development, an end user (sometimes end-user) is a person who ultimately uses or is intended to ultimately use a product. The end user stands in contrast to users who support or maintain the product, such as sysops, system administrat ...
and not for ease of automated use. Because of this, tool kits that scrape web content were created. A
web scraper Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping ...
is an API or tool to extract data from a website. Companies like
Amazon AWS Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon.com, Amazon that provides Software as a service, on-demand cloud computing computing platform, platforms and Application programming interface, APIs to individuals, companies, and gover ...
and Google provide web scraping tools, services, and public data available free of cost to end-users. Newer forms of web scraping involve listening to data feeds from web servers. For example,
JSON JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other ser ...
is commonly used as a transport storage mechanism between the client and the webserver. Recently, companies have developed web scraping systems that rely on using techniques in DOM parsing,
computer vision Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...
and
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
to simulate the human processing that occurs when viewing a webpage to automatically extract useful information. Large websites usually use defensive algorithms to protect their data from web scrapers and to limit the number of requests an IP or IP network may send. This has caused an ongoing battle between website developers and scraping developers.


Report mining is the extraction of data from human-readable computer reports. Conventional data extraction requires a connection to a working source system, suitable connectivity standards or an API, and usually complex querying. By using the source system's standard reporting options, and directing the output to a spool file instead of to a printer, static reports can be generated suitable for offline analysis via report mining.Scott Steinacher
"Data Pump transforms host data"
'' InfoWorld'', 30 August 1999, p55
This approach can avoid intensive
CPU A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, and ...
usage during business hours, can minimise end-user licence costs for ERP customers, and can offer very rapid prototyping and development of custom reports. Whereas data scraping and web scraping involve interacting with dynamic output, report mining involves extracting data from files in a human-readable format, such as HTML, PDF, or text. These can be easily generated from almost any system by intercepting the data feed to a printer. This approach can provide a quick and simple route to obtaining data without the need to program an API to the source system.


See also

* Comparison of feed aggregators * Data cleansing *
Data munging Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes s ...
*
Importer (computing) An importer is a software application that reads in a data file or metadata information in one format and converts it to another format via special algorithms (such as filters). An importer often is not an entire program by itself, but an extensi ...
* Information extraction * Open data * Mashup (web application hybrid) *
Metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
* Web scraping * Search engine scraping


References


Further reading

* Hemenway, Kevin and Calishain, Tara. ''Spidering Hacks''. Cambridge, Massachusetts: O'Reilly, 2003. . {{DEFAULTSORT:Data Scraping Data processing