Screen scraping
   HOME

TheInfoList



OR:

Data scraping is a technique where a
computer program A computer program is a sequence or set of instructions in a programming language for a computer to execute. Computer programs are one component of software, which also includes documentation and other intangible components. A computer program ...
extracts
data In the pursuit of knowledge, data (; ) is a collection of discrete Value_(semiotics), values that convey information, describing quantity, qualitative property, quality, fact, statistics, other basic units of meaning, or simply sequences of sy ...
from
human-readable A human-readable medium or human-readable format is any encoding of data or information that can be naturally read by humans. In computing, ''human-readable'' data is often encoded as ASCII or Unicode text, rather than as binary data. In most c ...
output coming from another program.


Description

Normally,
data transfer Data transmission and data reception or, more broadly, data communication or digital communications is the transfer and reception of data in the form of a digital bitstream or a digitized analog signal transmitted over a point-to-point or ...
between programs is accomplished using
data structures In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
suited for
automated Automation describes a wide range of technologies that reduce human intervention in processes, namely by predetermining decision criteria, subprocess relationships, and related actions, as well as embodying those predeterminations in machines ...
processing by computers, not people. Such interchange formats and
protocols Protocol may refer to: Sociology and politics * Protocol (politics), a formal agreement between nation states * Protocol (diplomacy), the etiquette of diplomacy and affairs of state * Etiquette, a code of personal behavior Science and technology ...
are typically rigidly structured, well-documented, easily
parsed Parsing, syntax analysis, or syntactic analysis is the process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gra ...
, and minimize ambiguity. Very often, these transmissions are not human-readable at all. Thus, the key element that distinguishes data scraping from regular
parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from L ...
is that the output being scraped is intended for display to an
end-user In product development, an end user (sometimes end-user) is a person who ultimately uses or is intended to ultimately use a product. The end user stands in contrast to users who support or maintain the product, such as sysops, system administrato ...
, rather than as an input to another program. It is therefore usually neither documented nor structured for convenient parsing. Data scraping often involves ignoring
binary data Binary data is data whose unit can take on only two possible states. These are often labelled as 0 and 1 in accordance with the binary numeral system and Boolean algebra. Binary data occurs in many different technical and scientific fields, wher ...
(usually images or multimedia data), display formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or hinders automated processing. Data scraping is most often done either to interface to a
legacy system In computing, a legacy system is an old method, technology, computer system, or application program, "of, relating to, or being a previous or outdated computer system", yet still in use. Often referencing a system as "legacy" means that it paved ...
, which has no other mechanism which is compatible with current hardware, or to interface to a third-party system which does not provide a more convenient
API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...
. In the second case, the operator of the third-party system will often see screen scraping as unwanted, due to reasons such as increased system load, the loss of advertisement
revenue In accounting, revenue is the total amount of income generated by the sale of goods and services related to the primary operations of the business. Commercial revenue may also be referred to as sales or as turnover. Some companies receive reven ...
, or the loss of control of the information content. Data scraping is generally considered an ''
ad hoc Ad hoc is a Latin phrase meaning literally 'to this'. In English, it typically signifies a solution for a specific purpose, problem, or task rather than a generalized solution adaptable to collateral instances. (Compare with '' a priori''.) C ...
'', inelegant technique, often used only as a "last resort" when no other mechanism for data interchange is available. Aside from the higher programming and processing overhead, output displays intended for human consumption often change structure frequently. Humans can cope with this easily, but a computer program will fail. Depending on the quality and the extent of
error handling In computing and computer programming, exception handling is the process of responding to the occurrence of ''exceptions'' – anomalous or exceptional conditions requiring special processing – during the execution of a program. In general, an ...
logic present in the computer, this failure can result in error messages, corrupted output or even program crashes.


Technical variants


Screen scraping

Although the use of physical "
dumb terminal A computer terminal is an electronic or electromechanical hardware device that can be used for entering data into, and transcribing data from, a computer or a computing system. The teletype was an example of an early-day hard-copy terminal a ...
" IBM 3270s is slowly diminishing, as more and more mainframe applications acquire Web interfaces, some Web applications merely continue to use the technique of screen scraping to capture old screens and transfer the data to modern front-ends. Screen scraping is normally associated with the programmatic collection of visual data from a source, instead of parsing data as in web scraping. Originally, ''screen scraping'' referred to the practice of reading text data from a computer display
terminal Terminal may refer to: Computing Hardware * Terminal (electronics), a device for joining electrical circuits together * Terminal (telecommunication), a device communicating over a line * Computer terminal, a set of primary input and output dev ...
's
screen Screen or Screens may refer to: Arts * Screen printing (also called ''silkscreening''), a method of printing * Big screen, a nickname associated with the motion picture industry * Split screen (filmmaking), a film composition paradigm in which mul ...
. This was generally done by reading the terminal's
memory Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembered ...
through its auxiliary
port A port is a maritime facility comprising one or more wharves or loading areas, where ships load and discharge cargo and passengers. Although usually situated on a sea coast or estuary, ports can also be found far inland, such as H ...
, or by connecting the terminal output port of one computer system to an input port on another. The term screen scraping is also commonly used to refer to the bidirectional exchange of data. This could be the simple cases where the controlling program navigates through the user interface, or more complex scenarios where the controlling program is entering data into an interface meant to be used by a human. As a concrete example of a classic screen scraper, consider a hypothetical legacy system dating from the 1960s—the dawn of computerized data processing. Computer to
user interface In the industrial design field of human–computer interaction, a user interface (UI) is the space where interactions between humans and machines occur. The goal of this interaction is to allow effective operation and control of the machine f ...
s from that era were often simply text-based
dumb terminal A computer terminal is an electronic or electromechanical hardware device that can be used for entering data into, and transcribing data from, a computer or a computing system. The teletype was an example of an early-day hard-copy terminal a ...
s which were not much more than virtual
teleprinter A teleprinter (teletypewriter, teletype or TTY) is an electromechanical device that can be used to send and receive typed messages through various communications channels, in both point-to-point and point-to-multipoint configurations. Init ...
s (such systems are still in use , for various reasons). The desire to interface such a system to more modern systems is common. A
robust Robustness is the property of being strong and healthy in constitution. When it is transposed into a system, it refers to the ability of tolerating perturbations that might affect the system’s functional body. In the same line ''robustness'' ca ...
solution will often require things no longer available, such as
source code In computing, source code, or simply code, is any collection of code, with or without comments, written using a human-readable programming language, usually as plain text. The source code of a program is specially designed to facilitate the w ...
, system documentation,
API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...
s, or
programmers A computer programmer, sometimes referred to as a software developer, a software engineer, a programmer or a coder, is a person who creates computer programs — often for larger computer software. A programmer is someone who writes/creates ...
with experience in a 50-year-old computer system. In such cases, the only feasible solution may be to write a screen scraper that "pretends" to be a user at a terminal. The screen scraper might connect to the legacy system via
Telnet Telnet is an application protocol used on the Internet or local area network to provide a bidirectional interactive text-oriented communication facility using a virtual terminal connection. User data is interspersed in-band with Telnet contr ...
,
emulate Emulate, Inc. (Emulate) is a biotechnology company that commercialized Organs-on-Chips technology—a human cell-based technology that recreates organ-level function to model organs in healthy and diseased states. The technology has applications ...
the keystrokes needed to navigate the old user interface, process the resulting display output, extract the desired data, and pass it on to the modern system. A sophisticated and resilient implementation of this kind, built on a platform providing the governance and control required by a major enterprise—e.g. change control, security, user management, data protection, operational audit, load balancing, and queue management, etc.—could be said to be an example of robotic process automation software, called RPA or RPAAI for self-guided RPA 2.0 based on
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech r ...
. In the 1980s, financial data providers such as
Reuters Reuters ( ) is a news agency owned by Thomson Reuters Corporation. It employs around 2,500 journalists and 600 photojournalists in about 200 locations worldwide. Reuters is one of the largest news agencies in the world. The agency was esta ...
,
Telerate Telerate was a US company providing financial data to market participants, specialising in commercial paper and bond prices. It was a pioneer in the electronic distribution of real-time market information in the 1970s. With its main innovation ...
, and
Quotron Quotron was a Los Angeles-based company that in 1960 became the first financial data technology company to deliver stock market quotes to an electronic screen rather than on a printed ticker tape. The Quotron offered brokers and money managers up ...
displayed data in 24×80 format intended for a human reader. Users of this data, particularly
investment banks Investment banking pertains to certain activities of a financial services company or a corporate division that consist in advisory-based financial transactions on behalf of individuals, corporations, and governments. Traditionally associated wit ...
, wrote applications to capture and convert this character data as numeric data for inclusion into calculations for trading decisions without re-keying the data. The common term for this practice, especially in the
United Kingdom The United Kingdom of Great Britain and Northern Ireland, commonly known as the United Kingdom (UK) or Britain, is a country in Europe, off the north-western coast of the European mainland, continental mainland. It comprises England, Scotlan ...
, was ''page shredding'', since the results could be imagined to have passed through a
paper shredder A paper shredder is a mechanical device used to cut sheets of paper into either strips or fine particles. Government organizations, businesses, and private individuals use shredders to destroy private, confidential, or otherwise sensitive docum ...
. Internally Reuters used the term 'logicized' for this conversion process, running a sophisticated computer system on
VAX/VMS OpenVMS, often referred to as just VMS, is a multi-user, multiprocessing and virtual memory-based operating system. It is designed to support time-sharing, batch processing, transaction processing and workstation applications. Customers using Ope ...
called the Logicizer. More modern screen scraping techniques include capturing the bitmap data from the screen and running it through an OCR engine, or for some specialised automated testing systems, matching the screen's bitmap data against expected results. This can be combined in the case of
GUI The GUI ( "UI" by itself is still usually pronounced . or ), graphical user interface, is a form of user interface that allows users to interact with electronic devices through graphical icons and audio indicator such as primary notation, inste ...
applications, with querying the graphical controls by programmatically obtaining references to their underlying programming objects. A sequence of screens is automatically captured and converted into a database. Another modern adaptation to these techniques is to use, instead of a sequence of screens as input, a set of images or PDF files, so there are some overlaps with generic "document scraping" and report mining techniques. There are many tools that can be used for screen scraping.


Web scraping

Web pages are built using text-based mark-up languages (
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaSc ...
and
XHTML Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages. It mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated. While HTML, prior ...
), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper is an
API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...
or tool to extract data from a website. Companies like Amazon AWS and
Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
provide web scraping tools, services, and public data available free of cost to end-users. Newer forms of web scraping involve listening to data feeds from web servers. For example, JSON is commonly used as a transport storage mechanism between the client and the webserver. Recently, companies have developed web scraping systems that rely on using techniques in DOM parsing, computer vision and natural language processing to simulate the human processing that occurs when viewing a webpage to automatically extract useful information. Large websites usually use defensive algorithms to protect their data from web scrapers and to limit the number of requests an IP or IP network may send. This has caused an ongoing battle between website developers and scraping developers.


Report mining is the extraction of data from human-readable computer reports. Conventional
data extraction Data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage ( data migration). The import into the intermediate extracting system is thus usua ...
requires a connection to a working source system, suitable
connectivity Connectivity may refer to: Computing and technology * Connectivity (media), the ability of the social media to accumulate economic capital from the users connections and activities * Internet connectivity, the means by which individual terminal ...
standards or an
API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...
, and usually complex querying. By using the source system's standard reporting options, and directing the output to a spool file instead of to a
printer Printer may refer to: Technology * Printer (publishing), a person or a company * Printer (computing), a hardware device * Optical printer for motion picture films People * Nariman Printer (fl. c. 1940), Indian journalist and activist * James ...
, static reports can be generated suitable for offline analysis via report mining.Scott Steinacher
"Data Pump transforms host data"
''
InfoWorld ''InfoWorld'' (abbreviated IW) is an information technology media business. Founded in 1978, it began as a monthly magazine. In 2007, it transitioned to a web-only publication. Its parent company today is International Data Group, and its siste ...
'', 30 August 1999, p55
This approach can avoid intensive CPU usage during business hours, can minimise
end-user In product development, an end user (sometimes end-user) is a person who ultimately uses or is intended to ultimately use a product. The end user stands in contrast to users who support or maintain the product, such as sysops, system administrato ...
licence costs for ERP customers, and can offer very rapid prototyping and development of custom reports. Whereas data scraping and web scraping involve interacting with dynamic output, report mining involves extracting data from files in a human-readable format, such as
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaSc ...
, PDF, or text. These can be easily generated from almost any system by intercepting the data feed to a printer. This approach can provide a quick and simple route to obtaining data without the need to program an API to the source system.


See also

*
Comparison of feed aggregators The following is a comparison of RSS feed aggregators. Often e-mail programs and web browsers have the ability to display RSS feeds. They are listed here, too. Many BitTorrent clients support RSS feeds for broadcasting (see Comparison of BitT ...
*
Data cleansing Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the d ...
*
Data munging Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one " raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes ...
*
Importer (computing) An importer is a software application that reads in a data file or metadata information in one format and converts it to another format via special algorithms (such as filters). An importer often is not an entire program by itself, but an extens ...
*
Information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
*
Open data Open data is data that is openly accessible, exploitable, editable and shared by anyone for any purpose. Open data is licensed under an open license. The goals of the open data movement are similar to those of other "open(-source)" movement ...
*
Mashup (web application hybrid) A mashup (computer industry jargon), in web development, is a web page or web application that uses content from more than one source to create a single new service displayed in a single graphical interface. For example, a user could combine the ...
* Metadata *
Web scraping Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scrapin ...
* Search engine scraping


References


Further reading

* Hemenway, Kevin and Calishain, Tara. ''Spidering Hacks''. Cambridge, Massachusetts: O'Reilly, 2003. . {{DEFAULTSORT:Data Scraping Data processing