A scraper site is a
website
A website (also written as a web site) is a collection of web pages and related content that is identified by a common domain name and published on at least one web server. Examples of notable websites are Google, Facebook, Amazon, and Wikip ...
that copies content from other websites using
web scraping
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scrapin ...
. The content is then mirrored with the goal of creating revenue, usually through advertising and sometimes by selling user data. Scraper sites come in various forms. Some provide little, if any material or information, and are intended to obtain user information such as e-mail addresses, to be targeted for spam e-mail. Price aggregation and shopping sites access multiple listings of a product and allow a user to rapidly compare the prices.
Examples of scraper websites
Search engine
A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
s such as
Google
Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
could be considered a type of scraper site. Search engines gather content from other websites, save it in their own databases, index it and present the scraped content to their search engine's own users. The majority of content scraped by search engines is copyrighted.
The scraping technique has been used on various dating websites as well. These sites often combine their scraping activities with
facial recognition Facial recognition or face recognition may refer to:
* Face detection, often a step done before facial recognition
* Face perception, the process by which the human brain understands and interprets the face
* Pareidolia, which involves, in part, se ...
.
Scraping is also used on general image recognition websites, and websites specifically made to identify images of crops with pests and diseases
Made for advertising
Some scraper sites are created to make money by using advertising programs. In such case, they are called ''Made for
AdSense
Google AdSense is a program run by Google through which website publishers in the Google Network of content sites serve text, images, video, or interactive media advertisements that are targeted to the site content and audience. These adver ...
'' sites or MFA. This derogatory term refers to websites that have no redeeming value except to lure visitors to the website for the sole purpose of clicking on advertisements.
''Made for AdSense'' sites are considered
search engine spam
Spamdexing (also known as search engine spam, search engine poisoning, black-hat search engine optimization, search spam or web spam) is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building ...
that dilute the search results with less-than-satisfactory search results. The scraped content is redundant to that which would be shown by the search engine under normal circumstances, had no MFA website been found in the listings.
Some scraper sites link to other sites in order to improve their
search engine ranking
Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic (known as "natural" or " organic" results) rather than dire ...
through a
private blog network. Prior to Google's update to its search algorithm known as
Panda
The giant panda (''Ailuropoda melanoleuca''), also known as the panda bear (or simply the panda), is a bear species endemic to China. It is characterised by its bold black-and-white coat and rotund body. The name "giant panda" is sometimes use ...
, a type of scraper site known as an
auto blog was quite common among black-hat marketers who used a method known as
spamdexing
Spamdexing (also known as search engine spam, search engine poisoning, black-hat search engine optimization, search spam or web spam) is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link buildin ...
.
Legality
Scraper sites may violate
copyright law
A copyright is a type of intellectual property that gives its owner the exclusive right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time. The creative work may be in a literary, artistic, educatio ...
. Even taking content from an
open content site can be a
copyright violation
Copyright infringement (at times referred to as piracy) is the use of works protected by copyright without permission for a usage where such permission is required, thereby infringing certain exclusive rights granted to the copyright holder, ...
, if done in a way which does not respect the license. For instance, the
GNU Free Documentation License
The GNU Free Documentation License (GNU FDL or simply GFDL) is a copyleft license for free documentation, designed by the Free Software Foundation (FSF) for the GNU Project. It is similar to the GNU General Public License, giving readers the r ...
(GFDL) and
Creative Commons
Creative Commons (CC) is an American non-profit organization and international network devoted to educational access and expanding the range of creative works available for others to build upon legally and to share. The organization has releas ...
ShareAlike (CC-BY-SA) licenses used on Wikipedia
[
] require that a republisher of Wikipedia inform its readers of the conditions on these licenses, and give credit to the original author.
Techniques
Depending upon the objective of a scraper, the methods in which websites are targeted differ. For example, sites with large amounts of content such as airlines, consumer electronics, department stores, etc. might be routinely targeted by their competition just to stay abreast of pricing information.
Another type of scraper will pull snippets and text from websites that rank high for keywords they have targeted. This way they hope to rank highly in the
search engine results pages (SERPs), piggybacking on the original page's
page rank.
RSS
RSS ( RDF Site Summary or Really Simple Syndication) is a web feed that allows users and applications to access updates to websites in a standardized, computer-readable format. Subscribing to RSS feeds can allow a user to keep track of many di ...
feeds are vulnerable to scrapers.
Other scraper sites consist of advertisements and paragraphs of words randomly selected from a dictionary. Often a visitor will click on a
pay-per-click advertisement on such site because it is the only comprehensible text on the page. Operators of these scraper sites gain financially from these clicks. Advertising networks claim to be constantly working to remove these sites from their programs, although these networks benefit directly from the clicks generated at this kind of site. From the advertisers' point of view, the networks don't seem to be making enough effort to stop this problem.
Scrapers tend to be associated with
link farms and are sometimes perceived as the same thing, when multiple scrapers link to the same target site. A frequent target victim site might be accused of link-farm participation, due to the artificial pattern of incoming links to a victim website, linked from multiple scraper sites.
Domain hijacking
Some programmers who create scraper sites may purchase a recently expired
domain name
A domain name is a string that identifies a realm of administrative autonomy, authority or control within the Internet. Domain names are often used to identify services provided through the Internet, such as websites, email services and more. ...
to reuse its SEO power in Google. Whole businesses focus on understanding all expired domains and utilising them for their historical ranking ability exist. Doing so will allow SEOs to utilize the already-established
backlink
A backlink is a link from some other website (the referrer) to that web resource (the referent). A ''web resource'' may be (for example) a website, web page, or web directory.
A backlink is a reference comparable to a citation. The quantity, ...
s to the domain name. Some spammers may try to match the topic of the expired site or copy the existing content from the
Internet Archive
The Internet Archive is an American digital library with the stated mission of "universal access to all knowledge". It provides free public access to collections of digitized materials, including websites, software applications/games, music ...
to maintain the authenticity of the site so that the backlinks don't drop. For example, an expired website about a photographer may be re-registered to create a site about photography tips or use the domain name in their
private blog network to power their own photography site.
Services at some expired domain name registration agents provide both the facility to find these expired domains and to gather the HTML that the domain name used to have on its web site.
See also
*
Scraping
*
Contact scraping
*
Domain parking
*
Web scraping
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scrapin ...
*
Blog scraping
*
Multi-protocol messengers: can connect to several networks, yet require to have an account on all of these, so don't violate any terms of the networks
References
{{DEFAULTSORT:Scraper Site
Web scraping