HOME

TheInfoList



OR:

A spider trap (or crawler trap) is a set of web pages that may intentionally or unintentionally be used to cause a
web crawler A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...
or search bot to make an infinite number of requests or cause a poorly constructed crawler to crash. Web crawlers are also called
web spider A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...
s, from which the name is derived. Spider traps may be created to "catch"
spambot A spambot is a computer program designed to assist in the sending of spam. Spambots usually create accounts and send spam messages with them. Web hosts and website operators have responded by banning spammers, leading to an ongoing struggle betw ...
s or other crawlers that waste a website's bandwidth. They may also be created unintentionally by calendars that use dynamic pages with links that continually point to the next day or year. Common techniques used are: * creation of indefinitely deep
directory Directory may refer to: * Directory (computing), or folder, a file system structure in which to store computer files * Directory (OpenVMS command) * Directory service, a software application for organizing information about a computer network's u ...
structures like http://example.com/bar/foo/bar/foo/bar/foo/bar/... * Dynamic pages that produce an unbounded number of documents for a web crawler to follow. Examples include calendars and algorithmically generated
language poetry The Language poets (or ''L=A=N=G=U=A=G=E'' poets, after the magazine of that name) are an avant-garde group or tendency in United States poetry that emerged in the late 1960s and early 1970s. The poets included: Bernadette Mayer, Leslie Scalap ...
. * documents filled with many characters, crashing the
lexical analyzer In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of ''lexical tokens'' ( strings with an assigned and thus identified ...
parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from L ...
the document. * documents with session-id's based on required cookies. There is no algorithm to detect all spider traps. Some classes of traps can be detected automatically, but new, unrecognized traps arise quickly.


Politeness

A spider trap causes a web crawler to enter something like an
infinite loop In computer programming, an infinite loop (or endless loop) is a sequence of instructions that, as written, will continue endlessly, unless an external intervention occurs ("pull the plug"). It may be intentional. Overview This differs from: * ...
, which wastes the spider's resources, lowers its productivity, and, in the case of a poorly written crawler, can crash the program. Polite spiders alternate requests between different hosts, and don't request documents from the same server more than once every several seconds, meaning that a "polite" web crawler is affected to a much lesser degree than an "impolite" crawler. In addition, sites with spider traps usually have a
robots.txt The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the site they are allowed to visit. Th ...
telling bots not to go to the trap, so a legitimate "polite" bot would not fall into the trap, whereas an "impolite" bot which disregards the robots.txt settings would be affected by the trap.


See also

*
Robots exclusion standard The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the site they are allowed to visit. Th ...
*
Web crawler A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...


References

Internet search {{Web-stub