Focused crawlers
   HOME

TheInfoList



OR:

A focused crawler is a
web crawler A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...
that collects Web pages that satisfy some specific property, by carefully prioritizing the
crawl frontier A crawl frontier is a data structure used for storage of URLs eligible for crawling and supporting such operations as adding URLs and selecting for crawl. Sometimes it can be seen as a priority queue. Overview A crawl frontier is one of the com ...
and managing the hyperlink exploration process. Some predicates may be based on simple, deterministic and surface properties. For example, a crawler's mission may be to crawl pages from only the .jp domain. Other predicates may be softer or comparative, e.g., "crawl pages about baseball", or "crawl pages with large
PageRank PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. According ...
". An important page property pertains to topics, leading to 'topical crawlers'. For example, a topical crawler may be deployed to collect pages about solar power, swine flu, or even more abstract concepts like controversy while minimizing resources spent fetching pages on other topics. Crawl frontier management may not be the only device used by focused crawlers; they may use a
Web directory A web directory or link directory is an online list or catalog of websites. That is, it is a directory on the World Wide Web of (all or part of) the World Wide Web. Historically, directories typically listed entries on people or businesses, and th ...
, a Web text index,
backlink A backlink is a link from some other website (the referrer) to that web resource (the referent). A ''web resource'' may be (for example) a website, web page, or web directory. A backlink is a reference comparable to a citation. The quantity, q ...
s, or any other Web artifact. A focused crawler must predict the probability that an unvisited page will be relevant before actually downloading the page. A possible predictor is the anchor text of links; this was the approach taken by Pinkerton in a crawler developed in the early days of the Web. Topical crawling was first introduced by
Filippo Menczer Filippo Menczer is an American and Italian professor of informatics and computer science who is the former director at the Center for Complex Networks and Systems Research, a research unit of the Indiana University School of Informatics, Computi ...
. Chakrabarti et al. coined the term 'focused crawler' and used a text classifier to prioritize the crawl frontier.
Andrew McCallum Andrew McCallum is a professor in the computer science department at University of Massachusetts Amherst. His primary specialties are in machine learning, natural language processing, information extraction, information integration, and social n ...
and co-authors also used
reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...
to focus crawlers. Diligenti et al. traced the context graph leading up to relevant pages, and their text content, to train classifiers. A form of online reinforcement learning has been used, along with features extracted from the
DOM tree The Document Object Model (DOM) is a cross-platform and language-independent interface that treats an XML or HTML document as a tree structure wherein each node is an object representing a part of the document. The DOM represents a document wi ...
and text of linking pages, to continually train classifiers that guide the crawl. In a review of topical crawling algorithms, Menczer et al. show that such simple strategies are very effective for short crawls, while more sophisticated techniques such as
reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...
and evolutionary adaptation can give the best performance over longer crawls. It has been shown that spatial information is important to classify Web documents. Another type of focused crawlers is semantic focused crawler, which makes use of domain ontologies to represent topical maps and link Web pages with relevant ontological concepts for the selection and categorization purposes. In addition, ontologies can be automatically updated in the crawling process. Dong et al. introduced such an ontology-learning-based crawler using support vector machine to update the content of ontological concepts when crawling Web Pages. Crawlers are also focused on page properties other than topics. Cho et al. study a variety of crawl prioritization policies and their effects on the link popularity of fetched pages. Najork and Weiner show that
breadth-first Breadth-first search (BFS) is an algorithm for searching a tree data structure for a node that satisfies a given property. It starts at the tree root and explores all nodes at the present depth prior to moving on to the nodes at the next de ...
crawling, starting from popular seed pages, leads to collecting large-PageRank pages early in the crawl. Refinements involving detection of stale (poorly maintained) pages have been reported by Eiron et al. A kind of semantic focused crawler, making use of the idea of
reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...
has been introduced by Meusel et al. using online-based classification algorithms in combination with a bandit-based selection strategy to efficiently crawl pages with markup languages like RDFa,
Microformats Microformats (μF) are a set of defined HTML classes created to serve as consistent and descriptive metadata about an element, designating it as representing a certain type of data (such as contact information, geographic coordinates, events ...
, and
Microdata Microdata can mean: * Microdata (statistics), a statistical term for individual response data in surveys and censuses * Microdata (HTML), a specification for semantic markup in HTML * Microdata Corporation Microdata Corporation was an American ...
. The performance of a focused crawler depends on the richness of links in the specific topic being searched, and focused crawling usually relies on a general web
search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
for providing starting points. Davison presented studies on Web links and text that explain why focused crawling succeeds on broad topics; similar studies were presented by Chakrabarti et al. Seed selection can be important for focused crawlers and significantly influence the crawling efficiency.Jian Wu, Pradeep Teregowda, Juan Pablo Fernández Ramírez, Prasenjit Mitra, Shuyi Zheng, C. Lee Giles
The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists
In proceedings of the 3rd Annual ACM Web Science Conference Pages 340-343, Evanston, IL, USA, June 2012.
A
whitelist A whitelist, allowlist, or passlist is a mechanism which explicitly allows some identified entities to access a particular privilege, service, mobility, or recognition i.e. it is a list of things allowed when everything is denied by default. It is ...
strategy is to start the focus crawl from a list of high quality seed
URLs A Uniform Resource Locator (URL), colloquially termed as a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifie ...
and limit the crawling scope to the domains of these URLs. These high quality seeds should be selected based on a list of
URL A Uniform Resource Locator (URL), colloquially termed as a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifie ...
candidates which are accumulated over a sufficient long period of general web crawling. The
whitelist A whitelist, allowlist, or passlist is a mechanism which explicitly allows some identified entities to access a particular privilege, service, mobility, or recognition i.e. it is a list of things allowed when everything is denied by default. It is ...
should be updated periodically after it is created.


References

{{Web crawlers Web crawlers Internet search algorithms