Fetcher
   HOME

TheInfoList



OR:

Apache Nutch is a highly extensible and scalable open source
web crawler A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...
software project.


Features

Nutch is coded entirely in the
Java programming language Java is a high-level, class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. It is a general-purpose programming language intended to let programmers ''write once, run anywh ...
, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering. The fetcher ("robot" or "
web crawler A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...
") has been written from scratch specifically for this project.


History

Nutch originated with
Doug Cutting Douglass Read Cutting is a software designer, advocate, and creator of open-source search technology. He founded two technology projects, Lucene, and Nutch, with Mike Cafarella. Both projects are now managed through the Apache Software Foundation ...
, creator of both
Lucene Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as ...
and
Hadoop Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage an ...
, and Mike Cafarella. In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multi-machine processing needs of the crawl and index tasks, the Nutch project has also implemented a
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
facility and a
distributed file system A clustered file system is a file system which is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system (only direct attached storage fo ...
. The two facilities have been spun out into their own subproject, called
Hadoop Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage an ...
. In January, 2005, Nutch joined the
Apache Incubator Apache Incubator is the gateway for open-source projects intended to become fully fledged Apache Software Foundation projects. The Incubator project was created in October 2002 to provide an entry path to the Apache Software Foundation for projec ...
, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation. In February 2014 the Common Crawl project adopted Nutch for its open, large-scale web crawl. While it was once a goal for the Nutch project to release a global large-scale web search engine, that is no longer the case.


Release history


Scalability

IBM Research studied the performance of Nutch/Lucene as part of its Commercial Scale Out (CSO) project. Their findings were that a
scale-out Scalability is the property of a system to handle a growing amount of work by adding resources to the system. In an economic context, a scalable business model implies that a company can increase sales given increased resources. For example, a ...
system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any
scale-up SCALE-UP, Student-Centered Active Learning Environment with Upside-Down Pedagogies, is a classroom specifically created to facilitate active, collaborative learning in a classroom. The spaces are carefully designed to facilitate interactions betwe ...
computer such as the POWER5. The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second.


Related projects

*
Hadoop Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage an ...
– Java framework that supports distributed applications running on large clusters.


Search engines built with Nutch

* Common Crawl – publicly available internet-wide crawls, started using Nutch in 2014. * Creative Commons Search – an implementation of Nutch, used in the period of 2004–2006. * DiscoverEd
Open educational resources Open educational resources (OER) are teaching, learning, and research materials intentionally created and licensed to be free for the end user to own, share, and in most cases, modify. The term "OER" describes publicly accessible materials and ...
search prototype developed by Creative Commons *
Krugle Krugle delivers continuously updated, federated access to all of the code and technical information in the enterprise. Krugle search helps an organization pinpoint critical code patterns and application issues - immediately and at massive scale. K ...
uses Nutch to crawl web pages for code, archives and technically interesting content. *
mozDex mozDex was a project to build an Internet-scale search engine with free and open source software (FOSS) technologies like Nutch. Since its search algorithms and code were open, it was hoped that no search results could be manipulated by either mo ...
(inactive) *
Wikia Search Wikia Search was a short-lived free and open-source web search engine launched by Wikia, a for-profit wiki-hosting company founded in late 2004 by Jimmy Wales and Angela Beesley. Wikia Search followed other experiments by Wikia into search eng ...
- launched 2008, closed down 2009


See also

*
Faceted search Faceted search is a technique that involves augmenting traditional search techniques with a faceted navigation system, allowing users to narrow down search results by applying multiple filters based on faceted classification of the items. It is som ...
*
Information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
*
Enterprise search Enterprise search is the practice of making content from multiple enterprise-type sources, such as databases and intranets, searchable to a defined audience. "Enterprise search" is used to describe the software of search information within an ente ...


References


Bibliography

*


External links

* {{Web crawlers
Nutch Apache Nutch is a highly extensible and scalable open source web crawler software project. Features Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular archite ...
Internet search engines Free search engine software Java (programming language) libraries Cross-platform free software Free web crawlers