Apache Nutch is a highly extensible and scalable open source

web crawler A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...

software project.

Features

Nutch is coded entirely in the

Java programming language Java is a high-level, class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. It is a general-purpose programming language intended to let programmers ''write once, run anywh ...

, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering. The fetcher ("robot" or "

") has been written from scratch specifically for this project.

History

Nutch originated with

Doug Cutting Douglass Read Cutting is a software designer, advocate, and creator of open-source search technology. He founded two technology projects, Lucene, and Nutch, with Mike Cafarella. Both projects are now managed through the Apache Software Foundation ...

, creator of both

Lucene Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as ...

and

Hadoop Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage an ...

, and Mike Cafarella. In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multi-machine processing needs of the crawl and index tasks, the Nutch project has also implemented a

MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filtering ...

facility and a

distributed file system A clustered file system is a file system which is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system (only direct attached storage fo ...

. The two facilities have been spun out into their own subproject, called

. In January, 2005, Nutch joined the

Apache Incubator Apache Incubator is the gateway for open-source projects intended to become fully fledged Apache Software Foundation projects. The Incubator project was created in October 2002 to provide an entry path to the Apache Software Foundation for projec ...

, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation. In February 2014 the Common Crawl project adopted Nutch for its open, large-scale web crawl. While it was once a goal for the Nutch project to release a global large-scale web search engine, that is no longer the case.

Release history

Scalability

IBM Research studied the performance of Nutch/Lucene as part of its Commercial Scale Out (CSO) project. Their findings were that a

scale-out Scalability is the property of a system to handle a growing amount of work by adding resources to the system. In an economic context, a scalable business model implies that a company can increase sales given increased resources. For example, a ...

system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any

scale-up SCALE-UP, Student-Centered Active Learning Environment with Upside-Down Pedagogies, is a classroom specifically created to facilitate active, collaborative learning in a classroom. The spaces are carefully designed to facilitate interactions betwe ...

computer such as the POWER5. The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second.

Related projects

– Java framework that supports distributed applications running on large clusters.

Search engines built with Nutch

* Common Crawl – publicly available internet-wide crawls, started using Nutch in 2014. * Creative Commons Search – an implementation of Nutch, used in the period of 2004–2006. * DiscoverEd –

Open educational resources Open educational resources (OER) are teaching, learning, and research materials intentionally created and licensed to be free for the end user to own, share, and in most cases, modify. The term "OER" describes publicly accessible materials and ...

search prototype developed by Creative Commons *

Krugle Krugle delivers continuously updated, federated access to all of the code and technical information in the enterprise. Krugle search helps an organization pinpoint critical code patterns and application issues - immediately and at massive scale. K ...

uses Nutch to crawl web pages for code, archives and technically interesting content. *

mozDex mozDex was a project to build an Internet-scale search engine with free and open source software (FOSS) technologies like Nutch. Since its search algorithms and code were open, it was hoped that no search results could be manipulated by either mo ...

(inactive) *

Wikia Search Wikia Search was a short-lived free and open-source web search engine launched by Wikia, a for-profit wiki-hosting company founded in late 2004 by Jimmy Wales and Angela Beesley. Wikia Search followed other experiments by Wikia into search eng ...

- launched 2008, closed down 2009

References

Bibliography

External links

* {{Web crawlers

Nutch Apache Nutch is a highly extensible and scalable open source web crawler software project. Features Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular archite ...

Internet search engines Free search engine software Java (programming language) libraries Cross-platform free software Free web crawlers