StormCrawler
   HOME

TheInfoList



OR:

StormCrawler is an
open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized so ...
collection of resources for building low-latency, scalable
web crawler A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web s ...
s on  Apache Storm. It is provided under Apache License and is written mostly in
Java (programming language) Java is a high-level, class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. It is a general-purpose programming language intended to let programmers ''write once, run anyw ...
. StormCrawler is modular and consists of a core module, which provides the basic building blocks of a web crawler such as fetching, parsing, URL filtering. Apart from the core components, the project also provides external resources, like for instance spout and bolts for
Elasticsearch Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is dual ...
and
Apache Solr Solr (pronounced "solar") is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features a ...
or a ParserBolt which uses
Apache Tika Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation. It detects and extracts metadata and text from over a thousand different file types, and as well as providing a Java libra ...
to parse various document formats. The project is used in production by various companies.
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, whi ...
published a Q&A in October 2016 with the author of StormCrawler. InfoQ ran one in December 2016. A comparative benchmark with
Apache Nutch Apache Nutch is a highly extensible and scalable open source web crawler software project. Features Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular archi ...
was published in January 2017 on dzone.com. Several research papers mentioned the use of StormCrawler, in particular: * Crawling the German Health Web: Exploratory Study and Graph Analysis. * The generation of a multi-million page corpus for the Persian language. * The SIREN - Security Information Retrieval and Extraction engine. The project WIKI contains a list of videos and slides available online. StormCrawler is used notably by
Common Crawl Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2011. It completes crawls generally ever ...
for generating a large and publicly available dataset of news.


See also

* Apache Storm *
Apache Nutch Apache Nutch is a highly extensible and scalable open source web crawler software project. Features Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular archi ...
*
Apache Solr Solr (pronounced "solar") is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features a ...
*
Elasticsearch Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is dual ...


References

{{Reflist Web crawlers Free software programmed in Java (programming language) Software using the Apache license