Cascading (software)
   HOME

TheInfoList



OR:

Cascading is a software
abstraction layer In computing, an abstraction layer or abstraction level is a way of hiding the working details of a subsystem. Examples of software models that use layers of abstraction include the OSI model for network protocols, OpenGL, and other graphics libra ...
for
Apache Hadoop Apache Hadoop () is a collection of open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop wa ...
and
Apache Flink Apache Flink is an Open-source software, open-source, unified stream processing, stream-processing and batch processing, batch-processing software framework, framework developed by the Apache Software Foundation. The core of Apache Flink is a dis ...
. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any
JVM A Java virtual machine (JVM) is a virtual machine that enables a computer to run Java programs as well as programs written in other languages that are also compiled to Java bytecode. The JVM is detailed by a specification that formally descri ...
-based language (
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
,
JRuby JRuby is an implementation of the Ruby programming language atop the Java Virtual Machine, written largely in Java. It is free software released under a three-way EPL/ GPL/LGPL license. JRuby is tightly integrated with Java to allow the embeddi ...
,
Clojure Clojure (, like ''closure'') is a dynamic programming language, dynamic and functional programming, functional dialect (computing), dialect of the programming language Lisp (programming language), Lisp on the Java (software platform), Java platfo ...
, etc.), hiding the underlying complexity of
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filte ...
jobs. It is open source and available under the
Apache License The Apache License is a permissive free software license written by the Apache Software Foundation (ASF). It allows users to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software ...
. Commercial support is available from Driven, Inc. Cascading was originally authored by Chris Wensel, who later founded Concurrent, Inc, which has been re-branded as Driven. Cascading is being actively developed by the community and a number of add-on modules are available.


Architecture

To use Cascading, Apache Hadoop must also be installed, and the Hadoop job .jar must contain the Cascading .jars. Cascading consists of a data processing API, integration API, process planner and process scheduler. Cascading leverages the scalability of Hadoop but abstracts standard data processing operations away from underlying map and reduce tasks. Blog post by Etsy describing their use of Cascading with Hadoop
/ref> Developers use Cascading to create a .jar file that describes the required processes. It follows a ‘source-pipe-sink’ paradigm, where data is captured from sources, follows reusable ‘pipes’ that perform data analysis processes, where the results are stored in output files or ‘sinks’. Pipes are created independent from the data they will process. Once tied to data sources and sinks, it is called a ‘flow’. These flows can be grouped into a ‘cascade’, and the process scheduler will ensure a given flow does not execute until all its dependencies are satisfied. Pipes and flows can be reused and reordered to support different business needs. Developers write the code in a JVM-based language and do not need to learn MapReduce. The resulting program can be regression tested and integrated with external applications like any other Java application. Cascading is most often used for ad targeting, log file analysis, bioinformatics,
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
,
predictive analytics Predictive analytics encompasses a variety of Statistics, statistical techniques from data mining, Predictive modelling, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or other ...
, web content mining, and extract, transform and load (ETL) applications.


Uses of Cascading

Cascading was cited as one of the top five most powerful Hadoop projects by SD Times in 2011, as a major open source project relevant to bioinformatics and is included in Hadoop: A Definitive Guide, by Tom White. The project has also been cited in presentations, conference proceedings and Hadoop user group meetings as a useful tool for working with Hadoop and with
Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californ ...
* MultiTool on
Amazon Web Services Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon.com, Amazon that provides Software as a service, on-demand cloud computing computing platform, platforms and Application programming interface, APIs to individuals, companies, and gover ...
was developed using Cascading. * LogAnalyzer for
Amazon CloudFront Amazon CloudFront is a content delivery network (CDN) operated by Amazon Web Services. The content delivery network was created to provide a globally-distributed network of proxy servers to cache content, such as web videos or other bulky media ...
was developed using Cascading. * BackType - social analytics platform * Etsy - marketplace * FlightCaster - predicting flight delays * Ion Flux - analyzing DNA sequence data * RapLeaf - personalization and recommendation systems * Razorfish - digital advertising


Domain-Specific Languages Built on Cascading

* PyCascading - by Twitter, available on GitHub * Cascading.jruby - developed by Gregoire Marabout, available on GitHub * Cascalog - authored by
Nathan Marz Nathan or Natan may refer to: People and biblical figures *Nathan (given name), including a list of people and characters with this name *Nathan (surname) *Nathan (prophet), a person in the Hebrew Bible *Nathan (son of David), a biblical figur ...
, available on GitHub * Scalding - A Scala API for Cascading. Makes it easier to transition Cascading/Scalding code to Spark. By Twitter, available on GitHub


References

{{Reflist, 30em


External links


Official website
Free software programmed in Java (programming language) Free system software Cloud infrastructure Hadoop