Apache Spark is an

open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...

unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit

data parallelism Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like ...

and

fault tolerance Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability, mission-critical, or even life-critical systems. Fault t ...

. Originally developed at the

University of California, Berkeley The University of California, Berkeley (UC Berkeley, Berkeley, Cal, or California), is a Public university, public Land-grant university, land-grant research university in Berkeley, California, United States. Founded in 1868 and named after t ...

's AMPLab starting in 2009, in 2013, the Spark

codebase In software development, a codebase (or code base) is a collection of source code used to build a particular software system, application, or software component. Typically, a codebase includes only human-written source code system files; thu ...

was donated to the

Apache Software Foundation The Apache Software Foundation ( ; ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open-source software projects. The ASF was formed from a group of developers of the ...

, which has maintained it since.

Overview

Apache Spark has its architectural foundation in the resilient distributed dataset (RDD), a read-only

multiset In mathematics, a multiset (or bag, or mset) is a modification of the concept of a set that, unlike a set, allows for multiple instances for each of its elements. The number of instances given for each element is called the ''multiplicity'' of ...

of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. The Dataframe API was released as an abstraction on top of the RDD, followed by the Dataset API. In Spark 1.x, the RDD was the primary

application programming interface An application programming interface (API) is a connection between computers or between computer programs. It is a type of software Interface (computing), interface, offering a service to other pieces of software. A document or standard that des ...

(API), but as of Spark 2.x use of the Dataset API is encouraged even though the RDD API is not deprecated. The RDD technology still underlies the Dataset API. Spark and its RDDs were developed in 2012 in response to limitations in the

MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filte ...

cluster computing

paradigm In science and philosophy, a paradigm ( ) is a distinct set of concepts or thought patterns, including theories, research methods, postulates, and standards for what constitute legitimate contributions to a field. The word ''paradigm'' is Ancient ...

, which forces a particular linear

dataflow In computing, dataflow is a broad concept, which has various meanings depending on the application and context. In the context of software architecture, data flow relates to stream processing or reactive programming. Software architecture Dat ...

structure on distributed programs: MapReduce programs read input data from disk,

map A map is a symbolic depiction of interrelationships, commonly spatial, between things within a space. A map may be annotated with text and graphics. Like any graphic, a map may be fixed to paper or other durable media, or may be displayed on ...

a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a

working set Working set is a concept in computer science which defines the amount of memory that a process (computing), process requires in a given time interval. Definition Peter_J._Denning, Peter Denning (1968) defines "the working set of information W(t ...

for distributed programs that offers a (deliberately) restricted form of distributed shared memory. Inside Apache Spark the workflow is managed as a

directed acyclic graph In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one ...

(DAG). Nodes represent RDDs while edges represent the operations on the RDDs. Spark facilitates the implementation of both iterative algorithms, which visit their data set multiple times in a loop, and interactive/exploratory data analysis, i.e., the repeated

database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...

-style querying of data. The latency of such applications may be reduced by several orders of magnitude compared to

Apache Hadoop Apache Hadoop () is a collection of open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop wa ...

MapReduce implementation. Among the class of iterative algorithms are the training algorithms for

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

systems, which formed the initial impetus for developing Apache Spark. Apache Spark requires a cluster manager and a distributed storage system. For cluster management, Spark supports standalone native Spark, Hadoop YARN, Apache Mesos or

Kubernetes Kubernetes (), also known as K8s is an open-source software, open-source OS-level virtualization, container orchestration (computing), orchestration system for automating software deployment, scaling, and management. Originally designed by Googl ...

. A standalone native Spark cluster can be launched manually or by the launch scripts provided by the install package. It is also possible to run the daemons on a single machine for testing. For distributed storage Spark can interface with a wide variety of distributed systems, including Alluxio, Hadoop Distributed File System (HDFS), MapR File System (MapR-FS),

Cassandra Cassandra or Kassandra (; , , sometimes referred to as Alexandra; ) in Greek mythology was a Trojan priestess dedicated to the god Apollo and fated by him to utter true prophecy, prophecies but never to be believed. In modern usage her name is e ...

, OpenStack Swift, Amazon S3,

Kudu The kudus are two species of antelope of the genus '' Tragelaphus'': * Lesser kudu, ''Tragelaphus imberbis'', of eastern Africa * Greater kudu, ''Tragelaphus strepsiceros'', of eastern and southern Africa The two species look similar, th ...

, Lustre file system, or a custom solution can be implemented. Spark also supports a pseudo-distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in such a scenario, Spark is run on a single machine with one executor per CPU core.

Spark Core

Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic I/O functionalities, exposed through an application programming interface (for

Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...

, Python, Scala,

.NET The .NET platform (pronounced as "''dot net"'') is a free and open-source, managed code, managed computer software framework for Microsoft Windows, Windows, Linux, and macOS operating systems. The project is mainly developed by Microsoft emplo ...

and R) centered on the RDD

abstraction Abstraction is a process where general rules and concepts are derived from the use and classifying of specific examples, literal (reality, real or Abstract and concrete, concrete) signifiers, first principles, or other methods. "An abstraction" ...

(the Java API is available for other JVM languages, but is also usable for some other non-JVM languages that can connect to the JVM, such as Julia). This interface mirrors a functional/ higher-order model of programming: a "driver" program invokes parallel operations such as map, filter or reduce on an RDD by passing a function to Spark, which then schedules the function's execution in parallel on the cluster. These operations, and additional ones such as joins, take RDDs as input and produce new RDDs. RDDs are immutable and their operations are lazy; fault-tolerance is achieved by keeping track of the "lineage" of each RDD (the sequence of operations that produced it) so that it can be reconstructed in the case of data loss. RDDs can contain any type of Python, .NET, Java, or Scala objects. Besides the RDD-oriented functional style of programming, Spark provides two restricted forms of shared variables: ''broadcast variables'' reference read-only data that needs to be available on all nodes, while ''accumulators'' can be used to program reductions in an imperative style. A typical example of RDD-centric functional programming is the following Scala program that computes the frequencies of all words occurring in a set of text files and prints the most common ones. Each , (a variant of ) and takes an anonymous function that performs a simple operation on a single data item (or a pair of items), and applies its argument to transform an RDD into a new RDD. val conf = new SparkConf().setAppName("wiki_test") // create a spark config object val sc = new SparkContext(conf) // Create a spark context val data = sc.textFile("/path/to/somedir") // Read files from "somedir" into an RDD of (filename, content) pairs. val tokens = data.flatMap(_.split(" ")) // Split each file into a list of tokens (words). val wordFreq = tokens.map((_, 1)).reduceByKey(_ + _) // Add a count of one to each token, then sum the counts per word type. wordFreq.sortBy(s => -s._2).map(x => (x._2, x._1)).top(10) // Get the top 10 words. Swap word and count to sort by count.

Spark SQL

Spark SQL is a component on top of Spark Core that introduced a data abstraction called DataFrames, which provides support for structured and semi-structured data. Spark SQL provides a

domain-specific language A domain-specific language (DSL) is a computer language specialized to a particular application domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains. There are a wide variety of DSLs, ranging ...

(DSL) to manipulate DataFrames in Scala,

, Python or

. It also provides SQL language support, with

command-line interface A command-line interface (CLI) is a means of interacting with software via command (computing), commands each formatted as a line of text. Command-line interfaces emerged in the mid-1960s, on computer terminals, as an interactive and more user ...

s and

ODBC In computing, Open Database Connectivity (ODBC) is a standard application programming interface (API) for accessing database management systems (DBMS). The designers of ODBC aimed to make it independent of database systems and operating systems. An ...

JDBC Java Database Connectivity (JDBC) is an application programming interface (API) for the Java (programming language), Java programming language which defines how a client may access a database. It is a Java-based data access technology used for Java ...

server. Although DataFrames lack the compile-time type-checking afforded by RDDs, as of Spark 2.0, the strongly typed DataSet is fully supported by Spark SQL as well. import org.apache.spark.sql.SparkSession val url = "jdbc:mysql://yourIP:yourPort/test?user=yourUsername;password=yourPassword" // URL for your database server. val spark = SparkSession.builder().getOrCreate() // Create a Spark session object val df = spark .read .format("jdbc") .option("url", url) .option("dbtable", "people") .load() df.printSchema() // Looks at the schema of this DataFrame. val countsByAge = df.groupBy("age").count() // Counts people by age Or alternatively via SQL: df.createOrReplaceTempView("people") val countsByAge = spark.sql("SELECT age, count(*) FROM people GROUP BY age")

Spark Streaming

Spark Streaming uses Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD transformations on those mini-batches of data. This design enables the same set of application code written for batch analytics to be used in streaming analytics, thus facilitating easy implementation of lambda architecture. However, this convenience comes with the penalty of latency equal to the mini-batch duration. Other streaming data engines that process event by event rather than in mini-batches include

Storm A storm is any disturbed state of the natural environment or the atmosphere of an astronomical body. It may be marked by significant disruptions to normal conditions such as strong wind, tornadoes, hail, thunder and lightning (a thunderstor ...

and the streaming component of Flink. Spark Streaming has support built-in to consume from Kafka, Flume,

Twitter Twitter, officially known as X since 2023, is an American microblogging and social networking service. It is one of the world's largest social media platforms and one of the most-visited websites. Users can share short text messages, image ...

, ZeroMQ, Kinesis, and TCP/IP sockets. In Spark 2.x, a separate technology based on Datasets, called Structured Streaming, that has a higher-level interface is also provided to support streaming. Spark can be deployed in a traditional on-premises

data center A data center is a building, a dedicated space within a building, or a group of buildings used to house computer systems and associated components, such as telecommunications and storage systems. Since IT operations are crucial for busines ...

as well as in the

cloud In meteorology, a cloud is an aerosol consisting of a visible mass of miniature liquid droplets, frozen crystals, or other particles, suspended in the atmosphere of a planetary body or similar space. Water or various other chemicals may ...

MLlib machine learning library

Spark MLlib is a distributed machine-learning framework on top of Spark Core that, due in large part to the distributed memory-based Spark architecture, is as much as nine times as fast as the disk-based implementation used by Apache Mahout (according to benchmarks done by the MLlib developers against the alternating least squares (ALS) implementations, and before Mahout itself gained a Spark interface), and

scales Scale or scales may refer to: Mathematics * Scale (descriptive set theory), an object defined on a set of points * Scale (ratio), the ratio of a linear dimension of a model to the corresponding dimension of the original * Scale factor, a number ...

better than Vowpal Wabbit. Many common machine learning and statistical algorithms have been implemented and are shipped with MLlib which simplifies large scale machine learning pipelines, including: *

summary statistics In descriptive statistics, summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible. Statisticians commonly try to describe the observations in * a measure of ...

, correlations, stratified sampling,

hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...

, random data generation *

classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

and regression: support vector machines,

logistic regression In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...

linear regression In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...

, naive Bayes classification,

Decision Tree A decision tree is a decision support system, decision support recursive partitioning structure that uses a Tree (graph theory), tree-like Causal model, model of decisions and their possible consequences, including probability, chance event ou ...

Random Forest Random forests or random decision forests is an ensemble learning method for statistical classification, classification, regression analysis, regression and other tasks that works by creating a multitude of decision tree learning, decision trees ...

, Gradient-Boosted Tree * collaborative filtering techniques including alternating least squares (ALS) * cluster analysis methods including k-means, and

latent Dirichlet allocation In natural language processing, latent Dirichlet allocation (LDA) is a Bayesian network (and, therefore, a generative statistical model) for modeling automatically extracted topics in textual corpora. The LDA is an example of a Bayesian topic ...

(LDA) * dimensionality reduction techniques such as

singular value decomposition In linear algebra, the singular value decomposition (SVD) is a Matrix decomposition, factorization of a real number, real or complex number, complex matrix (mathematics), matrix into a rotation, followed by a rescaling followed by another rota ...

(SVD), and

principal component analysis Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing. The data is linearly transformed onto a new coordinate system such that th ...

(PCA) *

feature extraction Feature may refer to: Computing * Feature recognition, could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (machine learning), in statistics: individual measurable properties of the phenome ...

and transformation functions *

optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...

algorithms such as

stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable ...

, limited-memory BFGS (L-BFGS)

GraphX

GraphX is a distributed graph-processing framework on top of Apache Spark. Because it is based on RDDs, which are immutable, graphs are immutable and thus GraphX is unsuitable for graphs that need to be updated, let alone in a transactional manner like a

graph database A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph (or edge or relationship). The graph relates the dat ...

. GraphX provides two separate APIs for implementation of massively parallel algorithms (such as

PageRank PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. Accordin ...

): a Pregel abstraction, and a more general MapReduce-style API. Unlike its predecessor Bagel, which was formally deprecated in Spark 1.6, GraphX has full support for property graphs (graphs where properties can be attached to edges and vertices). Like Apache Spark, GraphX initially started as a research project at UC Berkeley's AMPLab and Databricks, and was later donated to the Apache Software Foundation and the Spark project.

Language support

Apache Spark has built-in support for Scala, Java, SQL, R, and Python with 3rd party support for the .NET CLR, Julia, and more.

History

Spark was initially started by Matei Zaharia at UC Berkeley's AMPLab in 2009, and open sourced in 2010 under a

BSD license BSD licenses are a family of permissive free software licenses, imposing minimal restrictions on the use and distribution of covered software. This is in contrast to copyleft licenses, which have share-alike requirements. The original BSD lic ...

. In 2013, the project was donated to the Apache Software Foundation and switched its license to Apache 2.0. In February 2014, Spark became a Top-Level Apache Project. In November 2014, Spark founder M. Zaharia's company Databricks set a new world record in large scale sorting using Spark. Spark had in excess of 1000 contributors in 2015, making it one of the most active projects in the Apache Software Foundation and one of the most active open source

big data Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...

projects.

Scala version

Spark 3.5.2 is based on Scala 2.13 (and thus works with Scala 2.12 and 2.13 out-of-the-box), but it can also be made to work with Scala 3.

Developers

Apache Spark is developed by a community. The project is managed by a group called the "Project Management Committee" (PMC).

Maintenance releases and EOL

Feature release branches will, generally, be maintained with bug fix releases for a period of 18 months. For example, branch 2.3.x is no longer considered maintained as of September 2019, 18 months after the release of 2.3.0 in February 2018. No more 2.3.x releases should be expected after that point, even for bug fixes. The last minor release within a major a release will typically be maintained for longer as an “LTS” release. For example, 2.4.0 was released on November 2, 2018, and had been maintained for 31 months until 2.4.8 was released in May 2021. 2.4.8 is the last release and no more 2.4.x releases should be expected even for bug fixes.

Notes

References

External links

* {{DEFAULTSORT:Spark Spark Big data products Cluster computing Data mining and machine learning software Free software programmed in Scala Hadoop Java platform Software using the Apache license University of California, Berkeley Articles with example Scala code Open source projects