Resilient Distributed Dataset
   HOME

TheInfoList



OR:

Apache Spark is an
open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
unified analytics engine for large-scale data processing. Spark provides an
interface Interface or interfacing may refer to: Academic journals * ''Interface'' (journal), by the Electrochemical Society * ''Interface, Journal of Applied Linguistics'', now merged with ''ITL International Journal of Applied Linguistics'' * '' Inte ...
for programming clusters with implicit
data parallelism Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like ...
and
fault tolerance Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
. Originally developed at the
University of California, Berkeley The University of California, Berkeley (UC Berkeley, Berkeley, Cal, or California) is a public land-grant research university in Berkeley, California. Established in 1868 as the University of California, it is the state's first land-grant u ...
's
AMPLab AMPLAB was a University of California, Berkeley lab focused on big data analytics located in Soda Hall. The name stands for the Algorithms, Machines and People Lab. It has been publishing papers since 2008 and was officially launched in 2011. The ...
, the Spark
codebase In software development, a codebase (or code base) is a collection of source code used to build a particular software system, application, or software component. Typically, a codebase includes only human-written source code files; thus, a codeb ...
was later donated to the
Apache Software Foundation The Apache Software Foundation (ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open source software projects. The ASF was formed from a group of developers of the A ...
, which has maintained it since.


Overview

Apache Spark has its architectural foundation in the resilient distributed dataset (RDD), a read-only
multiset In mathematics, a multiset (or bag, or mset) is a modification of the concept of a set that, unlike a set, allows for multiple instances for each of its elements. The number of instances given for each element is called the multiplicity of that e ...
of data items distributed over a cluster of machines, that is maintained in a
fault-tolerant Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
way. The Dataframe API was released as an abstraction on top of the RDD, followed by the Dataset API. In Spark 1.x, the RDD was the primary
application programming interface An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how t ...
(API), but as of Spark 2.x use of the Dataset API is encouraged even though the RDD API is not
deprecated In several fields, especially computing, deprecation is the discouragement of use of some terminology, feature, design, or practice, typically because it has been superseded or is no longer considered efficient or safe, without completely removing ...
. The RDD technology still underlies the Dataset API. Spark and its RDDs were developed in 2012 in response to limitations in the
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
cluster computing
paradigm In science and philosophy, a paradigm () is a distinct set of concepts or thought patterns, including theories, research methods, postulates, and standards for what constitute legitimate contributions to a field. Etymology ''Paradigm'' comes f ...
, which forces a particular linear
dataflow In computing, dataflow is a broad concept, which has various meanings depending on the application and context. In the context of software architecture, data flow relates to stream processing or reactive programming. Software architecture Dataf ...
structure on distributed programs: MapReduce programs read input data from disk,
map A map is a symbolic depiction emphasizing relationships between elements of some space, such as objects, regions, or themes. Many maps are static, fixed to paper or some other durable medium, while others are dynamic or interactive. Although ...
a function across the data,
reduce Reduction, reduced, or reduce may refer to: Science and technology Chemistry * Reduction (chemistry), part of a reduction-oxidation (redox) reaction in which atoms have their oxidation state changed. ** Organic redox reaction, a redox react ...
the results of the map, and store reduction results on disk. Spark's RDDs function as a
working set Working set is a concept in computer science which defines the amount of memory that a process requires in a given time interval. Definition Peter Denning (1968) defines "the working set of information W(t, \tau) of a process at time t to be the ...
for distributed programs that offers a (deliberately) restricted form of distributed
shared memory In computer science, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between progr ...
. Inside Apache Spark the workflow is managed as a
directed acyclic graph In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one ve ...
(DAG). Nodes represent RDDs while edges represent the operations on the RDDs. Spark facilitates the implementation of both
iterative algorithm In computational mathematics, an iterative method is a mathematical procedure that uses an initial value to generate a sequence of improving approximate solutions for a class of problems, in which the ''n''-th approximation is derived from the pre ...
s, which visit their data set multiple times in a loop, and interactive/exploratory data analysis, i.e., the repeated
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
-style querying of data. The latency of such applications may be reduced by several orders of magnitude compared to
Apache Hadoop Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...
MapReduce implementation. Among the class of iterative algorithms are the training algorithms for
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
systems, which formed the initial impetus for developing Apache Spark. Apache Spark requires a
cluster manager Within cluster and parallel computing, a cluster manager is usually backend graphical user interface (GUI) or command-line interface (CLI) software that runs on a set of cluster nodes that it manages (in some cases it runs on a different server or ...
and a distributed storage system. For cluster management, Spark supports standalone (native Spark cluster, where you can launch a cluster either manually or use the launch scripts provided by the install package. It is also possible to run these daemons on a single machine for testing), Hadoop YARN,
Apache Mesos Apache Mesos is an open-source project to manage computer clusters. It was developed at the University of California, Berkeley. History Mesos began as a research project in the UC Berkeley RAD Lab by then PhD students Benjamin Hindman, Andy Konw ...
or
Kubernetes Kubernetes (, commonly stylized as K8s) is an open-source container orchestration system for automating software deployment, scaling, and management. Google originally designed Kubernetes, but the Cloud Native Computing Foundation now maintains ...
. For distributed storage, Spark can interface with a wide variety, including
Alluxio Alluxio is an open-source virtual distributed file system (VDFS). Initially as research project "Tachyon", Alluxio was created at the University of California, Berkeley's AMPLab as Haoyuan Li's Ph.D. Thesis, advised by Professor Scott Shenker & ...
, Hadoop Distributed File System (HDFS), MapR File System (MapR-FS),
Cassandra Cassandra or Kassandra (; Ancient Greek: Κασσάνδρα, , also , and sometimes referred to as Alexandra) in Greek mythology was a Trojan priestess dedicated to the god Apollo and fated by him to utter true prophecies but never to be believe ...
,
OpenStack Swift OpenStack is a free software, free, open-source software, open standard cloud computing platform. It is mostly deployed as Cloud computing#Infrastructure as a service (IaaS), infrastructure-as-a-service (IaaS) in both public and private clouds w ...
,
Amazon S3 Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its e- ...
,
Kudu The kudus are two species of antelope of the genus ''Tragelaphus'': * Lesser kudu, ''Tragelaphus imberbis'', of eastern Africa * Greater kudu, ''Tragelaphus strepsiceros'', of eastern and southern Africa The two species look similar, thou ...
, Lustre file system, or a custom solution can be implemented. Spark also supports a pseudo-distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in such a scenario, Spark is run on a single machine with one executor per
CPU core A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, and ...
.


Spark Core

Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic I/O functionalities, exposed through an application programming interface (for
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
,
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
, Scala, .NET and R) centered on the RDD
abstraction Abstraction in its main sense is a conceptual process wherein general rules and concepts are derived from the usage and classification of specific examples, literal ("real" or "concrete") signifiers, first principles, or other methods. "An abstr ...
(the Java API is available for other JVM languages, but is also usable for some other non-JVM languages that can connect to the JVM, such as
Julia Julia is usually a feminine given name. It is a Latinate feminine form of the name Julio and Julius. (For further details on etymology, see the Wiktionary entry "Julius".) The given name ''Julia'' had been in use throughout Late Antiquity (e.g. ...
). This interface mirrors a
functional Functional may refer to: * Movements in architecture: ** Functionalism (architecture) ** Form follows function * Functional group, combination of atoms within molecules * Medical conditions without currently visible organic basis: ** Functional sy ...
/ higher-order model of programming: a "driver" program invokes parallel operations such as map,
filter Filter, filtering or filters may refer to: Science and technology Computing * Filter (higher-order function), in functional programming * Filter (software), a computer program to process a data stream * Filter (video), a software component tha ...
or reduce on an RDD by passing a function to Spark, which then schedules the function's execution in parallel on the cluster. These operations, and additional ones such as
joins Join may refer to: * Join (law), to include additional counts or additional defendants on an indictment *In mathematics: ** Join (mathematics), a least upper bound of sets orders in lattice theory ** Join (topology), an operation combining two topo ...
, take RDDs as input and produce new RDDs. RDDs are
immutable In object-oriented and functional programming, an immutable object (unchangeable object) is an object whose state cannot be modified after it is created.Goetz et al. ''Java Concurrency in Practice''. Addison Wesley Professional, 2006, Section 3.4 ...
and their operations are lazy; fault-tolerance is achieved by keeping track of the "lineage" of each RDD (the sequence of operations that produced it) so that it can be reconstructed in the case of data loss. RDDs can contain any type of Python, .NET, Java, or Scala objects. Besides the RDD-oriented functional style of programming, Spark provides two restricted forms of shared variables: ''broadcast variables'' reference read-only data that needs to be available on all nodes, while ''accumulators'' can be used to program reductions in an imperative style. A typical example of RDD-centric functional programming is the following Scala program that computes the frequencies of all words occurring in a set of text files and prints the most common ones. Each , (a variant of ) and takes an
anonymous function In computer programming, an anonymous function (function literal, lambda abstraction, lambda function, lambda expression or block) is a function definition that is not bound to an identifier. Anonymous functions are often arguments being passed to ...
that performs a simple operation on a single data item (or a pair of items), and applies its argument to transform an RDD into a new RDD. val conf = new SparkConf().setAppName("wiki_test") // create a spark config object val sc = new SparkContext(conf) // Create a spark context val data = sc.textFile("/path/to/somedir") // Read files from "somedir" into an RDD of (filename, content) pairs. val tokens = data.flatMap(_.split(" ")) // Split each file into a list of tokens (words). val wordFreq = tokens.map((_, 1)).reduceByKey(_ + _) // Add a count of one to each token, then sum the counts per word type. wordFreq.sortBy(s => -s._2).map(x => (x._2, x._1)).top(10) // Get the top 10 words. Swap word and count to sort by count.


Spark SQL

Spark SQL is a component on top of Spark Core that introduced a data abstraction called DataFrames, which provides support for structured and
semi-structured data Semi-structured data is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic eleme ...
. Spark SQL provides a
domain-specific language A domain-specific language (DSL) is a computer language specialized to a particular application domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains. There are a wide variety of DSLs, ranging f ...
(DSL) to manipulate DataFrames in Scala,
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
,
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
or .NET. It also provides SQL language support, with
command-line interface A command-line interpreter or command-line processor uses a command-line interface (CLI) to receive commands from a user in the form of lines of text. This provides a means of setting parameters for the environment, invoking executables and pro ...
s and
ODBC In computing, Open Database Connectivity (ODBC) is a standard application programming interface (API) for accessing database management systems (DBMS). The designers of ODBC aimed to make it independent of database systems and operating systems. An ...
/
JDBC Java Database Connectivity (JDBC) is an application programming interface (API) for the programming language Java, which defines how a client may access a database. It is a Java-based data access technology used for Java database connectivity. I ...
server. Although DataFrames lack the compile-time type-checking afforded by RDDs, as of Spark 2.0, the strongly typed DataSet is fully supported by Spark SQL as well. import org.apache.spark.sql.SparkSession val url = "jdbc:mysql://yourIP:yourPort/test?user=yourUsername;password=yourPassword" // URL for your database server. val spark = SparkSession.builder().getOrCreate() // Create a Spark session object val df = spark .read .format("jdbc") .option("url", url) .option("dbtable", "people") .load() df.printSchema() // Looks at the schema of this DataFrame. val countsByAge = df.groupBy("age").count() // Counts people by age //or alternatively via SQL: //df.createOrReplaceTempView("people") //val countsByAge = spark.sql("SELECT age, count(*) FROM people GROUP BY age")


Spark Streaming

Spark Streaming uses Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD transformations on those mini-batches of data. This design enables the same set of application code written for batch analytics to be used in streaming analytics, thus facilitating easy implementation of
lambda architecture Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault ...
. However, this convenience comes with the penalty of latency equal to the mini-batch duration. Other streaming data engines that process event by event rather than in mini-batches include
Storm A storm is any disturbed state of the natural environment or the atmosphere An atmosphere () is a layer of gas or layers of gases that envelop a planet, and is held in place by the gravity of the planetary body. A planet retains an atmos ...
and the streaming component of
Flink ''Flink'' (full name: ''The Misadventures of Flink'' according to the title screen) is a 2D scrolling platform video game developed by former members of Thalion and published by Psygnosis. The Amiga CD32 and Mega CD versions are reliant on CD- ...
. Spark Streaming has support built-in to consume from
Kafka Franz Kafka (3 July 1883 – 3 June 1924) was a German-speaking Bohemian novelist and short-story writer, widely regarded as one of the major figures of 20th-century literature. His work fuses elements of realism and the fantastic. It typ ...
,
Flume A flume is a human-made channel for water, in the form of an open declined gravity chute whose walls are raised above the surrounding terrain, in contrast to a trench or ditch. Flumes are not to be confused with aqueducts, which are built to tr ...
,
Twitter Twitter is an online social media and social networking service owned and operated by American company Twitter, Inc., on which users post and interact with 280-character-long messages known as "tweets". Registered users can post, like, and ...
,
ZeroMQ ZeroMQ (also spelled ØMQ, 0MQ or ZMQ) is an asynchronous messaging library, aimed at use in distributed or concurrent applications. It provides a message queue, but unlike message-oriented middleware, a ZeroMQ system can run without a dedicated ...
, Kinesis, and TCP/IP sockets. In Spark 2.x, a separate technology based on Datasets, called Structured Streaming, that has a higher-level interface is also provided to support streaming. Spark can be deployed in a traditional
on-premises On-premises software (abbreviated to on-prem, and incorrectly referred to as on-premise) is installed and runs on computers on the premises of the person or organization using the software, rather than at a remote facility such as a server farm ...
data center A data center (American English) or data centre (British English)See spelling differences. is a building, a dedicated space within a building, or a group of buildings used to house computer systems and associated components, such as telecommunic ...
as well as in the
cloud In meteorology, a cloud is an aerosol consisting of a visible mass of miniature liquid droplets, frozen crystals, or other particles suspended in the atmosphere of a planetary body or similar space. Water or various other chemicals may co ...
.


MLlib Machine Learning Library

Spark MLlib is a
distributed Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations *Probability distribution, the probability of a particular value or value range of a varia ...
machine-learning framework on top of Spark Core that, due in large part to the distributed memory-based Spark architecture, is as much as nine times as fast as the disk-based implementation used by
Apache Mahout Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily on linear algebra. In the past, many of the implementations use the Ap ...
(according to benchmarks done by the MLlib developers against the
alternating least squares Alternating may refer to: Mathematics * Alternating algebra, an algebra in which odd-grade elements square to zero * Alternating form, a function formula in algebra * Alternating group, the group of even permutations of a finite set * Alternati ...
(ALS) implementations, and before Mahout itself gained a Spark interface), and
scales Scale or scales may refer to: Mathematics * Scale (descriptive set theory), an object defined on a set of points * Scale (ratio), the ratio of a linear dimension of a model to the corresponding dimension of the original * Scale factor, a number w ...
better than
Vowpal Wabbit Vowpal Wabbit (VW) is an open-source fast online interactive machine learning system library and program developed originally at Yahoo! Research, and currently at Microsoft Research. It was started and is led by John Langford. Vowpal Wabbit's i ...
. Many common machine learning and statistical algorithms have been implemented and are shipped with MLlib which simplifies large scale machine learning pipelines, including: *
summary statistics In descriptive statistics, summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible. Statisticians commonly try to describe the observations in * a measure of ...
,
correlations In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formal ...
,
stratified sampling In statistics, stratified sampling is a method of sampling from a population which can be partitioned into subpopulations. In statistical surveys, when subpopulations within an overall population vary, it could be advantageous to sample each s ...
,
hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters. ...
, random data generation *
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...
and regression:
support vector machines In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...
,
logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
,
linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is call ...
, naive Bayes classification,
Decision Tree A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...
,
Random Forest Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of th ...
, Gradient-Boosted Tree *
collaborative filtering Collaborative filtering (CF) is a technique used by recommender systems.Francesco Ricci and Lior Rokach and Bracha ShapiraIntroduction to Recommender Systems Handbook Recommender Systems Handbook, Springer, 2011, pp. 1-35 Collaborative filtering ...
techniques including alternating least squares (ALS) * cluster analysis methods including
k-means ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster with the nearest mean (cluster centers o ...
, and
latent Dirichlet allocation In natural language processing, Latent Dirichlet Allocation (LDA) is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. The LDA is an exa ...
(LDA) * dimensionality reduction techniques such as
singular value decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...
(SVD), and
principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
(PCA) *
feature extraction In machine learning, pattern recognition, and image processing, feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning a ...
and
transformation Transformation may refer to: Science and mathematics In biology and medicine * Metamorphosis, the biological process of changing physical form after birth or hatching * Malignant transformation, the process of cells becoming cancerous * Trans ...
functions *
optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
algorithms such as
stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...
,
limited-memory BFGS Limited-memory BFGS (L-BFGS or LM-BFGS) is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algo ...
(L-BFGS)


GraphX

GraphX is a distributed graph-processing framework on top of Apache Spark. Because it is based on RDDs, which are immutable, graphs are immutable and thus GraphX is unsuitable for graphs that need to be updated, let alone in a transactional manner like a
graph database A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the ''graph'' (or ''edge'' or ''relationship''). The graph relat ...
. GraphX provides two separate APIs for implementation of massively parallel algorithms (such as
PageRank PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. According ...
): a Pregel abstraction, and a more general MapReduce-style API. Unlike its predecessor Bagel, which was formally deprecated in Spark 1.6, GraphX has full support for property graphs (graphs where properties can be attached to edges and vertices). Like Apache Spark, GraphX initially started as a research project at UC Berkeley's AMPLab and Databricks, and was later donated to the Apache Software Foundation and the Spark project.


Language support

Apache Spark has built-in support for Scala, Java, R, and Python with 3rd party support for the .NET CLR, Julia, and more.


History

Spark was initially started by
Matei Zaharia Matei Zaharia is a Romanian-Canadian computer scientist, educator and the creator of Apache Spark. As of April 2022, Forbes ranked him and Ion Stoica as the 3rd- richest people in Romania with a net worth of $1.6 billion. Biography Zaharia gr ...
at UC Berkeley's AMPLab in 2009, and open sourced in 2010 under a
BSD license BSD licenses are a family of permissive free software licenses, imposing minimal restrictions on the use and distribution of covered software. This is in contrast to copyleft licenses, which have share-alike requirements. The original BSD lic ...
. In 2013, the project was donated to the Apache Software Foundation and switched its license to Apache 2.0. In February 2014, Spark became a Top-Level Apache Project. In November 2014, Spark founder M. Zaharia's company
Databricks Databricks is an American enterprise software company founded by the creators of Apache Spark. Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks. History Da ...
set a new world record in large scale sorting using Spark. Spark had in excess of 1000 contributors in 2015, making it one of the most active projects in the Apache Software Foundation and one of the most active open source
big data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...
projects.


Scala Version

Spark 3.3.0 is based on Scala 2.13 (and thus works with Scala 2.12 and 2.13 out-of-the-box), but it can also be made to work with Scala 3.


Developers

Apache Spark is developed by a community. The project is managed by a group called the "Project Management Committee" (PMC).


See also

* List of concurrent and parallel programming APIs/Frameworks


Notes


References


External links

* {{DEFAULTSORT:Spark
Spark Spark commonly refers to: * Spark (fire), a small glowing particle or ember * Electric spark, a form of electrical discharge Spark may also refer to: Places * Spark Point, a rocky point in the South Shetland Islands People * Spark (surname) * ...
Big data products Cluster computing Data mining and machine learning software Free software programmed in Scala Hadoop Java platform Software using the Apache license University of California, Berkeley Articles with example Scala code