Reynold Xin is a

computer scientist A computer scientist is a person who is trained in the academic study of computer science. Computer scientists typically work on the theoretical side of computation, as opposed to the hardware side on which computer engineers mainly focus (al ...

and

engineer Engineers, as practitioners of engineering, are professionals who invent, design, analyze, build and test machines, complex systems, structures, gadgets and materials to fulfill functional objectives and requirements while considering the l ...

specializing in

big data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...

distributed systems A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. Distributed computing is a field of computer sci ...

, and

cloud computing Cloud computing is the on-demand availability of computer system resources, especially data storage ( cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over mul ...

. He is a co-founder and Chief Architect of

Databricks Databricks is an American enterprise software company founded by the creators of Apache Spark. Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks. History Da ...

. He is best known for his work on

Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californi ...

, which is the top open-source

Big Data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...

project. He designed and lead development of the GraphX, Project Tungsten, and Structured Streaming components and he co-designed DataFrames—all of which are part of the core Apache Spark distribution—plus served as the release manager for Spark's 2.0 release.

Biography

UC Berkeley

Xin started his work on the Spark open source project while he was a PhD candidate at the

UC Berkeley The University of California, Berkeley (UC Berkeley, Berkeley, Cal, or California) is a public university, public land-grant university, land-grant research university in Berkeley, California. Established in 1868 as the University of Californi ...

AMPLab. The first research project, Shark, created a system that was able to efficiently execute SQL and advanced analytics workloads at scale. Shark won Best Demo Award at

SIGMOD SIGMOD is the Association for Computing Machinery's Special Interest Group on Management of Data, which specializes in large-scale data management problems and databases. The annual ACM SIGMOD Conference, which began in 1975, is considered one of ...

2012. Shark was one of the first open source interactive SQL on Hadoop systems, with claims that it was between 10 and 100 times faster than

Apache Hive Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditi ...

. Shark was used by technology companies such as Yahoo, although it was replaced by a newer system called Spark SQL in 2014. The second research project, GraphX, created a graph processing system on top of Spark, a general data-parallel system. GraphX at the same challenged the notion that specialized systems are necessary for graph computation. GraphX was released as an open source project and merged into Spark in 2014, as the graph processing library on Spark.

Databricks

In 2013, along with

Matei Zaharia Matei Zaharia is a Romanian-Canadian computer scientist, educator and the creator of Apache Spark. As of April 2022, Forbes ranked him and Ion Stoica as the 3rd- richest people in Romania with a net worth of $1.6 billion. Biography Zaharia gr ...

and other key Spark contributors, Xin co-founded

, a venture-backed company based in San Francisco that offers data platform as a service, based on Spark. In 2014, Xin led a team of engineers from Databricks to compete in the Sort Benchmark and won the 2014 world record in Daytona GraySort using Spark, beating the previous record held by

Apache Hadoop Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...

by 30 times. Xin claimed that Spark was the fastest open source engine for sorting a petabyte of data. While at Databricks, he also started the DataFrames project, Project Tungsten, and Structured Streaming. DataFrames has become the foundational API while Tungsten has become the new execution engine.

References

Living people University of California, Berkeley alumni University of Toronto alumni Year of birth missing (living people) {{Improve categories, date=February 2021