Presto (including PrestoDB, and PrestoSQL which was re-branded to
Trino
Trino ( pms, Trin) is a ''comune'' (municipality) in the Province of Vercelli in the Italian region Piedmont, located about northeast of Turin and about southwest of Vercelli, at the foot of the Montferrat hills.
Trino borders the following mun ...
) is a distributed query engine for
big data using the
SQL query language. Its architecture allows users to query data sources such as
Hadoop
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...
,
Cassandra
Cassandra or Kassandra (; Ancient Greek: Κασσάνδρα, , also , and sometimes referred to as Alexandra) in Greek mythology was a Trojan priestess dedicated to the god Apollo and fated by him to utter true prophecies but never to be believe ...
,
Kafka
Franz Kafka (3 July 1883 – 3 June 1924) was a German-speaking Bohemian novelist and short-story writer, widely regarded as one of the major figures of 20th-century literature. His work fuses elements of realism and the fantastic. It typ ...
,
AWS S3
Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon (company), Amazon.com u ...
,
Alluxio
Alluxio is an open-source virtual distributed file system (VDFS). Initially as research project "Tachyon", Alluxio was created at the University of California, Berkeley's AMPLab as Haoyuan Li's Ph.D. Thesis,
advised by Professor Scott Shenker ...
,
MySQL
MySQL () is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A relational database ...
,
MongoDB
MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is developed by MongoDB Inc. and licensed under the Ser ...
and
Teradata
Teradata Corporation is an American software company that provides cloud database and analytics-related software, products, and services. The company was formed in 1979 in Brentwood, California, as a collaboration between researchers at Caltech a ...
, and allows use of multiple data sources within a query. Presto is community-driven
open-source software
Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose. Ope ...
released under the
Apache License.
History
Presto was originally designed and developed at
Facebook, Inc. (later renamed Meta) for their data analysts to run interactive queries on its large
data warehouse
In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integra ...
in
Apache Hadoop. The first four developers were Martin Traverso, Dain Sundstrom, David Phillips, and Eric Hwang.
Before Presto, the data analysts at Facebook relied on
Apache Hive
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditi ...
for running SQL analytics on their multi-petabyte data warehouse.
Hive was deemed too slow for Facebook's scale and Presto was invented to fill the gap to run fast queries.
Original development started in 2012 and deployed at Facebook later that year. In November 2013, Facebook announced its open source release.
In 2014,
Netflix
Netflix, Inc. is an American subscription video on-demand over-the-top streaming service and production company based in Los Gatos, California. Founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California, it offers a ...
disclosed they used Presto on 10
petabyte
The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable unit ...
s of data stored in the
Amazon Simple Storage Service
Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its e- ...
(S3). In November, 2016, Amazon announced a service called
Athena
Athena or Athene, often given the epithet Pallas, is an ancient Greek goddess associated with wisdom, warfare, and handicraft who was later syncretized with the Roman goddess Minerva. Athena was regarded as the patron and protectress of ...
that was based on Presto. In 2017,
Teradata
Teradata Corporation is an American software company that provides cloud database and analytics-related software, products, and services. The company was formed in 1979 in Brentwood, California, as a collaboration between researchers at Caltech a ...
spun out a company called Starburst Data to commercially support Presto, which included staff acquired from Hadapt in 2014. Teradata's QueryGrid software allowed Presto to access a Teradata relational database.
In January 2019, the Presto Software Foundation was announced. The foundation is a not-for-profit organization for the advancement of the Presto open source distributed SQL query engine. At the same time, Presto development forked: PrestoDB maintained by Facebook, and PrestoSQL maintained by the Presto Software Foundation, with some cross pollination of code.
In September 2019, Facebook donated PrestoDB to the
Linux Foundation
The Linux Foundation (LF) is a non-profit technology consortium founded in 2000 as a merger between Open Source Development Labs and the Free Standards Group to standardize Linux, support its growth, and promote its commercial adoption. Addi ...
, establishing th
Presto Foundation Neither the creators of Presto, nor the top contributors and committers, were invited to join this foundation.
By 2020, all four of the original Presto developers had joined Starburst.
In December 2020, PrestoSQL was rebranded as Trino, since Facebook had obtained a trademark on the name "Presto" (also donated to the Linux Foundation).
Another company called Ahana was announced in 2020, with seed funding from
GV (formerly Google Ventures, an arm of
Alphabet, Inc.
Alphabet Inc. is an American multinational technology conglomerate holding company headquartered in Mountain View, California. It was created through a restructuring of Google on October 2, 2015, and became the parent company of Google and se ...
), to commercialize the PrestoDB fork as a cloud service, while also offering an open-source version.
A $20 million round of funding for Ahana was announced in August 2021.
Architecture

Presto's architecture is very similar to other
database management system
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases span ...
s using
cluster computing
A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.
The comp ...
, sometimes called
massively parallel
Massively parallel is the term for using a large number of computer processors (or separate computers) to simultaneously perform a set of coordinated computations in parallel. GPUs are massively parallel architecture with tens of thousands of t ...
processing (MPP). One coordinator works in sync with multiple workers. Clients submit SQL statements that are parsed and planned, following which parallel tasks are scheduled to workers. Workers jointly process rows from the data sources and produce results that are returned to the client. Compared to the original
Apache Hive
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditi ...
execution model which used the Hadoop
MapReduce
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a ''map'' procedure, which performs filteri ...
mechanism on each query, Presto does not write intermediate results to disk, resulting in a significant speed improvement. Presto is written in
Java
Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
.
A Presto query can combine data from multiple sources. Presto offers connectors to data sources including files in
Alluxio
Alluxio is an open-source virtual distributed file system (VDFS). Initially as research project "Tachyon", Alluxio was created at the University of California, Berkeley's AMPLab as Haoyuan Li's Ph.D. Thesis,
advised by Professor Scott Shenker ...
,
Hadoop Distributed File System
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage an ...
(often called a
data lake
A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transform ...
),
Amazon S3
Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its ...
,
MySQL
MySQL () is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A relational database ...
,
PostgreSQL
PostgreSQL (, ), also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. It was originally named POSTGRES, referring to its origins as a successor to the In ...
,
Microsoft SQL Server
Microsoft SQL Server is a relational database management system developed by Microsoft. As a database server, it is a software product with the primary function of storing and retrieving data as requested by other software applications—which ...
,
Amazon Redshift
Amazon Redshift is a data warehouse product which forms part of the larger cloud-computing platform Amazon Web Services. It is built on top of technology from the massive parallel processing (MPP) data warehouse company ParAccel (later acquire ...
,
Apache Kudu,
Apache Phoenix
Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store. Phoenix provides a JDBC driver that hides the intricacies of the NoSQL store enabling users to c ...
,
Apache Kafka
Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency pla ...
,
Apache Cassandra
Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cass ...
,
Apache Accumulo
Apache Accumulo is a highly scalable sorted, distributed key-value store based on Google's Bigtable. It is a system built on top of Apache Hadoop, Apache ZooKeeper, and Apache Thrift. Written in Java, Accumulo has cell-level access labels a ...
,
MongoDB
MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is developed by MongoDB Inc. and licensed under the Ser ...
and
Redis
Redis (; Remote Dictionary Server) is an in-memory data structure store, used as a distributed, in-memory key–value database, cache and message broker, with optional durability. Redis supports different kinds of abstract data structures, suc ...
. Unlike other Hadoop distribution-specific tools, such as
Apache Impala
Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its developmen ...
, Presto can work with any variant of Hadoop or without it. Presto supports separation of compute and storage and may be deployed on-premises or using
cloud computing
Cloud computing is the on-demand availability of computer system resources, especially data storage ( cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over m ...
.
See also
*
Apache Drill
Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's D ...
*
Big data
*
Data-intensive computing
Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications ...
*
Trino (SQL query engine)
References
External links
*
*
*
* {{GitHub, trinodb/trino, Trino
SQL
Free system software
Hadoop
Cloud platforms
Facebook software
Linux Foundation projects