Apache Drill is an
open-source software framework that supports data-intensive
distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from
MapR
MapR was a business software company headquartered in Santa Clara, California. MapR software provides access to a variety of data sources from a single computer cluster, including big data workloads such as Apache Hadoop and Apache Spark, a distr ...
, Drill is inspired by Google's
Dremel
Dremel ( ) is a multinational brand of power tools, focusing on home improvement and hobby applications. Dremel is known primarily for its rotary tools such as the Dremel 3000, 4000 and 8200 series which are similar to the pneumatic die grinder ...
system, also productized as
BigQuery
BigQuery is a fully managed, serverless data warehouse that enables scalable analysis over petabytes of data. It is a ''Platform as a Service'' ( PaaS) that supports querying using ANSI SQL. It also has built-in machine learning capabilities. B ...
. Drill is an Apache top-level project.
Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.
Drill supports a variety of
NoSQL databases and file systems, including
Alluxio
Alluxio is an open-source virtual distributed file system (VDFS). Initially as research project "Tachyon", Alluxio was created at the University of California, Berkeley's AMPLab as Haoyuan Li's Ph.D. Thesis,
advised by Professor Scott Shenker & ...
,
HBase
HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File Sys ...
,
MongoDB
MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is developed by MongoDB Inc. and licensed under the Ser ...
,
MapR
MapR was a business software company headquartered in Santa Clara, California. MapR software provides access to a variety of data sources from a single computer cluster, including big data workloads such as Apache Hadoop and Apache Spark, a distr ...
-DB,
HDFS
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...
,
MapR-FS,
Amazon S3
Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its ...
,
Azure Blob Storage,
Google Cloud Storage
Google Cloud Storage is a RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. The service combines the performance and scalability of Google's cloud with advanced security and sharing ...
,
Swift
Swift or SWIFT most commonly refers to:
* SWIFT, an international organization facilitating transactions between banks
** SWIFT code
* Swift (programming language)
* Swift (bird), a family of birds
It may also refer to:
Organizations
* SWIFT, ...
,
NAS and local files. A single query can join data from multiple datastores. For example, you can join a user profile collection in
MongoDB
MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is developed by MongoDB Inc. and licensed under the Ser ...
with a directory of event logs in
Hadoop
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage an ...
.
Drill's datastore-aware optimizer automatically restructures a query plan to leverage the datastore's internal processing capabilities. In addition, Drill supports
data locality
In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference localit ...
, if Drill and the datastore are on the same nodes.
Features
One explicitly stated design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds.
* Schema-free JSON document model similar to
MongoDB
MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is developed by MongoDB Inc. and licensed under the Ser ...
and
Elasticsearch
Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is dual ...
, without requiring a formal schema to be declared
* Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs
* Extremely user and developer friendly
* Pluggable architecture enables connectivity to multiple datastores
* Apache Drill 1.9 added dynamic
user defined functions.
* Apache Drill 1.11 added cryptographic-related functions and PCAP file format support.
Back-end Support
Drill is primarily focused on non-relational datastores, including
Apache Hadoop text files,
NoSQL, and cloud storage. A notable feature also includes in situ querying of local JSON and Apache Parquet files. Some additional datastores that it supports include:
* All Hadoop distributions (HDFS API 2.3+), including Apache Hadoop, MapR, CDH and Amazon EMR
* NoSQL:
MongoDB
MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is developed by MongoDB Inc. and licensed under the Ser ...
,
Apache HBase
HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File Sys ...
,
Apache Cassandra
Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassand ...
* Online Analytical Processing:
Apache Kudu
Apache Kudu is a free and open source column-oriented data store of the Apache Hadoop ecosystem. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides completeness to Hadoop's storage layer to enable ...
,
Apache Druid, OpenTSDB
* Cloud storage:
Amazon S3
Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its ...
,
Google Cloud Storage
Google Cloud Storage is a RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. The service combines the performance and scalability of Google's cloud with advanced security and sharing ...
, Azure Blob Storage, Swift,
IBM Cloud Object Storage
IBM Cloud Object Storage is a service offered by IBM for storing and accessing unstructured data. The object storage service can be deployed on-premise, as part of IBM Cloud Platform offerings, or in hybrid form. The offering can store any typ ...
* Diverse data formats, including
Apache Avro
Avro is a row-oriented remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Ap ...
,
Apache Parquet
Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing ...
and
JSON
* RDBMs storage plugins (Using
JDBC
Java Database Connectivity (JDBC) is an application programming interface (API) for the programming language Java, which defines how a client may access a database. It is a Java-based data access technology used for Java database connectivity. I ...
to connect to
MySQL
MySQL () is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A relational database ...
,
PostgreSQL, and others)
A new datastore can be added by developing a storage plugin. Drill's "schema-free" JSON data model enables it to query non-relational datastores in-situ .
Front-end Support
Drill itself can be queried via
JDBC
Java Database Connectivity (JDBC) is an application programming interface (API) for the programming language Java, which defines how a client may access a database. It is a Java-based data access technology used for Java database connectivity. I ...
,
ODBC, or
REST
Rest or REST may refer to:
Relief from activity
* Sleep
** Bed rest
* Kneeling
* Lying (position)
* Sitting
* Squatting position
Structural support
* Structural support
** Rest (cue sports)
** Armrest
** Headrest
** Footrest
Arts and enter ...
through a variety of methods and languages including Python and Java. The default install includes a web interface allowing end-users to execute ANSI SQL directly and export data tables as
CSV files without any programming.
The dashboard library,
Apache Superset,
is particularly well suited for visualization of data queried with Drill.
See also
*
Cloud computing
Cloud computing is the on-demand availability of computer system resources, especially data storage ( cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over mu ...
*
Big data
*
Data Intensive Computing
References
Papers
Some papers influenced the birth and design. Here is a partial list:
* 200
From Databases to Dataspaces: A New Abstraction for Information Management the authors highlight the need for storage systems to accept all data formats and to provide APIs for data access that evolve based on the storage system's understanding of the data.
* 201
External links
*
*
ttp://www.zdnet.com/article/sql-and-hadoop-its-complicated/ SQL and Hadoop: It's complicated
{{Apache Software Foundation
Drill
SQL
Free system software
Cloud computing
Cloud infrastructure
Free software for cloud computing