Apache Impala is an
open source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
massively parallel processing (MPP) SQL query engine for data stored in a
computer cluster
A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. The newes ...
running
Apache Hadoop
Apache Hadoop () is a collection of open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop wa ...
.
Impala has been described as the open-source equivalent of
Google F1, which inspired its development in 2012.
Description
Apache Impala is a query engine that runs on Apache Hadoop.
The project was announced in October 2012 with a public
beta test
The software release life cycle is the process of developing, testing, and distributing a software product (e.g., an operating system). It typically consists of several stages, such as pre-alpha, alpha, beta, and release candidate, before the fi ...
distribution and became generally available in May 2013.
Impala brings scalable
parallel database A parallel database system seeks to improve performance through parallelization of various operations, such as loading data, building indexes and evaluating queries. Although data may be stored in a distributed fashion, the distribution is governed ...
technology to Hadoop, enabling users to issue low-latency
SQL
Structured Query Language (SQL) (pronounced ''S-Q-L''; or alternatively as "sequel")
is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling s ...
queries to data stored in
HDFS and
Apache HBase without requiring data movement or transformation. Impala is integrated with Hadoop to use the same file and data formats, metadata, security and resource management frameworks used by
MapReduce
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster.
A MapReduce program is composed of a ''map'' procedure, which performs filte ...
,
Apache Hive
Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like Interface (computing), interface to query data stored in various databases and file systems that i ...
,
Apache Pig and other Hadoop software.
Impala is promoted for analysts and data scientists to perform analytics on data stored in Hadoop via SQL or
business intelligence
Business intelligence (BI) consists of strategies, methodologies, and technologies used by enterprises for data analysis and management of business information. Common functions of BI technologies include Financial reporting, reporting, online an ...
tools. The result is that large-scale data processing (via MapReduce) and interactive queries can be done on the same system using the same data and metadata – removing the need to migrate data sets into specialized systems and/or proprietary formats simply to perform analysis.
Features include:
*Supports
HDFS,
S3,
Microsoft Azure
Microsoft Azure, or just Azure ( /ˈæʒər, ˈeɪʒər/ ''AZH-ər, AY-zhər'', UK also /ˈæzjʊər, ˈeɪzjʊər/ ''AZ-ure, AY-zure''), is the cloud computing platform developed by Microsoft. It has management, access and development of ...
Blob Storage,
Apache HBase and
Apache Kudu storage,
*Reads Hadoop file formats, including text,
LZO,
SequenceFile,
Avro
Avro (an initialism of the founder's name) was a British aircraft manufacturer. Its designs include the Avro 504, used as a trainer in the First World War, the Avro Lancaster, one of the pre-eminent bombers of the Second World War, and the d ...
,
RCFile
Within database management systems, the record columnar file or RCFile is a data placement structure that determines how to store Table (database), relational tables on computer clusters. It is designed for systems using the MapReduce framework. Th ...
,
Parquet
Parquet (; French for "a small compartment") is a geometric mosaic of wood pieces used for decorative effect in flooring.
Parquet patterns are often entirely geometrical and angular—squares, triangles, lozenges—but may contain curves. T ...
and
ORC
*Supports Hadoop security (
Kerberos authentication,
Ldap
The Lightweight Directory Access Protocol (LDAP ) is an open, vendor-neutral, industry standard application protocol for accessing and maintaining distributed Directory service, directory information services over an Internet Protocol (IP) networ ...
),
*Fine-grained, role-based authorization with
Apache Ranger
*Uses metadata,
ODBC
In computing, Open Database Connectivity (ODBC) is a standard application programming interface (API) for accessing database management systems (DBMS). The designers of ODBC aimed to make it independent of database systems and operating systems. An ...
driver, and SQL syntax from
Apache Hive
Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like Interface (computing), interface to query data stored in various databases and file systems that i ...
.
In early 2013, a
column-oriented file format called
Parquet
Parquet (; French for "a small compartment") is a geometric mosaic of wood pieces used for decorative effect in flooring.
Parquet patterns are often entirely geometrical and angular—squares, triangles, lozenges—but may contain curves. T ...
was announced for architectures including Impala.
In December 2013,
Amazon Web Services
Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon.com, Amazon that provides Software as a service, on-demand cloud computing computing platform, platforms and Application programming interface, APIs to individuals, companies, and gover ...
announced support for Impala.
In early 2014,
MapR added support for Impala.
In 2015, another format called
Kudu
The kudus are two species of antelope of the genus '' Tragelaphus'':
* Lesser kudu, ''Tragelaphus imberbis'', of eastern Africa
* Greater kudu, ''Tragelaphus strepsiceros'', of eastern and southern Africa
The two species look similar, th ...
was announced, which
Cloudera
Cloudera, Inc. is an American data lake software company.
History
Cloudera, Inc. was formed on June 27, 2008 in Burlingame, California by Christophe Bisciglia, Amr Awadallah, Jeff Hammerbacher, and chief executive Mike Olson. Prior to Cloude ...
proposed to donate to the
Apache Software Foundation
The Apache Software Foundation ( ; ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open-source software projects. The ASF was formed from a group of developers of the ...
along with Impala.
Impala graduated to an Apache Top-Level Project (TLP) on 28 November 2017.
See also
*
Apache Drill — similar open source project inspired by Dremel
*
Dremel
Dremel ( ) is a multinational brand of power tools, focusing on home improvement and hobby applications. Dremel is known primarily for its rotary tools, such as the Dremel 3000, 4000 and 8200 series, which are similar to the pneumatic die gri ...
— similar tool from Google
*
Trino — open source SQL query engine created by the creators of Presto
*
Presto — open source SQL query engine created by Facebook and supported by
Teradata
Teradata Corporation is an American software company that provides cloud database and Analytics, analytics-related software, products, and services. The company was formed in 1979 in Brentwood, California, as a collaboration between researchers a ...
References
External links
Apache Impalaproject website
Impala GitHubproject source code
{{DEFAULTSORT:Impala
Impala
Cloud platforms
Free system software
Hadoop
2013 software