Apache Hive is a
data warehouse
In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for Business reporting, reporting and data analysis and is considered a core component of business intelligence. DWs are central Repos ...
software project built on top of
Apache Hadoop
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...
for providing data query and analysis. Hive gives an
SQL-like
interface
Interface or interfacing may refer to:
Academic journals
* ''Interface'' (journal), by the Electrochemical Society
* ''Interface, Journal of Applied Linguistics'', now merged with ''ITL International Journal of Applied Linguistics''
* '' Inte ...
to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the
MapReduce
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (
HiveQL
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Tradit ...
) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop.
While initially developed by
Facebook
Facebook is an online social media and social networking service owned by American company Meta Platforms. Founded in 2004 by Mark Zuckerberg with fellow Harvard College students and roommates Eduardo Saverin, Andrew McCollum, Dustin M ...
, Apache Hive is used and developed by other companies such as
Netflix
Netflix, Inc. is an American subscription video on-demand over-the-top streaming service and production company based in Los Gatos, California. Founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California, it offers a fil ...
and the
Financial Industry Regulatory Authority
The Financial Industry Regulatory Authority (FINRA) is a private American corporation that acts as a self-regulatory organization (SRO) that regulates member brokerage firms and exchange markets. FINRA is the successor to the National Associati ...
(FINRA). Amazon maintains a software fork of Apache Hive included in
Amazon Elastic MapReduce
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage an ...
on
Amazon Web Services
Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon.com, Amazon that provides Software as a service, on-demand cloud computing computing platform, platforms and Application programming interface, APIs to individuals, companies, and gover ...
.
Features
Apache Hive supports analysis of large datasets stored in Hadoop's
HDFS
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage a ...
and compatible file systems such as
Amazon S3
Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its e- ...
filesystem and
Alluxio
Alluxio is an open-source virtual distributed file system (VDFS). Initially as research project "Tachyon", Alluxio was created at the University of California, Berkeley's AMPLab as Haoyuan Li's Ph.D. Thesis,
advised by Professor Scott Shenker & ...
. It provides a
SQL-like query language called HiveQL with schema on read and transparently converts queries to
MapReduce
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
, Apache Tez and
Spark
Spark commonly refers to:
* Spark (fire), a small glowing particle or ember
* Electric spark, a form of electrical discharge
Spark may also refer to:
Places
* Spark Point, a rocky point in the South Shetland Islands
People
* Spark (surname)
* ...
jobs. All three execution engines can run in
Hadoop
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage an ...
's resource negotiator, YARN (Yet Another Resource Negotiator). To accelerate queries, it provided indexes, but this feature was removed in version 3.0
Other features of Hive include:
* Different storage types such as plain text,
RCFile Within computing database management systems, the RCFile (Record Columnar File) is a data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework. The RCFile ...
,
HBase
HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File Sys ...
, ORC, and others.
* Metadata storage in a
relational database management system
A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relatio ...
, significantly reducing the time to perform semantic checks during query execution.
* Operating on compressed data stored into the Hadoop ecosystem using algorithms including
DEFLATE,
BWT,
snappy, etc.
* Built-in
user-defined function A user-defined function (UDF) is a function provided by the user of a program or environment, in a context where the usual assumption is that functions are built into the program or environment. UDFs are usually written for the requirement of its cr ...
s (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.
* SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark jobs.
By default, Hive stores metadata in an embedded
Apache Derby
Apache Derby (previously distributed as IBM Cloudscape) is a relational database management system (RDBMS) developed by the Apache Software Foundation that can be embedded in Java programs and used for online transaction processing. It has a 3.5 ...
database, and other client/server databases like
MySQL
MySQL () is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A relational database o ...
can optionally be used.
The first four file formats supported in Hive were plain text, sequence file, optimized row columnar (ORC) format and
RCFile Within computing database management systems, the RCFile (Record Columnar File) is a data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework. The RCFile ...
.
Apache Parquet
Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing f ...
can be read via plugin in versions later than 0.10 and natively starting at 0.13.
Architecture
Major components of the Hive architecture are:
* Metastore: Stores metadata for each of the tables such as their schema and location. It also includes the partition metadata which helps the driver to track the progress of various data sets distributed over the cluster.
The data is stored in a traditional
RDBMS
A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relation ...
format. The metadata helps the driver to keep track of the data and it is crucial. Hence, a backup server regularly replicates the data which can be retrieved in case of data loss.
* Driver: Acts like a controller which receives the HiveQL statements. It starts the execution of the statement by creating sessions, and monitors the life cycle and progress of the execution. It stores the necessary metadata generated during the execution of a HiveQL statement. The driver also acts as a collection point of data or query results obtained after the Reduce operation.
* Compiler: Performs compilation of the HiveQL query, which converts the query to an execution plan. This plan contains the tasks and steps needed to be performed by the
Hadoop
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage an ...
MapReduce
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
to get the output as translated by the query. The compiler converts the query to an
abstract syntax tree
In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of text (often source code) written in a formal language. Each node of the tree denotes a construct occurring ...
(AST). After checking for compatibility and compile time errors, it converts the AST to a
directed acyclic graph
In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one ve ...
(DAG). The DAG divides operators to MapReduce stages and tasks based on the input query and data.
* Optimizer: Performs various transformations on the execution plan to get an optimized DAG. Transformations can be aggregated together, such as converting a pipeline of joins to a single join, for better performance.
It can also split the tasks, such as applying a transformation on data before a reduce operation, to provide better performance and scalability. However, the logic of transformation used for optimization used can be modified or pipelined using another optimizer.
An optimizer called YSmart is a part o
Apache Hive This is a correlated optimizer, which merges correlated MapReduce jobs into a single MapReduce job, significantly reducing the execution ti
.
* Executor: After compilation and optimization, the executor executes the tasks. It interacts with the job tracker of Hadoop to schedule tasks to be run. It takes care of pipelining the tasks by making sure that a task with dependency gets executed only if all other prerequisites are run.
* CLI, UI, and
Thrift Server: A
command-line interface
A command-line interpreter or command-line processor uses a command-line interface (CLI) to receive commands from a user in the form of lines of text. This provides a means of setting parameters for the environment, invoking executables and pro ...
(CLI) provides a
user interface
In the industrial design field of human–computer interaction, a user interface (UI) is the space where interactions between humans and machines occur. The goal of this interaction is to allow effective operation and control of the machine f ...
for an external user to interact with Hive by submitting queries, instructions and monitoring the process status. Thrift server allows external clients to interact with Hive over a network, similar to the
JDBC
Java Database Connectivity (JDBC) is an application programming interface (API) for the programming language Java, which defines how a client may access a database. It is a Java-based data access technology used for Java database connectivity. I ...
or
ODBC
In computing, Open Database Connectivity (ODBC) is a standard application programming interface (API) for accessing database management systems (DBMS). The designers of ODBC aimed to make it independent of database systems and operating systems. An ...
protocols.
HiveQL
While based on SQL, HiveQL does not strictly follow the full
SQL-92
SQL-92 was the third revision of the SQL database query language. Unlike SQL-89, it was a major revision of the standard. Aside from a few minor incompatibilities, the SQL-89 standard is forward-compatible with SQL-92.
The standard specificatio ...
standard. HiveQL offers extensions not in SQL, including ''multitable inserts'' and ''create table as select''.
HiveQL lacked support for
transactions and
materialized view
In computing, a materialized view is a database object that contains the results of a query. For example, it may be a local copy of data located remotely, or may be a subset of the rows and/or columns of a table or join result, or may be a summary ...
s, and only limited subquery support.
Support for insert, update, and delete with full
ACID
In computer science, ACID ( atomicity, consistency, isolation, durability) is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. In the context of databases, a sequ ...
functionality was made available with release 0.14.
Internally, a
compiler
In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
translates HiveQL statements into a
directed acyclic graph
In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one ve ...
of
MapReduce
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
, Tez, or
Spark
Spark commonly refers to:
* Spark (fire), a small glowing particle or ember
* Electric spark, a form of electrical discharge
Spark may also refer to:
Places
* Spark Point, a rocky point in the South Shetland Islands
People
* Spark (surname)
* ...
jobs, which are submitted to Hadoop for execution.
Example
The word count program counts the number of times each word occurs in the input. The word count can be written in HiveQL as:
DROP TABLE IF EXISTS docs;
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'input_file' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) temp
GROUP BY word
ORDER BY word;
A brief explanation of each of the statements is as follows:
DROP TABLE IF EXISTS docs;
CREATE TABLE docs (line STRING);
Checks if table
docs
exists and drops it if it does. Creates a new table called
docs
with a single column of type
STRING
called
line
.
LOAD DATA INPATH 'input_file' OVERWRITE INTO TABLE docs;
Loads the specified file or directory (In this case “input_file”) into the table.
OVERWRITE
specifies that the target table to which the data is being loaded into is to be re-written; Otherwise the data would be appended.
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) temp
GROUP BY word
ORDER BY word;
The query creates a table called
word_counts
with two columns:
word
and
count
. This query draws its input from the inner query . This query serves to split the input words into different rows of a temporary table aliased as
temp
. The groups the results based on their keys. This results in the
count
column holding the number of occurrences for each word of the
word
column. The sorts the words alphabetically.
Comparison with traditional databases
The storage and querying operations of Hive closely resemble those of traditional databases. While Hive is a SQL dialect, there are a lot of differences in structure and working of Hive in comparison to relational databases. The differences are mainly because Hive is built on top of the
Hadoop
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage an ...
ecosystem, and has to comply with the restrictions of Hadoop and
MapReduce
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
.
A schema is applied to a table in traditional databases. In such traditional databases, the table typically enforces the schema when the data is loaded into the table. This enables the database to make sure that the data entered follows the representation of the table as specified by the table definition. This design is called ''schema on write''. In comparison, Hive does not verify the data against the table schema on write. Instead, it subsequently does run time checks when the data is read. This model is called ''schema on read''.
The two approaches have their own advantages and drawbacks. Checking data against table schema during the load time adds extra overhead, which is why traditional databases take a longer time to load data. Quality checks are performed against the data at the load time to ensure that the data is not corrupt. Early detection of corrupt data ensures early exception handling. Since the tables are forced to match the schema after/during the data load, it has better query time performance. Hive, on the other hand, can load data dynamically without any schema check, ensuring a fast initial load, but with the drawback of comparatively slower performance at query time. Hive does have an advantage when the schema is not available at the load time, but is instead generated later dynamically.
Transactions are key operations in traditional databases. As any typical
RDBMS
A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relation ...
, Hive supports all four properties of transactions (
ACID
In computer science, ACID ( atomicity, consistency, isolation, durability) is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. In the context of databases, a sequ ...
):
Atomicity,
Consistency
In classical deductive logic, a consistent theory is one that does not lead to a logical contradiction. The lack of contradiction can be defined in either semantic or syntactic terms. The semantic definition states that a theory is consistent ...
,
Isolation, and
Durability
Durability is the ability of a physical product to remain functional, without requiring excessive maintenance or repair, when faced with the challenges of normal operation over its design lifetime. There are several measures of durability in use, ...
. Transactions in Hive were introduced in Hive 0.13 but were only limited to the partition level. Recent version of Hive 0.14 had these functions fully added to support complete
ACID
In computer science, ACID ( atomicity, consistency, isolation, durability) is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. In the context of databases, a sequ ...
properties. Hive 0.14 and later provides different row level transactions such as ''INSERT, DELETE and UPDATE''. Enabling ''INSERT, UPDATE, DELETE'' transactions require setting appropriate values for configuration properties such as
hive.support.concurrency
,
hive.enforce.bucketing
, and
hive.exec.dynamic.partition.mode
.
Security
Hive v0.7.0 added integration with Hadoop security. Hadoop began using
Kerberos authorization support to provide security. Kerberos allows for mutual authentication between client and server. In this system, the client's request for a ticket is passed along with the request. The previous versions of Hadoop had several issues such as users being able to spoof their username by setting the
hadoop.job.ugi
property and also MapReduce operations being run under the same user: hadoop or mapred. With Hive v0.7.0's integration with Hadoop security, these issues have largely been fixed. TaskTracker jobs are run by the user who launched it and the username can no longer be spoofed by setting the
hadoop.job.ugi
property. Permissions for newly created files in Hive are dictated by the
HDFS
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage a ...
. The Hadoop distributed file system authorization model uses three entities: user, group and others with three permissions: read, write and execute. The default permissions for newly created files can be set by changing the unmask value for the Hive configuration variable
hive.files.umask.value
.
See also
*
Apache Pig
Apache Pig
is a high-level platform for creating programs that run on Hadoop, Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the ...
*
Sqoop
*
Apache Impala
Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development ...
*
Apache Drill
Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's D ...
*
Apache Flume
Apache Flume is a distributed, reliable, and available software for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant ...
*
Apache HBase
HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File Syst ...
*
Trino (SQL query engine)
Trino is an open-source distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Trino can query datalakes that contain open column-oriented data file formats like ORC or Parquet ...
References
External links
*
{{DEFAULTSORT:Hive
2015 software
Hive
A hive may refer to a beehive, an enclosed structure in which some honey bee species live and raise their young.
Hive or hives may also refer to:
Arts
* ''Hive'' (game), an abstract-strategy board game published in 2001
* "Hive" (song), a 201 ...
Cloud computing
Facebook software
Free software programmed in Java (programming language)
Free system software
Hadoop
Software using the Apache license