Apache Hive

Last updated
Apache Hive
Original author(s) Facebook, Inc.
Developer(s) Contributors
Initial releaseOctober 1, 2010;13 years ago (2010-10-01) [1]
Stable release
3.1.3 / April 8, 2022;22 months ago (2022-04-08) [2]
Preview release
4.0.0-beta-1 / August 14, 2023;6 months ago (2023-08-14) [2]
Repository github.com/apache/hive
Written in Java
Operating system Cross-platform
Available inSQL
Type Data warehouse
License Apache License 2.0
Website hive.apache.org

Apache Hive is a data warehouse software project, built on top of Apache Hadoop for providing data query and analysis. [3] [4] Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids the portability of SQL-based applications to Hadoop. [5] While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). [6] [7] Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services. [8]

Contents

Features

Apache Hive supports the analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem and Alluxio. It provides a SQL-like query language called HiveQL [9] with schema on read and transparently converts queries to MapReduce, Apache Tez [10] and Spark jobs. All three execution engines can run in Hadoop's resource negotiator, YARN (Yet Another Resource Negotiator). To accelerate queries, it provided indexes, but this feature was removed in version 3.0 [11] Other features of Hive include:

By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used. [12]

The first four file formats supported in Hive were plain text, [13] sequence file, optimized row columnar (ORC) format [14] [15] and RCFile. [16] [17] Apache Parquet can be read via plugin in versions later than 0.10 and natively starting at 0.13. [18] [19]

Architecture

Major components of the Hive architecture are:

HiveQL

While based on SQL, HiveQL does not strictly follow the full SQL-92 standard. HiveQL offers extensions not in SQL, including multi-table inserts, and creates tables as select. HiveQL lacked support for transactions and materialized views, and only limited subquery support. [25] [26] Support for insert, update, and delete with full ACID functionality was made available with release 0.14. [27]

Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce, Tez, or Spark jobs, which are submitted to Hadoop for execution. [28]

Example

The word count program counts the number of times each word occurs in the input. The word count can be written in HiveQL as: [5]

DROPTABLEIFEXISTSdocs;CREATETABLEdocs(lineSTRING);LOADDATAINPATH'input_file'OVERWRITEINTOTABLEdocs;CREATETABLEword_countsASSELECTword,count(1)AScountFROM(SELECTexplode(split(line,'\s'))ASwordFROMdocs)tempGROUPBYwordORDERBYword;

A brief explanation of each of the statements is as follows:

DROPTABLEIFEXISTSdocs;CREATETABLEdocs(lineSTRING);

Checks if table docs exists and drops it if it does. Creates a new table called docs with a single column of type STRING called line.

LOADDATAINPATH'input_file'OVERWRITEINTOTABLEdocs;

Loads the specified file or directory (In this case “input_file”) into the table. OVERWRITE specifies that the target table to which the data is being loaded into is to be re-written; Otherwise, the data would be appended.

CREATETABLEword_countsASSELECTword,count(1)AScountFROM(SELECTexplode(split(line,'\s'))ASwordFROMdocs)tempGROUPBYwordORDERBYword;

The query CREATE TABLE word_counts AS SELECT word, count(1) AS count creates a table called word_counts with two columns: word and count. This query draws its input from the inner query (SELECTexplode(split(line,'\s'))ASwordFROMdocs)temp". This query serves to split the input words into different rows of a temporary table aliased as temp. The GROUPBYWORD groups the results based on their keys. This results in the count column holding the number of occurrences for each word of the word column. The ORDERBYWORDS sorts the words alphabetically.

Comparison with traditional databases

The storage and querying operations of Hive closely resemble those of traditional databases. While Hive is a SQL dialect, there are a lot of differences in structure and working of Hive in comparison to relational databases. The differences are mainly because Hive is built on top of the Hadoop ecosystem, and has to comply with the restrictions of Hadoop and MapReduce.

A schema is applied to a table in traditional databases. In such traditional databases, the table typically enforces the schema when the data is loaded into the table. This enables the database to make sure that the data entered follows the representation of the table as specified by the table definition. This design is called schema on write. In comparison, Hive does not verify the data against the table schema on write. Instead, it subsequently does run time checks when the data is read. This model is called schema on read. [25] The two approaches have their own advantages and drawbacks. Checking data against table schema during the load time adds extra overhead, which is why traditional databases take a longer time to load data. Quality checks are performed against the data at the load time to ensure that the data is not corrupt. Early detection of corrupt data ensures early exception handling. Since the tables are forced to match the schema after/during the data load, it has better query time performance. Hive, on the other hand, can load data dynamically without any schema check, ensuring a fast initial load, but with the drawback of comparatively slower performance at query time. Hive does have an advantage when the schema is not available at the load time, but is instead generated later dynamically. [25]

Transactions are key operations in traditional databases. As any typical RDBMS, Hive supports all four properties of transactions (ACID): Atomicity, Consistency, Isolation, and Durability. Transactions in Hive were introduced in Hive 0.13 but were only limited to the partition level. [29] The recent version of Hive 0.14 had these functions fully added to support complete ACID properties. Hive 0.14 and later provides different row level transactions such as INSERT, DELETE and UPDATE. [30] Enabling INSERT, UPDATE, and DELETE transactions require setting appropriate values for configuration properties such as hive.support.concurrency, hive.enforce.bucketing, and hive.exec.dynamic.partition.mode. [31]

Security

Hive v0.7.0 added integration with Hadoop security. Hadoop began using Kerberos authorization support to provide security. Kerberos allows for mutual authentication between client and server. In this system, the client's request for a ticket is passed along with the request. The previous versions of Hadoop had several issues such as users being able to spoof their username by setting the hadoop.job.ugi property and also MapReduce operations being run under the same user: Hadoop or mapred. With Hive v0.7.0's integration with Hadoop security, these issues have largely been fixed. TaskTracker jobs are run by the user who launched it and the username can no longer be spoofed by setting the hadoop.job.ugi property. Permissions for newly created files in Hive are dictated by the HDFS. The Hadoop distributed file system authorization model uses three entities: user, group and others with three permissions: read, write and execute. The default permissions for newly created files can be set by changing the unmask value for the Hive configuration variable hive.files.umask.value. [5]

See also

Related Research Articles

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

An entity–attribute–value model (EAV) is a data model optimized for the space-efficient storage of sparse—or ad-hoc—property or data values, intended for situations where runtime usage patterns are arbitrary, subject to user variation, or otherwise unforeseeable using a fixed design. The use-case targets applications which offer a large or rich system of defined property types, which are in turn appropriate to a wide set of entities, but where typically only a small, specific selection of these are instantiated for a given entity. Therefore, this type of data model relates to the mathematical notion of a sparse matrix. EAV is also known as object–attribute–value model, vertical database model, and open schema.

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

A database shard, or simply a shard, is a horizontal partition of data in a database or search engine. Each shard is held on a separate database server instance, to spread load.

<span class="mw-page-title-main">Apache Pig</span> Open-source data analytics software

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications that devote most of their execution time to computational requirements are deemed compute-intensive, whereas applications are deemed data-intensive require large volumes of data and devote most of their processing time to I/O and manipulation of data.

Within database management systems, the RCFile is a data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework. The RCFile structure includes a data storage format, data compression approach, and optimization techniques for data reading. It is able to meet all the four requirements of data placement: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.

<span class="mw-page-title-main">Oracle NoSQL Database</span> Distributed database

Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

<span class="mw-page-title-main">Apache Trafodion</span> Relational database management system for Apache Hadoop

Apache Trafodion is an open-source Top-Level Project at the Apache Software Foundation. It was originally developed by the information technology division of Hewlett-Packard Company and HP Labs to provide the SQL query language on Apache HBase targeting big data transactional or operational workloads. The project was named after the Welsh word for transactions. As of April 2021, it is no longer actively developed.

<span class="mw-page-title-main">Apache Flink</span> Framework and distributed processing engine

Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs. Furthermore, Flink's runtime supports the execution of iterative algorithms natively.

Presto is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata, and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License.

<span class="mw-page-title-main">Apache Kylin</span> Open-source distributed analytics engine

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

<span class="mw-page-title-main">Apache ORC</span> Column-oriented data storage format

Apache ORC is a free and open-source column-oriented data storage format. It is similar to the other columnar-storage file formats available in the Hadoop ecosystem such as RCFile and Parquet. It is used by most of the data processing frameworks Apache Spark, Apache Hive, Apache Flink and Apache Hadoop.

Apache CarbonData is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile and ORC. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

References

  1. "Release release-1.0.0 · apache/Hive". GitHub .
  2. 1 2 "Apache Hive - Downloads" . Retrieved 21 November 2022.
  3. Venner, Jason (2009). Pro Hadoop . Apress. ISBN   978-1-4302-1942-2.
  4. Yin Huai, Ashutosh Chauhan, Alan Gates, Gunther Hagleitner, Eric N.Hanson, Owen O'Malley, Jitendra Pandey, Yuan Yuan, Rubao Lee, and Xiaodong Zhang (2014). "Major Technical Advancements in Apache Hive". SIGMOD' 14. pp. 1235–1246. doi:10.1145/2588555.2595630.{{cite conference}}: CS1 maint: multiple names: authors list (link)
  5. 1 2 3 Programming Hive [Book].
  6. Use Case Study of Hive/Hadoop
  7. OSCON Data 2011, Adrian Cockcroft, "Data Flow at Netflix" on YouTube
  8. Amazon Elastic MapReduce Developer Guide
  9. HiveQL Language Manual
  10. Apache Tez
  11. Hive Language Manual
  12. Lam, Chuck (2010). Hadoop in Action. Manning Publications. ISBN   978-1-935182-19-1.
  13. "Optimising Hadoop and Big Data with Text and HiveOptimising Hadoop and Big Data with Text and Hive". Archived from the original on 2014-11-15. Retrieved 2014-11-16.
  14. "ORC Language Manual". Hive project wiki. Retrieved April 24, 2017.
  15. Yin Huai, Siyuan Ma, Rubao Lee, Owen O'Malley, and Xiaodong Zhang (2013). "Understanding Insights into the Basic Structure and Essential Issues of Table Placement Methods in Clusters ". VLDB' 39. pp. 1750–1761. CiteSeerX   10.1.1.406.4342 . doi:10.14778/2556549.2556559.{{cite conference}}: CS1 maint: multiple names: authors list (link)
  16. 1 2 3 "Facebook's Petabyte Scale Data Warehouse using Hive and Hadoop" (PDF). Archived from the original (PDF) on 2011-07-28. Retrieved 2011-09-09.
  17. Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang, and Zhiwei Xu (2011). "RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems". IEEE 27th International Conference on Data Engineering.{{cite conference}}: CS1 maint: multiple names: authors list (link)
  18. "Parquet". 18 Dec 2014. Archived from the original on 2 February 2015. Retrieved 2 February 2015.
  19. Massie, Matt (21 August 2013). "A Powerful Big Data Trio: Spark, Parquet and Avro". zenfractal.com. Archived from the original on 2 February 2015. Retrieved 2 February 2015.
  20. 1 2 "Design - Apache Hive - Apache Software Foundation". cwiki.apache.org. Retrieved 2016-09-12.
  21. "Abstract Syntax Tree". c2.com. Retrieved 2016-09-12.
  22. 1 2 Dokeroglu, Tansel; Ozal, Serkan; Bayir, Murat Ali; Cinar, Muhammet Serkan; Cosar, Ahmet (2014-07-29). "Improving the performance of Hadoop Hive by sharing scan and computation tasks". Journal of Cloud Computing. 3 (1): 1–11. doi: 10.1186/s13677-014-0012-6 .
  23. Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang (2011). "YSmart: Yet Another SQL-to-MapReduce Translator". 31st International Conference on Distributed Computing Systems. pp. 25–36.{{cite conference}}: CS1 maint: multiple names: authors list (link)
  24. "HiveServer - Apache Hive - Apache Software Foundation". cwiki.apache.org. Retrieved 2016-09-12.
  25. 1 2 3 White, Tom (2010). Hadoop: The Definitive Guide . O'Reilly Media. ISBN   978-1-4493-8973-4.
  26. Hive Language Manual
  27. ACID and Transactions in Hive
  28. "Hive A Warehousing Solution Over a MapReduce Framework" (PDF). Archived from the original (PDF) on 2013-10-08. Retrieved 2011-09-03.
  29. "Introduction to Hive transactions". datametica.com. Archived from the original on 2016-09-03. Retrieved 2016-09-12.
  30. "Hive Transactions - Apache Hive - Apache Software Foundation". cwiki.apache.org. Retrieved 2016-09-12.
  31. "Configuration Properties - Apache Hive - Apache Software Foundation". cwiki.apache.org. Retrieved 2016-09-12.