Apache HBase

Last updated

Apache HBase
Original author(s) Powerset
Developer(s) Apache Software Foundation
Initial release28 March 2008;15 years ago (2008-03-28)
Stable release
2.4.x2.4.14 / 29 August 2022;13 months ago (2022-08-29) [1]
2.5.x2.5.3 / 5 February 2023;8 months ago (2023-02-05) [1]
Preview release
3.0.0-alpha-3 / 27 June 2022;16 months ago (2022-06-27) [1]
Repository GitHub Repository , Gitbox Repository
Written in Java
Operating system Cross-platform
Type Distributed database
License Apache License 2.0
Website hbase.apache.org

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File System) or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection).

Contents

HBase features compression, in-memory operation, and Bloom filters on a per-column basis as outlined in the original Bigtable paper. [2] Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API but also through REST, Avro or Thrift gateway APIs. HBase is a wide-column store and has been widely adopted because of its lineage with Hadoop and HDFS. HBase runs on top of HDFS and is well-suited for fast read and write operations on large datasets with high throughput and low input/output latency.

HBase is not a direct replacement for a classic SQL database, however Apache Phoenix project provides a SQL layer for HBase as well as JDBC driver that can be integrated with various analytics and business intelligence applications. The Apache Trafodion project provides a SQL query engine with ODBC and JDBC drivers and distributed ACID transaction protection across multiple statements, tables and rows that use HBase as a storage engine.

HBase is now serving several data-driven websites [3] but Facebook's Messaging Platform migrated from HBase to MyRocks in 2018. [4] [5] Unlike relational and traditional databases, HBase does not support SQL scripting; instead the equivalent is written in Java, employing similarity with a MapReduce application.

In the parlance of Eric Brewer's CAP Theorem, HBase is a CP type system.

History

Apache HBase began as a project by the company Powerset out of a need to process massive amounts of data for the purposes of natural-language search. Since 2010 it is a top-level Apache project.

Facebook elected to implement its new messaging platform using HBase in November 2010, but migrated away from HBase in 2018. [4]

The 2.4.x series is the current stable release line, it supersedes earlier release lines.

Use cases & production deployments

Enterprises that use HBase

The following is a list of notable enterprises that have used or are using HBase:

See also

Related Research Articles

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

<span class="mw-page-title-main">Apache Cassandra</span> Free and open-source database management system

Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients. Cassandra was designed to implement a combination of Amazon's Dynamo distributed storage and replication techniques combined with Google's Bigtable data and storage engine model.

DataNucleus is an open source project which provides software products around data management in Java. The DataNucleus project started in 2008.

Structured storage is computer storage for structured data, often in the form of a distributed database. Computer software formally known as structured storage systems include Apache Cassandra, Google's Bigtable and Apache HBase.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids the portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

Within database management systems, the RCFile is a data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework. The RCFile structure includes a data storage format, data compression approach, and optimization techniques for data reading. It is able to meet all the four requirements of data placement: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.

Apache Accumulo is a highly scalable sorted, distributed key-value store based on Google's Bigtable. It is a system built on top of Apache Hadoop, Apache ZooKeeper, and Apache Thrift. Written in Java, Accumulo has cell-level access labels and server-side programming mechanisms. According to DB-Engines ranking, Accumulo is the third most popular NoSQL wide column store behind Apache Cassandra and HBase and the 67th most popular database engine of any type (complete) as of 2018.

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

Druid is a column-oriented, open-source, distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data. The name Druid comes from the shapeshifting Druid class in many role-playing games, to reflect that the architecture of the system can shift to solve different types of data problems.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

<span class="mw-page-title-main">Apache Trafodion</span> Relational database management system for Apache Hadoop

Apache Trafodion is an open-source Top-Level Project at the Apache Software Foundation. It was originally developed by the information technology division of Hewlett-Packard Company and HP Labs to provide the SQL query language on Apache HBase targeting big data transactional or operational workloads. The project was named after the Welsh word for transactions. As of April 2021, it is no longer actively developed.

Infinispan is a distributed cache and key-value NoSQL data store software developed by Red Hat. Java applications can embed it as library, use it as a service in WildFly or any non-java applications can use it, as remote service through TCP/IP.

Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store. Phoenix provides a JDBC driver that hides the intricacies of the NoSQL store enabling users to create, delete, and alter SQL tables, views, indexes, and sequences; insert and delete rows singly and in bulk; and query data through SQL. Phoenix compiles queries and other statements into native NoSQL store APIs rather than using MapReduce enabling the building of low latency applications on top of NoSQL stores.

<span class="mw-page-title-main">Apache Kylin</span> Open-source distributed analytics engine

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

<span class="mw-page-title-main">Apache Apex</span>

Apache Apex is a YARN-native platform that unifies stream and batch processing. It processes big data-in-motion in a way that is scalable, performant, fault-tolerant, stateful, secure, distributed, and easily operable.

<span class="mw-page-title-main">Apache Ignite</span>

Apache Ignite is a distributed database management system for high-performance computing.

Apache IoTDB is a column-oriented open-source, time-series database (TSDB) management system written in Java. It has both edge and cloud versions, provides an optimized columnar file format for efficient time-series data storage, and TSDB with high ingestion rate, low latency queries and data analysis support. It is specially optimized for time-series oriented operations like aggregations query, downsampling and sub-sequence similarity search. The name IoTDB comes from Internet of Things (IoT) Database, which means it was designed as an IoT-native TSDB that resolves the pain points of the typical IoT scenarios, including massive data generation, high frequency sampling, out-of-order data, specific analytics requirements, high costs of storage and operation & maintenance, low computational power of IoT devices.

References

  1. 1 2 3 "Apache HBase – Apache HBase Downloads" . Retrieved 27 September 2022.
  2. Chang, et al. (2006). Bigtable: A Distributed Storage System for Structured Data
  3. "Apache HBase – Powered By Apache HBase". hbase.apache.org. Retrieved 8 April 2018.
  4. 1 2 "Migrating Messenger storage to optimize performance". www.facebook.com. 26 June 2018. Retrieved 5 July 2018.
  5. Facebook: Why our 'next-gen' comms ditched MySQL Retrieved: 17 December 2010
  6. HBaseCon (2 August 2016). "Apache HBase at Airbnb". slideshare.net. Retrieved 8 April 2018.
  7. "Near Real Time Search Indexing".
  8. "Is data locality always out of the box in Hadoop?".
  9. "Why Imgur Dropped MySQL in Favor of HBase - DZone Database". dzone.com. Retrieved 8 April 2018.
  10. "Tech Tuesday: Imgur Notifications: From MySQL to HBase - The Imgur Blog". blog.imgur.com. Retrieved 8 April 2018.
  11. Doyung Yoon. "S2Graph : A Large-Scale Graph Database with HBase".
  12. Cheolsoo Park and Ashwin Shankar. "Netflix: Integrating Spark at Petabyte Scale".
  13. Engineering, Pinterest (30 March 2018). "Improving HBase backup efficiency at Pinterest". Medium. Retrieved 14 April 2020.{{cite web}}: |first= has generic name (help)
  14. "Hbase at Salesforce.com".
  15. Josh Baer. "How Apache Drives Spotify's Music Recommendations".
  16. "Tuenti Group Chat: Simple, yet complex". Archived from the original on 24 November 2012. Retrieved 29 September 2015.
  17. "Tuenti Asyncthrift". GitHub . 6 November 2013.

Bibliography