Apache Accumulo

Last updated
Apache Accumulo
Developer(s) Apache Software Foundation
Stable release 2.0.1 (December 24, 2020;2 years ago (2020-12-24) [1] ) [±]
Repository Accumulo Repository
Written in Java
Operating system Cross-platform
License Apache License 2.0
Website accumulo.apache.org

Apache Accumulo is a highly scalable sorted, distributed key-value store based on Google's Bigtable. [2] It is a system built on top of Apache Hadoop, Apache ZooKeeper, and Apache Thrift. Written in Java, Accumulo has cell-level access labels and server-side programming mechanisms. According to DB-Engines ranking, Accumulo is the third most popular NoSQL wide column store behind Apache Cassandra and HBase and the 67th most popular database engine of any type (complete) as of 2018. [3]

Contents

History

Accumulo was created in 2008 by the US National Security Agency and contributed to the Apache Foundation as an incubator project in September 2011. [4]

On March 21, 2012, Accumulo graduated from incubation at Apache, making it a top-level project. [5]

Controversy

In June 2012, the US Senate Armed Services Committee (SASC) released the Draft 2012 Department of Defense (DoD) Authorization Bill, which included references to Apache Accumulo. In the draft bill SASC required DoD to evaluate whether Apache Accumulo could achieve commercial viability before implementing it throughout DoD. [6] Specific criteria were not included in the draft language, but the establishment of commercial entities supporting Apache Accumulo could be considered a success factor. [7]

Main features

Cell-level security

Apache Accumulo extends the Bigtable data model, adding a new element to the key called Column Visibility. This element stores a logical combination of security labels that must be satisfied at query time in order for the key and value to be returned as part of a user request. This allows data of varying security requirements to be stored in the same table, and allows users to see only those keys and values for which they are authorized. [4]

Server-side programming

In addition to Cell-Level Security, Apache Accumulo provides a server-side programming mechanism called Iterators that allows users to perform additional processing at the Tablet Server. The range of operations that can be applied is equivalent to those that can be implemented within a MapReduce Combiner function, which produces an aggregate value for several key-value pairs.

User key ordering

Apache Accumulo orders entries in order of user keys, and exposes an iterator over a key range. This allows locality of reference not available from some other distributed stores (including Cassandra and Voldemort that order by hash of the user key).

Papers

See also

Related Research Articles

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

Bigtable is a fully managed wide-column and key-value NoSQL database service for large analytical and operational workloads as part of the Google Cloud portfolio.

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

<span class="mw-page-title-main">Apache Cassandra</span> Free and open-source database management system

Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients. Cassandra was designed to implement a combination of Amazon's Dynamo distributed storage and replication techniques combined with Google's Bigtable data and storage engine model.

A database shard, or simply a shard, is a horizontal partition of data in a database or search engine. Each shard is held on a separate database server instance, to spread load.

Sector/Sphere is an open source software suite for high-performance distributed data storage and processing. It can be broadly compared to Google's GFS and MapReduce technology. Sector is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming architecture framework that supports in-storage parallel data processing for data stored in Sector. Sector/Sphere operates in a wide area network (WAN) setting.

A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Such databases have existed since the late 1960s, but the name "NoSQL" was only coined in the early 21st century, triggered by the needs of Web 2.0 companies. NoSQL databases are increasingly used in big data and real-time web applications. NoSQL systems are also sometimes called Not only SQL to emphasize that they may support SQL-like query languages or sit alongside SQL databases in polyglot-persistent architectures.

Structured storage is computer storage for structured data, often in the form of a distributed database. Computer software formally known as structured storage systems include Apache Cassandra, Google's Bigtable and Apache HBase.

Pentaho is business intelligence (BI) software that provides data integration, OLAP services, reporting, information dashboards, data mining and extract, transform, load (ETL) capabilities. Its headquarters are in Orlando, Florida. Pentaho was acquired by Hitachi Data Systems in 2015 and in 2017 became part of Hitachi Vantara.

<span class="mw-page-title-main">Apache ZooKeeper</span>

Apache ZooKeeper is an open-source server for highly reliable distributed coordination of cloud applications. It is a project of the Apache Software Foundation.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system, also productized as BigQuery. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.

<span class="mw-page-title-main">Log-structured merge-tree</span> Data structure

In computer science, the log-structured merge-tree is a data structure with performance characteristics that make it attractive for providing indexed access to files with high insert volume, such as transactional log data. LSM trees, like other search trees, maintain key-value pairs. LSM trees maintain data in two or more separate structures, each of which is optimized for its respective underlying storage medium; data is synchronized between the two structures efficiently, in batches.

<span class="mw-page-title-main">Apache Trafodion</span> Relational database management system for Apache Hadoop

Apache Trafodion is an open-source Top-Level Project at the Apache Software Foundation. It was originally developed by the information technology division of Hewlett-Packard Company and HP Labs to provide the SQL query language on Apache HBase targeting big data transactional or operational workloads. The project was named after the Welsh word for transactions. As of April 2021, it is no longer actively developed.

A wide-column store is a column-oriented DBMS and therefore a special type of NoSQL database. It uses tables, rows, and columns, but unlike a relational database, the names and format of the columns can vary from row to row in the same table. A wide-column store can be interpreted as a two-dimensional key–value store.

<span class="mw-page-title-main">Apache Kylin</span> Open-source distributed analytics engine

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

<span class="mw-page-title-main">RocksDB</span>

RocksDB is a high performance embedded database for key-value data. It is a fork of Google's LevelDB optimized to exploit many CPU cores, and make efficient use of fast storage, such as solid-state drives (SSD), for input/output (I/O) bound workloads. It is based on a log-structured merge-tree data structure. It is written in C++ and provides official language bindings for C++, C, and Java; alongside many third-party language bindings. RocksDB is open-source software, and was originally released under a BSD 3-clause license. However, in July 2017 the project was migrated to a dual license of both Apache 2.0 and GPLv2 license, possibly in response to the Apache Software Foundation's blacklist of the previous BSD+Patents license clause.

<span class="mw-page-title-main">Apache Kudu</span> Open-source column-oriented data store

Apache Kudu is a free and open source column-oriented data store of the Apache Hadoop ecosystem. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides completeness to Hadoop's storage layer to enable fast analytics on fast data.

References

  1. "Apache Accumulo 2.0.1". Apache Accumulo. The Apache Software Foundation. 2020-12-24. Retrieved 2021-03-16.
  2. Apache Accumulo. Accumulo.apache.org. Retrieved on 2013-09-18.
  3. DB-Engines Ranking - popularity ranking of wide column stores. Db-engines.com. Retrieved on 2018-04-10. Archived 2018-04-10.
  4. 1 2 NSA Submits Open Source, Secure Database To Apache - Government. Informationweek.com (2011-09-06). Retrieved on 2013-09-18.
  5. Accumulo Incubation Status - Apache Incubator. Incubator.apache.org. Retrieved on 2013-09-18.
  6. Metz, Cade. (2012-12-19) NSA Mimics Google, Pisses Off Senate | Wired Enterprise. Wired.com. Retrieved on 2013-09-18.
  7. SASC Accumulo language pro-open source, say proponents Archived 2016-03-20 at the Wayback Machine . FierceGovernmentIT (2012-06-14). Retrieved on 2013-09-18.