Apache ZooKeeper

Last updated
Apache ZooKeeper
Developer(s) Apache Software Foundation
Stable release
3.8.1 / January 30, 2023;8 months ago (2023-01-30) [1]
Repository ZooKeeper Repository
Written in Java
Operating system Cross-platform
Type Distributed computing
License Apache License 2.0
Website zookeeper.apache.org

Apache ZooKeeper is an open-source server for highly reliable distributed coordination of cloud applications. [2] It is a project of the Apache Software Foundation.

Contents

ZooKeeper is essentially a service for distributed systems offering a hierarchical key-value store, which is used to provide a distributed configuration service, synchronization service, and naming registry for large distributed systems (see Use cases ). [3] ZooKeeper was a sub-project of Hadoop but is now a top-level Apache project in its own right.

Overview

ZooKeeper's architecture supports high availability through redundant services. The clients can thus ask another ZooKeeper leader if the first fails to answer. ZooKeeper nodes store their data in a hierarchical name space, much like a file system or a tree data structure. Clients can read from and write to the nodes and in this way have a shared configuration service. ZooKeeper can be viewed as an atomic broadcast system, through which updates are totally ordered. The ZooKeeper Atomic Broadcast (ZAB) protocol is the core of the system. [4]

ZooKeeper is used by companies including Yelp, Rackspace, Yahoo!, [5] Odnoklassniki, Reddit, [6] NetApp SolidFire, [7] Meta, [8] Twitter [9] and eBay as well as open source enterprise search systems like Solr and distributed database systems like Apache Pinot. [10] [11]

ZooKeeper is modeled after Google's Chubby lock service [12] [13] and was originally developed at Yahoo! for streamlining the processes running on big-data clusters by storing the status in local log files on the ZooKeeper servers. These servers communicate with the client machines to provide them the information. ZooKeeper was developed in order to fix the bugs that occurred while deploying distributed big-data applications.

Some of the prime features of Apache ZooKeeper are:

Architecture

Some common terminologies regarding the ZooKeeper architecture:

The services in the cluster are replicated and stored on a set of servers (called an "ensemble"), each of which maintains an in-memory database containing the entire data tree of state as well as a transaction log and snapshots stored persistently. Multiple client applications can connect to a server, and each client maintains a TCP connection through which it sends requests and heartbeats and receives responses and watch events for monitoring. [14]

Use cases

Typical use cases for ZooKeeper are:

Client libraries

In addition to the client libraries included with the ZooKeeper distribution, a number of third-party libraries such as Apache Curator and Kazoo are available that make using ZooKeeper easier, add additional functionality, additional programming languages, etc.

Apache projects using ZooKeeper

See also

Related Research Articles

Google File System is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware. Google file system was replaced by Colossus in 2010.

High-availability clusters are groups of computers that support server applications that can be reliably utilized with a minimum amount of down-time. They operate by using high availability software to harness redundant computers in groups or clusters that provide continued service when system components fail. Without clustering, if a server running a particular application crashes, the application will be unavailable until the crashed server is fixed. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate file systems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.

Operating systems use lock managers to organise and serialise the access to resources. A distributed lock manager (DLM) runs in every machine in a cluster, with an identical copy of a cluster-wide lock database. In this way a DLM provides software applications which are distributed across a cluster on multiple machines with a means to synchronize their accesses to shared resources.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

<span class="mw-page-title-main">Apache Solr</span> Open-source enterprise-search platform

Solr is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document handling. Providing distributed search and index replication, Solr is designed for scalability and fault tolerance. Solr is widely used for enterprise search and analytics use cases and has an active development community and regular releases.

<span class="mw-page-title-main">Vertica</span> Software company

Vertica is an analytic database management software company. Vertica was founded in 2005 by the database researcher Michael Stonebraker with Andrew Palmer as the founding CEO. Ralph Breslauer and Christopher P. Lynch served as CEOs later on.

Sector/Sphere is an open source software suite for high-performance distributed data storage and processing. It can be broadly compared to Google's GFS and MapReduce technology. Sector is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming architecture framework that supports in-storage parallel data processing for data stored in Sector. Sector/Sphere operates in a wide area network (WAN) setting.

<span class="mw-page-title-main">Apache Hama</span>

Apache Hama is a distributed computing framework based on bulk synchronous parallel computing techniques for massive scientific computations e.g., matrix, graph and network algorithms. It was a Top Level Project under the Apache Software Foundation. Retired in April 2020, project resources are made available as part of the Apache Attic. It was created by Edward J. Yoon, who named it and was inspired by Google's Pregel large-scale graph computing framework described in 2010. Hama also means hippopotamus in Korean language (하마), following the trend of naming Apache projects after animals and zoology.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids the portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications which devote most of their execution time to computational requirements are deemed compute-intensive, whereas computing applications which require large volumes of data and devote most of their processing time to I/O and manipulation of data are deemed data-intensive.

HPCC, also known as DAS, is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing big data. The HPCC platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL.

<span class="mw-page-title-main">Oracle NoSQL Database</span>

Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.

In computing, Hazelcast is a unified real-time data platform based on Java that combines a fast data store with stream processing. It is also the name of the company developing the product. The Hazelcast company is funded by venture capital and headquartered in Palo Alto, California.

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system.

Druid is a column-oriented, open-source, distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data. The name Druid comes from the shapeshifting Druid class in many role-playing games, to reflect that the architecture of the system can shift to solve different types of data problems.

Kubernetes is an open-source container orchestration system for automating software deployment, scaling, and management. Originally designed by Google, the project is now maintained by the Cloud Native Computing Foundation.

LizardFS is an open source distributed file system that is POSIX-compliant and licensed under GPLv3. It was released in 2013 as fork of MooseFS. LizardFS is also offering a paid Technical Support with possibility of configurating and setting up the cluster and active cluster monitoring.

<span class="mw-page-title-main">Apache Ignite</span>

Apache Ignite is a distributed database management system for high-performance computing.

<span class="mw-page-title-main">Apache Pinot</span> Open-source distributed data store

Apache Pinot is a column-oriented, open-source, distributed data store written in Java. Pinot is designed to execute OLAP queries with low latency. It is suited in contexts where fast analytics, such as aggregations, are needed on immutable data, possibly, with real-time data ingestion. The name Pinot comes from the Pinot grape vines that are pressed into liquid that is used to produce a variety of different wines. The founders of the database chose the name as a metaphor for analyzing vast quantities of data from a variety of different file formats or streaming data sources.

References

  1. "Apache ZooKeeper - Releases" . Retrieved 12 February 2023.
  2. "Apache Zookeeper4" . Retrieved 31 January 2021.
  3. "Index - Apache ZooKeeper - Apache Software Foundation". cwiki.apache.org. Retrieved 2016-08-26.
  4. "Zookeeper Overview".
  5. "ZooKeeper/Powered By". Archived from the original on 2013-12-09. Retrieved 2012-01-25.
  6. "Why Reddit was down on Aug 11". 16 August 2016.
  7. "5 Big DaaS Challenges and How to Overcome Them | NetApp Newsroom". NetApp Newsroom. 2016-06-20. Retrieved 2017-05-24.[ permanent dead link ]
  8. "Location-Aware Distribution: Configuring servers at scale". Facebook Code. 2018-07-19. Retrieved 2018-07-20.
  9. "ZooKeeper at Twitter". Twitter Engineering Blog. 2018-10-11. Retrieved 2018-12-08.
  10. "SolrCloud".
  11. "Apache Pinot: Architecture".
  12. Burrows, Mike (2006). "The Chubby lock service for loosely-coupled distributed systems". 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI).
  13. Chandra, Tushar Deepak; Griesemer, Robert; Redstone, Joshua (2007). "Paxos Made Live - An Engineering Perspective (2006 Invited Talk)". Google Research. Retrieved 2020-03-03.
  14. "Zookeeper".