Distributed Replicated Block Device

Last updated
Original author(s) Philipp Reisner, Lars Ellenberg
Developer(s) LINBIT HA-Solutions GmbH, Vienna and LINBIT USA LLC, Oregon
Stable release
9.2.4 / 5 June 2023;9 months ago (2023-06-05) [1]
Preview release
10.0.0a4 / 16 October 2020;3 years ago (2020-10-16) [2]
Repository
Written in C
Operating system Linux
Type Distributed storage system
License GNU General Public License v2
Website linbit.com/drbd/

Distributed Replicated Block Device (DRBD) is a distributed replicated storage system for the Linux platform. It is implemented as a kernel driver, several userspace management applications, and some shell scripts. DRBD is traditionally used in high availability (HA) computer clusters, but beginning with DRBD version 9, it can also be used to create larger software defined storage pools with a focus on cloud integration. [3]

Contents

A DRBD device is a DRBD block device that refers to a logical block device in a logical volume schema.

The DRBD software is free software released under the terms of the GNU General Public License version 2.

DRBD is part of the Lisog open source stack initiative.

Mode of operation

DRBD layers logical block devices (conventionally named /dev/drbdX, where X is the device minor number) over existing local block devices on participating cluster nodes. Writes to the primary node are transferred to the lower-level block device and simultaneously propagated to the secondary node(s). The secondary node(s) then transfers data to its corresponding lower-level block device. All read I/O is performed locally unless read-balancing is configured. [4]

Should the primary node fail, a cluster management process promotes the secondary node to a primary state. [5] This transition may require a subsequent verification of the integrity of the file system stacked on top of DRBD, by way of a filesystem check or a journal replay. When the failed ex-primary node returns, the system may (or may not) raise it to primary level again, after device data resynchronization. DRBD's synchronization algorithm is efficient in the sense that only those blocks that were changed during the outage must be resynchronized, rather than the device in its entirety.

DRBD is often deployed together with the Pacemaker or Heartbeat cluster resource managers, although it does integrate with other cluster management frameworks. It integrates with virtualization solutions such as Xen, and may be used both below and on top of the Linux LVM stack. [6]

DRBD allows for load-balancing configurations, allowing both nodes to access a particular DRBD in read/write mode with shared storage semantics. [7] A multiple primary (multiple read/write nodes) configuration requires the use of a distributed lock manager.

Since 2018 DRBD can also be leveraged in the block storage management software LINSTOR for replication between different nodes and to provide block storage devices to users and applications.

Shared cluster storage comparison

Conventional computer cluster systems typically use some sort of shared storage for data being used by cluster resources. This approach has a number of disadvantages, which DRBD may help offset:

A disadvantage is the lower time required to write directly to a shared storage device than to route the write through the other node.

Comparison to RAID-1

DRBD bears a superficial similarity to RAID-1 in that it involves a copy of data on two storage devices, such that if one fails, the data on the other can be used. However, it operates in a very different way from RAID and even network RAID.

In RAID, the redundancy exists in a layer transparent to the storage-using application. While there are two storage devices, there is only one instance of the application and the application is not aware of multiple copies. When the application reads, the RAID layer chooses the storage device to read. When a storage device fails, the RAID layer chooses to read the other, without the application instance knowing of the failure.

In contrast, with DRBD there are two instances of the application, and each can read only from one of the two storage devices. Should one storage device fail, the application instance tied to that device can no longer read the data. Consequently, in that case that application instance shuts down and the other application instance, tied to the surviving copy of the data, takes over.

Conversely, in RAID, if the single application instance fails, the information on the two storage devices is effectively unusable, but in DRBD, the other application instance can take over.

Applications

Operating within the Linux kernel's block layer, DRBD is essentially workload agnostic. A DRBD can be used as the basis of:

DRBD-based clusters are often employed for adding synchronous replication and high availability to file servers, relational databases (such as MySQL), and many other workloads.

Inclusion in Linux kernel

DRBD's authors originally submitted the software to the Linux kernel community in July 2007, for possible inclusion in the canonical kernel.org version of the Linux kernel. [10] After a lengthy review and several discussions, Linus Torvalds agreed to have DRBD as part of the official Linux kernel. DRBD was merged on 8 December 2009 during the "merge window" for Linux kernel version 2.6.33.

See also

Related Research Articles

In computer storage, logical volume management or LVM provides a method of allocating space on mass-storage devices that is more flexible than conventional partitioning schemes to store volumes. In particular, a volume manager can concatenate, stripe together or otherwise combine partitions into larger virtual partitions that administrators can re-size or move, potentially without interrupting system use.

Sistina Software was a US company that focused on storage solutions designed around a Linux platform. It originated in the University of Minnesota.

In computing, the Global File System 2 or GFS2 is a shared-disk file system for Linux computer clusters. GFS2 allows all members of a cluster to have direct concurrent access to the same shared block storage, in contrast to distributed file systems which distribute data throughout the cluster. GFS2 can also be used as a local file system on a single computer.

In Linux, Logical Volume Manager (LVM) is a device mapper framework that provides logical volume management for the Linux kernel. Most modern Linux distributions are LVM-aware to the point of being able to have their root file systems on a logical volume.

High-availability clusters are groups of computers that support server applications that can be reliably utilized with a minimum amount of down-time. They operate by using high availability software to harness redundant computers in groups or clusters that provide continued service when system components fail. Without clustering, if a server running a particular application crashes, the application will be unavailable until the crashed server is fixed. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate file systems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.

The device mapper is a framework provided by the Linux kernel for mapping physical block devices onto higher-level virtual block devices. It forms the foundation of the logical volume manager (LVM), software RAIDs and dm-crypt disk encryption, and offers additional features such as file system snapshots.

Replication in computing involves sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.

<span class="mw-page-title-main">Disk mirroring</span>

In data storage, disk mirroring is the replication of logical disk volumes onto separate physical hard disks in real time to ensure continuous availability. It is most commonly used in RAID 1. A mirrored volume is a complete logical representation of separate volume copies.

In Linux systems, initrd is a scheme for loading a temporary root file system into memory, to be used as part of the Linux startup process. initrd and initramfs refer to two different methods of achieving this. Both are commonly used to make preparations before the real root file system can be mounted.

GPFS is high-performance clustered file system software developed by IBM. It can be deployed in shared-disk or shared-nothing distributed parallel modes, or a combination of these. It is used by many of the world's largest commercial companies, as well as some of the supercomputers on the Top 500 List. For example, it is the filesystem of the Summit at Oak Ridge National Laboratory which was the #1 fastest supercomputer in the world in the November 2019 TOP500 list of supercomputers. Summit is a 200 Petaflops system composed of more than 9,000 POWER9 processors and 27,000 NVIDIA Volta GPUs. The storage filesystem called Alpine has 250 PB of storage using Spectrum Scale on IBM ESS storage hardware, capable of approximately 2.5TB/s of sequential I/O and 2.2TB/s of random I/O.

The IBM SAN Volume Controller (SVC) is a block storage virtualization appliance that belongs to the IBM System Storage product family. SVC implements an indirection, or "virtualization", layer in a Fibre Channel storage area network (SAN).

In computer science, storage virtualization is "the process of presenting a logical view of the physical storage resources to" a host computer system, "treating all storage media in the enterprise as a single pool of storage."

Btrfs is a computer storage format that combines a file system based on the copy-on-write (COW) principle with a logical volume manager, developed together. It was founded by Chris Mason in 2007 for use in Linux, and since November 2013, the file system's on-disk format has been declared stable in the Linux kernel.

Ceph is a free and open-source software-defined storage platform that provides object storage, block storage, and file storage built on a common distributed cluster foundation. Ceph provides completely distributed operation without a single point of failure and scalability to the exabyte level, and is freely available. Since version 12 (Luminous), Ceph does not rely on any other conventional filesystem and directly manages HDDs and SSDs with its own storage backend BlueStore and can expose a POSIX filesystem.

The most widespread standard for configuring multiple hard disk drives is RAID, which comes in a number of standard configurations and non-standard configurations. Non-RAID drive architectures also exist, and are referred to by acronyms with tongue-in-cheek similarity to RAID:

bcache is a cache in the Linux kernel's block layer, which is used for accessing secondary storage devices. It allows one or more fast storage devices, such as flash-based solid-state drives (SSDs), to act as a cache for one or more slower storage devices, such as hard disk drives (HDDs); this effectively creates hybrid volumes and provides performance improvements.

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system.

<span class="mw-page-title-main">Dell Technologies PowerFlex</span> Software-defined storage product

Dell Technologies PowerFlex, is a commercial software-defined storage product from Dell Technologies that creates a server-based storage area network (SAN) from local server storage using x86 servers. It converts this direct-attached storage into shared block storage that runs over an IP-based network.

ONTAP or Data ONTAP or Clustered Data ONTAP (cDOT) or Data ONTAP 7-Mode is NetApp's proprietary operating system used in storage disk arrays such as NetApp FAS and AFF, ONTAP Select, and Cloud Volumes ONTAP. With the release of version 9.0, NetApp decided to simplify the Data ONTAP name and removed the word "Data" from it, removed the 7-Mode image, therefore, ONTAP 9 is the successor of Clustered Data ONTAP 8.

References

  1. "DRBD 9.x Linux Kernel Driver". linbit.com. Retrieved 2023-07-19.
  2. "Releases · LINBIT/drbd". github.com. Retrieved 2021-04-04.
  3. "DRBD for your Cloud". www.drbd.org. Retrieved 2016-12-05.
  4. "18.4. Achieving better Read Performance via increased Redundancy - DRBD Users Guide (9.0)". www.drbd.org. Retrieved 2016-12-05.
  5. "Chapter 8. Integrating DRBD with Pacemaker clusters - DRBD Users Guide (9.0)". www.drbd.org. Retrieved 2016-12-05.
  6. LINBIT. "The DRBD User's Guide" . Retrieved 2011-11-28.
  7. Reisner, Philipp (2005-10-11). "DRBD v8 - Replicated Storage with Shared Disk Semantics" (PDF). Proceedings of the 12th International Linux System Technology Conference. Hamburg, Germany.
  8. "User Guides » LINBIT".
  9. "Active-active DRBD with OCFS2 - Gentoo Linux Wiki". Archived from the original on 2013-03-08. Retrieved 2013-03-21.
  10. Ellenberg, Lars (2007-07-21). "DRBD wants to go mainline". linux-kernel (Mailing list). Retrieved 2007-08-03.