Hierarchical storage management

Last updated

Hierarchical storage management (HSM), also known as Tiered storage, [1] is a data storage and Data management technique that automatically moves data between high-cost and low-cost storage media. HSM systems exist because high-speed storage devices, such as solid state drive arrays, are more expensive (per byte stored) than slower devices, such as hard disk drives, optical discs and magnetic tape drives. While it would be ideal to have all data available on high-speed devices all the time, this is prohibitively expensive for many organizations. Instead, HSM systems store the bulk of the enterprise's data on slower devices, and then copy data to faster disk drives when needed. The HSM system monitors the way data is used and makes best guesses as to which data can safely be moved to slower devices and which data should stay on the fast devices.

Contents

HSM may also be used where more robust storage is available for long-term archiving, but this is slow to access. This may be as simple as an off-site backup, for protection against a building fire.

HSM is a long-established concept, dating back to the beginnings of commercial data processing. The techniques used though have changed significantly as new technology becomes available, for both storage and for long-distance communication of large data sets. The scale of measures such as 'size' and 'access time' have changed dramatically. Despite this, many of the underlying concepts keep returning to favour years later, although at much larger or faster scales. [1]

Implementation

In a typical HSM scenario, data which is frequently used are stored on warm storage device, such as solid state disk (SSD). Data that is infrequently accessed is, after some time migrated to a slower, high capacity cold storage tier. If a user does access data which is on the cold storage tier, it is automatically moved back to warm storage. The advantage is that the total amount of stored data can be much larger than the capacity of the warm storage device, but since only rarely used files are on cold storage, most users will usually not notice any slowdown.

Conceptually, HSM is analogous to the cache found in most computer CPUs, where small amounts of expensive SRAM memory running at very high speeds is used to store frequently used data, but the least recently used data is evicted to the slower but much larger main DRAM memory when new data has to be loaded.

In practice, HSM is typically performed by dedicated software, such as IBM Tivoli Storage Manager, or Oracle's SAM-QFS.

The deletion of files from a higher level of the hierarchy (e.g. magnetic disk) after they have been moved to a lower level (e.g. optical media) is sometimes called file grooming. [2]

History

Hierarchical Storage Manager (HSM, then DFHSM and finally DFSMShsm) was first[ citation needed ] implemented by IBM on March 31, 1978 for MVS to reduce the cost of data storage, and to simplify the retrieval of data from slower media. The user would not need to know where the data was stored and how to get it back; the computer would retrieve the data automatically. The only difference to the user was the speed at which data was returned. HSM could originally migrate datasets only to disk volumes and virtual volumes on a IBM 3850 Mass Storage Facility, but a latter release supported magnetic tape volumes for migration level 2 (ML2).

Later, IBM ported HSM to its AIX operating system, and then to other Unix-like operating systems such as Solaris, HP-UX and Linux.

CSIRO Australia's Division of Computing Research implemented an HSM in its DAD (Drums and Display) operating system with its Document Region in the 1960s, with copies of documents being written to 7-track tape and automatic retrieval upon access to the documents.

HSM was also implemented on the DEC VAX/VMS systems and the Alpha/VMS systems. The first implementation date should be readily determined from the VMS System Implementation Manuals or the VMS Product Description Brochures.

More recently, the development of Serial ATA (SATA) disks has created a significant market for three-stage HSM: files are migrated from high-performance Fibre Channel storage area network devices to somewhat slower but much cheaper SATA disk arrays totaling several terabytes or more, and then eventually from the SATA disks to tape.


Use cases

HSM is often used for deep archival storage of data to be held long term at low cost. Automated tape robots can silo large quantities of data efficiently with low power consumption.

Some HSM software products allow the user to place portions of data files on high-speed disk cache and the rest on tape. This is used in applications that stream video over the internet—the initial portion of a video is delivered immediately from disk while a robot finds, mounts and streams the rest of the file to the end user. Such a system greatly reduces disk cost for large content provision systems.

HSM software is today used also for tiering between hard disk drives and flash memory, with flash memory being over 30 times faster than magnetic disks, but disks being considerably cheaper.

Algorithms

The key factor behind HSM is a Data migration policy that controls the file transfers in the system. More precisely, the policy decides which tier a file should be stored in, so that the entire storage system can be well-organized and have a shortest response time to requests. There are several algorithms realizing this process, such as Least Recently Used replacement(LRU), [3] Size-Temperature Replacement(STP), Heuristic Threshold(STEP) [4] etc. In research of recent years, there are also some intelligent policies coming up by using machine learning technologies. [5]

Tiering vs. Caching

While tiering solutions and caching may look the same on the surface, the fundamental differences lie in the way the faster storage is utilized and the algorithms used to detect and accelerate frequently accessed data. [6]

Caching operates by making a copy of frequently accessed blocks of data, and storing the copy in the faster storage device and use this copy instead of the original data source on the slower, high capacity backend storage. Every time a storage read occurs, the caching software look to see if a copy of this data already exists on the cache and uses that copy, if available. Otherwise, the data is read from the slower, high capacity storage. [6]

Tiering on the other hand operates very differently. Rather than making a copy of frequently accessed data into fast storage, tiering moves data across tiers, for example, by relocating cold data to low cost, high capacity nearline storage devices. [7] [6] The basic idea is, mission-critical and highly accesses or "hot" data is stored in expensive medium such as SSD to take advantage of high I/O performance, while nearline or rarely accessed or "cold" data is stored in nearline storage medium such as HHD and tapes which are inexpensive. [8] Thus, the "data temperature" or activity levels determines the primary storage hierarchy. [9]

Implementations

See also

Related Research Articles

<span class="mw-page-title-main">Computer data storage</span> Storage of digital data readable by computers

Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers.

<span class="mw-page-title-main">Computer memory</span> Component of a computer storing information for immediate use.

Computer memory stores information, such as data and programs for immediate use in the computer. The term memory is often synonymous with the term primary storage or main memory. An archaic synonym for memory is store.

<span class="mw-page-title-main">Memory hierarchy</span> Computer memory architecture

In computer organisation, the memory hierarchy separates computer storage into a hierarchy based on response time. Since response time, complexity, and capacity are related, the levels may also be distinguished by their performance and controlling technologies. Memory hierarchy affects performance in computer architectural design, algorithm predictions, and lower level programming constructs involving locality of reference.

In computer storage, logical volume management or LVM provides a method of allocating space on mass-storage devices that is more flexible than conventional partitioning schemes to store volumes. In particular, a volume manager can concatenate, stripe together or otherwise combine partitions into larger virtual partitions that administrators can re-size or move, potentially without interrupting system use.

<span class="mw-page-title-main">File system</span> Format or program for storing files and directories

In computing, a file system or filesystem is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one large body of data with no way to tell where one piece of data stopped and the next began, or where any piece of data was located when it was time to retrieve it. By separating the data into pieces and giving each piece a name, the data are easily isolated and identified. Taking its name from the way a paper-based data management system is named, each group of data is called a "file". The structure and logic rules used to manage the groups of data and their names is called a "file system."

The IBM 3850 Mass Storage System (MSS) was an online tape library used to hold large amounts of infrequently accessed data. It was one of the earliest examples of nearline storage.

Nearline storage is a term used in computer science to describe an intermediate type of data storage that represents a compromise between online storage and offline storage/archiving.

An optical jukebox is a robotic data storage device that can automatically load and unload optical discs, such as Compact Disc, DVD, Ultra Density Optical or Blu-ray and can provide terabytes (TB) or petabytes (PB) of tertiary storage. The devices are often called optical disk libraries, "optical storage archives", robotic drives, or autochangers. Jukebox devices may have up to 2,000 slots for disks, and usually have a picking device that traverses the slots and drives. Zerras Inc. provides a removeable capsule that holds up to 200 discs per library which can be scaled-out to manage 1600 discs per 42U rack unit. The arrangement of the slots and picking devices affects performance and maintenance costs, depending on the robotics design, the space between a disk and the picking device. Seek times and transfer rates vary depending upon the optical technology used.

A virtual tape library (VTL) is a data storage virtualization technology used typically for backup and recovery purposes. A VTL presents a storage component as tape libraries or tape drives for use with existing backup software.

IBM Storage Protect is a data protection platform that gives enterprises a single point of control and administration for backup and recovery. It is the flagship product in the IBM Spectrum Protect family.

QFS is a filesystem from Oracle. It is tightly integrated with SAM, the Storage and Archive Manager, and hence is often referred to as SAM-QFS. SAM provides the functionality of a hierarchical storage manager.

<span class="mw-page-title-main">ReadyBoost</span> Disk caching component of Microsoft Windows

ReadyBoost is a disk caching software component developed by Microsoft for Windows Vista and included in later versions of Windows. ReadyBoost enables NAND memory mass storage CompactFlash, SD card, and USB flash drive devices to be used as a cache between the hard drive and random access memory in an effort to increase computing performance. ReadyBoost relies on the SuperFetch and also adjusts its cache based on user activity. ReadyDrive for hybrid drives is implemented in a manner similar to ReadyBoost.

Magnetic-tape data storage is a system for storing digital information on magnetic tape using digital recording.

<span class="mw-page-title-main">IBM storage</span> Product portfolio of IBM

The IBM Storage product portfolio includes disk, flash, tape, NAS storage products, storage software and services. IBM's approach is to focus on data management.

A clustered file system (CFS) is a file system which is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system. Clustered file systems can provide features like location-independent addressing and redundancy which improve reliability or reduce the complexity of the other parts of the cluster. Parallel file systems are a type of clustered file system that spread data across multiple storage nodes, usually for redundancy or performance.

A stub file is a computer file that appears to the user to be on disk and immediately available for use, but is actually held either in part or entirely on a different storage medium. When a stub file is accessed, device driver software intercepts the access, retrieves the data from its actual location and writes it to the file, then allows the user's access to proceed. Typically, users are unaware that the file's data is stored on a different medium, though they may experience a slight delay when accessing such a file.

Content storage management (CSM) is a technique for the evolution of traditional media archive technology used by media companies and content owners to store and protect valuable file-based media assets. CSM solutions focus on active management of content and media assets regardless of format, type and source, interfaces between proprietary content source/destination devices and any format and type of commodity IT centric storage technology. These digital media files most often contain video but in rarer cases may be still pictures or sound. A CSM system may be directed manually but is more often directed by upper-level systems, which may include media asset management (MAM), automation, or traffic.

<span class="mw-page-title-main">Active Archive Alliance</span> Trade association

The Active Archive Alliance is a trade association that promotes a method of tiered storage. This method provides users access to data across a virtual file system that migrates data between multiple storage systems and media types including solid-state drive/flash, hard disk drives, magnetic tape, optical disk, and cloud. The result of an active archive implementation is that data can be stored on the most appropriate media type for the given retention and restoration requirements of that data. This allows less time sensitive or infrequently accessed data to be stored on less expensive media and eliminates the need for an administrator to manually migrate data between storage systems. Additionally, since storage systems such as tape libraries have low power consumption, the operational expense of storing data in an active archive is significantly reduced.

Data Facility Storage Management Subsystem (DFSMS) is a central component of IBM's flagship operating system z/OS. It includes access methods, utilities and program management functions. Data Facility Storage Management Subsystem is also a collective name for a collection of several products, all but two of which are included in the DFSMS/MVS product.

References

    1. 1 2 Larry Freeman. "What's Old Is New Again - Storage Tiering" (PDF).
    2. Patrick M. Dillon; David C. Leonard (1998). Multimedia and the Web from A to Z. ABC-CLIO. p. 116. ISBN   978-1-57356-132-7.
    3. O'Neil, Elizabeth J.; O'Neil, Patrick E.; Weikum, Gerhard (1993-06-01). "The LRU-K page replacement algorithm for database disk buffering". ACM SIGMOD Record. 22 (2): 297–306. doi:10.1145/170036.170081. ISSN   0163-5808. S2CID   207177617.
    4. Verma, A.; Pease, D.; Sharma, U.; Kaplan, M.; Rubas, J.; Jain, R.; Devarakonda, M.; Beigi, M. (2005). "An Architecture for Lifecycle Management in Very Large File Systems". 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST'05). Monterey, CA, USA: IEEE. pp. 160–168. doi:10.1109/MSST.2005.4. ISBN   978-0-7695-2318-7. S2CID   7082285.
    5. Zhang, Tianru; Hellander, Andreas; Toor, Salman (2022). "Efficient Hierarchical Storage Management Empowered by Reinforcement Learning". IEEE Transactions on Knowledge and Data Engineering: 1–1. doi:10.1109/TKDE.2022.3176753. ISSN   1041-4347.
    6. 1 2 3 Brand, Aron (June 20, 2022). "Hot Storage vs Cold Storage: Choosing the Right Tier for Your Data". Medium.com. Retrieved June 20, 2022.
    7. Posey, Brien (November 8, 2016). "Differences between SSD caching and tiering technologies". TechTarget. Retrieved Jun 21, 2022.
    8. Winnard & Biondo 2016, p. 5.
    9. Winnard & Biondo 2016, p. 6.
    10. IBM Corporation. "Abstract for DFSMS/VM Planning Guide". ibm.com. Retrieved Sep 16, 2021.
    11. z/OS 2.5 DFSMShsm Storage Administration (PDF). IBM. 2022. SC23-6871-50. Retrieved February 24, 2022.
    12. [SAM/QFS at OpenSolaris.org
    13. Rand Morimoto; Michael Noel; Omar Droubi; Ross Mistry; Chris Amaris (2008). Windows Server 2008 Unleashed. Sams Publishing. p. 938. ISBN   978-0-13-271563-8.
    14. "ITPro Today: IT News, How-Tos, Trends, Case Studies, Career Tips, More".