Five-minute rule

Last updated

In computer science, the five-minute rule is a rule of thumb for deciding whether a data item should be kept in memory, or stored on disk and read back into memory when required. It was first formulated by Jim Gray and Gianfranco Putzolu in 1985, [1] [2] and then subsequently revised in 1997 [3] and 2007 [4] to reflect changes in the relative cost and performance of memory and persistent storage.

The rule is as follows:

The 5-minute random rule: cache randomly accessed disk pages that are re-used every 5 minutes or less.

Gray also issued a counterpart one-minute rule for sequential access: [5]

The 1-minute rule: cache sequentially accessed disk pages that are re-used every 1 minute or less.

Although the 5-minute rule was invented in the realm of databases, it has also been applied elsewhere, for example, in Network File System cache capacity planning. [6]

The original 5-minute rule was derived from the following cost-benefit computation: [4]

BreakEvenIntervalinSeconds = (PagesPerMBofRAM / AccessesPerSecondPerDisk) × (PricePerDiskDrive / PricePerMBofRAM)

Applying it to 2007 data yields approximately a 90-minutes interval for magnetic-disk-to-DRAM caching, 15 minutes for SSD-to-DRAM caching and 214 hours for disk-to-SSD caching. The disk-to-DRAM interval was thus a bit short of what Gray and Putzolu anticipated in 1987 as the "five-hour rule" was going to be in 2007 for RAM and disks. [4]

According to calculations by NetApp engineer David Dale as reported in The Register, the figures for disc-to-DRAM caching in 2008 were as follows: "The 50KB page break-even was five minutes, the 4KB one was one hour and the 1KB one was five hours. There needed to be a 50-fold increase in page size to cache for break-even at five minutes." Regarding disk-to-SSD caching in 2010, the same source reported that "A 250KB page break even with SLC was five minutes, but five hours with a 4KB page size. It was five minutes with a 625KB page size with MLC flash and 13 hours with a 4KB MLC page size." [7]

In 2000, Gray and Shenoy applied a similar calculation for web page caching and concluded that a browser should "cache web pages if there is any chance they will be re-referenced within their lifetime." [8]

Related Research Articles

Cache (computing) Data storage that enables faster access

In computing, a cache is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere. A cache hit occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot. Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests that can be served from the cache, the faster the system performs.

Flash memory Electronic non-volatile computer storage device

Flash memory is an electronic non-volatile computer memory storage medium that can be electrically erased and reprogrammed. The two main types of flash memory, NOR flash and NAND flash, are named for the NOR and NAND logic gates. NOR and NAND flash use the same cell design, consisting of floating gate MOSFETs. They differ at the circuit level depending on whether the state of the bit line or word lines is pulled high or low. In NAND flash, the relationship between the bit line and the word lines resembles a NAND gate; in NOR flash, it resembles a NOR gate.

Jim Gray (computer scientist) American computer scientist

James Nicholas Gray was an American computer scientist who received the Turing Award in 1998 "for seminal contributions to database and transaction processing research and technical leadership in system implementation".

In computing, cache algorithms are optimizing instructions, or algorithms, that a computer program or a hardware-maintained structure can utilize in order to manage a cache of information stored on the computer. Caching improves performance by keeping recent or often-used data items in memory locations that are faster or computationally cheaper to access than normal memory stores. When the cache is full, the algorithm must choose which items to discard to make room for the new ones.

Hierarchical storage management (HSM) is a data storage technique that automatically moves data between high-cost and low-cost storage media. HSM systems exist because high-speed storage devices, such as solid state drive arrays, are more expensive than slower devices, such as hard disk drives, optical discs and magnetic tape drives. While it would be ideal to have all data available on high-speed devices all the time, this is prohibitively expensive for many organizations. Instead, HSM systems store the bulk of the enterprise's data on slower devices, and then copy data to faster disk drives when needed. In effect, HSM turns the fast disk drives into caches for the slower mass storage devices. The HSM system monitors the way data is used and makes best guesses as to which data can safely be moved to slower devices and which data should stay on the fast devices.

Chipkill is IBM's trademark for a form of advanced error checking and correcting (ECC) computer memory technology that protects computer memory systems from any single memory chip failure as well as multi-bit errors from any portion of a single memory chip. One simple scheme to perform this function scatters the bits of a Hamming code ECC word across multiple memory chips, such that the failure of any single memory chip will affect only one ECC bit per word. This allows memory contents to be reconstructed despite the complete failure of one chip. Typical implementations use more advanced codes, such as a BCH code, that can correct multiple bits with less overhead.

Automated tiered storage is the automated progression or demotion of data across different tiers (types) of storage devices and media. The movement of data takes place in an automated way with the help of a software or embedded firmware and is assigned to the related media according to performance and capacity requirements. More advanced implementations include the ability to define rules and policies that dictate if and when data can be moved between the tiers, and in many cases provides the ability to pin data to tiers permanently or for specific periods of time. Implementations vary, but are classed into two broad categories: pure software based implementations that run on general purpose processors supporting most forms of general purpose storage media and embedded automated tiered storage controlled by firmware as part of a closed embedded storage system such as a SAN disk array. Software Defined Storage architectures commonly include a component of tiered storage as part of their primary functions.

Solid-state drive Data storage device that uses no moving parts

A solid-state drive (SSD) is a solid-state storage device that uses integrated circuit assemblies to store data persistently, typically using flash memory, and functioning as secondary storage in the hierarchy of computer storage. It is also sometimes called a solid-state device or a solid-state disk, even though SSDs lack the physical spinning disks and movable read–write heads used in hard disk drives (HDDs) and floppy disks.

Multi-level cell Memory cell capable of storing more than a single bit of information

In electronics, a multi-level cell (MLC) is a memory cell capable of storing more than a single bit of information, compared to a single-level cell (SLC), which can store only one bit per memory cell. A memory cell typically consists of a single floating-gate MOSFET, thus multi-level cells reduce the number of MOSFETs required to store the same amount of data as single-level cells.

Universal memory refers to a computer data storage device combining the cost benefits of DRAM, the speed of SRAM, the non-volatility of flash memory along with infinite durability, and longevity. Such a device, if it ever becomes possible to develop, would have a far-reaching impact on the computer market. Some doubt that such a type of memory will ever be possible.

A hybrid array is a form of hierarchical storage management that combines hard disk drives (HDDs) with solid-state drives (SSDs) for I/O speed improvements.

Texas Memory Systems, Inc. (TMS) was an American corporation that designed and manufactured solid-state disks (SSDs) and digital signal processors (DSPs). TMS was founded in 1978 and that same year introduced their first solid-state drive, followed by their first digital signal processor. In 2000 they introduced the RamSan line of SSDs. Based in Houston, Texas, they supply these two product categories to large enterprise and government organizations.

Patrick Eugene O'Neil was an American computer scientist, an expert on databases, and a professor of computer science at the University of Massachusetts Boston.

A NVDIMM or non-volatile DIMM is a type of persistent random-access memory for computers using widely used DIMM form-factors. Non-volatile memory is memory that retains its contents even when electrical power is removed, for example from an unexpected power loss, system crash, or normal shutdown. Properly used, NVDIMMs can improve application performance and system crash recovery time. They enhance solid-state drive (SSD) endurance and reliability.

ExpressCache is a Windows-based SSD caching technology developed by Condusiv Technologies and licensed to a number of laptop manufacturers including Acer, ASUS, Samsung, Sony, Lenovo, and Fujitsu. ExpressCache is also bundled with some SanDisk products such as ReadyCache; SanDisk currently holds an exclusive ExpressCache license for stand-alone storage products.

Solid-state storage (SSS) is a type of non-volatile computer storage that stores and retrieves digital information using only electronic circuits, without any involvement of moving mechanical parts. This differs fundamentally from the traditional electromechanical storage, which records data using rotating or linearly moving media coated with magnetic material.

In an enterprise server, a Caching SAN Adapter is a host bus adapter (HBA) for storage area network (SAN) connectivity which accelerates performance by transparently storing duplicate data such that future requests for that data can be serviced faster compared to retrieving the data from the source. A caching SAN adapter is used to accelerate the performance of applications across multiple clustered or virtualized servers and uses DRAM, NAND Flash or other memory technologies as the cache. The key requirement for the memory technology is that it is faster than the media storing the original copy of the data to ensure performance acceleration is achieved.

3D XPoint Novel computer memory type meant to offer higher speeds than flash memory and lower prices than DRAM

3D XPoint is a non-volatile memory (NVM) technology developed jointly by Intel and Micron Technology. It was announced in July 2015 and is available on the open market under the brand name Optane (Intel) since April 2017. Bit storage is based on a change of bulk resistance, in conjunction with a stackable cross-gridded data access array. Initial prices are less than dynamic random-access memory (DRAM) but more than flash memory.

In-situ processing also known as in-storage processing (ISP) is a computer science term that refers to processing data where it resides. In-situ means "situated in the original, natural, or existing place or position." An in-situ process processes data where it is stored such as in solid-state drives (SSDs) or memory devices like NVDIMM, rather than sending the data to a computer's central processing unit (CPU).

IBM FlashCore Modules are solid state technology computer data storage modules using PCI Express attachment and the NVMe command set. The raw storage capacities are 4.8 TB, 9.6 TB, 19.2 TB and 38.4 TB. The FlashCore modules support hardware self-encryption and real-time inline hardware data compression without performance impact. They are used in selected arrays from the IBM FlashSystem family.

References

  1. Gray, Jim; Putzolu, Franco (May 1985), The 5 Minute Rule for Trading Memory for Disc Accesses and the 5 Byte Rule for Trading Memory for CPU Time (PDF)
  2. Gray, Jim; Putzolu, Gianfranco R. (1987), "The 5 Minute Rule for Trading Memory for Disk Accesses and The 10 Byte Rule for Trading Memory for CPU Time", Proceedings of the ACM SIGMOD Conference, pp. 395–398, CiteSeerX   10.1.1.624.3312 , doi:10.1145/38713.38755, ISBN   978-0897912365, S2CID   10770251
  3. Gray, Jim; Graefe, Goetz (1997), "The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb", ACM SIGMOD Record, 26 (4): 63–68, arXiv: cs/9809005 , doi:10.1145/271074.271094, S2CID   21524661
  4. 1 2 3 Graefe, Goetz (2007), "The five-minute rule twenty years later, and how flash memory changes the rules", DaMoN '07: Proceedings of the 3rd international workshop on Data management on new hardware, pp. 1–9, doi:10.1145/1363189.1363198, ISBN   9781595937728, S2CID   14991801 Free version in ACM Queue , September 2008.
  5. René J. Chevance (2004). Server Architectures: Multiprocessors, Clusters, Parallel Systems, Web Servers, Storage Solutions. Digital Press. p. 542. ISBN   978-0-08-049229-2.
  6. Gian-Paolo D. Musumeci; Mike Loukides (2002). System Performance Tuning. O'Reilly Media, Inc. p. 263. ISBN   978-0-596-55204-6.
  7. https://www.theregister.co.uk/2010/05/19/flash_5_minute_rule/?page=2 [ bare URL ]
  8. Jim Gray, Prashant Shenoy, "Rules of Thumb in Data Engineering", MS-TR-99-100