External memory algorithm

Last updated

In computing, external memory algorithms or out-of-core algorithms are algorithms that are designed to process data that are too large to fit into a computer's main memory at once. Such algorithms must be optimized to efficiently fetch and access data stored in slow bulk memory (auxiliary memory) such as hard drives or tape drives, or when memory is on a computer network. [1] [2] External memory algorithms are analyzed in the external memory model.

Contents

Model

The cache on the left holds
M
B
{\displaystyle {\tfrac {M}{B}}}
blocks of size
B
{\displaystyle B}
each, for a total of M objects. The external memory on the right is unbounded. External memory.svg
The cache on the left holds blocks of size each, for a total of M objects. The external memory on the right is unbounded.

External memory algorithms are analyzed in an idealized model of computation called the external memory model (or I/O model, or disk access model). The external memory model is an abstract machine similar to the RAM machine model, but with a cache in addition to main memory. The model captures the fact that read and write operations are much faster in a cache than in main memory, and that reading long contiguous blocks is faster than reading randomly using a disk read-and-write head. The running time of an algorithm in the external memory model is defined by the number of reads and writes to memory required. [3] The model was introduced by Alok Aggarwal and Jeffrey Vitter in 1988. [4] The external memory model is related to the cache-oblivious model, but algorithms in the external memory model may know both the block size and the cache size. For this reason, the model is sometimes referred to as the cache-aware model. [5]

The model consists of a processor with an internal memory or cache of size M, connected to an unbounded external memory. Both the internal and external memory are divided into blocks of size B. One input/output or memory transfer operation consists of moving a block of B contiguous elements from external to internal memory, and the running time of an algorithm is determined by the number of these input/output operations. [4]

Algorithms

Algorithms in the external memory model take advantage of the fact that retrieving one object from external memory retrieves an entire block of size . This property is sometimes referred to as locality.

Searching for an element among objects is possible in the external memory model using a B-tree with branching factor . Using a B-tree, searching, insertion, and deletion can be achieved in time (in Big O notation). Information theoretically, this is the minimum running time possible for these operations, so using a B-tree is asymptotically optimal. [4]

External sorting is sorting in an external memory setting. External sorting can be done via distribution sort, which is similar to quicksort, or via a -way merge sort. Both variants achieve the asymptotically optimal runtime of to sort N objects. This bound also applies to the fast Fourier transform in the external memory model. [2]

The permutation problem is to rearrange elements into a specific permutation. This can either be done either by sorting, which requires the above sorting runtime, or inserting each element in order and ignoring the benefit of locality. Thus, permutation can be done in time.

Applications

The external memory model captures the memory hierarchy, which is not modeled in other common models used in analyzing data structures, such as the random-access machine, and is useful for proving lower bounds for data structures. The model is also useful for analyzing algorithms that work on datasets too big to fit in internal memory. [4]

A typical example is geographic information systems, especially digital elevation models, where the full data set easily exceeds several gigabytes or even terabytes of data.

This methodology extends beyond general purpose CPUs and also includes GPU computing as well as classical digital signal processing. In general-purpose computing on graphics processing units (GPGPU), powerful graphics cards (GPUs) with little memory (compared with the more familiar system memory, which is most often referred to simply as RAM) are utilized with relatively slow CPU-to-GPU memory transfer (when compared with computation bandwidth).

History

An early use of the term "out-of-core" as an adjective is in 1962 in reference to devices that are other than the core memory of an IBM 360. [6] An early use of the term "out-of-core" with respect to algorithms appears in 1971. [7]

See also

Related Research Articles

<span class="mw-page-title-main">Merge sort</span> Divide and conquer-based sorting algorithm

In computer science, merge sort is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the relative order of equal elements is the same in the input and output. Merge sort is a divide-and-conquer algorithm that was invented by John von Neumann in 1945. A detailed description and analysis of bottom-up merge sort appeared in a report by Goldstine and von Neumann as early as 1948.

In computer science, algorithmic efficiency is a property of an algorithm which relates to the amount of computational resources used by the algorithm. An algorithm must be analyzed to determine its resource usage, and the efficiency of an algorithm can be measured based on the usage of different resources. Algorithmic efficiency can be thought of as analogous to engineering productivity for a repeating or continuous process.

In computer science, divide and conquer is an algorithm design paradigm. A divide-and-conquer algorithm recursively breaks down a problem into two or more sub-problems of the same or related type, until these become simple enough to be solved directly. The solutions to the sub-problems are then combined to give a solution to the original problem.

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.

A B+ tree is an m-ary tree with a variable but often large number of children per node. A B+ tree consists of a root, internal nodes and leaves. The root may be either a leaf or a node with two or more children.

External sorting is a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device and instead they must reside in the slower external memory, usually a disk drive. Thus, external sorting algorithms are external memory algorithms and thus applicable in the external memory model of computation.

In computing, a cache-oblivious algorithm is an algorithm designed to take advantage of a processor cache without having the size of the cache as an explicit parameter. An optimal cache-oblivious algorithm is a cache-oblivious algorithm that uses the cache optimally. Thus, a cache-oblivious algorithm is designed to perform well, without modification, on multiple machines with different cache sizes, or for a memory hierarchy with different levels of cache having different sizes. Cache-oblivious algorithms are contrasted with explicit loop tiling, which explicitly breaks a problem into blocks that are optimally sized for a given cache.

<span class="mw-page-title-main">Quicksort</span> Divide and conquer sorting algorithm

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than merge sort and heapsort for randomized data, particularly on larger distributions.

In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers x0, x1, x2, ... is a second sequence of numbers y0, y1, y2, ..., the sums of prefixes of the input sequence:

In-place matrix transposition, also called in-situ matrix transposition, is the problem of transposing an N×M matrix in-place in computer memory, ideally with O(1) (bounded) additional storage, or at most with additional storage much less than NM. Typically, the matrix is assumed to be stored in row-major or column-major order.

Because matrix multiplication is such a central operation in many numerical algorithms, much work has been invested in making matrix multiplication algorithms efficient. Applications of matrix multiplication in computational problems are found in many fields including scientific computing and pattern recognition and in seemingly unrelated problems such as counting the paths through a graph. Many different algorithms have been designed for multiplying matrices on different types of hardware, including parallel and distributed systems, where the computational work is spread over multiple processors.

Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size s from the n-element sequence, and determining the range of the buckets by sorting the sample and choosing p−1 < s elements from the result. These elements then divide the array into p approximately equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.

In computer science, a compressed suffix array is a compressed data structure for pattern matching. Compressed suffix arrays are a general class of data structure that improve on the suffix array. These data structures enable quick search for an arbitrary string with a comparatively small index.

In computer science, a fractal tree index is a tree data structure that keeps data sorted and allows searches and sequential access in the same time as a B-tree but with insertions and deletions that are asymptotically faster than a B-tree. Like a B-tree, a fractal tree index is a generalization of a binary search tree in that a node can have more than two children. Furthermore, unlike a B-tree, a fractal tree index has buffers at each node, which allow insertions, deletions and other changes to be stored in intermediate locations. The goal of the buffers is to schedule disk writes so that each write performs a large amount of useful work, thereby avoiding the worst-case performance of B-trees, in which each disk write may change a small amount of data on disk. Like a B-tree, fractal tree indexes are optimized for systems that read and write large blocks of data. The fractal tree index has been commercialized in databases by Tokutek. Originally, it was implemented as a cache-oblivious lookahead array, but the current implementation is an extension of the Bε tree. The Bε is related to the Buffered Repository Tree. The Buffered Repository Tree has degree 2, whereas the Bε tree has degree Bε. The fractal tree index has also been used in a prototype filesystem. An open source implementation of the fractal tree index is available, which demonstrates the implementation details outlined below.

Funnelsort is a comparison-based sorting algorithm. It is similar to mergesort, but it is a cache-oblivious algorithm, designed for a setting where the number of elements to sort is too large to fit in a cache where operations are done. It was introduced by Matteo Frigo, Charles Leiserson, Harald Prokop, and Sridhar Ramachandran in 1999 in the context of the cache oblivious model.

The cache-oblivious distribution sort is a comparison-based sorting algorithm. It is similar to quicksort, but it is a cache-oblivious algorithm, designed for a setting where the number of elements to sort is too large to fit in a cache where operations are done. In the external memory model, the number of memory transfers it needs to perform a sort of items on a machine with cache of size and cache lines of length is , under the tall cache assumption that . This number of memory transfers has been shown to be asymptotically optimal for comparison sorts. This distribution sort also achieves the asymptotically optimal runtime complexity of .

Communication-avoiding algorithms minimize movement of data within a memory hierarchy for improving its running-time and energy consumption. These minimize the total of two costs : arithmetic and communication. Communication, in this context refers to moving data, either between levels of memory or between multiple processors over a network. It is much more expensive than arithmetic.

An oblivious RAM (ORAM) simulator is a compiler that transforms algorithms in such a way that the resulting algorithms preserve the input-output behavior of the original algorithm but the distribution of the memory access pattern of the transformed algorithm is independent of the memory access pattern of the original algorithm.

<span class="mw-page-title-main">Parallel external memory</span>

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine. It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

External memory graph traversal is a type of graph traversal optimized for accessing externally stored memory.

References

  1. Vitter, J. S. (2001). "External Memory Algorithms and Data Structures: Dealing with MASSIVE DATA". ACM Computing Surveys. 33 (2): 209–271. CiteSeerX   10.1.1.42.7064 . doi:10.1145/384192.384193. S2CID   2155038.
  2. 1 2 Vitter, J. S. (2008). Algorithms and Data Structures for External Memory (PDF). pp. 305–474. CiteSeerX   10.1.1.140.3731 . doi:10.1561/0400000014. ISBN   978-1-60198-106-6.{{cite book}}: |journal= ignored (help)
  3. Zhang, Donghui; Tsotras, Vassilis J.; Levialdi, Stefano; Grinstein, Georges; Berry, Damon Andrew; Gouet-Brunet, Valerie; Kosch, Harald; Döller, Mario; Döller, Mario; Kosch, Harald; Maier, Paul; Bhattacharya, Arnab; Ljosa, Vebjorn; Nack, Frank; Bartolini, Ilaria; Gouet-Brunet, Valerie; Mei, Tao; Rui, Yong; Crucianu, Michel; Shih, Frank Y.; Fan, Wenfei; Ullman-Cullere, Mollie; Clark, Eugene; Aronson, Samuel; Mellin, Jonas; Berndtsson, Mikael; Grahne, Gösta; Bertossi, Leopoldo; Dong, Guozhu; et al. (2009). "I/O Model of Computation". Encyclopedia of Database Systems. Springer Science+Business Media. pp. 1333–1334. doi:10.1007/978-0-387-39940-9_752. ISBN   978-0-387-35544-3.
  4. 1 2 3 4 Aggarwal, Alok; Vitter, Jeffrey (1988). "The input/output complexity of sorting and related problems". Communications of the ACM . 31 (9): 1116–1127. doi: 10.1145/48529.48535 . S2CID   6264984.
  5. Demaine, Erik (2002). Cache-Oblivious Algorithms and Data Structures (PDF). Lecture Notes from the EEF Summer School on Massive Data Sets. Aarhus: BRICS.
  6. NASA SP. NASA. 1962. p. 276.
  7. Computers in Crisis. ACM. 1971. p. 296.