Database storage structures

Last updated October 26, 2022

Database tables and indexes may be stored on disk in one of a number of forms, including ordered/unordered flat files, ISAM, heap files, hash buckets, or B+ trees. Each form has its own particular advantages and disadvantages. The most commonly used forms are B-trees and ISAM. Such forms or structures are one aspect of the overall schema used by a database engine to store information.

Unordered

Unordered storage typically stores the records in the order they are inserted. Such storage offers good insertion efficiency ( $O\left(1\right)$ ), but inefficient retrieval times ( $O\left(n\right)$ ). Typically these retrieval times are better, however, as most databases use indexes on the primary keys, resulting in retrieval times of $O\left(\log n\right)$ or $O\left(1\right)$ for keys that are the same as the database row offsets within the storage system.^{[ citation needed ]}

Ordered

Ordered storage typically stores the records in order and may have to rearrange or increase the file size when a new record is inserted, resulting in lower insertion efficiency. However, ordered storage provides more efficient retrieval as the records are pre-sorted, resulting in a complexity of $O\left(\log n\right)$ .^{[ citation needed ]}

Structured files

Heap files

Heap files are lists of unordered records of variable size. Although sharing a similar name, heap files are widely different from in-memory heaps. In-memory heaps are ordered, as opposed to heap files.

Simplest and most basic method
- insert efficient, with new records added at the end of the file, providing chronological order
- retrieval efficient when the handle to the memory is the address of the memory
- search inefficient, as searching has to be linear
- deletion is accomplished by marking selected records as "deleted"
- requires periodic reorganization if file is very volatile (changed frequently)
Advantages
- efficient for bulk loading data
- efficient for relatively small relations as indexing overheads are avoided
- efficient when retrievals involve large proportion of stored records
Disadvantages
- not efficient for selective retrieval using key values, especially if large
- sorting may be time-consuming
- not suitable for volatile tables

Hash buckets

Hash functions calculate the address of the page in which the record is to be stored based on one or more fields in the record
- hashing functions chosen to ensure that addresses are spread evenly across the address space
- ‘occupancy’ is generally 40% to 60% of the total file size
- unique address not guaranteed so collision detection and collision resolution mechanisms are required
Open addressing
Chained/unchained overflow
Pros and cons
- efficient for exact matches on key field
- not suitable for range retrieval, which requires sequential storage
- calculates where the record is stored based on fields in the record
- hash functions ensure even spread of data
- collisions are possible, so collision detection and restoration is required

B+ trees

These are the most commonly used in practice.

Time taken to access any record is the same because the same number of nodes is searched
Index is a full index so data file does not have to be ordered
Pros and cons
- versatile data structure – sequential as well as random access
- access is fast
- supports exact, range, part key and pattern matches efficiently.
- volatile files are handled efficiently because index is dynamic – expands and contracts as table grows and shrinks
- less well suited to relatively stable files – in this case, ISAM is more efficient

Data orientation

Most conventional relational databases use "row-oriented" storage, meaning that all data associated with a given row is stored together. By contrast, column-oriented DBMS store all data from a given column together in order to more quickly serve data warehouse-style queries. Correlation databases are similar to row-based databases, but apply a layer of indirection to map multiple instances of the same value to the same numerical identifier.

Related Research Articles

In computer science, an array is a data structure consisting of a collection of elements, each identified by at least one array index or key. An array is stored such that the position of each element can be computed from its index tuple by a mathematical formula. The simplest type of data structure is a linear array, also called one-dimensional array.

In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for nodes with more than two children. Unlike other self-balancing binary search trees, the B-tree is well suited for storage systems that read and write relatively large blocks of data, such as databases and file systems.

A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored.

In computer science, a trie, also called digital tree or prefix tree, is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.

In computer science, an associative array, map, symbol table, or dictionary is an abstract data type that stores a collection of pairs, such that each possible key appears at most once in the collection. In mathematical terms an associative array is a function with finite domain. It supports 'lookup', 'remove', and 'insert' operations.

A database engine is the underlying software component that a database management system (DBMS) uses to create, read, update and delete (CRUD) data from a database. Most database management systems include their own application programming interface (API) that allows the user to interact with their underlying engine without going through the user interface of the DBMS.

ISAM is a method for creating, maintaining, and manipulating computer files of data so that records can be retrieved sequentially or randomly by one or more keys. Indexes of key fields are maintained to achieve fast retrieval of required file records in Indexed files. IBM originally developed ISAM for mainframe computers, but implementations are available for most computer systems.

In computer science, a perfect hash function $h$ for a set $S$ is a hash function that maps distinct elements in $S$ to a set of $m$ integers, with no collisions. In mathematical terms, it is an injective function.

A B+ tree is an m-ary tree with a variable but often large number of children per node. A B+ tree consists of a root, internal nodes and leaves. The root may be either a leaf or a node with two or more children.

In computer science, a radix tree is a data structure that represents a space-optimized trie in which each node that is the only child is merged with its parent. The result is that the number of children of every internal node is at most the radix $r$ of the radix tree, where $r$ is a positive integer and a power $x$ of 2, having $x$ ≥ 1. Unlike regular trees, edges can be labeled with sequences of elements as well as single elements. This makes radix trees much more efficient for small sets and for sets of strings that share long prefixes.

A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Indexes are used to quickly locate data without having to search every row in a database table every time a database table is accessed. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.

Extensible Storage Engine (ESE), also known as JET Blue, is an ISAM data storage technology from Microsoft. ESE is the core of Microsoft Exchange Server, Active Directory, and Windows Search. It's also used by a number of Windows components including Windows Update client and Help and Support Center. Its purpose is to allow applications to store and retrieve data via indexed and sequential access.

Linear hashing (LH) is a dynamic data structure which implements a hash table and grows or shrinks one bucket at a time. It was invented by Witold Litwin in 1980. It has been analyzed by Baeza-Yates and Soza-Pollman. It is the first in a number of schemes known as dynamic hashing such as Larson's Linear Hashing with Partial Extensions, Linear Hashing with Priority Splitting, Linear Hashing with Partial Expansions and Priority Splitting, or Recursive Linear Hashing.

Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches in a variation of the behavior referred to as brood parasitism; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.

In computer science, a search data structure is any data structure that allows the efficient retrieval of specific items from a set of items, such as a specific record from a database.

This is a comparison between notable database engines for the MySQL database management system (DBMS). A database engine is the underlying software component that a DBMS uses to create, read, update and delete (CRUD) data from a database.

A quotient filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. A query will elicit a reply specifying either that the element is definitely not in the set or that the element is probably in the set. The former result is definitive; i.e., the test does not generate false negatives. But with the latter result there is some probability, ε, of the test returning "element is in the set" when in fact the element is not present in the set. There is a tradeoff between ε, the false positive rate, and storage size; increasing the filter's storage size reduces ε. Other AMQ operations include "insert" and "optionally delete". The more elements are added to the set, the larger the probability of false positives.

In computer science, a fractal tree index is a tree data structure that keeps data sorted and allows searches and sequential access in the same time as a B-tree but with insertions and deletions that are asymptotically faster than a B-tree. Like a B-tree, a fractal tree index is a generalization of a binary search tree in that a node can have more than two children. Furthermore, unlike a B-tree, a fractal tree index has buffers at each node, which allow insertions, deletions and other changes to be stored in intermediate locations. The goal of the buffers is to schedule disk writes so that each write performs a large amount of useful work, thereby avoiding the worst-case performance of B-trees, in which each disk write may change a small amount of data on disk. Like a B-tree, fractal tree indexes are optimized for systems that read and write large blocks of data. The fractal tree index has been commercialized in databases by Tokutek. Originally, it was implemented as a cache-oblivious lookahead array, but the current implementation is an extension of the B^ε tree. The B^ε is related to the Buffered Repository Tree. The Buffered Repository Tree has degree 2, whereas the B^ε tree has degree B^ε. The fractal tree index has also been used in a prototype filesystem. An open source implementation of the fractal tree index is available, which demonstrates the implementation details outlined below.

In computer science, a retrieval data structure, also known as static function, is a space-efficient dictionary-like data type composed of a collection of pairs that allows the following operations:

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

v t e Database management systems
Types	Object-oriented comparison Relational list comparison Key–value Column-oriented list Document-oriented Wide-column store Graph NoSQL NewSQL In-memory list Multi-model comparison Cloud Blockchain-based database
Concepts	Database ACID Armstrong's axioms Codd's 12 rules CAP theorem CRUD Null Candidate key Foreign key Superkey Surrogate key Unique key
Objects	Relation table column row View Transaction Transaction log Trigger Index Stored procedure Cursor Partition
Components	Concurrency control Data dictionary JDBC XQJ ODBC Query language Query optimizer Query rewriting system Query plan
Functions	Administration Query optimization Replication Sharding
Related topics	Database models Database normalization Database storage Distributed database Federated database system Referential integrity Relational algebra Relational calculus Relational model Object–relational database Transaction processing
Category Outline WikiProject