Search data structure

Last updated

In computer science, a search data structure[ citation needed ] is any data structure that allows the efficient retrieval of specific items from a set of items, such as a specific record from a database.

Contents

The simplest, most general, and least efficient search structure is merely an unordered sequential list of all the items. Locating the desired item in such a list, by the linear search method, inevitably requires a number of operations proportional to the number n of items, in the worst case as well as in the average case. Useful search data structures allow faster retrieval; however, they are limited to queries of some specific kind. Moreover, since the cost of building such structures is at least proportional to n, they only pay off if several queries are to be performed on the same database (or on a database that changes little between queries).

Static search structures are designed for answering many queries on a fixed database; dynamic structures also allow insertion, deletion, or modification of items between successive queries. In the dynamic case, one must also consider the cost of fixing the search structure to account for the changes in the database.

Classification

The simplest kind of query is to locate a record that has a specific field (the key) equal to a specified value v. Other common kinds of query are "find the item with smallest (or largest) key value", "find the item with largest key value not exceeding v", "find all items with key values between specified bounds vmin and vmax".

In certain databases the key values may be points in some multi-dimensional space. For example, the key may be a geographic position (latitude and longitude) on the Earth. In that case, common kinds of queries are "find the record with a key closest to a given point v", or "find all items whose key lies at a given distance from v", or "find all items within a specified region R of the space".

A common special case of the latter are simultaneous range queries on two or more simple keys, such as "find all employee records with salary between 50,000 and 100,000 and hired between 1995 and 2007".

Single ordered keys

Finding the smallest element

Asymptotic worst-case analysis

In this table, the asymptotic notation O(f(n)) means "not exceeding some fixed multiple of f(n) in the worst case."

Data StructureInsertDeleteBalanceGet at indexSearchFind minimumFind maximumSpace usage
Unsorted array O(1)
(see note)
O(1)
(see note)
N/AO(1)O(n)O(n)O(n)O(n)
Sorted array O(n)O(n)N/AO(1)O(log n)O(1)O(1)O(n)
Stack O(1)O(1)O(n)O(n)
Queue O(1)O(1)O(n)O(n)
Unsorted linked list O(1)O(1) [1] N/AO(n)O(n)O(n)O(n)O(n)
Sorted linked listO(n)O(1) [1] N/AO(n)O(n)O(1)O(1)O(n)
Skip list
Self-balancing binary search tree O(log n)O(log n)O(log n)N/AO(log n)O(log n)O(log n)O(n)
Heap O(log n)O(log n)O(log n)N/AO(n)O(1) for a min-heap
O(n) for a max-heap [2]
O(1) for a max-heap
O(n) for a min-heap [2]
O(n)
Hash table O(1)O(1)O(n)N/AO(1)O(n)O(n)O(n)
Trie (k = average length of key)O(k)O(k)N/AO(k)O(k)O(k)O(k)O(kn)
Cartesian tree
B-tree O(log n)O(log n)O(log n)N/AO(log n)O(log n)O(log n)O(n)
Red–black tree O(log n)O(log n)O(log n)O(n)
Splay tree
AVL tree O(log n)
k-d tree

Note: Insert on an unsorted array is sometimes quoted as being O(n) due to the assumption that the element to be inserted must be inserted at one particular location of the array, which would require shifting all the subsequent elements by one position. However, in a classic array, the array is used to store arbitrary unsorted elements, and hence the exact position of any given element is of no consequence, and insert is carried out by increasing the array size by 1 and storing the element at the end of the array, which is a O(1) operation. [3] [4] Likewise, the deletion operation is sometimes quoted as being O(n) due to the assumption that subsequent elements must be shifted, but in a classic unsorted array the order is unimportant (though elements are implicitly ordered by insert-time), so deletion can be carried out by swapping the element to be deleted with the last element in the array and then decrementing the array size by 1, which is a O(1) operation. [5]

This table is only an approximate summary; for each data structure there are special situations and variants that may lead to different costs. Also two or more data structures can be combined to obtain lower costs.

Footnotes

  1. 1 2 Thomas H. Cormen; Charles E. Leiserson; Ronald L. Rivest (1990). Introduction to Algorithms . The College of Information Sciences and Technology at Penn State. ISBN   978-0-262-53091-0. LIST-DELETE runs in O(1) time, but if to delete an element with a given key, Θ(n) time is required in the worst case because we must first call LIST-SEARCH.
  2. 1 2 Thomas H. Cormen; Charles E. Leiserson; Ronald L. Rivest (1990). Introduction to Algorithms. The College of Information Sciences and Technology at Penn State. ISBN   978-0-262-53091-0. There are two kinds of binary heaps: max-heaps and min-heaps. In both kinds, the values in the nodes satisfy a heap property... the largest element in a max-heap is stored at the root... The smallest element in a min-heap is at the root... The operation HEAP-MAXIMUM returns the maximum heap element in Θ(1) time by simply returning the value A[1] in the heap.
  3. Allen Sherrod (2007). Data Structures and Algorithms for Game Developers. Cengage Learning. ISBN   978-1-58450-663-8. The insertion of an item into an unordered array does not depend on anything other than placing the new item at the end of the list. This gives the insertion into an unordered array of O(1).
  4. Thomas H. Cormen; Charles E. Leiserson; Ronald L. Rivest (1990). Introduction to Algorithms . The College of Information Sciences and Technology at Penn State. ISBN   978-0-262-53091-0.
  5. "Algorithm - the time complexity of deletion in a unsorted array". Finding the element with a given value is linear. Since the array isn't sorted anyway, you can do the deletion itself in constant time. First swap the element you want to delete to the end of the array, then reduce the array size by one element.

See also

Related Research Articles

<span class="mw-page-title-main">Binary search tree</span> Rooted binary tree data structure

In computer science, a binary search tree (BST), also called an ordered or sorted binary tree, is a rooted binary tree data structure with the key of each internal node being greater than all the keys in the respective node's left subtree and less than the ones in its right subtree. The time complexity of operations on the binary search tree is linear with respect to the height of the tree.

<span class="mw-page-title-main">Hash table</span> Associative array for storing key-value pairs

In computing, a hash table, also known as a hash map, is a data structure that implements an associative array, also called a dictionary, which is an abstract data type that maps keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored.

<span class="mw-page-title-main">Heap (data structure)</span> Computer science data structure

In computer science, a heap is a specialized tree-based data structure that satisfies the heap property: In a max heap, for any given node C, if P is a parent node of C, then the key of P is greater than or equal to the key of C. In a min heap, the key of P is less than or equal to the key of C. The node at the "top" of the heap is called the root node.

<span class="mw-page-title-main">Insertion sort</span> Sorting algorithm

Insertion sort is a simple sorting algorithm that builds the final sorted array (or list) one item at a time by comparisons. It is much less efficient on large lists than more advanced algorithms such as quicksort, heapsort, or merge sort. However, insertion sort provides several advantages:

In computer science, a priority queue is an abstract data-type similar to a regular queue or stack data structure. Each element in a priority queue has an associated priority. In a priority queue, elements with high priority are served before elements with low priority. In some implementations, if two elements have the same priority, they are served in the same order in which they were enqueued. In other implementations, the order of elements with the same priority is undefined.

<span class="mw-page-title-main">Sorting algorithm</span> Algorithm that arranges lists in order

In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. Efficient sorting is important for optimizing the efficiency of other algorithms that require input data to be in sorted lists. Sorting is also often useful for canonicalizing data and for producing human-readable output.

<span class="mw-page-title-main">Binary heap</span> Variant of heap data structure

A binary heap is a heap data structure that takes the form of a binary tree. Binary heaps are a common way of implementing priority queues. The binary heap was introduced by J. W. J. Williams in 1964, as a data structure for heapsort.

In computer science, an associative array, map, symbol table, or dictionary is an abstract data type that stores a collection of pairs, such that each possible key appears at most once in the collection. In mathematical terms, an associative array is a function with finite domain. It supports 'lookup', 'remove', and 'insert' operations.

<span class="mw-page-title-main">Bucket sort</span> Sorting algorithm

Bucket sort, or bin sort, is a sorting algorithm that works by distributing the elements of an array into a number of buckets. Each bucket is then sorted individually, either using a different sorting algorithm, or by recursively applying the bucket sorting algorithm. It is a distribution sort, a generalization of pigeonhole sort that allows multiple keys per bucket, and is a cousin of radix sort in the most-to-least significant digit flavor. Bucket sort can be implemented with comparisons and therefore can also be considered a comparison sort algorithm. The computational complexity depends on the algorithm used to sort each bucket, the number of buckets to use, and whether the input is uniformly distributed.

In computer science, a set is an abstract data type that can store unique values, without any particular order. It is a computer implementation of the mathematical concept of a finite set. Unlike most other collection types, rather than retrieving a specific element from a set, one typically tests a value for membership in a set.

<span class="mw-page-title-main">Treap</span>

In computer science, the treap and the randomized binary search tree are two closely related forms of binary search tree data structures that maintain a dynamic set of ordered keys and allow binary searches among the keys. After any sequence of insertions and deletions of keys, the shape of the tree is a random variable with the same probability distribution as a random binary tree; in particular, with high probability its height is proportional to the logarithm of the number of keys, so that each search, insertion, or deletion operation takes logarithmic time to perform.

A van Emde Boas tree, also known as a vEB tree or van Emde Boas priority queue, is a tree data structure which implements an associative array with m-bit integer keys. It was invented by a team led by Dutch computer scientist Peter van Emde Boas in 1975. It performs all operations in O(log m) time, or equivalently in O(log log M) time, where M = 2m is the largest element that can be stored in the tree. The parameter M is not to be confused with the actual number of elements stored in the tree, by which the performance of other tree data-structures is often measured.

In computer science, an interval tree is a tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point. It is often used for windowing queries, for instance, to find all roads on a computerized map inside a rectangular viewport, or to find all visible elements inside a three-dimensional scene. A similar data structure is the segment tree.

A sorted array is an array data structure in which each element is sorted in numerical, alphabetical, or some other order, and placed at equally spaced addresses in computer memory. It is typically used in computer science to implement static lookup tables to hold multiple values which have the same data type. Sorting an array is useful in organising data in ordered form and recovering them rapidly.

In computer science, the Brodal queue is a heap/priority queue structure with very low worst case time bounds: for insertion, find-minimum, meld and decrease-key and for delete-minimum and general deletion. They are the first heap variant to achieve these bounds without resorting to amortization of operational costs. Brodal queues are named after their inventor Gerth Stølting Brodal.

In computer science, an order statistic tree is a variant of the binary search tree (or more generally, a B-tree) that supports two additional operations beyond insertion, lookup and deletion:

In computer science, a mergeable heap is an abstract data type, which is a heap supporting a merge operation.

In computer science, a fractal tree index is a tree data structure that keeps data sorted and allows searches and sequential access in the same time as a B-tree but with insertions and deletions that are asymptotically faster than a B-tree. Like a B-tree, a fractal tree index is a generalization of a binary search tree in that a node can have more than two children. Furthermore, unlike a B-tree, a fractal tree index has buffers at each node, which allow insertions, deletions and other changes to be stored in intermediate locations. The goal of the buffers is to schedule disk writes so that each write performs a large amount of useful work, thereby avoiding the worst-case performance of B-trees, in which each disk write may change a small amount of data on disk. Like a B-tree, fractal tree indexes are optimized for systems that read and write large blocks of data. The fractal tree index has been commercialized in databases by Tokutek. Originally, it was implemented as a cache-oblivious lookahead array, but the current implementation is an extension of the Bε tree. The Bε is related to the Buffered Repository Tree. The Buffered Repository Tree has degree 2, whereas the Bε tree has degree Bε. The fractal tree index has also been used in a prototype filesystem. An open source implementation of the fractal tree index is available, which demonstrates the implementation details outlined below.

In computer science, a shadow heap is a mergeable heap data structure which supports efficient heap merging in the amortized sense. More specifically, shadow heaps make use of the shadow merge algorithm to achieve insertion in O(f(n)) amortized time and deletion in O((log n log log n)/f(n)) amortized time, for any choice of 1 ≤ f(n) ≤ log log n.

This is a comparison of the performance of notable data structures, as measured by the complexity of their logical operations. For a more comprehensive listing of data structures, see List of data structures.