Interpolation search

Last updated

Interpolation search is an algorithm for searching for a key in an array that has been ordered by numerical values assigned to the keys (key values). It was first described by W. W. Peterson in 1957. [1] Interpolation search resembles the method by which people search a telephone directory for a name (the key value by which the book's entries are ordered): in each step the algorithm calculates where in the remaining search space the sought item might be, based on the key values at the bounds of the search space and the value of the sought key, usually via a linear interpolation. The key value actually found at this estimated position is then compared to the key value being sought. If it is not equal, then depending on the comparison, the remaining search space is reduced to the part before or after the estimated position. This method will only work if calculations on the size of differences between key values are sensible.

Contents

Interpolation search
Interpolation sort.gif
Visualization of the interpolation search algorithm in which 24 is the target value.
Class Search algorithm
Data structure Array
Worst-case performance O(n)
Best-case performance O(1)
Average performance O(log(log(n))) [2]
Worst-case space complexity O(1)
OptimalYes

By comparison, binary search always chooses the middle of the remaining search space, discarding one half or the other, depending on the comparison between the key found at the estimated position and the key sought — it does not require numerical values for the keys, just a total order on them. The remaining search space is reduced to the part before or after the estimated position. The linear search uses equality only as it compares elements one-by-one from the start, ignoring any sorting.

On average the interpolation search makes about log(log(n)) comparisons (if the elements are uniformly distributed), where n is the number of elements to be searched. In the worst case (for instance where the numerical values of the keys increase exponentially) it can make up to O(n) comparisons.

In interpolation-sequential search, interpolation is used to find an item near the one being searched for, then linear search is used to find the exact item.

Performance

Using big-O notation, the performance of the interpolation algorithm on a data set of size n is O(n); however under the assumption of a uniform distribution of the data on the linear scale used for interpolation, the performance can be shown to be O(log log n). [3] [4] [5]

Dynamic interpolation search extends the o(log log n) bound to other distributions and also supports O(log n) insertion and deletion. [6] [7]

Practical performance of interpolation search depends on whether the reduced number of probes is outweighed by the more complicated calculations needed for each probe. It can be useful for locating a record in a large sorted file on disk, where each probe involves a disk seek and is much slower than the interpolation arithmetic.

Index structures like B-trees also reduce the number of disk accesses, and are more often used to index on-disk data in part because they can index many types of data and can be updated online. Still, interpolation search may be useful when one is forced to search certain sorted but unindexed on-disk datasets.

Adaptation to different datasets

When sort keys for a dataset are uniformly distributed numbers, linear interpolation is straightforward to implement and will find an index very near the sought value.

On the other hand, for a phone book sorted by name, the straightforward approach to interpolation search does not apply. The same high-level principles can still apply, though: one can estimate a name's position in the phone book using the relative frequencies of letters in names and use that as a probe location.

Some interpolation search implementations may not work as expected when a run of equal key values exists. The simplest implementation of interpolation search won't necessarily select the first (or last) element of such a run.

Book-based searching

The conversion of names in a telephone book to some sort of number clearly will not provide numbers having a uniform distribution (except via immense effort such as sorting the names and calling them name #1, name #2, etc.) and further, it is well known that some names are much more common than others (Smith, Jones,) Similarly with dictionaries, where there are many more words starting with some letters than others. Some publishers go to the effort of preparing marginal annotations or even cutting into the side of the pages to show markers for each letter so that at a glance a segmented interpolation can be performed.

Sample implementation

The following C++ code example is a simple implementation. At each stage it computes a probe position then as with the binary search, moves either the upper or lower bound in to define a smaller interval containing the sought value. Unlike the binary search which guarantees a halving of the interval's size with each stage, a misled interpolation may reduce/i-case efficiency of O(n).

#include<cassert>/*arr[low, high) is sorted, search the data "key" in this array,if "key" is found, return the corresponding index (NOT necessarily the highest possible index);if "key" is not found, just return low - 1How to verify that the algorithm is correct?Proof:(finiteness: after one loop, the width of [low, high] decreases strictly )Fist, high <--- high - 1scenario 1. when low = highscenario 2. when low < high, arr[low] = arr[high]scenario 3. when low < high, arr[low] < arr[high], key < arr[low] or key > arr[high]scenario 4. when low < high, arr[low] < arr[high], arr[low] <= key <= arr[high]Now let's analyze scenario 4:Once entering the "while" loop,  low <= middle <= high    let's analyse after one loop(if we don't return), whether "low > high" will occur    After one loop:        case a1: "low" branch has been executed in this loop            arr[middle] < key <= arr[high]            so we have middle < high            so after this loop, we have            low <= high        case a2: "high" branch has been executed in this loop            arr[low] <= key < arr[middle]            so we have low < middle            so after this loop, we have            low <= high    so after one loop(if we don't return), we have "low <= high"    when we exit the "while" loop:        case b1: arr[low] >= arr[high]            In the last loop, if "low" branch is executed, we know                arr[low - 1] < k <= arr[high]                arr[low] >= arr[high]                low <= high                so we have                 arr[low - 1] < k <= arr[low] = arr[high]            In the last loop, if "high" branch is executed, we know                arr[low] <= key < arr[high + 1]                arr[low] >= arr[high]                low <= high                so we have                arr[low] = arr[high] <= key < arr[high + 1]        case b2: (arr[low] < arr[high]) && (arr[low] > key):            In the last loop, "low" must have been changed            so we have            arr[low - 1] < key            so we have             arr[low - 1] < key < arr[low]        case b3: (arr[low] < arr[high]) && (key > arr[high])            In the last loop, "high" must have been changed            so we have            key < arr[high + 1]            so we have            arr[low] < arr[high] < key < arr[high + 1]*/template<typenameT>staticRankinterpolation_search_v001(T*arr,constT&key,Ranklow,Rankhigh){high-=1;intmiddle;intinitial_low=low;while((arr[low]<arr[high])&&(arr[low]<=key)&&(key<=arr[high])){middle=low+((key-arr[low])*(high-low))/(arr[high]-arr[low]);assert((low<=middle)&&(middle<=high));if(arr[middle]<key)low=middle+1;elseif(key<arr[middle])high=middle-1;elsereturnmiddle;}if(key==arr[low])returnlow;elsereturninitial_low-1;}/*search "key" in the sorted array arr[low, high),return: the highest index i such that arr[i] <= keyHow to verify that the algorithm is correct?Proof:finiteness: after one loop, the width of [low, high] decreases strictly Fist, high <---- high - 1scenario 1. when low = highscenario 2. when low < high, key < arr[low] or arr[high] <= keyscenario 3. when low < high, arr[low] <= key < arr[high]Now let's analyze scenario 3:Once entering the "while" loop,  low <= middle < highwhen we exit the "while" loop:    case a1: key < arr[low]        so "low" is changed in the last loop, we know        arr[low - 1] <= key < arr[low]    case a2: arr[high] <= key        so "high" is changed in the last loop, we know        key < arr[high], impossibleconclusion: we should return "low - 1"*/template<typenameT>staticRankinterpolation_search_v002(T*arr,constT&key,Ranklow,Rankhigh){high-=1;assert(low<=high);Rankmiddle;if(key<arr[low]){returnlow-1;}if(arr[high]<=key){returnhigh;}// now low < high ,  arr[low] <= key < arr[high]while((arr[low]<=key)&&(key<arr[high])){middle=low+((high-low)*(key-arr[low]))/(arr[high]-arr[low]);assert((low<=middle)&&(middle<high));if(key<arr[middle]){high=middle;}else{low=middle+1;}}returnlow-1;}

Notice that having probed the list at index mid, for reasons of loop control administration, this code sets either high or low to be not mid but an adjacent index, which location is then probed during the next iteration. Since an adjacent entry's value will not be much different, the interpolation calculation is not much improved by this one step adjustment, at the cost of an additional reference to distant memory such as disk.

Each iteration of the above code requires between five and six comparisons (the extra is due to the repetitions needed to distinguish the three states of < > and = via binary comparisons in the absence of a three-way comparison) plus some messy arithmetic, while the binary search algorithm can be written with one comparison per iteration and uses only trivial integer arithmetic. It would thereby search an array of a million elements with no more than twenty comparisons (involving accesses to slow memory where the array elements are stored); to beat that, the interpolation search, as written above, would be allowed no more than three iterations.

See also

Related Research Articles

<span class="mw-page-title-main">Analysis of algorithms</span> Study of resources used by an algorithm

In computer science, the analysis of algorithms is the process of finding the computational complexity of algorithms—the amount of time, storage, or other resources needed to execute them. Usually, this involves determining a function that relates the size of an algorithm's input to the number of steps it takes or the number of storage locations it uses. An algorithm is said to be efficient when this function's values are small, or grow slowly compared to a growth in the size of the input. Different inputs of the same size may cause the algorithm to have different behavior, so best, worst and average case descriptions might all be of practical interest. When not otherwise specified, the function describing the performance of an algorithm is usually an upper bound, determined from the worst case inputs to the algorithm.

<span class="mw-page-title-main">Binary search</span> Search algorithm finding the position of a target value within a sorted array

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for nodes with more than two children. Unlike other self-balancing binary search trees, the B-tree is well suited for storage systems that read and write relatively large blocks of data, such as databases and file systems.

<span class="mw-page-title-main">Hash table</span> Associative array for storing key-value pairs

In computing, a hash table is a data structure that implements an associative array, also called a dictionary or simply map; an associative array is an abstract data type that maps keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored. A map implemented by a hash table is called a hash map.

<span class="mw-page-title-main">Heapsort</span> A sorting algorithm which uses the heap data structure

In computer science, heapsort is a comparison-based sorting algorithm which can be thought of as "an implementation of selection sort using the right data structure." Like selection sort, heapsort divides its input into a sorted and an unsorted region, and it iteratively shrinks the unsorted region by extracting the largest element from it and inserting it into the sorted region. Unlike selection sort, heapsort does not waste time with a linear-time scan of the unsorted region; rather, heap sort maintains the unsorted region in a heap data structure to efficiently find the largest element in each step.

<span class="mw-page-title-main">Heap (data structure)</span> Computer science data structure

In computer science, a heap is a tree-based data structure that satisfies the heap property: In a max heap, for any given node C, if P is a parent node of C, then the key of P is greater than or equal to the key of C. In a min heap, the key of P is less than or equal to the key of C. The node at the "top" of the heap is called the root node.

<span class="mw-page-title-main">Insertion sort</span> Sorting algorithm

Insertion sort is a simple sorting algorithm that builds the final sorted array (or list) one item at a time by comparisons. It is much less efficient on large lists than more advanced algorithms such as quicksort, heapsort, or merge sort. However, insertion sort provides several advantages:

In computer science, linear search or sequential search is a method for finding an element within a list. It sequentially checks each element of the list until a match is found or the whole list has been searched.

<span class="mw-page-title-main">Merge sort</span> Divide and conquer sorting algorithm

In computer science, merge sort is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the relative order of equal elements is the same in the input and output. Merge sort is a divide-and-conquer algorithm that was invented by John von Neumann in 1945. A detailed description and analysis of bottom-up merge sort appeared in a report by Goldstine and von Neumann as early as 1948.

Merge algorithms are a family of algorithms that take multiple sorted lists as input and produce a single list as output, containing all the elements of the inputs lists in sorted order. These algorithms are used as subroutines in various sorting algorithms, most famously merge sort.

<span class="mw-page-title-main">Sorting algorithm</span> Algorithm that arranges lists in order

In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. Efficient sorting is important for optimizing the efficiency of other algorithms that require input data to be in sorted lists. Sorting is also often useful for canonicalizing data and for producing human-readable output.

<span class="mw-page-title-main">Treap</span> Random search tree data structure

In computer science, the treap and the randomized binary search tree are two closely related forms of binary search tree data structures that maintain a dynamic set of ordered keys and allow binary searches among the keys. After any sequence of insertions and deletions of keys, the shape of the tree is a random variable with the same probability distribution as a random binary tree; in particular, with high probability its height is proportional to the logarithm of the number of keys, so that each search, insertion, or deletion operation takes logarithmic time to perform.

In computer science, an interval tree is a tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point. It is often used for windowing queries, for instance, to find all roads on a computerized map inside a rectangular viewport, or to find all visible elements inside a three-dimensional scene. A similar data structure is the segment tree.

<span class="mw-page-title-main">Quickselect</span> Algorithm for the kth smallest element in an array

In computer science, quickselect is a selection algorithm to find the kth smallest element in an unordered list, also known as the kth order statistic. Like the related quicksort sorting algorithm, it was developed by Tony Hoare, and thus is also known as Hoare's selection algorithm. Like quicksort, it is efficient in practice and has good average-case performance, but has poor worst-case performance. Quickselect and its variants are the selection algorithms most often used in efficient real-world implementations.

<span class="mw-page-title-main">Quicksort</span> Divide and conquer sorting algorithm

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than merge sort and heapsort for randomized data, particularly on larger distributions.

<span class="mw-page-title-main">Recursion (computer science)</span> Use of functions that call themselves

In computer science, recursion is a method of solving a computational problem where the solution depends on solutions to smaller instances of the same problem. Recursion solves such recursive problems by using functions that call themselves from within their own code. The approach can be applied to many types of problems, and recursion is one of the central ideas of computer science.

The power of recursion evidently lies in the possibility of defining an infinite set of objects by a finite statement. In the same manner, an infinite number of computations can be described by a finite recursive program, even if this program contains no explicit repetitions.

Flashsort is a distribution sorting algorithm showing linear computational complexity O(n) for uniformly distributed data sets and relatively little additional memory requirement. The original work was published in 1998 by Karl-Dietrich Neubert.

In computer science, integer sorting is the algorithmic problem of sorting a collection of data values by integer keys. Algorithms designed for integer sorting may also often be applied to sorting problems in which the keys are floating point numbers, rational numbers, or text strings. The ability to perform integer arithmetic on the keys allows integer sorting algorithms to be faster than comparison sorting algorithms in many cases, depending on the details of which operations are allowed in the model of computing and how large the integers to be sorted are.

In computer science, an exponential search is an algorithm, created by Jon Bentley and Andrew Chi-Chih Yao in 1976, for searching sorted, unbounded/infinite lists. There are numerous ways to implement this, with the most common being to determine a range that the search key resides in and performing a binary search within that range. This takes O(log i) where i is the position of the search key in the list, if the search key is in the list, or the position where the search key should be, if the search key is not in the list.

References

  1. W. W. Peterson (1957). "Addressing for Random-Access Storage". IBM J. Res. Dev. 1 (2): 130–146. doi:10.1147/rd.12.0130.
  2. Simon Yuan. "Understanding The Complexity Of Interpolation Search, Seminar Advanced Algorithms and Data Structures" (PDF).
  3. Weiss, Mark Allen (2006). Data structures and problem solving using Java, Pearson Addison Wesley
  4. Armenakis, A. C., Garey, L. E., Gupta, R. D., An adaptation of a root finding method to searching ordered disk files, BIT Numerical Mathematics, Volume 25, Number 4 / December, 1985.
  5. Sedgewick, Robert (1990), Algorithms in C, Addison-Wesley
  6. Mehlhorn, Kurt; Tsakalidis, Athanasios (1993). "Dynamic interpolation search". Journal of the ACM. 40 (3): 621–634. doi:10.1145/174130.174139. ISSN   0004-5411.
  7. Andersson, Arne; Mattsson, Christer (1993). "Dynamic interpolation search in o(log log n) time". Automata, Languages and Programming. Vol. 700. Berlin, Heidelberg: Springer Berlin Heidelberg. p. 15–27. doi:10.1007/3-540-56939-1_58. ISBN   978-3-540-56939-8.
  8. Mohammed, Adnan Saher; Amrahov, Şahin Emrah; Çelebi, Fatih V. (1 October 2021). "Interpolated binary search: An efficient hybrid search algorithm on ordered datasets". Engineering Science and Technology. 24 (5): 1072–1079. doi: 10.1016/j.jestch.2021.02.009 . ISSN   2215-0986.