Interpolation search

Last updated

Interpolation search is an algorithm for searching for a key in an array that has been ordered by numerical values assigned to the keys (key values). It was first described by W. W. Peterson in 1957. [1] Interpolation search resembles the method by which people search a telephone directory for a name (the key value by which the book's entries are ordered): in each step the algorithm calculates where in the remaining search space the sought item might be, based on the key values at the bounds of the search space and the value of the sought key, usually via a linear interpolation. The key value actually found at this estimated position is then compared to the key value being sought. If it is not equal, then depending on the comparison, the remaining search space is reduced to the part before or after the estimated position. This method will only work if calculations on the size of differences between key values are sensible.

Contents

Interpolation search
Interpolation sort.gif
Visualization of the interpolation search algorithm in which 24 is the target value.
Class Search algorithm
Data structure Array
Worst-case performance O(n)
Best-case performance O(1)
Average performance O(log(log(n))) [2]
Worst-case space complexity O(1)
OptimalYes

By comparison, binary search always chooses the middle of the remaining search space, discarding one half or the other, depending on the comparison between the key found at the estimated position and the key sought — it does not require numerical values for the keys, just a total order on them. The remaining search space is reduced to the part before or after the estimated position. The linear search uses equality only as it compares elements one-by-one from the start, ignoring any sorting.

On average the interpolation search makes about log(log(n)) comparisons (if the elements are uniformly distributed), where n is the number of elements to be searched. In the worst case (for instance where the numerical values of the keys increase exponentially) it can make up to O(n) comparisons.

In interpolation-sequential search, interpolation is used to find an item near the one being searched for, then linear search is used to find the exact item.

Performance

Using big-O notation, the performance of the interpolation algorithm on a data set of size n is O(n); however under the assumption of a uniform distribution of the data on the linear scale used for interpolation, the performance can be shown to be O(log log n). [3] [4] [5] However, Dynamic Interpolation Search is possible in o(log log n) time using a novel data structure. [6]

Practical performance of interpolation search depends on whether the reduced number of probes is outweighed by the more complicated calculations needed for each probe. It can be useful for locating a record in a large sorted file on disk, where each probe involves a disk seek and is much slower than the interpolation arithmetic.

Index structures like B-trees also reduce the number of disk accesses, and are more often used to index on-disk data in part because they can index many types of data and can be updated online. Still, interpolation search may be useful when one is forced to search certain sorted but unindexed on-disk datasets.

Adaptation to different datasets

When sort keys for a dataset are uniformly distributed numbers, linear interpolation is straightforward to implement and will find an index very near the sought value.

On the other hand, for a phone book sorted by name, the straightforward approach to interpolation search does not apply. The same high-level principles can still apply, though: one can estimate a name's position in the phone book using the relative frequencies of letters in names and use that as a probe location.

Some interpolation search implementations may not work as expected when a run of equal key values exists. The simplest implementation of interpolation search won't necessarily select the first (or last) element of such a run.

Book-based searching

The conversion of names in a telephone book to some sort of number clearly will not provide numbers having a uniform distribution (except via immense effort such as sorting the names and calling them name #1, name #2, etc.) and further, it is well known that some names are much more common than others (Smith, Jones,) Similarly with dictionaries, where there are many more words starting with some letters than others. Some publishers go to the effort of preparing marginal annotations or even cutting into the side of the pages to show markers for each letter so that at a glance a segmented interpolation can be performed.

Sample implementation

The following C++ code example is a simple implementation. At each stage it computes a probe position then as with the binary search, moves either the upper or lower bound in to define a smaller interval containing the sought value. Unlike the binary search which guarantees a halving of the interval's size with each stage, a misled interpolation may reduce/i-case efficiency of O(n).

#include<cassert>/*T must implement the operators -, !=, ==, >=, <= and <such that >=, <=, !=, == and < define a total order on T*//*arr[low, high) is sorted, search the data "key" in this array,if "key" is found, return the corresponding index (NOT necessarily the highest possible index);if "key" is not found, just return low - 1How to verify that the algorithm is correct?Proof:Fist, high <--- high - 1scenario 1. when low = highscenario 2. when low < high, arr[low] = arr[high]scenario 3. when low < high, arr[low] < arr[high], key < arr[low] or key > arr[high]scenario 4. when low < high, arr[low] < arr[high], arr[low] <= key <= arr[high]Now let's analyze scenario 4:Once entering the "while" loop,  low <= middle <= high(finiteness: after one loop, "low" will increase strictly or "high" will decrease strictly )we can see that at the end of one loop (if we don't return, if the "low" branch and "high" branch have both been executed at least once):1. arr[0, low) < key < arr(high, n)(if the "low" branch has NOT been executed, the "high" branch has been executed at least once)2. arr[low] <= key < arr(high, n)(...)3. arr[0, low) < key <= arr[high]Let's see when we exit the "while" loopcase a1. low < high, arr[low] = arr[high]case a2. low < high, arr[low] < arr[high], key < arr[low] or key > arr[high]case a3. low < high, arr[low] < arr[high], arr[low] <= key <= arr[high]Let's analyze case a3 :1. middle = high - 1,  arr[high - 1] < key <= arr[high] , low <---- high2. middle = high,  arr[high] < key, impossible3. middle = low + 1, arr[low] <= key < arr[low + 1], high <---- low4. middle = low, key < arr[low], impossible*/template<typenameT>intinterpolation_search_v001(T*arr,constT&key,intlow,inthigh){high-=1;intmiddle;intinitial_low=low;while((arr[high]!=arr[low])&&(key>=arr[low])&&(key<=arr[high])){middle=low+(((key-arr[low])*(high-low))/(arr[high]-arr[low]));if(arr[middle]<key)low=middle+1;elseif(key<arr[middle])high=middle-1;elsereturnmiddle;}if(key==arr[low])returnlow;elsereturninitial_low-1;}/*search "key" in the sorted array arr[low, high),return: the highest index i such that arr[i] <= keyHow to verify that the algorithm is correct?Proof:Fist, high <---- high - 1scenario 1. when low = highscenario 2. when low < high, key < arr[low] or arr[high] <= keyscenario 3. when low < high, arr[low] <= key < arr[high]Now let's analyze scenario 3:Once entering the "while" loop,  low <= middle < high(finiteness: after one loop, "low" will increase strictly or "high" will decrease strictly )we can see that at the end of one loop (if we don't return, if the "low" branch and "high" branch have both been executed at least once):1. arr[0, low-1] <= key < arr[high, n)(if the "low" branch has NOT been executed, the "high" branch has been executed at least once)2. arr[low] <= key < arr[high, n)(...)3. arr[0, low-1] <= key < arr[high]Let's see when we exit the "while" loopcase a1: key < arr[low]    so "low" is changed in the last loop, we know    arr[low - 1] <= key < arr[low]case a2: arr[high] <= key    so "high" is changed in the last loop, we know    key < arr[high], impossible*/template<typenameT>intinterpolation_search_v002(T*arr,constT&key,intlow,inthigh){high-=1;assert(low<=high);intmiddle;if(key<arr[low]){returnlow-1;}if(arr[high]<=key){returnhigh;}// now low < high ,  arr[low] <= key < arr[high]while((arr[low]<=key)&&(key<arr[high])){middle=low+((high-low)*(key-arr[low]))/(arr[high]-arr[low]);assert((low<=middle)&&(middle<high));if(key<arr[middle]){high=middle;}else{low=middle+1;}}returnlow-1;}

Notice that having probed the list at index mid, for reasons of loop control administration, this code sets either high or low to be not mid but an adjacent index, which location is then probed during the next iteration. Since an adjacent entry's value will not be much different, the interpolation calculation is not much improved by this one step adjustment, at the cost of an additional reference to distant memory such as disk.

Each iteration of the above code requires between five and six comparisons (the extra is due to the repetitions needed to distinguish the three states of < > and = via binary comparisons in the absence of a three-way comparison) plus some messy arithmetic, while the binary search algorithm can be written with one comparison per iteration and uses only trivial integer arithmetic. It would thereby search an array of a million elements with no more than twenty comparisons (involving accesses to slow memory where the array elements are stored); to beat that, the interpolation search, as written above, would be allowed no more than three iterations.

See also

Related Research Articles

<span class="mw-page-title-main">Binary search algorithm</span> Search algorithm finding the position of a target value within a sorted array

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for nodes with more than two children. Unlike other self-balancing binary search trees, the B-tree is well suited for storage systems that read and write relatively large blocks of data, such as databases and file systems.

<span class="mw-page-title-main">Heapsort</span> A sorting algorithm which uses the heap data structure

In computer science, heapsort is a comparison-based sorting algorithm which can be thought of as "an implementation of selection sort using the right data structure." Like selection sort, heapsort divides its input into a sorted and an unsorted region, and it iteratively shrinks the unsorted region by extracting the largest element from it and inserting it into the sorted region. Unlike selection sort, heapsort does not waste time with a linear-time scan of the unsorted region; rather, heap sort maintains the unsorted region in a heap data structure to efficiently find the largest element in each step.

<span class="mw-page-title-main">Heap (data structure)</span> Computer science data structure

In computer science, a heap is a specialized tree-based data structure that satisfies the heap property: In a max heap, for any given node C, if P is a parent node of C, then the key of P is greater than or equal to the key of C. In a min heap, the key of P is less than or equal to the key of C. The node at the "top" of the heap is called the root node.

<span class="mw-page-title-main">Insertion sort</span> Sorting algorithm

Insertion sort is a simple sorting algorithm that builds the final sorted array (or list) one item at a time by comparisons. It is much less efficient on large lists than more advanced algorithms such as quicksort, heapsort, or merge sort. However, insertion sort provides several advantages:

In computer science, linear search or sequential search is a method for finding an element within a list. It sequentially checks each element of the list until a match is found or the whole list has been searched.

<span class="mw-page-title-main">Merge sort</span> Divide and conquer-based sorting algorithm

In computer science, merge sort is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the relative order of equal elements is the same in the input and output. Merge sort is a divide-and-conquer algorithm that was invented by John von Neumann in 1945. A detailed description and analysis of bottom-up merge sort appeared in a report by Goldstine and von Neumann as early as 1948.

Merge algorithms are a family of algorithms that take multiple sorted lists as input and produce a single list as output, containing all the elements of the inputs lists in sorted order. These algorithms are used as subroutines in various sorting algorithms, most famously merge sort.

<span class="mw-page-title-main">Sorting algorithm</span> Algorithm that arranges lists in order

In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. Efficient sorting is important for optimizing the efficiency of other algorithms that require input data to be in sorted lists. Sorting is also often useful for canonicalizing data and for producing human-readable output.

In computer science, selection sort is an in-place comparison sorting algorithm. It has an O(n2) time complexity, which makes it inefficient on large lists, and generally performs worse than the similar insertion sort. Selection sort is noted for its simplicity and has performance advantages over more complicated algorithms in certain situations, particularly where auxiliary memory is limited.

<span class="mw-page-title-main">Treap</span>

In computer science, the treap and the randomized binary search tree are two closely related forms of binary search tree data structures that maintain a dynamic set of ordered keys and allow binary searches among the keys. After any sequence of insertions and deletions of keys, the shape of the tree is a random variable with the same probability distribution as a random binary tree; in particular, with high probability its height is proportional to the logarithm of the number of keys, so that each search, insertion, or deletion operation takes logarithmic time to perform.

In computer science, a lookup table (LUT) is an array that replaces runtime computation with a simpler array indexing operation, in a process termed as direct addressing. The savings in processing time can be significant, because retrieving a value from memory is often faster than carrying out an "expensive" computation or input/output operation. The tables may be precalculated and stored in static program storage, calculated as part of a program's initialization phase (memoization), or even stored in hardware in application-specific platforms. Lookup tables are also used extensively to validate input values by matching against a list of valid items in an array and, in some programming languages, may include pointer functions to process the matching input. FPGAs also make extensive use of reconfigurable, hardware-implemented, lookup tables to provide programmable hardware functionality. LUTs differ from hash tables in a way that, to retrieve a value with key , a hash table would store the value in the slot where is a hash function i.e. is used to compute the slot, while in the case of LUT, the value is stored in slot , thus directly addressable.

The sort-merge join is a join algorithm and is used in the implementation of a relational database management system.

<i>k</i>-d tree Multidimensional search tree for points in k dimensional space

In computer science, a k-d tree is a space-partitioning data structure for organizing points in a k-dimensional space. K-dimensional is that which concerns exactly k orthogonal axes or a space of any number of dimensions. k-d trees are a useful data structure for several applications, such as:

<span class="mw-page-title-main">Z-order curve</span> Mapping function that preserves data point locality

In mathematical analysis and computer science, functions which are Z-order, Lebesgue curve, Morton space-filling curve, Morton order or Morton code map multidimensional data to one dimension while preserving locality of the data points. It is named in France after Henri Lebesgue, who studied it in 1904, and named in the United States after Guy Macdonald Morton, who first applied the order to file sequencing in 1966. The z-value of a point in multidimensions is simply calculated by interleaving the binary representations of its coordinate values. Once the data are sorted into this ordering, any one-dimensional data structure can be used, such as simple one dimensional arrays, binary search trees, B-trees, skip lists or hash tables. The resulting ordering can equivalently be described as the order one would get from a depth-first traversal of a quadtree or octree.

<span class="mw-page-title-main">Quicksort</span> Divide and conquer sorting algorithm

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than merge sort and heapsort for randomized data, particularly on larger distributions.

<span class="mw-page-title-main">Recursion (computer science)</span> Use of functions that call themselves

In computer science, recursion is a method of solving a computational problem where the solution depends on solutions to smaller instances of the same problem. Recursion solves such recursive problems by using functions that call themselves from within their own code. The approach can be applied to many types of problems, and recursion is one of the central ideas of computer science.

The power of recursion evidently lies in the possibility of defining an infinite set of objects by a finite statement. In the same manner, an infinite number of computations can be described by a finite recursive program, even if this program contains no explicit repetitions.

In computer science, integer sorting is the algorithmic problem of sorting a collection of data values by integer keys. Algorithms designed for integer sorting may also often be applied to sorting problems in which the keys are floating point numbers, rational numbers, or text strings. The ability to perform integer arithmetic on the keys allows integer sorting algorithms to be faster than comparison sorting algorithms in many cases, depending on the details of which operations are allowed in the model of computing and how large the integers to be sorted are.

In computer science, an exponential search is an algorithm, created by Jon Bentley and Andrew Chi-Chih Yao in 1976, for searching sorted, unbounded/infinite lists. There are numerous ways to implement this, with the most common being to determine a range that the search key resides in and performing a binary search within that range. This takes O(log i) where i is the position of the search key in the list, if the search key is in the list, or the position where the search key should be, if the search key is not in the list.

References

  1. W. W. Peterson (1957). "Addressing for Random-Access Storage". IBM J. Res. Dev. 1 (2): 130–146. doi:10.1147/rd.12.0130.
  2. Simon Yuan. "Understanding The Complexity Of Interpolation Search, Seminar Advanced Algorithms and Data Structures" (PDF).
  3. Weiss, Mark Allen (2006). Data structures and problem solving using Java, Pearson Addison Wesley
  4. Armenakis, A. C., Garey, L. E., Gupta, R. D., An adaptation of a root finding method to searching ordered disk files, BIT Numerical Mathematics, Volume 25, Number 4 / December, 1985.
  5. Sedgewick, Robert (1990), Algorithms in C, Addison-Wesley
  6. Andersson, Arne, and Christer Mattsson. ‘Dynamic Interpolation Search in o(log log n) Time’. In Automata, Languages and Programming, edited by Andrzej Lingas, Rolf Karlsson, and Svante Carlsson, 700:15–27. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 1993. doi : 10.1007/3-540-56939-1_58
  7. Mohammed, Adnan Saher; Amrahov, Şahin Emrah; Çelebi, Fatih V. (1 October 2021). "Interpolated binary search: An efficient hybrid search algorithm on ordered datasets". Engineering Science and Technology. 24 (5): 1072–1079. doi: 10.1016/j.jestch.2021.02.009 . ISSN   2215-0986.