Proxmap sort

Last updated
Proxmap sort
Insertion sorting into buckets during proxmap. Insertion Sorting during proxmap.PNG
Insertion sorting into buckets during proxmap.
Example of insertion sort sorting a list of random numbers.
Class Sorting algorithm
Data structure Array
Worst-case performance
Best-case performance
Average performance
Worst-case space complexity
Elements are distributed among bins Bucket sort 1.png
Elements are distributed among bins
Unlike bucket sorting which sorts after all the buckets are filled, the elements are insertion sorted as they are inserted Bucket sort 2.png
Unlike bucket sorting which sorts after all the buckets are filled, the elements are insertion sorted as they are inserted

ProxmapSort, or Proxmap sort, is a sorting algorithm that works by partitioning an array of data items, or keys, into a number of "subarrays" (termed buckets, in similar sorts). The name is short for computing a "proximity map," which indicates for each key K the beginning of a subarray where K will reside in the final sorted order. Keys are placed into each subarray using insertion sort. If keys are "well distributed" among the subarrays, sorting occurs in linear time. The computational complexity estimates involve the number of subarrays and the proximity mapping function, the "map key," used. It is a form of bucket and radix sort.

Contents

Once a ProxmapSort is complete, ProxmapSearch can be used to find keys in the sorted array in time if the keys were well distributed during the sort.

Both algorithms were invented in the late 1980s by Prof. Thomas A. Standish at the University of California, Irvine.

Overview

Basic strategy

In general: Given an array A with n keys:

Simplied version: Given an array A with n keys

  1. Initialize: Create and initialize 2 arrays of n size: hitCount, proxMap, and 2 arrays of A.length: location, and A2.
  2. Partition: Using a carefully chosen mapKey function, divide the A2 into subarrays using the keys in A
  3. Disperse: Read over A, dropping each key into its bucket in A2; insertion sorting as needed.
  4. Collect: Visit the subarrays in order and put all the elements back into the original array, or simply use A2.

Note: "keys" may also contain other data, for instance an array of Student objects that contain the key plus a student ID and name. This makes ProxMapSort suitable for organizing groups of objects, not just keys themselves.

Example

Consider a full array: A[0 to n-1] with n keys. Let i be an index of A. Sort A's keys into array A2 of equal size.

The map key function is defined as mapKey(key) = floor(K).

Array table
A6.75.98.41.27.33.711.51.14.80.410.56.11.8
H130111211011
P01-94567910-91112
L7610194121501171
A20.41.11.21.83.74.85.96.16.77.38.410.511.5

ProxMapSortDemo.gif

Pseudocode

// compute hit countsfori=0to11// where 11 is n{H[i]=0;}fori=0to12// where 12 is A.length{pos=MapKey(A[i]);H[pos]=H[pos]+1;}runningTotal=0;// compute prox map – location of start of each subarrayfori=0to11ifH[i]=0P[i]=-9;elseP[i]=runningTotal;runningTotal=runningTotal+H[i];fori=0to12// compute location – subarray – in A2 into which each item in A is to be placedL[i]=P[MapKey(A[i])];forI=0to12;// sort itemsA2[I]=<empty>;fori=0to12// insert each item into subarray beginning at start, preserving order{start=L[i];// subarray for this item begins at this locationinsertionmade=false;forj=startto(<theendofA2isfound,andinsertionnotmade>){ifA2[j]==<empty>// if subarray empty, just put item in first position of subarrayA2[j]=A[i];insertionmade=true;elseifA[i]<A2[j]// key belongs at A2[j]intend=j+1;// find end of used part of subarray – where first <empty> iswhile(A2[end]!=<empty>)end++;fork=end-1toj// move larger keys to the right 1 cellA2[k+1]=A2[k];A2[j]=A[i];insertionmade=true;// add in new key}}

Here A is the array to be sorted and the mapKey functions determines the number of subarrays to use. For example, floor(K) will simply assign as many subarrays as there are integers from the data in A. Dividing the key by a constant reduces the number of subarrays; different functions can be used to translate the range of elements in A to subarrays, such as converting the letters A–Z to 0–25 or returning the first character (0–255) for sorting strings. Subarrays are sorted as the data comes in, not after all data has been placed into the subarray, as is typical in bucket sorting.

Proxmap searching

ProxmapSearch uses the proxMap array generated by a previously done ProxmapSort to find keys in the sorted array A2 in constant time.

Basic strategy

Pseudocode

function mapKey(key) isreturn floor(key)
    proxMap ← previously generated proxmap array of size n     A2 ← previously sorted array of size n function proxmap-search(key) isfor i = proxMap[mapKey(key)] to length(array) − 1 doif sortedArray[i].key == key thenreturn sortedArray[i]

Analysis

Performance

Computing H, P, and L all take time. Each is computed with one pass through an array, with constant time spent at each array location.

Having a good MapKey function is imperative for avoiding the worst case. We must know something about the distribution of the data to come up with a good key.

Optimizations

  1. Save time: Save the MapKey(i) values so they don't have to be recomputed (as they are in the code above)
  2. Save space: The proxMaps can be stored in the hitCount array, as the hit counts are not needed once the proxmap is computed; the data can be sorted back into A, instead of using A2, if one takes care to note which A values have been sorted so far, and which not.

JavaScript code implementation:

Array.prototype.ProxmapSort=function(){//-- Edit date: 2019/11/13 Taiwan --//varstart=0;varend=this.length;varA2=newArray(end);varMapKey=newArray(end);varhitCount=newArray(end);for(vari=start;i<end;i++){hitCount[i]=0;}varmin=this[start];varmax=this[start];for(vari=start+1;i<end;i++){if(this[i]<min){min=this[i];}else{if(this[i]>max){max=this[i];}}}//Optimization 1.Save the MapKey[i].for(vari=start;i<end;i++){MapKey[i]=Math.floor(((this[i]-min)/(max-min))*(end-1));hitCount[MapKey[i]]++;}//Optimization 2.ProxMaps store in the hitCount.hitCount[end-1]=end-hitCount[end-1];for(vari=end-1;i>start;i--){hitCount[i-1]=hitCount[i]-hitCount[i-1];}//insert A[i]=this[i] to A2 correct positionvarinsertIndex=0;varinsertStart=0;for(vari=start;i<end;i++){insertIndex=hitCount[MapKey[i]];insertStart=insertIndex;while(A2[insertIndex]!=null){insertIndex++;}while(insertIndex>insertStart&&this[i]<A2[insertIndex-1]){A2[insertIndex]=A2[insertIndex-1];insertIndex--;}A2[insertIndex]=this[i];}for(vari=start;i<end;i++){this[i]=A2[i];}};

Comparison with other sorting algorithms

Since ProxmapSort is not a comparison sort, the Ω(n log n) lower bound is inapplicable.[ citation needed ] Its speed can be attributed to it not being comparison-based and using arrays instead of dynamically allocated objects and pointers that must be followed, such as is done with when using a binary search tree.

ProxmapSort allows for the use of ProxmapSearch. Despite the O(n) build time, ProxMapSearch makes up for it with its average access time, making it very appealing for large databases. If the data doesn't need to be updated often, the access time may make this function more favorable than other non-comparison sorting based sorts.

Like ProxmapSort, bucket sort generally operates on a list of n numeric inputs between zero and some maximum key or value M and divides the value range into n buckets each of size M/n. If each bucket is sorted using insertion sort, ProxmapSort and bucket sort can be shown to run in predicted linear time. [1] [ original research? ] However, the performance of this sort degrades with clustering (or too few buckets with too many keys); if many values occur close together, they will all fall into a single bucket and performance will be severely diminished. This behavior also holds for ProxmapSort: if the buckets are too large, its performance will degrade severely.

Related Research Articles

<span class="mw-page-title-main">Hash table</span> Associative array for storing key-value pairs

In computing, a hash table, also known as a hash map or a hash set, is a data structure that implements an associative array, also called a dictionary, which is an abstract data type that maps keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored.

<span class="mw-page-title-main">Heap (data structure)</span> Computer science data structure

In computer science, a heap is a tree-based data structure that satisfies the heap property: In a max heap, for any given node C, if P is a parent node of C, then the key of P is greater than or equal to the key of C. In a min heap, the key of P is less than or equal to the key of C. The node at the "top" of the heap is called the root node.

<span class="mw-page-title-main">Insertion sort</span> Sorting algorithm

Insertion sort is a simple sorting algorithm that builds the final sorted array (or list) one item at a time by comparisons. It is much less efficient on large lists than more advanced algorithms such as quicksort, heapsort, or merge sort. However, insertion sort provides several advantages:

<span class="mw-page-title-main">Merge sort</span> Divide and conquer sorting algorithm

In computer science, merge sort is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the relative order of equal elements is the same in the input and output. Merge sort is a divide-and-conquer algorithm that was invented by John von Neumann in 1945. A detailed description and analysis of bottom-up merge sort appeared in a report by Goldstine and von Neumann as early as 1948.

In computer science, radix sort is a non-comparative sorting algorithm. It avoids comparison by creating and distributing elements into buckets according to their radix. For elements with more than one significant digit, this bucketing process is repeated for each digit, while preserving the ordering of the prior step, until all digits have been considered. For this reason, radix sort has also been called bucket sort and digital sort.

<span class="mw-page-title-main">Sorting algorithm</span> Algorithm that arranges lists in order

In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. Efficient sorting is important for optimizing the efficiency of other algorithms that require input data to be in sorted lists. Sorting is also often useful for canonicalizing data and for producing human-readable output.

<span class="mw-page-title-main">Binary heap</span> Variant of heap data structure

A binary heap is a heap data structure that takes the form of a binary tree. Binary heaps are a common way of implementing priority queues. The binary heap was introduced by J. W. J. Williams in 1964, as a data structure for heapsort.

<span class="mw-page-title-main">Shellsort</span> Sorting algorithm which uses multiple comparison intervals

Shellsort, also known as Shell sort or Shell's method, is an in-place comparison sort. It can be seen as either a generalization of sorting by exchange or sorting by insertion. The method starts by sorting pairs of elements far apart from each other, then progressively reducing the gap between elements to be compared. By starting with far-apart elements, it can move some out-of-place elements into the position faster than a simple nearest-neighbor exchange. Donald Shell published the first version of this sort in 1959. The running time of Shellsort is heavily dependent on the gap sequence it uses. For many practical variants, determining their time complexity remains an open problem.

<span class="mw-page-title-main">Bucket sort</span> Sorting algorithm

Bucket sort, or bin sort, is a sorting algorithm that works by distributing the elements of an array into a number of buckets. Each bucket is then sorted individually, either using a different sorting algorithm, or by recursively applying the bucket sorting algorithm. It is a distribution sort, a generalization of pigeonhole sort that allows multiple keys per bucket, and is a cousin of radix sort in the most-to-least significant digit flavor. Bucket sort can be implemented with comparisons and therefore can also be considered a comparison sort algorithm. The computational complexity depends on the algorithm used to sort each bucket, the number of buckets to use, and whether the input is uniformly distributed.

In computer science, counting sort is an algorithm for sorting a collection of objects according to keys that are small positive integers; that is, it is an integer sorting algorithm. It operates by counting the number of objects that possess distinct key values, and applying prefix sum on those counts to determine the positions of each key value in the output sequence. Its running time is linear in the number of items and the difference between the maximum key value and the minimum key value, so it is only suitable for direct use in situations where the variation in keys is not significantly greater than the number of items. It is often used as a subroutine in radix sort, another sorting algorithm, which can handle larger keys more efficiently.

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed ; the more items added, the larger the probability of false positives.

In computing, a persistent data structure or not ephemeral data structure is a data structure that always preserves the previous version of itself when it is modified. Such data structures are effectively immutable, as their operations do not (visibly) update the structure in-place, but instead always yield a new updated structure. The term was introduced in Driscoll, Sarnak, Sleator, and Tarjan's 1986 article.

<span class="mw-page-title-main">Quicksort</span> Divide and conquer sorting algorithm

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than merge sort and heapsort for randomized data, particularly on larger distributions.

2-choice hashing, also known as 2-choice chaining, is "a variant of a hash table in which keys are added by hashing with two hash functions. The key is put in the array position with the fewer (colliding) keys. Some collision resolution scheme is needed, unless keys are kept in buckets. The average-case cost of a successful search is , where is the number of keys and is the size of the array. The most collisions is with high probability."

Database tables and indexes may be stored on disk in one of a number of forms, including ordered/unordered flat files, ISAM, heap files, hash buckets, or B+ trees. Each form has its own particular advantages and disadvantages. The most commonly used forms are B-trees and ISAM. Such forms or structures are one aspect of the overall schema used by a database engine to store information.

Flashsort is a distribution sorting algorithm showing linear computational complexity O(n) for uniformly distributed data sets and relatively little additional memory requirement. The original work was published in 1998 by Karl-Dietrich Neubert.

Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size s from the n-element sequence, and determining the range of the buckets by sorting the sample and choosing p−1 < s elements from the result. These elements then divide the array into p approximately equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.

<span class="mw-page-title-main">Block sort</span> Efficient sorting algorithm that combines insert and merge operations

Block sort, or block merge sort, is a sorting algorithm combining at least two merge operations with an insertion sort to arrive at O(n log n) (see Big O notation) in-place stable sorting time. It gets its name from the observation that merging two sorted lists, A and B, is equivalent to breaking A into evenly sized blocks, inserting each A block into B under special rules, and merging AB pairs.

The cache-oblivious distribution sort is a comparison-based sorting algorithm. It is similar to quicksort, but it is a cache-oblivious algorithm, designed for a setting where the number of elements to sort is too large to fit in a cache where operations are done. In the external memory model, the number of memory transfers it needs to perform a sort of items on a machine with cache of size and cache lines of length is , under the tall cache assumption that . This number of memory transfers has been shown to be asymptotically optimal for comparison sorts. This distribution sort also achieves the asymptotically optimal runtime complexity of .

Interpolation sort is a sorting algorithm that is a kind of bucket sort. It uses an interpolation formula to assign data to the bucket. A general interpolation formula is:

References

  1. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms , Second Edition. MIT Press and McGraw-Hill, 2001. ISBN   0-262-03293-7. Section 8.4: Bucket sort, pp.174177.