Proxmap sort

Proxmap sort
	Insertion sorting into buckets during proxmap. Example of insertion sort sorting a list of random numbers.
Class	Sorting algorithm
Data structure	Array
Worst-case performance
Best-case performance
Average performance
Worst-case space complexity

Last updated April 30, 2024

Elements are distributed among bins Bucket sort 1.png — Elements are distributed among bins

Unlike bucket sorting which sorts after all the buckets are filled, the elements are insertion sorted as they are inserted Bucket sort 2.png — Unlike bucket sorting which sorts after all the buckets are filled, the elements are insertion sorted as they are inserted

ProxmapSort, or Proxmap sort, is a sorting algorithm that works by partitioning an array of data items, or keys, into a number of "subarrays" (termed buckets, in similar sorts). The name is short for computing a "proximity map," which indicates for each key K the beginning of a subarray where K will reside in the final sorted order. Keys are placed into each subarray using insertion sort. If keys are "well distributed" among the subarrays, sorting occurs in linear time. The computational complexity estimates involve the number of subarrays and the proximity mapping function, the "map key," used. It is a form of bucket and radix sort.

Overview

Basic strategy

In general: Given an array A with n keys:

map a key to a subarray of the destination array A2, by applying the map key function to each array item
determine how many keys will map to the same subarray, using an array of "hit counts," H
determine where each subarray will begin in the destination array so that each bucket is exactly the right size to hold all the keys that will map to it, using an array of "proxmaps," P
for each key, compute the subarray it will map to, using an array of "locations," L
for each key, look up its location, place it into that cell of A2; if it collides with a key already in that position, insertion sort the key into place, moving keys greater than this key to the right by one to make a space for this key. Since the subarray is big enough to hold all the keys mapped to it, such movement will never cause the keys to overflow into the following subarray.

Simplied version: Given an array A with n keys

Initialize: Create and initialize 2 arrays of n size: hitCount, proxMap, and 2 arrays of A.length: location, and A2.
Partition: Using a carefully chosen mapKey function, divide the A2 into subarrays using the keys in A
Disperse: Read over A, dropping each key into its bucket in A2; insertion sorting as needed.
Collect: Visit the subarrays in order and put all the elements back into the original array, or simply use A2.

Note: "keys" may also contain other data, for instance an array of Student objects that contain the key plus a student ID and name. This makes ProxMapSort suitable for organizing groups of objects, not just keys themselves.

Example

Consider a full array: A[0 to n-1] with n keys. Let i be an index of A. Sort A's keys into array A2 of equal size.

The map key function is defined as mapKey(key) = floor(K).

Array table
A	6.7	5.9	8.4	1.2	7.3	3.7	11.5	1.1	4.8	0.4	10.5	6.1	1.8
H	1	3	0	1	1	1	2	1	1	0	1	1
P	0	1	-9	4	5	6	7	9	10	-9	11	12
L	7	6	10	1	9	4	12	1	5	0	11	7	1
A2	0.4	1.1	1.2	1.8	3.7	4.8	5.9	6.1	6.7	7.3	8.4	10.5	11.5

Pseudocode

// compute hit countsfori=0to11// where 11 is n{H[i]=0;}fori=0to12// where 12 is A.length{pos=MapKey(A[i]);H[pos]=H[pos]+1;}runningTotal=0;// compute prox map – location of start of each subarrayfori=0to11ifH[i]=0P[i]=-9;elseP[i]=runningTotal;runningTotal=runningTotal+H[i];fori=0to12// compute location – subarray – in A2 into which each item in A is to be placedL[i]=P[MapKey(A[i])];forI=0to12;// sort itemsA2[I]=<empty>;fori=0to12// insert each item into subarray beginning at start, preserving order{start=L[i];// subarray for this item begins at this locationinsertionmade=false;forj=startto(<theendofA2isfound,andinsertionnotmade>){ifA2[j]==<empty>// if subarray empty, just put item in first position of subarrayA2[j]=A[i];insertionmade=true;elseifA[i]<A2[j]// key belongs at A2[j]intend=j+1;// find end of used part of subarray – where first <empty> iswhile(A2[end]!=<empty>)end++;fork=end-1toj// move larger keys to the right 1 cellA2[k+1]=A2[k];A2[j]=A[i];insertionmade=true;// add in new key}}

Here A is the array to be sorted and the mapKey functions determines the number of subarrays to use. For example, floor(K) will simply assign as many subarrays as there are integers from the data in A. Dividing the key by a constant reduces the number of subarrays; different functions can be used to translate the range of elements in A to subarrays, such as converting the letters A–Z to 0–25 or returning the first character (0–255) for sorting strings. Subarrays are sorted as the data comes in, not after all data has been placed into the subarray, as is typical in bucket sorting.

Proxmap searching

ProxmapSearch uses the proxMap array generated by a previously done ProxmapSort to find keys in the sorted array A2 in constant time.

Basic strategy

Sort the keys using ProxmapSort, keeping the MapKey function, and the P and A2 arrays
To search for a key, go to P[MapKey(k)], the start of the subarray that contains the key, if that key is in the data set
Sequentially search the subarray; if the key is found, return it (and associated information); if find a value greater than the key, the key is not in the data set
Computing P[MapKey(k)] takes $O(1)$ time. If a map key that gives a good distribution of keys was used during the sort, each subarray is bounded above by a constant c, so at most c comparisons are needed to find the key or know it is not present; therefore ProxmapSearch is $O(1)$ . If the worst map key was used, all keys are in the same subarray, so ProxmapSearch, in this worst case, will require $O(n)$ comparisons.

Pseudocode

function mapKey(key) isreturn floor(key)

    proxMap ← previously generated proxmap array of size n     A2 ← previously sorted array of size n function proxmap-search(key) isfor i = proxMap[mapKey(key)] to length(array) − 1 doif sortedArray[i].key == key thenreturn sortedArray[i]

Analysis

Performance

Computing H, P, and L all take $O(n)$ time. Each is computed with one pass through an array, with constant time spent at each array location.

Worst case: MapKey places all items into one subarray, resulting in a standard insertion sort, and time of $O(n^{2})$ .
Best case: MapKey delivers the same small number of items to each subarray in an order where the best case of insertion sort occurs. Each insertion sort is $O(c)$ , c the size of the subarrays; there are p subarrays thus p * c = n, so the insertion phase take O(n); thus, ProxmapSort is $O(n)$ .
Average case: Each subarray is at most size c, a constant; insertion sort for each subarray is then O(c^2) at worst – a constant. (The actual time can be much better, since c items are not sorted until the last item is placed in the bucket). Total time is the number of buckets, (n/c), times $O(c^{2})$ = $O(n)$ .

Having a good MapKey function is imperative for avoiding the worst case. We must know something about the distribution of the data to come up with a good key.

Optimizations

Save time: Save the MapKey(i) values so they don't have to be recomputed (as they are in the code above)
Save space: The proxMaps can be stored in the hitCount array, as the hit counts are not needed once the proxmap is computed; the data can be sorted back into A, instead of using A2, if one takes care to note which A values have been sorted so far, and which not.

JavaScript code implementation:

Array.prototype.ProxmapSort=function(){//-- Edit date: 2019/11/13 Taiwan --//varstart=0;varend=this.length;varA2=newArray(end);varMapKey=newArray(end);varhitCount=newArray(end);for(vari=start;i<end;i++){hitCount[i]=0;}varmin=this[start];varmax=this[start];for(vari=start+1;i<end;i++){if(this[i]<min){min=this[i];}else{if(this[i]>max){max=this[i];}}}//Optimization 1.Save the MapKey[i].for(vari=start;i<end;i++){MapKey[i]=Math.floor(((this[i]-min)/(max-min))*(end-1));hitCount[MapKey[i]]++;}//Optimization 2.ProxMaps store in the hitCount.hitCount[end-1]=end-hitCount[end-1];for(vari=end-1;i>start;i--){hitCount[i-1]=hitCount[i]-hitCount[i-1];}//insert A[i]=this[i] to A2 correct positionvarinsertIndex=0;varinsertStart=0;for(vari=start;i<end;i++){insertIndex=hitCount[MapKey[i]];insertStart=insertIndex;while(A2[insertIndex]!=null){insertIndex++;}while(insertIndex>insertStart&&this[i]<A2[insertIndex-1]){A2[insertIndex]=A2[insertIndex-1];insertIndex--;}A2[insertIndex]=this[i];}for(vari=start;i<end;i++){this[i]=A2[i];}};

Comparison with other sorting algorithms

Since ProxmapSort is not a comparison sort, the Ω(n log n) lower bound is inapplicable.^{[ citation needed ]} Its speed can be attributed to it not being comparison-based and using arrays instead of dynamically allocated objects and pointers that must be followed, such as is done with when using a binary search tree.

ProxmapSort allows for the use of ProxmapSearch. Despite the O(n) build time, ProxMapSearch makes up for it with its $O(1)$ average access time, making it very appealing for large databases. If the data doesn't need to be updated often, the access time may make this function more favorable than other non-comparison sorting based sorts.

Generic bucket sort related to ProxmapSort

Like ProxmapSort, bucket sort generally operates on a list of n numeric inputs between zero and some maximum key or value M and divides the value range into n buckets each of size M/n. If each bucket is sorted using insertion sort, ProxmapSort and bucket sort can be shown to run in predicted linear time.^[1]^{[ original research? ]} However, the performance of this sort degrades with clustering (or too few buckets with too many keys); if many values occur close together, they will all fall into a single bucket and performance will be severely diminished. This behavior also holds for ProxmapSort: if the buckets are too large, its performance will degrade severely.

Related Research Articles

In computing, a hash table, also known as a hash map or a hash set, is a data structure that implements an associative array, also called a dictionary, which is an abstract data type that maps keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored.

<span class="mw-page-title-main">Heap (data structure)</span> Computer science data structure

In computer science, a heap is a tree-based data structure that satisfies the heap property: In a max heap, for any given node C, if P is a parent node of C, then the key of P is greater than or equal to the key of C. In a min heap, the key of P is less than or equal to the key of C. The node at the "top" of the heap is called the root node.

<span class="mw-page-title-main">Insertion sort</span> Sorting algorithm

Insertion sort is a simple sorting algorithm that builds the final sorted array (or list) one item at a time by comparisons. It is much less efficient on large lists than more advanced algorithms such as quicksort, heapsort, or merge sort. However, insertion sort provides several advantages:

<span class="mw-page-title-main">Merge sort</span> Divide and conquer sorting algorithm

In computer science, merge sort is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the relative order of equal elements is the same in the input and output. Merge sort is a divide-and-conquer algorithm that was invented by John von Neumann in 1945. A detailed description and analysis of bottom-up merge sort appeared in a report by Goldstine and von Neumann as early as 1948.

In computer science, radix sort is a non-comparative sorting algorithm. It avoids comparison by creating and distributing elements into buckets according to their radix. For elements with more than one significant digit, this bucketing process is repeated for each digit, while preserving the ordering of the prior step, until all digits have been considered. For this reason, radix sort has also been called bucket sort and digital sort.

<span class="mw-page-title-main">Sorting algorithm</span> Algorithm that arranges lists in order

In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. Efficient sorting is important for optimizing the efficiency of other algorithms that require input data to be in sorted lists. Sorting is also often useful for canonicalizing data and for producing human-readable output.

<span class="mw-page-title-main">Binary heap</span> Variant of heap data structure

A binary heap is a heap data structure that takes the form of a binary tree. Binary heaps are a common way of implementing priority queues. The binary heap was introduced by J. W. J. Williams in 1964, as a data structure for heapsort.

<span class="mw-page-title-main">Shellsort</span> Sorting algorithm which uses multiple comparison intervals

Shellsort, also known as Shell sort or Shell's method, is an in-place comparison sort. It can be seen as either a generalization of sorting by exchange or sorting by insertion. The method starts by sorting pairs of elements far apart from each other, then progressively reducing the gap between elements to be compared. By starting with far-apart elements, it can move some out-of-place elements into the position faster than a simple nearest-neighbor exchange. Donald Shell published the first version of this sort in 1959. The running time of Shellsort is heavily dependent on the gap sequence it uses. For many practical variants, determining their time complexity remains an open problem.

Bucket sort, or bin sort, is a sorting algorithm that works by distributing the elements of an array into a number of buckets. Each bucket is then sorted individually, either using a different sorting algorithm, or by recursively applying the bucket sorting algorithm. It is a distribution sort, a generalization of pigeonhole sort that allows multiple keys per bucket, and is a cousin of radix sort in the most-to-least significant digit flavor. Bucket sort can be implemented with comparisons and therefore can also be considered a comparison sort algorithm. The computational complexity depends on the algorithm used to sort each bucket, the number of buckets to use, and whether the input is uniformly distributed.

In computer science, counting sort is an algorithm for sorting a collection of objects according to keys that are small positive integers; that is, it is an integer sorting algorithm. It operates by counting the number of objects that possess distinct key values, and applying prefix sum on those counts to determine the positions of each key value in the output sequence. Its running time is linear in the number of items and the difference between the maximum key value and the minimum key value, so it is only suitable for direct use in situations where the variation in keys is not significantly greater than the number of items. It is often used as a subroutine in radix sort, another sorting algorithm, which can handle larger keys more efficiently.

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed ; the more items added, the larger the probability of false positives.

In computing, a persistent data structure or not ephemeral data structure is a data structure that always preserves the previous version of itself when it is modified. Such data structures are effectively immutable, as their operations do not (visibly) update the structure in-place, but instead always yield a new updated structure. The term was introduced in Driscoll, Sarnak, Sleator, and Tarjan's 1986 article.

<span class="mw-page-title-main">Quicksort</span> Divide and conquer sorting algorithm

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than merge sort and heapsort for randomized data, particularly on larger distributions.

2-choice hashing, also known as 2-choice chaining, is "a variant of a hash table in which keys are added by hashing with two hash functions. The key is put in the array position with the fewer (colliding) keys. Some collision resolution scheme is needed, unless keys are kept in buckets. The average-case cost of a successful search is , where $is the number of keys and is the size of the array. The most collisions is with high probability."$

Database tables and indexes may be stored on disk in one of a number of forms, including ordered/unordered flat files, ISAM, heap files, hash buckets, or B+ trees. Each form has its own particular advantages and disadvantages. The most commonly used forms are B-trees and ISAM. Such forms or structures are one aspect of the overall schema used by a database engine to store information.

Flashsort is a distribution sorting algorithm showing linear computational complexity $O (n)$ for uniformly distributed data sets and relatively little additional memory requirement. The original work was published in 1998 by Karl-Dietrich Neubert.

Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size $s$ from the $n$ -element sequence, and determining the range of the buckets by sorting the sample and choosing $p -1 < s$ elements from the result. These elements then divide the array into $p$ approximately equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.

Block sort, or block merge sort, is a sorting algorithm combining at least two merge operations with an insertion sort to arrive at $O (n log n)$ (see Big O notation) in-place stable sorting time. It gets its name from the observation that merging two sorted lists, $A$ and $B$ , is equivalent to breaking $A$ into evenly sized blocks, inserting each $A$ block into $B$ under special rules, and merging $AB$ pairs.

The cache-oblivious distribution sort is a comparison-based sorting algorithm. It is similar to quicksort, but it is a cache-oblivious algorithm, designed for a setting where the number of elements to sort is too large to fit in a cache where operations are done. In the external memory model, the number of memory transfers it needs to perform a sort of $items on a machine with cache of size and cache lines of length is, under the tall cache assumption that . This number of memory transfers has been shown to be asymptotically optimal for comparison sorts. This distribution sort also achieves the asymptotically optimal runtime complexity of .$

Interpolation sort is a sorting algorithm that is a kind of bucket sort. It uses an interpolation formula to assign data to the bucket. A general interpolation formula is:

References

↑ Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms , Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Section 8.4: Bucket sort, pp.174–177.

Thomas A. Standish. Data Structures in Java. Addison Wesley Longman, 1998. ISBN 0-201-30564-X. Section 10.6, pp. 394–405.
Standish, T. A.; Jacobson, N. (2005). "Using $O (n)$ Proxmap Sort and $O (1)$ Proxmap Search to motivate CS2 students (Part I)". ACM SIGCSE Bulletin. 37 (4). doi:10.1145/1113847.1113874.
Standish, T. A.; Jacobson, N. (2006). "Using $O (n)$ Proxmap Sort and $O (1)$ Proxmap Search to motivate CS2 students, Part II". ACM SIGCSE Bulletin. 38 (2). doi:10.1145/1138403.1138427.
Norman Jacobson "A Synopsis of ProxmapSort & ProxmapSearch" from Department of Computer Science, Donald Bren School of Information and Computer Sciences, UC Irvine.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms , Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Section 8.4: Bucket sort, pp.174–177.

[1]

v t e Sorting algorithms
Theory	Computational complexity theory Big O notation Total order Lists Inplacement Stability Comparison sort Adaptive sort Sorting network Integer sorting X + Y sorting Transdichotomous model Quantum sort
Exchange sorts	Bubble sort Cocktail shaker sort Odd–even sort Comb sort Gnome sort Proportion extend sort Quicksort
Selection sorts	Selection sort Heapsort Smoothsort Cartesian tree sort Tournament sort Cycle sort Weak-heap sort
Insertion sorts	Insertion sort Shellsort Splaysort Tree sort Library sort Patience sorting
Merge sorts	Merge sort Cascade merge sort Oscillating merge sort Polyphase merge sort
Distribution sorts	American flag sort Bead sort Bucket sort Burstsort Counting sort Interpolation sort Pigeonhole sort Proxmap sort Radix sort Flashsort
Concurrent sorts	Bitonic sorter Batcher odd–even mergesort Pairwise sorting network Samplesort
Hybrid sorts	Block merge sort Kirkpatrick–Reisch sort Timsort Introsort Spreadsort Merge-insertion sort
Other	Topological sorting Pre-topological order Pancake sorting Spaghetti sort
Impractical sorts	Stooge sort Slowsort Bogosort

Proxmap sort

Contents

Overview

Basic strategy

Example

Pseudocode

Proxmap searching

Basic strategy

Pseudocode

Analysis

Performance

Optimizations

Comparison with other sorting algorithms

Generic bucket sort related to ProxmapSort

Related Research Articles

References

External links