K-sorted sequence

Last updated

In computer science, a nearly-sorted sequence, also known as roughly-sorted sequence and as -sorted sequence is a sequence which is almost ordered. By almost ordered, it is meant that no element of the sequence is very far away from where it would be if the sequence were perfectly ordered. It is still possible that no element of the sequence is at the place where it should be if the sequence were perfectly ordered.

Contents

The nearly-sorted sequences are particularly useful when the exact order of element has little importance. For example Twitter [1] nearly sort tweets, up to the second, as there is no need for more precision. Actually, given the impossibility to exactly synchronize all computers, an exact sorting of all tweets according to the time at which they are posted is impossible. This idea led to the creation of Snowflake IDs.

-sorting is the operation of reordering the elements of a sequence so that it becomes -sorted. -sorting is generally more efficient than sorting. Similarly, sorting a sequence is easier if it is known that the sequence is -sorted. So if a program needs only to consider -sorted sequences as input or output, considering -sorted sequences may save time.

The radius of a sequence is a measure of presortedness, that is, its value indicate how much the elements in the list has to be moved to get a totally sorted value. In the above example of tweets which are sorted up to the second, the radius is bounded by the number of tweets in a second.

Definition

Given a positive number , a sequence is said to be -sorted if for each and for each , . That is, the sequence has to be ordered only for pairs of elements whose distance is at least .

The radius of the sequence , denoted [2] or [3] is the smallest such that the sequence is -sorted. The radius is a measure of presortedness.

A sequence is said to be nearly-sorted or roughly-sorted if its radius is small compared to its length.

Equivalent definition

A sequence is -sorted if and only if each range of length , is -sorted.

Properties

All sequences of length are -sorted, that is, . A sequence is -sorted if and only if it is sorted. A -sorted sequence is automatically -sorted but not necessarily -sorted.

Relation with sorted sequences

Given a sequence a -sorted sequence and its sorted permutation , is at most .

Algorithms

Deciding whether a sequence is -sorted

Deciding whether a sequence is -sorted can be done in linear time and constant space by a streaming algorithm. It suffices, for each , to keep track of and to check that .

Computing the radius of a sequence

Computing the radius of a sequence can be computed in linear time and space. This follows from the fact that it can be defined as .

Halving the radius of a sequence

Given a -sorted sequence , it is possible to compute a -sorted permutation of in linear time and constant space.

First, given a sequence , lets say that this sequence is partitioned if the -smaller elements are in and the -greater elements are in . Lets call partitioning the action of reordering the sequence into a partitioned permutation. This can be done in linear time by first finding the median of and then moving elements to the first or second half depending on whether they are smaller or greater than the median.

The sequence can be obtained by partitioning the blocks of elements at position , then by partitioning the blocks of elements at position , and then again the elements at position for each number such that those sequences are defined.

Using processors, with no shared read nor write access to memory, the same algorithm can be applied in time, since each partition of a sequence can occur in parallel.

Merging two -sorted sequences

Merging two -sorted sequences and can be done in linear time and constant space.

First, using the preceding algorithm, both sequences should be transformed into -sorted sequences.

Let's construct iteratively an output sequence by removing content from both and adding it in .

If both 's are empty, then it suffices to return . Otherwise, let us assume that is empty and not , it suffices to remove the content of and append it to . The case where is empty and not is similar by symmetry.

Let us consider the case where neither is empty. Let us call and , they are the greatest of the -firsts elements of each list. Let us assume that , the other case is similar by symmetry. Remove from and remove from and add them to .

Sorting a -sorted sequence

A -sorted sequence can be sorted by applying the halving algorithm given above times. This takes time on a sequential machine, or time using processors.

-sorting a sequence

Since each sequence is necessarily -sorted, it suffices to apply the halving algorithm -times. Thus, -sorting can be done in -time. This algorithm is Par-optimal, that is, there exists no sequential algorithm with a better worst-case complexity.

Related Research Articles

In number theory, an additive function is an arithmetic function f(n) of the positive integer variable n such that whenever a and b are coprime, the function applied to the product ab is the sum of the values of the function applied to a and b:

The Liouville lambda function, denoted by λ(n) and named after Joseph Liouville, is an important arithmetic function. Its value is +1 if n is the product of an even number of prime numbers, and −1 if it is the product of an odd number of primes.

<span class="mw-page-title-main">Lagrange's four-square theorem</span> Every natural number can be represented as the sum of four integer squares

Lagrange's four-square theorem, also known as Bachet's conjecture, states that every natural number can be represented as a sum of four non-negative integer squares. That is, the squares form an additive basis of order four.

In algebraic geometry and the theory of complex manifolds, a logarithmic differential form is a differential form with poles of a certain kind. The concept was introduced by Pierre Deligne. In short, logarithmic differentials have the mildest possible singularities needed in order to give information about an open submanifold.

In mathematics, a real or complex-valued function f on d-dimensional Euclidean space satisfies a Hölder condition, or is Hölder continuous, when there are real constants C ≥ 0, α > 0, such that

In computational complexity theory, PostBQP is a complexity class consisting of all of the computational problems solvable in polynomial time on a quantum Turing machine with postselection and bounded error.

<span class="mw-page-title-main">Sunflower (mathematics)</span> Collection of sets in which every two sets have the same intersection

In the mathematical fields of set theory and extremal combinatorics, a sunflower or -system is a collection of sets in which all possible distinct pairs of sets share the same intersection. This common intersection is called the kernel of the sunflower.

Linear Programming Boosting (LPBoost) is a supervised classifier from the boosting family of classifiers. LPBoost maximizes a margin between training samples of different classes and hence also belongs to the class of margin-maximizing supervised classification algorithms. Consider a classification function

A self-concordant function is a function satisfying a certain differential inequality, which makes it particularly easy for optimization using Newton's method A self-concordant barrier is a particular self-concordant function, that is also a barrier function for a particular convex set. Self-concordant barriers are important ingredients in interior point methods for optimization.

<span class="mw-page-title-main">Anatoly Karatsuba</span> Russian mathematician

Anatoly Alexeyevich Karatsuba was a Russian mathematician working in the field of analytic number theory, p-adic numbers and Dirichlet series.

In set theory, an extender is a system of ultrafilters which represents an elementary embedding witnessing large cardinal properties. A nonprincipal ultrafilter is the most basic case of an extender.

In theoretical computer science, the computational complexity of matrix multiplication dictates how quickly the operation of matrix multiplication can be performed. Matrix multiplication algorithms are a central subroutine in theoretical and numerical algorithms for numerical linear algebra and optimization, so finding the fastest algorithm for matrix multiplication is of major practical relevance.

In mathematics, a submodular set function is a set function that, informally, describes the relationship between a set of inputs and an output, where adding more of one input has a decreasing additional benefit. The natural diminishing returns property which makes them suitable for many applications, including approximation algorithms, game theory and electrical networks. Recently, submodular functions have also found utility in several real world problems in machine learning and artificial intelligence, including automatic summarization, multi-document summarization, feature selection, active learning, sensor placement, image collection summarization and many other domains.

In computer science, the range query problem consists of efficiently answering several queries regarding a given interval of elements within an array. For example, a common task, known as range minimum query, is finding the smallest value inside a given range within a list of numbers.

In the mathematical theory of probability, the drift-plus-penalty method is used for optimization of queueing networks and other stochastic systems.

Funnelsort is a comparison-based sorting algorithm. It is similar to mergesort, but it is a cache-oblivious algorithm, designed for a setting where the number of elements to sort is too large to fit in a cache where operations are done. It was introduced by Matteo Frigo, Charles Leiserson, Harald Prokop, and Sridhar Ramachandran in 1999 in the context of the cache oblivious model.

<span class="mw-page-title-main">Suffix automaton</span> Deterministic finite automaton accepting set of all suffixes of particular string

In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix automaton of a string is the smallest directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string.

Buchholz's psi-functions are a hierarchy of single-argument ordinal functions introduced by German mathematician Wilfried Buchholz in 1986. These functions are a simplified version of the -functions, but nevertheless have the same strength as those. Later on this approach was extended by Jäger and Schütte.

In number theory, the prime omega functions and count the number of prime factors of a natural number Thereby counts each distinct prime factor, whereas the related function counts the total number of prime factors of honoring their multiplicity. That is, if we have a prime factorization of of the form for distinct primes , then the respective prime omega functions are given by and . These prime factor counting functions have many important number theoretic relations.

<span class="mw-page-title-main">Parallel external memory</span>

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine. It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

References

  1. @rk. "Announcing Snowflake". Twitter blog. Retrieved 26 April 2021.
  2. Altman, Tom; Igarashi, Yoshihide (1989-08-25). "Roughly Sorting: Sequential and Parallel Approach". Journal of Information Processing. 12 (2): 154–158.
  3. Estivill-Castro, Vladimir; Wood, Derick (October 1989). "A new measure of presortdness". Information and Computation. 83 (1): 111–119. doi: 10.1016/0890-5401(89)90050-3 .