Maximum subarray problem

Last updated
Visualization of how sub-arrays change based on start and end positions of a sample. Each possible contiguous sub-array is represented by a point on a colored line. That point's y-coordinate represents the sum of the sample. Its x-coordinate represents the end of the sample, and the leftmost point on that colored line represents the start of the sample. In this case, the array from which samples are taken is [2, 3, -1, -20, 5, 10]. Maximum Subarray Visualization.svg
Visualization of how sub-arrays change based on start and end positions of a sample. Each possible contiguous sub-array is represented by a point on a colored line. That point's y-coordinate represents the sum of the sample. Its x-coordinate represents the end of the sample, and the leftmost point on that colored line represents the start of the sample. In this case, the array from which samples are taken is [2, 3, -1, -20, 5, 10].

In computer science, the maximum sum subarray problem, also known as the maximum segment sum problem, is the task of finding a contiguous subarray with the largest sum, within a given one-dimensional array A[1...n] of numbers. It can be solved in time and space.

Contents

Formally, the task is to find indices and with , such that the sum

is as large as possible. (Some formulations of the problem also allow the empty subarray to be considered; by convention, the sum of all values of the empty subarray is zero.) Each number in the input array A could be positive, negative, or zero. [1]

For example, for the array of values [2, 1, 3, 4, 1, 2, 1, 5, 4], the contiguous subarray with the largest sum is [4, 1, 2, 1], with sum 6.

Some properties of this problem are:

  1. If the array contains all non-negative numbers, then the problem is trivial; a maximum subarray is the entire array.
  2. If the array contains all non-positive numbers, then a solution is any subarray of size 1 containing the maximal value of the array (or the empty subarray, if it is permitted).
  3. Several different sub-arrays may have the same maximum sum.

Although this problem can be solved using several different algorithmic techniques, including brute force, [2] divide and conquer, [3] dynamic programming, [4] and reduction to shortest paths, a simple single-pass algorithm known as Kadane's algorithm solves it efficiently.

History

The maximum subarray problem was proposed by Ulf Grenander in 1977 as a simplified model for maximum likelihood estimation of patterns in digitized images. [5]

Grenander was looking to find a rectangular subarray with maximum sum, in a two-dimensional array of real numbers. A brute-force algorithm for the two-dimensional problem runs in O(n6) time; because this was prohibitively slow, Grenander proposed the one-dimensional problem to gain insight into its structure. Grenander derived an algorithm that solves the one-dimensional problem in O(n2) time, [note 1] improving the brute force running time of O(n3). When Michael Shamos heard about the problem, he overnight devised an O(n log n) divide-and-conquer algorithm for it. Soon after, Shamos described the one-dimensional problem and its history at a Carnegie Mellon University seminar attended by Jay Kadane, who designed within a minute an O(n)-time algorithm, [5] [6] [7] which is as fast as possible. [note 2] In 1982, David Gries obtained the same O(n)-time algorithm by applying Dijkstra's "standard strategy"; [8] in 1989, Richard Bird derived it by purely algebraic manipulation of the brute-force algorithm using the Bird–Meertens formalism. [9]

Grenander's two-dimensional generalization can be solved in O(n3) time either by using Kadane's algorithm as a subroutine, or through a divide-and-conquer approach. Slightly faster algorithms based on distance matrix multiplication have been proposed by Tamaki & Tokuyama (1998) and by Takaoka (2002). There is some evidence that no significantly faster algorithm exists; an algorithm that solves the two-dimensional maximum subarray problem in O(n3ε) time, for any ε>0, would imply a similarly fast algorithm for the all-pairs shortest paths problem. [10]

Applications

Maximum subarray problems arise in many fields, such as genomic sequence analysis and computer vision.

Genomic sequence analysis employs maximum subarray algorithms to identify important biological segments of protein sequences that have unusual properties, by assigning scores to points within the sequence that are positive when a motif to be recognized is present, and negative when it is not, and then seeking the maximum subarray among these scores. These problems include conserved segments, GC-rich regions, tandem repeats, low-complexity filter, DNA binding domains, and regions of high charge. [11]

In computer vision, bitmap images generally consist only of positive values, for which the maximum subarray problem is trivial: the result is always the whole array. However, after subtracting a threshold value (such as the average pixel value) from each pixel, so that above-average pixels will be positive and below-average pixels will be negative, the maximum subarray problem can be applied to the modified image to detect bright areas within it. [12]

Kadane's algorithm

No empty subarrays admitted

Kadane's algorithm scans the given array from left to right. In the th step, it computes the subarray with the largest sum ending at ; this sum is maintained in variable current_sum. [note 3] Moreover, it computes the subarray with the largest sum anywhere in , maintained in variable best_sum, [note 4] and easily obtained as the maximum of all values of current_sum seen so far, cf. line 7 of the algorithm.

As a loop invariant, in the th step, the old value of current_sum holds the maximum over all of the sum . Therefore, current_sum [note 5] is the maximum over all of the sum . To extend the latter maximum to cover also the case , it is sufficient to consider also the singleton subarray . This is done in line 6 by assigning current_sum as the new value of current_sum, which after that holds the maximum over all of the sum .

Thus, the problem can be solved with the following code, [13] expressed in Python.

defmax_subarray(numbers):"""Find the largest sum of any contiguous subarray."""best_sum=-infinitycurrent_sum=0forxinnumbers:current_sum=max(x,current_sum+x)best_sum=max(best_sum,current_sum)returnbest_sum

If the input contains no positive element, the returned value is that of the largest element (i.e., the value closest to 0), or negative infinity if the input was empty. For correctness, an exception should be raised when the input array is empty, since an empty array has no maximum nonempty subarray. If the array is nonempty, its first element could be used in place of negative infinity, if needed to avoid mixing numeric and non-numeric values.

The algorithm can be adapted to the case which allows empty subarrays or to keep track of the starting and ending indices of the maximum subarray.

This algorithm calculates the maximum subarray ending at each position from the maximum subarray ending at the previous position, so it can be viewed as a trivial case of dynamic programming.

Empty subarrays admitted

Kadane's original algorithm solves the problem variant when empty subarrays are admitted. [4] [7] This variant will return 0 if the input contains no positive elements (including when the input is empty). It is obtained by two changes in code: in line 3, best_sum should be initialized to 0 to account for the empty subarray

best_sum=0;

and line 6 in the for loop current_sum should be updated as max(0, current_sum + x). [note 6]

current_sum=max(0,current_sum+x)

As a loop invariant, in the th step, the old value of current_sum holds the maximum over all of the sum . [note 7] Therefore, current_sum is the maximum over all of the sum . To extend the latter maximum to cover also the case , it is sufficient to consider also the empty subarray . This is done in line 6 by assigning current_sum as the new value of current_sum, which after that holds the maximum over all of the sum . Machine-verified C / Frama-C code of both variants can be found here.

Computing the best subarray's position

The algorithm can be modified to keep track of the starting and ending indices of the maximum subarray as well.

Because of the way this algorithm uses optimal substructures (the maximum subarray ending at each position is calculated in a simple way from a related but smaller and overlapping subproblem: the maximum subarray ending at the previous position) this algorithm can be viewed as a simple/trivial example of dynamic programming.

Complexity

The runtime complexity of Kadane's algorithm is and its space complexity is . [4] [7]

Generalizations

Similar problems may be posed for higher-dimensional arrays, but their solutions are more complicated; see, e.g., Takaoka (2002). Brodal & Jørgensen (2007) showed how to find the k largest subarray sums in a one-dimensional array, in the optimal time bound .

The Maximum sum k-disjoint subarrays can also be computed in the optimal time bound . [14]

See also

Notes

  1. By using a precomputed table of cumulative sums to compute the subarray sum in constant time
  2. since every algorithm must at least scan the array once which already takes O(n) time
  3. named MaxEndingHere in Bentley (1989), and c in Gries (1982)
  4. named MaxSoFar in Bentley (1989), and s in Gries (1982)
  5. In the Python code below, is expressed as x, with the index left implicit.
  6. While Bentley (1989) does not mention this difference, using x instead of 0 in the above version without empty subarrays achieves maintaining its loop invariant current_sum at the beginning of the th step.
  7. This sum is when , corresponding to the empty subarray .

Notes

  1. Bentley 1989, p. 69.
  2. Bentley 1989, p. 70.
  3. Bentley 1989, p. 73.
  4. 1 2 3 Bentley 1989, p. 74.
  5. 1 2 Bentley 1984, p. 868-869.
  6. Bentley 1989, p. 76-77.
  7. 1 2 3 Gries 1982, p. 211.
  8. Gries 1982, p. 209-211.
  9. Bird 1989, Sect.8, p.126.
  10. Backurs, Dikkala & Tzamos 2016.
  11. Ruzzo & Tompa (1999); Alves, Cáceres & Song (2004)
  12. Bae & Takaoka (2006); Weddell et al. (2013)
  13. Bentley 1989, p. 78,171. Bentley, like Gries, first introduces the variant admitting empty subarrays, see below, and describes only the changes.
  14. Bengtsson & Chen 2007.

Related Research Articles

<span class="mw-page-title-main">Knapsack problem</span> Problem in combinatorial optimization

The knapsack problem is the following problem in combinatorial optimization:

The subset sum problem (SSP) is a decision problem in computer science. In its most general formulation, there is a multiset of integers and a target-sum , and the question is to decide whether any subset of the integers sum to precisely . The problem is known to be NP-complete. Moreover, some restricted variants of it are NP-complete too, for example:

<span class="mw-page-title-main">Bucket sort</span> Sorting algorithm

Bucket sort, or bin sort, is a sorting algorithm that works by distributing the elements of an array into a number of buckets. Each bucket is then sorted individually, either using a different sorting algorithm, or by recursively applying the bucket sorting algorithm. It is a distribution sort, a generalization of pigeonhole sort that allows multiple keys per bucket, and is a cousin of radix sort in the most-to-least significant digit flavor. Bucket sort can be implemented with comparisons and therefore can also be considered a comparison sort algorithm. The computational complexity depends on the algorithm used to sort each bucket, the number of buckets to use, and whether the input is uniformly distributed.

In computer science, counting sort is an algorithm for sorting a collection of objects according to keys that are small positive integers; that is, it is an integer sorting algorithm. It operates by counting the number of objects that possess distinct key values, and applying prefix sum on those counts to determine the positions of each key value in the output sequence. Its running time is linear in the number of items and the difference between the maximum key value and the minimum key value, so it is only suitable for direct use in situations where the variation in keys is not significantly greater than the number of items. It is often used as a subroutine in radix sort, another sorting algorithm, which can handle larger keys more efficiently.

<span class="mw-page-title-main">Dynamic programming</span> Problem optimization method

Dynamic programming is both a mathematical optimization method and an algorithmic paradigm. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.

A van Emde Boas tree, also known as a vEB tree or van Emde Boas priority queue, is a tree data structure which implements an associative array with m-bit integer keys. It was invented by a team led by Dutch computer scientist Peter van Emde Boas in 1975. It performs all operations in O(log m) time, or equivalently in time, where is the largest element that can be stored in the tree. The parameter is not to be confused with the actual number of elements stored in the tree, by which the performance of other tree data-structures is often measured.

<span class="mw-page-title-main">Quicksort</span> Divide and conquer sorting algorithm

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than merge sort and heapsort for randomized data, particularly on larger distributions.

In mathematical optimization theory, duality or the duality principle is the principle that optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem. If the primal is a minimization problem then the dual is a maximization problem. Any feasible solution to the primal (minimization) problem is at least as large as any feasible solution to the dual (maximization) problem. Therefore, the solution to the primal is an upper bound to the solution of the dual, and the solution of the dual is a lower bound to the solution of the primal. This fact is called weak duality.

In mathematical optimization, constrained optimization is the process of optimizing an objective function with respect to some variables in the presence of constraints on those variables. The objective function is either a cost function or energy function, which is to be minimized, or a reward function or utility function, which is to be maximized. Constraints can be either hard constraints, which set conditions for the variables that are required to be satisfied, or soft constraints, which have some variable values that are penalized in the objective function if, and based on the extent that, the conditions on the variables are not satisfied.

Semidefinite programming (SDP) is a subfield of mathematical programming concerned with the optimization of a linear objective function over the intersection of the cone of positive semidefinite matrices with an affine space, i.e., a spectrahedron.

In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers x0, x1, x2, ... is a second sequence of numbers y0, y1, y2, ..., the sums of prefixes of the input sequence:

The study of facility location problems (FLP), also known as location analysis, is a branch of operations research and computational geometry concerned with the optimal placement of facilities to minimize transportation costs while considering factors like avoiding placing hazardous materials near housing, and competitors' facilities. The techniques also apply to cluster analysis.

In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically just one. These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.

<span class="mw-page-title-main">Proxmap sort</span>

ProxmapSort, or Proxmap sort, is a sorting algorithm that works by partitioning an array of data items, or keys, into a number of "subarrays". The name is short for computing a "proximity map," which indicates for each key K the beginning of a subarray where K will reside in the final sorted order. Keys are placed into each subarray using insertion sort. If keys are "well distributed" among the subarrays, sorting occurs in linear time. The computational complexity estimates involve the number of subarrays and the proximity mapping function, the "map key," used. It is a form of bucket and radix sort.

In computer science, the range query problem consists of efficiently answering several queries regarding a given interval of elements within an array. For example, a common task, known as range minimum query, is finding the smallest value inside a given range within a list of numbers.

The cache-oblivious distribution sort is a comparison-based sorting algorithm. It is similar to quicksort, but it is a cache-oblivious algorithm, designed for a setting where the number of elements to sort is too large to fit in a cache where operations are done. In the external memory model, the number of memory transfers it needs to perform a sort of items on a machine with cache of size and cache lines of length is , under the tall cache assumption that . This number of memory transfers has been shown to be asymptotically optimal for comparison sorts. This distribution sort also achieves the asymptotically optimal runtime complexity of .

In computer science, the count-distinct problem (also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements. This is a well-known problem with numerous applications. The elements might represent IP addresses of packets passing through a router, unique visitors to a web site, elements in a large database, motifs in a DNA sequence, or elements of RFID/sensor networks.

<span class="mw-page-title-main">Spiral optimization algorithm</span> Optimization algorithm

In mathematics, the spiral optimization (SPO) algorithm is a metaheuristic inspired by spiral phenomena in nature.

The Ruzzo–Tompa algorithm or the RT algorithm is a linear-time algorithm for finding all non-overlapping, contiguous, maximal scoring subsequences in a sequence of real numbers. The Ruzzo–Tompa algorithm was proposed by Walter L. Ruzzo and Martin Tompa. This algorithm is an improvement over previously known quadratic time algorithms. The maximum scoring subsequence from the set produced by the algorithm is also a solution to the maximum subarray problem.

Interpolation sort is a sorting algorithm that is a kind of bucket sort. It uses an interpolation formula to assign data to the bucket. A general interpolation formula is:

References