Marzullo's algorithm

Last updated

Marzullo's algorithm, invented by Keith Marzullo for his Ph.D. dissertation in 1984, is an agreement algorithm used to select sources for estimating accurate time from a number of noisy time sources. A refined version of it, renamed the "intersection algorithm", forms part of the modern Network Time Protocol. Marzullo's algorithm is also used to compute the relaxed intersection of n boxes (or more generally n subsets of Rn), as required by several robust set estimation methods.

Contents

Purpose

Marzullo's algorithm is efficient in terms of time for producing an optimal value from a set of estimates with confidence intervals where the actual value may be outside the confidence interval for some sources. In this case the best estimate is taken to be the smallest interval consistent with the largest number of sources.

If we have the estimates 10 ± 2, 12 ± 1 and 11 ± 1 then these intervals are [8,12], [11,13] and [10,12] which intersect to form [11,12] or 11.5 ± 0.5 as consistent with all three values.

Marzullo's algorithm, example#1 Marzullo example-1.jpg
Marzullo's algorithm, example#1

If instead the ranges are [8,12], [11,13] and [14,15] then there is no interval consistent with all these values but [11,12] is consistent with the largest number of sources namely, two of them.

Marzullo's algorithm, example#2 Marzullo example-2.jpg
Marzullo's algorithm, example#2

Finally, if the ranges are [8,9], [8,12] and [10,12] then both the intervals [8,9] and [10,12] are consistent with the largest number of sources.

Marzullo's algorithm, example#3 Marzullo example-3.jpg
Marzullo's algorithm, example#3

This procedure determines an interval. If the desired result is a best value from that interval then a naive approach would be to take the center of the interval as the value, which is what was specified in the original Marzullo algorithm. A more sophisticated approach would recognize that this could be throwing away useful information from the confidence intervals of the sources and that a probabilistic model of the sources could return a value other than the center.

Note that the computed value is probably better described as "optimistic" rather than "optimal". For example, consider three intervals [10,12], [11, 13] and [11.99,13]. The algorithm described below computes [11.99, 12] or 11.995 ± 0.005 which is a very precise value. If we suspect that one of the estimates might be incorrect, then at least two of the estimates must be correct. Under this condition, the best estimate is [11,13] since this is the largest interval that always intersects at least two estimates. The algorithm described below is easily parameterized with the maximum number of incorrect estimates.

Method

Marzullo's algorithm begins by preparing a table of the sources, sorting it and then searching (efficiently) for the intersections of intervals. For each source there is a range [cr,c+r] defined by c ± r. For each range the table will have two tuples of the form offset,type. One tuple will represent the beginning of the range, marked with type 1 as cr,1 and the other will represent the end with type +1 as c+r,+1.

The description of the algorithm uses the following variables: best (largest number of overlapping intervals found), cnt (current number of overlapping intervals), beststart and bestend (the beginning and end of best interval found so far), i (an index), and the table of tuples.

  1. Build the table of tuples.
  2. Sort the table by the offset. (If two tuples with the same offset but opposite types exist, indicating that one interval ends just as another begins, then a method of deciding which comes first is necessary. Such an occurrence can be considered an overlap with no duration, which can be found by the algorithm by putting type 1 before type +1. If such pathological overlaps are considered objectionable they can be avoided by putting type +1 before 1 in this case.)
  3. [initialize] best=0 cnt=0
  4. [loop] go through each tuple in the table in ascending order
  1. [current number of overlapping intervals] cnt=cnt−type[i]
  2. if cnt>best then best=cnt beststart=offset[i] bestend=offset[i+1]
commentary: the next tuple, at [i+1], will either be an end of an interval (type=+1) in which case it ends this best interval, or it will be a beginning of an interval (type=1) and in the next step will replace best.
ambiguity: unspecified is what to do if best=cnt. This is a condition of a tie for greatest overlap. The decision can either be made to take the smaller of bestendbeststart and offset[i+1]offset[i] or just take an arbitrary one of the two equally good entries. This decision is relevant only when type[i+1]=+1.
  1. [end loop] return [beststart,bestend] as optimal interval. The number of false sources (ones which do not overlap the optimal interval returned) is the number of sources minus the value of best.

Efficiency

Marzullo's algorithm is efficient in both space and time. The asymptotic space usage is O(n), where n is the number of sources. In considering the asymptotic time requirement the algorithm can be considered to consist of building the table, sorting it and searching it. Sorting can be done in O(n log n) time, and this dominates the building and searching phases which can be performed in linear time. Therefore, the time efficiency of Marzullo's algorithm is O(n log n).

Once the table has been built and sorted it is possible to update the interval for one source (when new information is received) in linear time. Therefore, updating data for one source and finding the best interval can be done in O(n) time.

Related Research Articles

In computer science, an array is a data structure consisting of a collection of elements, each identified by at least one array index or key. An array is stored such that the position of each element can be computed from its index tuple by a mathematical formula. The simplest type of data structure is a linear array, also called one-dimensional array.

<span class="mw-page-title-main">Binary search algorithm</span> Search algorithm finding the position of a target value within a sorted array

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

<span class="mw-page-title-main">Hash function</span> Mapping arbitrary data to fixed-size values

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

<span class="mw-page-title-main">Huffman coding</span> Technique to compress data

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

<span class="mw-page-title-main">Sorting algorithm</span> Algorithm that arranges lists in order

In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. Efficient sorting is important for optimizing the efficiency of other algorithms that require input data to be in sorted lists. Sorting is also often useful for canonicalizing data and for producing human-readable output.

A* is a graph traversal and path search algorithm, which is used in many fields of computer science due to its completeness, optimality, and optimal efficiency. One major practical drawback is its space complexity, as it stores all generated nodes in memory. Thus, in practical travel-routing systems, it is generally outperformed by algorithms which can pre-process the graph to attain better performance, as well as memory-bounded approaches; however, A* is still the best solution in many cases.

Alpha–beta pruning is a search algorithm that seeks to decrease the number of nodes that are evaluated by the minimax algorithm in its search tree. It is an adversarial search algorithm used commonly for machine playing of two-player games. It stops evaluating a move when at least one possibility has been found that proves the move to be worse than a previously examined move. Such moves need not be evaluated further. When applied to a standard minimax tree, it returns the same move as minimax would, but prunes away branches that cannot possibly influence the final decision.

Template metaprogramming (TMP) is a metaprogramming technique in which templates are used by a compiler to generate temporary source code, which is merged by the compiler with the rest of the source code and then compiled. The output of these templates can include compile-time constants, data structures, and complete functions. The use of templates can be thought of as compile-time polymorphism. The technique is used by a number of languages, the best-known being C++, but also Curl, D, Nim, and XL.

In computer science, a selection algorithm is an algorithm for finding the kth smallest number in a list or array; such a number is called the kth order statistic. This includes the cases of finding the minimum, maximum, and median elements. There are O(n)-time selection algorithms, and sublinear performance is possible for structured data; in the extreme, O(1) for an array of sorted data. Selection is a subproblem of more complex problems like the nearest neighbor and shortest path problems. Many selection algorithms are derived by generalizing a sorting algorithm, and conversely some sorting algorithms can be derived as repeated application of selection.

<span class="mw-page-title-main">Sorting network</span>

In computer science, comparator networks are abstract devices built up of a fixed number of "wires", carrying values, and comparator modules that connect pairs of wires, swapping the values on the wires if they are not in a desired order. Such networks are typically designed to perform sorting on fixed numbers of values, in which case they are called sorting networks.

The intersection algorithm is an agreement algorithm used to select sources for estimating accurate time from a number of noisy time sources. It forms part of the modern Network Time Protocol. It is a modified form of Marzullo's algorithm.

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.

In computer science, an interval tree is a tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point. It is often used for windowing queries, for instance, to find all roads on a computerized map inside a rectangular viewport, or to find all visible elements inside a three-dimensional scene. A similar data structure is the segment tree.

<span class="mw-page-title-main">Quicksort</span> Divide and conquer sorting algorithm

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961, it is still a commonly used algorithm for sorting. Overall, it is slightly faster than merge sort and heapsort for randomized data, particularly on larger distributions.

Query optimization is a feature of many relational database management systems and other databases such as NoSQL and graph databases. The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans.

Interval scheduling is a class of problems in computer science, particularly in the area of algorithm design. The problems consider a set of tasks. Each task is represented by an interval describing the time in which it needs to be processed by some machine. For instance, task A might run from 2:00 to 5:00, task B might run from 4:00 to 10:00 and task C might run from 9:00 to 11:00. A subset of intervals is compatible if no two intervals overlap on the machine/resource. For example, the subset {A,C} is compatible, as is the subset {B}; but neither {A,B} nor {B,C} are compatible subsets, because the corresponding intervals within each subset overlap.

Group method of data handling (GMDH) is a family of inductive algorithms for computer-based mathematical modeling of multi-parametric datasets that features fully automatic structural and parametric optimization of models.

In computer science, a charging argument is used to compare the output of an optimization algorithm to an optimal solution. It is typically used to show that an algorithm produces optimal results by proving the existence of a particular injective function. For profit maximization problems, the function can be any one-to-one mapping from elements of an optimal solution to elements of the algorithm's output. For cost minimization problems, the function can be any one-to-one mapping from elements of the algorithm's output to elements of an optimal solution.

The activity selection problem is a combinatorial optimization problem concerning the selection of non-conflicting activities to perform within a given time frame, given a set of activities each marked by a start time (si) and finish time (fi). The problem is to select the maximum number of activities that can be performed by a single person or machine, assuming that a person can only work on a single activity at a time. The activity selection problem is also known as the Interval scheduling maximization problem (ISMP), which is a special type of the more general Interval Scheduling problem.

References