Exponential backoff

Last updated

Exponential backoff is an algorithm that uses feedback to multiplicatively decrease the rate of some process, in order to gradually find an acceptable rate. These algorithms find usage in a wide range of systems and processes, with radio networks and computer networks being particularly notable.

Contents

Exponential backoff algorithm

An exponential backoff algorithm is a form of closed-loop control system that reduces the rate of a controlled process in response to adverse events. For example, if a smartphone app fails to connect to its server, it might try again 1 second later, then if it fails again, 2 seconds later, then 4, etc. Each time the pause is multiplied by a fixed amount (in this case 2). In this case, the adverse event is failing to connect to the server. Other examples of adverse events include collisions of network traffic, an error response from a service, or an explicit request to reduce the rate (i.e. back off).

The rate reduction can be modelled as an exponential function:

or

Here, t is the time delay applied between actions, b is the multiplicative factor or base, c is the number of adverse events observed, and f is the frequency (or rate) of the process (i.e. number of actions per unit of time). The value of c is incremented each time an adverse event is observed, leading to an exponential rise in delay and, therefore, an inversely proportionate rate. An exponential backoff algorithm where b = 2 is referred to as a binary exponential backoff algorithm.

When the rate has been reduced in response to an adverse event, it usually does not remain at that reduced level forever. If no adverse events are observed for some period of time, often referred to as the recovery time or cooling-off period, the rate may be increased again. The time period that must elapse before attempting to increase the rate again may, itself, be determined by an exponential backoff algorithm. Typically, recovery of the rate occurs more slowly than reduction of the rate due to backoff, and often requires careful tuning to avoid oscillation of the rate. [1] The exact recovery behaviour is implementation-specific and may be informed by any number of environmental factors.

The mechanism by which rate reduction is practically achieved in a system may be more complex than a simple time delay. In some cases the value t may refer to an upper bound to the time delay, rather than a specific time delay value. The name exponential backoff refers to the exponential growth characteristic of the backoff, rather than an exact numeric relationship between adverse event counts and delay times.

Rate limiting

Exponential backoff is commonly utilised as part of rate limiting mechanisms in computer systems such as web services, to help enforce fair distribution of access to resources and prevent network congestion. Each time a service informs a client that it is sending requests too frequently, the client reduces its rate by some predetermined factor, until the client's request rate reaches an acceptable equilibrium. The service may enforce rate limiting by refusing to respond to requests when the client is sending them too frequently, so that misbehaving clients are not allowed to exceed their allotted resources.

A benefit of utilising an exponential backoff algorithm, over of a fixed rate limit, is that rate limits can be achieved dynamically without providing any prior information to the client. In the event that resources are unexpectedly constrained, e.g. due to heavy load or a service disruption, backoff requests and error responses from the service can automatically decrease the request rate from clients. This can help maintain some level of availability rather than overloading the service. In addition, quality of service can be prioritised to certain clients based on their individual importance, e.g. by reducing the backoff for emergency calls on a telephone network during periods of high load.

In a simple version of the algorithm, messages are delayed by predetermined (non-random) time. For example, in SIP protocol over unreliable transport (such as UDP) the client retransmits requests at an interval that starts at T1 seconds (usually, 500 ms, which is the estimate of the round-trip time) and doubles after every retransmission until it reaches T2 seconds (which defaults to 4 s). This results in retransmission intervals of 500 ms, 1 s, 2 s, 4 s, 4 s, 4 s, etc. [2]

Collision avoidance

Exponential backoff algorithms can be used to avoid network collisions. In a point-to-multipoint or multiplexed network, multiple senders communicate over a single shared channel. If two senders attempt to transmit a message at the same time, or talk over each other, a collision occurs and the messages are damaged or lost. Each sender can then back off before attempting to retransmit the same message again.

A deterministic exponential backoff algorithm is unsuitable for this use case, since each sender would back off for the same time period, leading them to retransmit simultaneously and cause another collision. Instead, for purposes of collision avoidance, the time between retransmissions is randomized and the exponential backoff algorithm sets the range of delay values that are possible. The time delay is usually measured in slots, which are fixed-length periods (or slices) of time on the network. In a binary exponential backoff algorithm (i.e. one where b = 2), after c collisions, each retransmission is delayed by a random number of slot times between 0 and 2c 1. After the first collision, each sender will wait 0 or 1 slot times. After the second collision, the senders will wait anywhere from 0 to 3 slot times (inclusive). After the third collision, the senders will wait anywhere from 0 to 7 slot times (inclusive), and so forth. As the number of retransmission attempts increases, the number of possibilities for delay increases exponentially. This decreases the probability of a collision, but increases the average latency.

Exponential backoff is utilised during retransmission of frames in carrier-sense multiple access with collision avoidance (CSMA/CA) and carrier-sense multiple access with collision detection (CSMA/CD) networks, where this algorithm is part of the channel access method used to send data on these networks. In Ethernet networks, the algorithm is commonly used to schedule retransmissions after collisions. The retransmission is delayed by an amount of time derived from the slot time (for example, the time it takes to send 512 bits; i.e., 512 bit-times) and the number of attempts to retransmit.

Example

This example is from the Ethernet protocol, [3] where a sending host is able to know when a collision has occurred (that is, another host has tried to transmit), when it is sending a frame. If both hosts attempted to re-transmit as soon as a collision occurred, there would be yet another collision and the pattern would continue forever. The hosts must choose a random value within an acceptable range to ensure that this situation doesn't happen. An exponential backoff algorithm is therefore used. The value 51.2 μs is used as an example here because it is the slot time for a 10 Mbit/s Ethernet line. However, 51.2 μs could be replaced by any positive value, in practice.

  1. When a collision first occurs, send a jamming signal to prevent further data from being sent.
  2. Resend a frame after either 0 seconds or 51.2 μs, chosen at random.
  3. If that fails, resend the frame after either 0 s, 51.2 μs, 102.4 μs, or 153.6 μs.
  4. If that still fails, resend the frame after k · 51.2 μs, where k is a random integer between 0 and 23 1.
  5. For further failures, after the cth failed attempt, resend the frame after k · 51.2 μs, where k is a random integer between 0 and 2c 1.

Truncated exponential backoff

The 'truncated' variant of the algorithm introduces a limit on c. This simply means that after a certain number of increases, the exponentiation stops. Without a limit on c, the delay between transmissions may become undesirably long if a sender repeatedly observes adverse events, e.g. due to a degradation in network service. In a randomized system this may occur by chance, leading to unpredictable latency; longer delays due to unbounded increases in c are exponentially less probable, but they are effectively inevitable on a busy network due to the law of truly large numbers. Limiting c helps to reduce the possibility of unexpectedly long transmission latencies and improve recovery times after a transient outage.

For example, if the ceiling is set at i = 10 in a truncated binary exponential backoff algorithm, (as it is in the IEEE 802.3 CSMA/CD standard [4] ), then the maximum delay is 1023 slot times, i.e. 210 1.

Selecting an appropriate backoff limit for a system involves striking a balance between collision probability and latency. By increasing the ceiling there is an exponential reduction in probability of collision on each transmission attempt. At the same time, increasing the limit also exponentially increases the range of possible latency times for a transmission, leading to less deterministic performance and an increase in the average latency. The optimal limit value for a system is specific to both the implementation and environment. [5]

Expected backoff

Given a uniform distribution of backoff times, the expected backoff time is the mean of the possibilities. After c collisions in a binary exponential backoff algorithm, the delay is randomly chosen from [0, 1, ..., N] slots, where N = 2c − 1, and the expected backoff time (in slots) is

For example, the expected backoff time for the third (c = 3) collision, one could first calculate the maximum backoff time, N:

and then calculate the mean of the backoff time possibilities:

.

which is, for the example, E(3) = 3.5 slots.

See also

Related Research Articles

The Transmission Control Protocol (TCP) is one of the main protocols of the Internet protocol suite. It originated in the initial network implementation in which it complemented the Internet Protocol (IP). Therefore, the entire suite is commonly referred to as TCP/IP. TCP provides reliable, ordered, and error-checked delivery of a stream of octets (bytes) between applications running on hosts communicating via an IP network. Major internet applications such as the World Wide Web, email, remote administration, and file transfer rely on TCP, which is part of the Transport layer of the TCP/IP suite. SSL/TLS often runs on top of TCP.

<span class="mw-page-title-main">Exponential distribution</span> Probability distribution

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

ALOHAnet, also known as the ALOHA System, or simply ALOHA, was a pioneering computer networking system developed at the University of Hawaii.

Distributed coordination function (DCF) is the fundamental medium access control (MAC) technique of the IEEE 802.11-based WLAN standard. DCF employs a carrier-sense multiple access with collision avoidance (CSMA/CA) with the binary exponential backoff algorithm.

Network congestion in data networking and queueing theory is the reduced quality of service that occurs when a network node or link is carrying more data than it can handle. Typical effects include queueing delay, packet loss or the blocking of new connections. A consequence of congestion is that an incremental increase in offered load leads either only to a small increase or even a decrease in network throughput.

<span class="mw-page-title-main">Expectation–maximization algorithm</span> Iterative method for finding maximum likelihood estimates in statistical models

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem.

A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performance in the "average case" over all possible choices of random determined by the random bits; thus either the running time, or the output are random variables.

In probability theory, a Chernoff bound is an exponentially decreasing upper bound on the tail of a random variable based on its moment generating function. The minimum of all such exponential bounds forms the Chernoff or Chernoff-Cramér bound, which may decay faster than exponential. It is especially useful for sums of independent random variables, such as sums of Bernoulli random variables.

<span class="mw-page-title-main">Yule–Simon distribution</span> Discrete probability distribution

In probability and statistics, the Yule–Simon distribution is a discrete probability distribution named after Udny Yule and Herbert A. Simon. Simon originally called it the Yule distribution.

Transmission Control Protocol (TCP) uses a congestion control algorithm that includes various aspects of an additive increase/multiplicative decrease (AIMD) scheme, along with other schemes including slow start and congestion window (CWND), to achieve congestion avoidance. The TCP congestion-avoidance algorithm is the primary basis for congestion control in the Internet. Per the end-to-end principle, congestion control is largely a function of internet hosts, not the network itself. There are several variations and versions of the algorithm implemented in protocol stacks of operating systems of computers that connect to the Internet.

TCP tuning techniques adjust the network congestion avoidance parameters of Transmission Control Protocol (TCP) connections over high-bandwidth, high-latency networks. Well-tuned networks can perform up to 10 times faster in some cases. However, blindly following instructions without understanding their real consequences can hurt performance as well.

Retransmission, essentially identical with automatic repeat request (ARQ), is the resending of packets which have been either damaged or lost. Retransmission is one of the basic mechanisms used by protocols operating over a packet switched computer network to provide reliable communication.

Packet loss occurs when one or more packets of data travelling across a computer network fail to reach their destination. Packet loss is either caused by errors in data transmission, typically across wireless networks, or network congestion. Packet loss is measured as a percentage of packets lost with respect to packets sent.

<span class="mw-page-title-main">Hyperexponential distribution</span>

In probability theory, a hyperexponential distribution is a continuous probability distribution whose probability density function of the random variable X is given by

In probability theory, the Gillespie algorithm generates a statistically correct trajectory of a stochastic equation system for which the reaction rates are known. It was created by Joseph L. Doob and others, presented by Dan Gillespie in 1976, and popularized in 1977 in a paper where he uses it to simulate chemical or biochemical systems of reactions efficiently and accurately using limited computational power. As computers have become faster, the algorithm has been used to simulate increasingly complex systems. The algorithm is particularly useful for simulating reactions within cells, where the number of reagents is low and keeping track of every single reaction is computationally feasible. Mathematically, it is a variant of a dynamic Monte Carlo method and similar to the kinetic Monte Carlo methods. It is used heavily in computational systems biology.

A stochastic simulation is a simulation of a system that has variables that can change stochastically (randomly) with individual probabilities.

Karn's algorithm addresses the problem of getting accurate estimates of the round-trip time for messages when using the Transmission Control Protocol (TCP) in computer networking. The algorithm, also sometimes termed as the Karn-Partridge algorithm was proposed in a paper by Phil Karn and Craig Partridge in 1987.

In cryptography, the dining cryptographers problem studies how to perform a secure multi-party computation of the boolean-XOR function. David Chaum first proposed this problem in the early 1980s and used it as an illustrative example to show that it was possible to send anonymous messages with unconditional sender and recipient untraceability. Anonymous communication networks based on this problem are often referred to as DC-nets.

Zeta-TCP refers to a set of proprietary Transmission Control Protocol (TCP) algorithms aiming at improving the end-to-end performance of TCP, regardless of whether the peer is Zeta-TCP or any other TCP protocol stack, in other words, to be compatible with the existing TCP algorithms. It was designed and implemented by AppEx Networks Corporation.

In queueing theory, a discipline within the mathematical theory of probability, the backpressure routing algorithm is a method for directing traffic around a queueing network that achieves maximum network throughput, which is established using concepts of Lyapunov drift. Backpressure routing considers the situation where each job can visit multiple service nodes in the network. It is an extension of max-weight scheduling where each job visits only a single service node.

References

  1. Tanenbaum & Wetherall 2010 , p. 395
  2. Rosenberg et al. RFC3261 – SIP: Session Initiation Protocol. The Internet Society. 2002.
  3. Peterson, Larry L.; Davie, Bruce S. (2022). "Chapter 2: Direct Links". Computer Networks: A Systems Approach (Sixth ed.). Morgan Kaufmann Publishers. p. 120. ISBN   978-0-12-818200-0.
  4. "IEEE Standard 802.3-2015". IEEE. Retrieved 20 March 2022. (purchase)
  5. Tanenbaum & Wetherall 2010 , p. 285

Bibliography

PD-icon.svg This article incorporates public domain material from Federal Standard 1037C. General Services Administration. Archived from the original on 22 January 2022.