Exponential backoff

Last updated

Exponential backoff is an algorithm that uses feedback to multiplicatively decrease the rate of some process, in order to gradually find an acceptable rate. These algorithms find usage in a wide range of systems and processes, with radio networks and computer networks being particularly notable.

Contents

Exponential backoff algorithm

An exponential backoff algorithm is a form of closed-loop control system that reduces the rate of a controlled process in response to adverse events. For example, if a smartphone app fails to connect to its server, it might try again 1 second later, then if it fails again, 2 seconds later, then 4, etc. Each time the pause is multiplied by a fixed amount (in this case 2). In this case, the adverse event is failing to connect to the server. Other examples of adverse events include collisions of network traffic, an error response from a service, or an explicit request to reduce the rate (i.e. back off).

The rate reduction can be modelled as an exponential function:

or

Here, t is the time delay applied between actions, b is the multiplicative factor or base, c is the number of adverse events observed, and f is the frequency (or rate) of the process (i.e. number of actions per unit of time). The value of c is incremented each time an adverse event is observed, leading to an exponential rise in delay and, therefore, an inversely proportionate rate. An exponential backoff algorithm where b = 2 is referred to as a binary exponential backoff algorithm.

When the rate has been reduced in response to an adverse event, it usually does not remain at that reduced level forever. If no adverse events are observed for some period of time, often referred to as the recovery time or cooling-off period, the rate may be increased again. The time period that must elapse before attempting to increase the rate again may, itself, be determined by an exponential backoff algorithm. Typically, recovery of the rate occurs more slowly than reduction of the rate due to backoff, and often requires careful tuning to avoid oscillation of the rate. [1] The exact recovery behaviour is implementation-specific and may be informed by any number of environmental factors.

The mechanism by which rate reduction is practically achieved in a system may be more complex than a simple time delay. In some cases the value t may refer to an upper bound to the time delay, rather than a specific time delay value. The name exponential backoff refers to the exponential growth characteristic of the backoff, rather than an exact numeric relationship between adverse event counts and delay times.

Rate limiting

Exponential backoff is commonly utilised as part of rate limiting mechanisms in computer systems such as web services, to help enforce fair distribution of access to resources and prevent network congestion. Each time a service informs a client that it is sending requests too frequently, the client reduces its rate by some predetermined factor, until the client's request rate reaches an acceptable equilibrium. The service may enforce rate limiting by refusing to respond to requests when the client is sending them too frequently, so that misbehaving clients are not allowed to exceed their allotted resources.

A benefit of utilising an exponential backoff algorithm, over of a fixed rate limit, is that rate limits can be achieved dynamically without providing any prior information to the client. In the event that resources are unexpectedly constrained, e.g. due to heavy load or a service disruption, backoff requests and error responses from the service can automatically decrease the request rate from clients. This can help maintain some level of availability rather than overloading the service. In addition, quality of service can be prioritised to certain clients based on their individual importance, e.g. by reducing the backoff for emergency calls on a telephone network during periods of high load.

In a simple version of the algorithm, messages are delayed by predetermined (non-random) time. For example, in SIP protocol over unreliable transport (such as UDP) the client retransmits requests at an interval that starts at T1 seconds (usually, 500 ms, which is the estimate of the round-trip time) and doubles after every retransmission until it reaches T2 seconds (which defaults to 4 s). This results in retransmission intervals of 500 ms, 1 s, 2 s, 4 s, 4 s, 4 s, etc. [2]

Collision avoidance

Exponential backoff algorithms can be used to avoid network collisions. In a point-to-multipoint or multiplexed network, multiple senders communicate over a single shared channel. If two senders attempt to transmit a message at the same time, or talk over each other, a collision occurs and the messages are damaged or lost. Each sender can then back off before attempting to retransmit the same message again.

A deterministic exponential backoff algorithm is unsuitable for this use case, since each sender would back off for the same time period, leading them to retransmit simultaneously and cause another collision. Instead, for purposes of collision avoidance, the time between retransmissions is randomized and the exponential backoff algorithm sets the range of delay values that are possible. The time delay is usually measured in slots, which are fixed-length periods (or slices) of time on the network. In a binary exponential backoff algorithm (i.e. one where b = 2), after c collisions, each retransmission is delayed by a random number of slot times between 0 and 2c 1. After the first collision, each sender will wait 0 or 1 slot times. After the second collision, the senders will wait anywhere from 0 to 3 slot times (inclusive). After the third collision, the senders will wait anywhere from 0 to 7 slot times (inclusive), and so forth. As the number of retransmission attempts increases, the number of possibilities for delay increases exponentially. This decreases the probability of a collision, but increases the average latency.

Exponential backoff is utilised during retransmission of frames in carrier-sense multiple access with collision avoidance (CSMA/CA) and carrier-sense multiple access with collision detection (CSMA/CD) networks, where this algorithm is part of the channel access method used to send data on these networks. In Ethernet networks, the algorithm is commonly used to schedule retransmissions after collisions. The retransmission is delayed by an amount of time derived from the slot time (for example, the time it takes to send 512 bits; i.e., 512 bit-times) and the number of attempts to retransmit.

Example

This example is from the Ethernet protocol, [3] where a sending host is able to know when a collision has occurred (that is, another host has tried to transmit), when it is sending a frame. If both hosts attempted to re-transmit as soon as a collision occurred, there would be yet another collision and the pattern would continue forever. The hosts must choose a random value within an acceptable range to ensure that this situation doesn't happen. An exponential backoff algorithm is therefore used. The value 51.2 μs is used as an example here because it is the slot time for a 10 Mbit/s Ethernet line. However, 51.2 μs could be replaced by any positive value, in practice.

  1. When a collision first occurs, send a jamming signal to prevent further data from being sent.
  2. Resend a frame after either 0 seconds or 51.2 μs, chosen at random.
  3. If that fails, resend the frame after either 0 s, 51.2 μs, 102.4 μs, or 153.6 μs.
  4. If that still fails, resend the frame after k · 51.2 μs, where k is a random integer between 0 and 23 1.
  5. For further failures, after the cth failed attempt, resend the frame after k · 51.2 μs, where k is a random integer between 0 and 2c 1.

History and theory

In a seminal paper published in AFIPS 1970, [4] Norman Abramson presented the idea of multiple “users,” on different islands, sharing a single radio channel (i.e., a single frequency) to access the main computer at the University of Hawaii without any time synchronization. Packet collisions at the receiver of the main computer are treated by senders after a timeout as detected errors. Each sender not receiving a positive acknowledgment from the main computer would retransmit its “lost” packet. Abramson assumed that the sequence of packets transmitted into the shared channel is a Poisson process at rate G, which is the sum of the rate S of new packet arrivals to senders and the rate of retransmitted packets into the channel. Assuming steady state, he showed that the channel throughput rate is with a maximum value of 1/(2e) = 0.184 in theory.

Larry Roberts considered a time slotted ALOHA channel with each time slot long enough for a packet transmission time. (A satellite channel using the TDMA protocol is time slotted.) Using the same Poisson process and steady state assumptions as Abramson, Larry Roberts showed that the maximum throughput rate is 1/e = 0.368 in theory. [5] . Roberts was the program manager of the ARPANET research project. Inspired by the slotted ALOHA idea, Roberts initiated a new ARPANET Satellite System (ASS) project to include satellite links in the ARPANET.

Simulation results by Abramson, his colleagues, and others showed that an ALOHA channel, slotted or not, is unstable and would sometimes go into congestion collapse. How much time until congestion collapse depended on the arrival rate of new packets as well other unknown factors. In 1971, Larry Roberts asked Professor Leonard Kleinrock and his Ph.D. student, Simon Lam, at UCLA to join the Satellite System project of ARPANET. Simon Lam would work on the stability, performance evaluation, and adaptive control of slotted ALOHA for his Ph.D. dissertation research. The first paper he co-authored with Kleinrock was ARPANET Satellite System (ASS) Note 12 disseminated to the ASS group in August 1972. [6] In this paper, a slot chosen randomly over an interval of K slots was used for retransmission. A new result from the model is that increasing K increases channel throughput which converges to 1/e as K increases to infinity. This model retained the assumptions of Poisson arrivals and steady state, and was not intended for understanding statistical behavior and congestion collapse.

Stability and adaptive backoff

To understand stability, Lam created a discrete-time Markov chain model for analyzing the statistical behavior of slotted ALOHA in chapter 5 of his dissertation [7] . The model has three parameters: N, s, and p. N is the total number of users. At any time, each user may be idle or blocked. Each user has at most one packet to transmit in the next time slot. An idle user generates a new packet with probability s and transmits it in the next time slot immediately. A blocked user transmits its backlogged packet with probability p, where 1/p = (K+1)/2 to keep the average retransmission interval the same. The throughput-delay results of the two retransmission methods were compared by extensive simulations and found to be essentially the same (see Fig. 5-1 on page 100, Chapter 5, in Lam’s dissertation).

Lam’s model provides mathematically rigorous answers to the stability questions of slotted ALOHA, as well as an efficient algorithm for computing the throughput-delay performance for any stable system. There are 3 key results, shown below, from Lam’s Markov chain model in Chapter 5 of his dissertation (also published jointly with Professor Len Kleinrock, in IEEE Transactions on Communications [8] .)

  1. Slotted ALOHA with Poisson arrivals (i.e., infinite N) is inherently unstable, because a stationary probability distribution does not exist. (Reaching steady state was a key assumption used in the models of Abramson and Roberts.)
  2. For slotted ALOHA with a finite N and a finite K, the Markov chain model can be used to determine whether the system is stable or unstable for a given input rate (N×s) and, if it is stable, compute its average packet delay and channel throughput rate.
  3. Increasing K increases the maximum number of users that can be accommodated by a stable slotted ALOHA channel (see Fig. 5-9 on page 114 in Chapter 5 of Lam's dissertation, or Fig. 10 on page 418 in the 1975 Kleinrock-Lam paper.)

Corollary: For a finite (Nxs), an unstable channel for the current K value can be made stable by increasing K to a sufficiently large value, to be referred to as its K(N,s). (See Fig 5-10 on page 116 in Chapter 5 of Lam’s dissertation, or Figure 11 on page 418 in the 1975 Kleinrock-Lam paper.)

Heuristic RCP for adaptive backoff

Lam used Markov decision theory and developed optimal control policies for slotted ALOHA but these policies require all blocked users to know the current state (number of blocked users) of the Markov chain. In 1973, Lam decided that instead of using a complex protocol for users to estimate the system state, he would create a simple algorithm for each user to use its own local information, i.e., the number of collisions its backlogged packet has encountered. (See Algorithm 4 on pages 901-902 in the Lam-Kleinrock paper [9] or subsection 6.7.2, on pages 209-210 in Chapter 6 of Lam’s dissertation.) Applying the above Corollary, Lam invented the following class of adaptive backoff algorithms (named Heuristic RCP).

A Heuristic RCP algorithm consists of the following steps: (1) Let m denote the number of previous collisions incurred by a packet at a user as the feedback information in its control loop. For a new packet, K(0) is initialized to 1. (2) The packet’s retransmission interval K(m) increases as m increases (until the channel becomes stable as implied by the above Corollary). For implementation, with K(0)=1, as m increases, K(m) can be increased by multiplication (or by addition).

Observation: Binary Exponential Backoff (BEB) used in Ethernet several years later is a special case of Heuristic RCP with K(m)= .

BEB is very easy to implement. It is however not optimal for many applications because BEB uses 2 as the only multiplier which provides no flexibility for optimization. In particular, for a system with a large number of users, BEB increases K(m) too slowly. On the other hand, for a system with a small number of users, a fairly small K is sufficient for the system to be stable, and backoff would not be necessary.

To illustrate an example of a multiplicative RCP that uses several multipliers, see the bottom row in Table 6.3 on page 214 in Chapter 6 of Lam’s dissertation, or bottom row in Table III on page 902 in the Lam-Kleinrock paper. In this example:

  1. A new packet is transmitted immediately, m=0, K(0)=1
  2. For a packet with 1 previous collision, K(1)=K(0)x10=10 (The multiplier jumps up directly to K*= 10 which was found to be the optimum K value at steady state for this particular system (slotted ALOHA for a satellite channel).
  3. For a packet with 2 previous collisions, K(2)=K(1)x10=100 (one more collision, K jumps up 10 times).
  4. K(3) =K(2) ×2 =200
  5. K(m)=K(m-1) for m≥4

For this example, K=200 is sufficient for a stable slotted ALOHA system with N equal to about 400, which follows from result 3 above Corollary. There is no need to increase K any further.

Truncated exponential backoff

The 'truncated' variant of the algorithm introduces a limit on c. This simply means that after a certain number of increases, the exponentiation stops. Without a limit on c, the delay between transmissions may become undesirably long if a sender repeatedly observes adverse events, e.g. due to a degradation in network service. In a randomized system this may occur by chance, leading to unpredictable latency; longer delays due to unbounded increases in c are exponentially less probable, but they are effectively inevitable on a busy network due to the law of truly large numbers. Limiting c helps to reduce the possibility of unexpectedly long transmission latencies and improve recovery times after a transient outage.

For example, if the ceiling is set at i = 10 in a truncated binary exponential backoff algorithm, (as it is in the IEEE 802.3 CSMA/CD standard [10] ), then the maximum delay is 1023 slot times, i.e. 210 1.

Selecting an appropriate backoff limit for a system involves striking a balance between collision probability and latency. By increasing the ceiling there is an exponential reduction in probability of collision on each transmission attempt. At the same time, increasing the limit also exponentially increases the range of possible latency times for a transmission, leading to less deterministic performance and an increase in the average latency. The optimal limit value for a system is specific to both the implementation and environment. [11]

Expected backoff

Given a uniform distribution of backoff times, the expected backoff time is the mean of the possibilities. After c collisions in a binary exponential backoff algorithm, the delay is randomly chosen from [0, 1, ..., N] slots, where N = 2c − 1, and the expected backoff time (in slots) is

For example, the expected backoff time for the third (c = 3) collision, one could first calculate the maximum backoff time, N:

and then calculate the mean of the backoff time possibilities:

.

which is, for the example, E(3) = 3.5 slots.

See also

Related Research Articles

The Transmission Control Protocol (TCP) is one of the main protocols of the Internet protocol suite. It originated in the initial network implementation in which it complemented the Internet Protocol (IP). Therefore, the entire suite is commonly referred to as TCP/IP. TCP provides reliable, ordered, and error-checked delivery of a stream of octets (bytes) between applications running on hosts communicating via an IP network. Major internet applications such as the World Wide Web, email, remote administration, and file transfer rely on TCP, which is part of the Transport layer of the TCP/IP suite. SSL/TLS often runs on top of TCP.

Network throughput refers to the rate of message delivery over a communication channel, such as Ethernet or packet radio, in a communication network. The data that these messages contain may be delivered over physical or logical links, or through network nodes. Throughput is usually measured in bits per second, and sometimes in data packets per second or data packets per time slot.

<span class="mw-page-title-main">Carrier-sense multiple access with collision avoidance</span> Computer network multiple access method

Carrier-sense multiple access with collision avoidance (CSMA/CA) in computer networking, is a network multiple access method in which carrier sensing is used, but nodes attempt to avoid collisions by beginning transmission only after the channel is sensed to be "idle". When they do transmit, nodes transmit their packet data in its entirety.

Carrier-sense multiple access with collision detection (CSMA/CD) is a medium access control (MAC) method used most notably in early Ethernet technology for local area networking. It uses carrier-sensing to defer transmissions until no other stations are transmitting. This is used in combination with collision detection in which a transmitting station detects collisions by sensing transmissions from other stations while it is transmitting a frame. When this collision condition is detected, the station stops transmitting that frame, transmits a jam signal, and then waits for a random time interval before trying to resend the frame.

Carrier-sense multiple access (CSMA) is a medium access control (MAC) protocol in which a node verifies the absence of other traffic before transmitting on a shared transmission medium, such as an electrical bus or a band of the electromagnetic spectrum.

<span class="mw-page-title-main">Queueing theory</span> Mathematical study of waiting lines, or queues

Queueing theory is the mathematical study of waiting lines, or queues. A queueing model is constructed so that queue lengths and waiting time can be predicted. Queueing theory is generally considered a branch of operations research because the results are often used when making business decisions about the resources needed to provide a service.

In telecommunications and computer networks, a channel access method or multiple access method allows more than two terminals connected to the same transmission medium to transmit over it and to share its capacity. Examples of shared physical media are wireless networks, bus networks, ring networks and point-to-point links operating in half-duplex mode.

ALOHAnet, also known as the ALOHA System, or simply ALOHA, was a pioneering computer networking system developed at the University of Hawaii.

<span class="mw-page-title-main">Medium access control</span> Service layer in IEEE 802 network standards

In IEEE 802 LAN/MAN standards, the medium access control (MAC), also called media access control, is the layer that controls the hardware responsible for interaction with the wired or wireless transmission medium. The MAC sublayer and the logical link control (LLC) sublayer together make up the data link layer. The LLC provides flow control and multiplexing for the logical link, while the MAC provides flow control and multiplexing for the transmission medium.

Distributed coordination function (DCF) is the fundamental medium access control (MAC) technique of the IEEE 802.11-based WLAN standard. DCF employs a carrier-sense multiple access with collision avoidance (CSMA/CA) with the binary exponential backoff algorithm.

Network congestion in data networking and queueing theory is the reduced quality of service that occurs when a network node or link is carrying more data than it can handle. Typical effects include queueing delay, packet loss or the blocking of new connections. A consequence of congestion is that an incremental increase in offered load leads either only to a small increase or even a decrease in network throughput.

Transmission Control Protocol (TCP) uses a congestion control algorithm that includes various aspects of an additive increase/multiplicative decrease (AIMD) scheme, along with other schemes including slow start and congestion window (CWND), to achieve congestion avoidance. The TCP congestion-avoidance algorithm is the primary basis for congestion control in the Internet. Per the end-to-end principle, congestion control is largely a function of internet hosts, not the network itself. There are several variations and versions of the algorithm implemented in protocol stacks of operating systems of computers that connect to the Internet.

TCP tuning techniques adjust the network congestion avoidance parameters of Transmission Control Protocol (TCP) connections over high-bandwidth, high-latency networks. Well-tuned networks can perform up to 10 times faster in some cases. However, blindly following instructions without understanding their real consequences can hurt performance as well.

Retransmission, essentially identical with automatic repeat request (ARQ), is the resending of packets which have been either damaged or lost. Retransmission is one of the basic mechanisms used by protocols operating over a packet switched computer network to provide reliable communication.

Packet loss occurs when one or more packets of data travelling across a computer network fail to reach their destination. Packet loss is either caused by errors in data transmission, typically across wireless networks, or network congestion. Packet loss is measured as a percentage of packets lost with respect to packets sent.

In statistical time division multiplexing, contention is a media access method that is used to share a broadcast medium. In contention, any computer in the network can transmit data at any time.

Karn's algorithm addresses the problem of getting accurate estimates of the round-trip time for messages when using the Transmission Control Protocol (TCP) in computer networking. The algorithm, also sometimes termed as the Karn-Partridge algorithm was proposed in a paper by Phil Karn and Craig Partridge in 1987.

Zeta-TCP refers to a set of proprietary Transmission Control Protocol (TCP) algorithms aiming at improving the end-to-end performance of TCP, regardless of whether the peer is Zeta-TCP or any other TCP protocol stack, in other words, to be compatible with the existing TCP algorithms. It was designed and implemented by AppEx Networks Corporation.

In mathematics and telecommunications, stochastic geometry models of wireless networks refer to mathematical models based on stochastic geometry that are designed to represent aspects of wireless networks. The related research consists of analyzing these models with the aim of better understanding wireless communication networks in order to predict and control various network performance metrics. The models require using techniques from stochastic geometry and related fields including point processes, spatial statistics, geometric probability, percolation theory, as well as methods from more general mathematical disciplines such as geometry, probability theory, stochastic processes, queueing theory, information theory, and Fourier analysis.

<span class="mw-page-title-main">Fast and Secure Protocol</span> Terminal command scheme used to transfer data

The Fast Adaptive and Secure Protocol (FASP) is a proprietary data transfer protocol. FASP is a network-optimized network protocol created by Michelle C. Munson and Serban Simu, productized by Aspera, and now owned by IBM subsequent to its acquisition of Aspera. The associated client/server software packages are also commonly called Aspera. The technology is patented under US Patent #8085781, Bulk Data Transfer, #20090063698, Method and system for aggregate bandwidth control. and others.

References

  1. Tanenbaum & Wetherall 2010 , p. 395
  2. Rosenberg et al. RFC3261 – SIP: Session Initiation Protocol. The Internet Society. 2002.
  3. Peterson, Larry L.; Davie, Bruce S. (2022). "Chapter 2: Direct Links". Computer Networks: A Systems Approach (Sixth ed.). Morgan Kaufmann Publishers. p. 120. ISBN   978-0-12-818200-0.
  4. Abramson, Norman (1970). The ALOHA System - Another Alternative for Computer Communications (PDF). Proc. 1970 Fall Joint Computer Conference. AFIPS Press.
  5. Roberts, Lawrence G. (April 1975). "ALOHA Packet System With and Without Slots and Capture". Computer Communications Review. 5 (2): 28–42. doi:10.1145/1024916.1024920.
  6. Kleinrock, Leonard; Simon S. Lam (August 1972). Analytic Results for the ARPANET Satellite System Model Including the Effects of the Retransmission Delay Distribution (PDF) (Technical report). ARPA Network Information Center, Stanford Research Institute, Menlo Park, California. ASS Note 12 (NIC 11294).
  7. Lam, Simon S. (March 1974). Packet Switching in a Multi-Access Broadcast Channel with Application to Satellite Communication in a Computer Network, Ph.D. dissertation, 306 pages (Thesis). UCLA-ENG-7429 (ARPA), UCLA School of Engineering and Applied Science.
  8. Kleinrock, Leonard; Lam S., Simon (April 1975). "Packet-Switching in a Multi-Access Broadcast Channel: Performance Evaluation" (PDF). IEEE Transactions on Communications. COM-23 (4): 410–423. Retrieved 16 February 2023.
  9. Lam, Simon S.; Kleinrock, Leonard (September 1975). "Packet-Switching in a Multi-Access Broadcast Channel: Dynamic Control Procedures" (PDF). IEEE Transactions on Communications. COM-23 (9): 891–904. Retrieved 16 July 2023.
  10. "IEEE Standard 802.3-2015". IEEE. Retrieved 20 March 2022. (purchase)
  11. Tanenbaum & Wetherall 2010 , p. 285

Bibliography

PD-icon.svg This article incorporates public domain material from Federal Standard 1037C. General Services Administration. Archived from the original on 22 January 2022.