Nagle's algorithm

Last updated

Nagle's algorithm is a means of improving the efficiency of TCP/IP networks by reducing the number of packets that need to be sent over the network. It was defined by John Nagle while working for Ford Aerospace. It was published in 1984 as a Request for Comments (RFC) with title Congestion Control in IP/TCP Internetworks in RFC   896.

Contents

The RFC describes what Nagle calls the "small-packet problem", where an application repeatedly emits data in small chunks, frequently only 1 byte in size. Since TCP packets have a 40-byte header (20 bytes for TCP, 20 bytes for IPv4), this results in a 41-byte packet for 1 byte of useful information, a huge overhead. This situation often occurs in Telnet sessions, where most keypresses generate a single byte of data that is transmitted immediately. Worse, over slow links, many such packets can be in transit at the same time, potentially leading to congestion collapse.

Nagle's algorithm works by combining a number of small outgoing messages and sending them all at once. Specifically, as long as there is a sent packet for which the sender has received no acknowledgment, the sender should keep buffering its output until it has a full packet's worth of output, thus allowing output to be sent all at once.

Algorithm

The RFC defines the algorithm as

inhibit the sending of new TCP segments when new outgoing data arrives from the user if any previously transmitted data on the connection remains unacknowledged.

Where MSS is the maximum segment size, the largest segment that can be sent on this connection, and the window size is the currently acceptable window of unacknowledged data, this can be written in pseudocode as[ citation needed ]

if there is new data to send thenif the window size ≥ MSS and available data is ≥ MSS then         send complete MSS segment now     elseif there is unconfirmed data still in the pipe then             enqueue data in the buffer until an acknowledge is received         else             send data immediately         end ifend ifend if

Interaction with delayed ACK

This algorithm interacts badly with TCP delayed acknowledgments (delayed ACK), a feature introduced into TCP at roughly the same time in the early 1980s, but by a different group. With both algorithms enabled, applications that do two successive writes to a TCP connection, followed by a read that will not be fulfilled until after the data from the second write has reached the destination, experience a constant delay of up to 500 milliseconds, the "ACK delay". It is recommended to disable either, although traditionally it's easier to disable Nagle, since such a switch already exists for real-time applications.

A solution recommended by Nagle, that prevents the algorithm sending premature packets, is by buffering up application writes then flushing the buffer: [1]

The user-level solution is to avoid write–write–read sequences on sockets. Write–read–write–read is fine. Write–write–write is fine. But write–write–read is a killer. So, if you can, buffer up your little writes to TCP and send them all at once. Using the standard UNIX I/O package and flushing write before each read usually works.

Nagle considers delayed ACKs a "bad idea" since the application layer does not usually respond within the delay window (which would allow the ACK to be combined with the response packet). [2] For typical (non-realtime) use cases, he recommends disabling delayed ACK instead of disabling his algorithm, as "quick" ACKs do not incur as much overhead as many small packets do for the same improvement in round-trip time. [3]

Disabling either Nagle or delayed ACK

TCP implementations usually provide applications with an interface to disable the Nagle algorithm. This is typically called the TCP_NODELAY option. On Microsoft Windows the TcpNoDelay registry switch decides the default. TCP_NODELAY is present since the TCP/IP stack in 4.2BSD of 1983, a stack with many descendants. [4]

The interface for disabling delayed ACK is not consistent among systems. The TCP_QUICKACK flag is available on Linux since 2001 (2.4.4) and potentially on Windows, where the official interface is SIO_TCP_SET_ACK_FREQUENCY. [5]

Setting TcpAckFrequency to 1 in the Windows registry turns off delayed ACK by default. [6] On FreeBSD, the sysctl entry net.inet.tcp.delayed_ack controls the default behavior. [4] No such switch is present in Linux. [7]

Large-write case

The interaction between delayed ACK and Nagle also extends to larger writes. If the data in a single write spans 2n packets, where there are 2n-1 full-sized TCP segments followed by a partial TCP segment, the original Nagle algorithm would withhold the last packet, waiting for either more data to send (to fill the packet), or the ACK for the previous packet (indicating that all the previous packets have left the network). A delayed ACK would, again, add a maximum of 500 ms before the last packet is sent. [8] This behavior limits performance for non-pipelined stop-and-wait request-response application protocol such as HTTP with persistent connection. [9]

Minshall's modification to Nagle's algorithm makes it such that the algorithm always sends if the last packet is full-sized, only waiting for an acknowledgement when the last packet is partial. The goal was to weaken the incentive for disabling Nagle by taking care of this large-write penalty. [10] Again, disabling delayed ACK on the receiving end would remove the issue completely.

Interactions with real-time systems

Applications that expect real-time responses and low latency can react poorly with Nagle's algorithm. Applications such as networked multiplayer video games or the movement of the mouse in a remotely controlled operating system, expect that actions are sent immediately, while the algorithm purposefully delays transmission, increasing bandwidth efficiency at the expense of one-way latency. [3] For this reason applications with low-bandwidth time-sensitive transmissions typically use TCP_NODELAY to bypass the Nagle-delayed ACK delay. [11]

Another option is to use UDP instead.

Operating systems implementation

Most modern operating systems implement Nagle's algorithms. In AIX, [12] Linux, and Windows [13] it is enabled by default and can be disabled on a per-socket basis using the TCP_NODELAY option.

Related Research Articles

The Transmission Control Protocol (TCP) is one of the main protocols of the Internet protocol suite. It originated in the initial network implementation in which it complemented the Internet Protocol (IP). Therefore, the entire suite is commonly referred to as TCP/IP. TCP provides reliable, ordered, and error-checked delivery of a stream of octets (bytes) between applications running on hosts communicating via an IP network. Major internet applications such as the World Wide Web, email, remote administration, and file transfer rely on TCP, which is part of the Transport layer of the TCP/IP suite. SSL/TLS often runs on top of TCP.

In computer networking, the User Datagram Protocol (UDP) is one of the core communication protocols of the Internet protocol suite used to send messages to other hosts on an Internet Protocol (IP) network. Within an IP network, UDP does not require prior communication to set up communication channels or data paths.

Trivial File Transfer Protocol (TFTP) is a simple lockstep File Transfer Protocol which allows a client to get a file from or put a file onto a remote host. One of its primary uses is in the early stages of nodes booting from a local area network. TFTP has been used for this application because it is very simple to implement.

Berkeley sockets is an application programming interface (API) for Internet sockets and Unix domain sockets, used for inter-process communication (IPC). It is commonly implemented as a library of linkable modules. It originated with the 4.2BSD Unix operating system, which was released in 1983.

Explicit Congestion Notification (ECN) is an extension to the Internet Protocol and to the Transmission Control Protocol and is defined in RFC 3168 (2001). ECN allows end-to-end notification of network congestion without dropping packets. ECN is an optional feature that may be used between two ECN-enabled endpoints when the underlying network infrastructure also supports it.

TCP offload engine (TOE) is a technology used in some network interface cards (NIC) to offload processing of the entire TCP/IP stack to the network controller. It is primarily used with high-speed network interfaces, such as gigabit Ethernet and 10 Gigabit Ethernet, where processing overhead of the network stack becomes significant. TOEs are often used as a way to reduce the overhead associated with Internet Protocol (IP) storage protocols such as iSCSI and Network File System (NFS).

Transmission Control Protocol (TCP) uses a congestion control algorithm that includes various aspects of an additive increase/multiplicative decrease (AIMD) scheme, along with other schemes including slow start and congestion window (CWND), to achieve congestion avoidance. The TCP congestion-avoidance algorithm is the primary basis for congestion control in the Internet. Per the end-to-end principle, congestion control is largely a function of internet hosts, not the network itself. There are several variations and versions of the algorithm implemented in protocol stacks of operating systems of computers that connect to the Internet.

Silly window syndrome (SWS) is a problem in computer networking caused by poorly implemented TCP flow control. A serious problem can arise in the sliding window operation when the sending application program creates data slowly, the receiving application program consumes data slowly, or both. If a server with this problem is unable to process all incoming data, it requests that its clients reduce the amount of data they send at a time. If the server continues to be unable to process all incoming data, the window becomes smaller and smaller, sometimes to the point that the data transmitted is smaller than the packet header, making data transmission extremely inefficient. The name of this problem is due to the window size shrinking to a "silly" value.

TCP tuning techniques adjust the network congestion avoidance parameters of Transmission Control Protocol (TCP) connections over high-bandwidth, high-latency networks. Well-tuned networks can perform up to 10 times faster in some cases. However, blindly following instructions without understanding their real consequences can hurt performance as well.

Retransmission, essentially identical with automatic repeat request (ARQ), is the resending of packets which have been either damaged or lost. Retransmission is one of the basic mechanisms used by protocols operating over a packet switched computer network to provide reliable communication.

Protocol spoofing is used in data communications to improve performance in situations where an existing protocol is inadequate, for example due to long delays or high error rates.

A network socket is a software structure within a network node of a computer network that serves as an endpoint for sending and receiving data across the network. The structure and properties of a socket are defined by an application programming interface (API) for the networking architecture. Sockets are created only during the lifetime of a process of an application running in the node.

In computer networks, goodput is the application-level throughput of a communication; i.e. the number of useful information bits delivered by the network to a certain destination per unit of time. The amount of data considered excludes protocol overhead bits as well as retransmitted data packets. This is related to the amount of time from the first bit of the first packet sent until the last bit of the last packet is delivered.

In computing, Microsoft's Windows Vista and Windows Server 2008 introduced in 2007/2008 a new networking stack named Next Generation TCP/IP stack, to improve on the previous stack in several ways. The stack includes native implementation of IPv6, as well as a complete overhaul of IPv4. The new TCP/IP stack uses a new method to store configuration settings that enables more dynamic control and does not require a computer restart after a change in settings. The new stack, implemented as a dual-stack model, depends on a strong host-model and features an infrastructure to enable more modular components that one can dynamically insert and remove.

A sliding window protocol is a feature of packet-based data transmission protocols. Sliding window protocols are used where reliable in-order delivery of packets is required, such as in the data link layer as well as in the Transmission Control Protocol (TCP). They are also used to improve efficiency when the channel may include high latency.

Sockstress is a method of attacking servers and other devices that accept TCP connections on the Internet and other TCP-based networks. This method depletes local resources in order to crash a service or an entire machine, essentially functioning as a denial-of-service attack.

TCP delayed acknowledgment is a technique used by some implementations of the Transmission Control Protocol in an effort to improve network performance. In essence, several ACK responses may be combined into a single response, reducing protocol overhead. However, in some circumstances, the technique can reduce application performance.

In data networking, telecommunications, and computer buses, an acknowledgment (ACK) is a signal that is passed between communicating processes, computers, or devices to signify acknowledgment, or receipt of message, as part of a communications protocol. The negative-acknowledgement is a signal that is sent to reject a previously received message or to indicate some kind of error. Acknowledgments and negative acknowledgments inform a sender of the receiver's state so that it can adjust its own state accordingly.

The Stream Control Transmission Protocol (SCTP) is a computer networking communications protocol in the transport layer of the Internet protocol suite. Originally intended for Signaling System 7 (SS7) message transport in telecommunication, the protocol provides the message-oriented feature of the User Datagram Protocol (UDP), while ensuring reliable, in-sequence transport of messages with congestion control like the Transmission Control Protocol (TCP). Unlike UDP and TCP, the protocol supports multihoming and redundant paths to increase resilience and reliability.

NACK-Oriented Reliable Multicast (NORM) is a transport layer Internet protocol designed to provide reliable transport in multicast groups in data networks. It is formally defined by the Internet Engineering Task Force (IETF) in Request for Comments (RFC) 5740, which was published in November 2009.

References

  1. John Nagle (January 19, 2006), Boosting Socket Performance on Linux, Slashdot
  2. Nagle, John. "Sigh. If you're doing bulk file transfers, you never hit that problem. (reply 9048947)". Hacker News. Retrieved 9 May 2018.
  3. 1 2 Nagle, John. "That fixed 200ms ACK delay timer was a horrible mistake. Why 200ms? Human reaction time. (reply 9050645)". Hacker News. Retrieved 9 May 2018. [...] One of the few legit cases for turning off the Nagle algorithm is for a FPS game running over the net. There, one-way latency matters; getting your shots and moves to the server before the other players affects gameplay.
  4. 1 2 tcp(4)    FreeBSD Kernel Interfaces Manual
  5. "sockets - C++ Disable Delayed Ack on Windows". Stack Overflow.
  6. "New registry entry for controlling the TCP Acknowledgment (ACK) behavior in Windows XP and in Windows Server 2003". 23 February 2023.
  7. tcp(7)    Linux Programmer's Manual – Overview, Conventions and Miscellanea
  8. "TCP Performance problems caused by interaction between Nagle's Algorithm and Delayed ACK". Stuartcheshire.org. Retrieved November 14, 2012.
  9. Heidemann, John (April 1997). "Performance Interactions Between P-HTTP and TCP Implementations". Computer Communications Review. 27 (2). ACM: 65–73. doi: 10.1145/263876.263886 . S2CID   6992265.
  10. A Proposed Modification to Nagle’s Algorithm. 1999. I-D draft-minshall-nagle.
  11. Bug 17868 Some Java applications are slow on remote X connections.
  12. "IBM Knowledge Center". www.ibm.com.
  13. "How would one disable Nagle's algorithm in Linux?". Stack Overflow.