TCP window scale option

Last updated

The TCP window scale option is an option to increase the receive window size allowed in Transmission Control Protocol above its former maximum value of 65,535 bytes. This TCP option, along with several others, is defined in RFC   7323 which deals with long fat networks (LFNs).

Contents

TCP windows

The throughput of a TCP communication is limited by two windows: the congestion window and the receive window. The congestion window tries not to exceed the capacity of the network (congestion control); the receive window tries not to exceed the capacity of the receiver to process data (flow control). The receiver may be overwhelmed by data if for example it is very busy (such as a Web server). Each TCP segment contains the current value of the receive window. If, for example, a sender receives an ACK which acknowledges byte 4000 and specifies a receive window of 10000 (bytes), the sender will not send packets after byte 14000, even if the congestion window allows it.

Theory

TCP window scale option is needed for efficient transfer of data when the bandwidth-delay product (BDP) is greater than 64 KB [1] . For instance, if a T1 transmission line of 1.5 Mbit/s was used over a satellite link with a 513 millisecond round-trip time (RTT), the bandwidth-delay product is  bits or about 96,187 bytes. Using a maximum buffer size of 64 KB [1] only allows the buffer to be filled to (65,535 / 96,187) = 68% of the theoretical maximum speed of 1.5 Mbit/s, or 1.02 Mbit/s.

By using the window scale option, the receive window size may be increased up to a maximum value of  bytes, or about 1 GiB. [2] This is done by specifying a two byte shift count in the header options field. The true receive window size is left shifted by the value in shift count. A maximum value of 14 may be used for the shift count value. This would allow a single TCP connection to transfer data over the example satellite link at 1.5 Mbit/s utilizing all of the available bandwidth.

Essentially, not more than one full transmission window can be transferred within one round-trip time period. The window scale option enables a single TCP connection to fully utilize an LFN with a BDP of up to 1 GB, e.g. a 10 Gbit/s link with round-trip time of 800 ms.

Possible side effects

Because some firewalls do not properly implement TCP Window Scaling, it can cause a user's Internet connection to malfunction intermittently for a few minutes, then appear to start working again for no reason. There is also an issue if a firewall doesn't support the TCP extensions. [3]

Configuration of operating systems

Windows

TCP Window Scaling is implemented in Windows since Windows 2000. [4] [5] It is enabled by default in Windows Vista / Server 2008 and newer, but can be turned off manually if required. [6] Windows Vista and Windows 7 have a fixed default TCP receive buffer of 64 kB, scaling up to 16 MB through "autotuning", limiting manual TCP tuning over long fat networks. [7]

Linux

Linux kernels (from 2.6.8, August 2004) have enabled TCP Window Scaling by default. The configuration parameters are found in the /proc filesystem, see pseudo-file /proc/sys/net/ipv4/tcp_window_scaling and its companions /proc/sys/net/ipv4/tcp_rmem and /proc/sys/net/ipv4/tcp_wmem (more information: man tcp, section sysctl). [8]

Scaling can be turned off by issuing the following command.

$ sudosysctl-w"net.ipv4.tcp_window_scaling=0"

To maintain the changes after a restart, include the line "net.ipv4.tcp_window_scaling=0" in /etc/sysctl.conf (or /etc/sysctl.d/99-sysctl.conf as of systemd 207).

FreeBSD, OpenBSD, NetBSD and Mac OS X

Default setting for FreeBSD, OpenBSD, NetBSD and Mac OS X is to have window scaling (and other features related to RFC 1323) enabled.
To verify their status, a user can check the value of the "net.inet.tcp.rfc1323" variable via the sysctl command:

$ sysctlnet.inet.tcp.rfc1323 

A value of 1 (output "net.inet.tcp.rfc1323=1") means scaling is enabled, 0 means "disabled". If enabled it can be turned off by issuing the command:

$ sudosysctl-wnet.inet.tcp.rfc1323=0

This setting is lost across a system restart. To ensure that it is set at boot time, add the following line to /etc/sysctl.conf: net.inet.tcp.rfc1323=0

However, on macOS 10.14 this command provides an error

sysctl: unknown oid 'net.inet.tcp.rfc1323'

Sources

  1. 1 2 Here, K, M, G, or T refer to the binary prefixes based on powers of 1024.
  2. Borman, D., Braden, B., Jacobson, V., & Scheffenegger, R. (2014). TCP extensions for high performance (No. rfc7323).
  3. "Network connectivity may fail when you try to use Windows Vista behind a firewall device". Support.microsoft.com. Retrieved July 11, 2019.
  4. "Description of Windows 2000 and Windows Server 2003 TCP Features". Support.microsoft.com. Retrieved July 11, 2019.
  5. "TCP Receive Window Size and Window Scaling". Archived from the original on January 1, 2008.
  6. "Network connectivity fails when you try to use Windows Vista behind a firewall device". Microsoft. July 8, 2009.
  7. "MS Windows". Fasterdata.es.net. Retrieved July 11, 2019.
  8. "/proc/sys/net/ipv4/* Variables".

Related Research Articles

The Transmission Control Protocol (TCP) is one of the main protocols of the Internet protocol suite. It originated in the initial network implementation in which it complemented the Internet Protocol (IP). Therefore, the entire suite is commonly referred to as TCP/IP. TCP provides reliable, ordered, and error-checked delivery of a stream of octets (bytes) between applications running on hosts communicating via an IP network. Major internet applications such as the World Wide Web, email, remote administration, and file transfer rely on TCP, which is part of the Transport layer of the TCP/IP suite. SSL/TLS often runs on top of TCP.

In computing, traceroute and tracert are diagnostic command-line interface commands for displaying possible routes (paths) and transit delays of packets across an Internet Protocol (IP) network.

In computer networking, the User Datagram Protocol (UDP) is one of the core communication protocols of the Internet protocol suite used to send messages to other hosts on an Internet Protocol (IP) network. Within an IP network, UDP does not require prior communication to set up communication channels or data paths.

A Berkeley (BSD) socket is an application programming interface (API) for Internet domain sockets and Unix domain sockets, used for inter-process communication (IPC). It is commonly implemented as a library of linkable modules. It originated with the 4.2BSD Unix operating system, which was released in 1983.

SOCKS is an Internet protocol that exchanges network packets between a client and server through a proxy server. SOCKS5 optionally provides authentication so only authorized users may access a server. Practically, a SOCKS server proxies TCP connections to an arbitrary IP address, and provides a means for UDP packets to be forwarded. A SOCKS server accepts incoming client connection on TCP port 1080, as defined in RFC 1928.

Explicit Congestion Notification (ECN) is an extension to the Internet Protocol and to the Transmission Control Protocol and is defined in RFC 3168 (2001). ECN allows end-to-end notification of network congestion without dropping packets. ECN is an optional feature that may be used between two ECN-enabled endpoints when the underlying network infrastructure also supports it.

Throughput of a network can be measured using various tools available on different platforms. This page explains the theory behind what these tools set out to measure and the issues regarding these measurements.

Transmission Control Protocol (TCP) uses a congestion control algorithm that includes various aspects of an additive increase/multiplicative decrease (AIMD) scheme, along with other schemes including slow start and a congestion window (CWND), to achieve congestion avoidance. The TCP congestion-avoidance algorithm is the primary basis for congestion control in the Internet. Per the end-to-end principle, congestion control is largely a function of internet hosts, not the network itself. There are several variations and versions of the algorithm implemented in protocol stacks of operating systems of computers that connect to the Internet.

Nagle's algorithm is a means of improving the efficiency of TCP/IP networks by reducing the number of packets that need to be sent over the network. It was defined by John Nagle while working for Ford Aerospace. It was published in 1984 as a Request for Comments (RFC) with title Congestion Control in IP/TCP Internetworks in RFC 896.

The proc filesystem (procfs) is a special filesystem in Unix-like operating systems that presents information about processes and other system information in a hierarchical file-like structure, providing a more convenient and standardized method for dynamically accessing process data held in the kernel than traditional tracing methods or direct access to kernel memory. Typically, it is mapped to a mount point named /proc at boot time. The proc file system acts as an interface to internal data structures about running processes in the kernel. In Linux, it can also be used to obtain information about the kernel and to change certain kernel parameters at runtime (sysctl).

TCP tuning techniques adjust the network congestion avoidance parameters of Transmission Control Protocol (TCP) connections over high-bandwidth, high-latency networks. Well-tuned networks can perform up to 10 times faster in some cases. However, blindly following instructions without understanding their real consequences can hurt performance as well.

sysctl Unix-like software that manages kernel attributes

sysctl is a software mechanism in some Unix-like operating systems that reads and modifies the attributes of the system kernel such as its version number, maximum limits, and security settings. It is available both as a system call for compiled programs, and an administrator command for interactive use and scripting. Linux additionally exposes sysctl as a virtual file system.

Maximum segment lifetime or MSL is the time a TCP segment can exist in the internetwork system. It was defined in 1981 to be 2 minutes.

For this specification the MSL is taken to be 2 minutes. This is an engineering choice, and may be changed if experience indicates it is desirable to do so.

In computer networking, a host model is an option of designing the TCP/IP stack of a networking operating system like Microsoft Windows or Linux. When a unicast packet arrives at a host, IP must determine whether the packet is locally destined. If the IP stack is implemented with a weak host model, it accepts any locally destined packet regardless of the network interface on which the packet was received. If the IP stack is implemented with a strong host model, it only accepts locally destined packets if the destination IP address in the packet matches an IP address assigned to the network interface on which the packet was received.

In data communications, the bandwidth-delay product is the product of a data link's capacity and its round-trip delay time. The result, an amount of data measured in bits, is equivalent to the maximum amount of data on the network circuit at any given time, i.e., data that has been transmitted but not yet acknowledged. The bandwidth-delay product was originally proposed as a rule of thumb for sizing router buffers in conjunction with congestion avoidance algorithm random early detection (RED).

In computer networks, goodput is the application-level throughput of a communication; i.e. the number of useful information bits delivered by the network to a certain destination per unit of time. The amount of data considered excludes protocol overhead bits as well as retransmitted data packets. This is related to the amount of time from the first bit of the first packet sent until the last bit of the last packet is delivered.

In computing, Microsoft's Windows Vista and Windows Server 2008 introduced in 2007/2008 a new networking stack named Next Generation TCP/IP stack, to improve on the previous stack in several ways. The stack includes native implementation of IPv6, as well as a complete overhaul of IPv4. The new TCP/IP stack uses a new method to store configuration settings that enables more dynamic control and does not require a computer restart after a change in settings. The new stack, implemented as a dual-stack model, depends on a strong host-model and features an infrastructure to enable more modular components that one can dynamically insert and remove.

NPF is a BSD licensed stateful packet filter, a central piece of software for firewalling. It is comparable to iptables, ipfw, ipfilter and PF. NPF is developed on NetBSD.

The Stream Control Transmission Protocol (SCTP) is a computer networking communications protocol in the transport layer of the Internet protocol suite. Originally intended for Signaling System 7 (SS7) message transport in telecommunication, the protocol provides the message-oriented feature of the User Datagram Protocol (UDP), while ensuring reliable, in-sequence transport of messages with congestion control like the Transmission Control Protocol (TCP). Unlike UDP and TCP, the protocol supports multihoming and redundant paths to increase resilience and reliability.

WireGuard is a communication protocol and free and open-source software that implements encrypted virtual private networks (VPNs). It aims to be lighter and better performing than IPsec and OpenVPN, two common tunneling protocols. The WireGuard protocol passes traffic over UDP.