HTTP ETag

Last updated

The ETag or entity tag is part of HTTP, the protocol for the World Wide Web. It is one of several mechanisms that HTTP provides for Web cache validation, which allows a client to make conditional requests. This mechanism allows caches to be more efficient and saves bandwidth, as a Web server does not need to send a full response if the content has not changed. ETags can also be used for optimistic concurrency control [1] to help prevent simultaneous updates of a resource from overwriting each other.

Contents

An ETag is an opaque identifier assigned by a Web server to a specific version of a resource found at a URL. [2] If the resource representation at that URL ever changes, a new and different ETag is assigned. Used in this manner, ETags are similar to fingerprints and can quickly be compared to determine whether two representations of a resource are the same.

ETag generation

The use of ETags in the HTTP header is optional (not mandatory as with some other fields of the HTTP 1.1 header). The method by which ETags are generated has never been specified in the HTTP specification.

Common methods of ETag generation include using a collision-resistant hash function of the resource's content, a hash of the last modification timestamp, or even just a revision number.

In order to avoid the use of stale cache data, methods used to generate ETags should guarantee (as much as is practical) that each ETag is unique. However, an ETag-generation function could be judged to be "usable", if it can be proven (mathematically) that duplication of ETags would be "acceptably rare", even if it could or would occur.

RFC-7232 explicitly states that ETags should be content-coding aware, e.g.

ETag: "123-a" – for no Content-Encoding ETag: "123-b" – for Content-Encoding: gzip

Some earlier checksum functions that were weaker than CRC32 or CRC64 are known to suffer from hash collision problems. Thus they were not good candidates for use in ETag generation.

Strong and weak validation

The ETag mechanism supports both strong validation and weak validation. They are distinguished by the presence of an initial "W/" in the ETag identifier, as:

"123456789"   – A strong ETag validator W/"123456789" – A weak ETag validator

A strongly validating ETag match indicates that the content of the two resource representations is byte-for-byte identical and that all other entity fields (such as Content-Language) are also unchanged. Strong ETags permit the caching and reassembly of partial responses, as with byte-range requests.

A weakly validating ETag match only indicates that the two representations are semantically equivalent, meaning that for practical purposes they are interchangeable and that cached copies can be used. However, the resource representations are not necessarily byte-for-byte identical, and thus weak ETags are not suitable for byte-range requests. Weak ETags may be useful for cases in which strong ETags are impractical for a Web server to generate, such as with dynamically generated content.

Typical usage

In typical usage, when a URL is retrieved, the Web server will return the resource's current representation along with its corresponding ETag value, which is placed in an HTTP response header "ETag" field:

ETag: "686897696a7c876b7e"

The client may then decide to cache the representation, along with its ETag. Later, if the client wants to retrieve the same URL resource again, it will first determine whether the locally cached version of the URL has expired (through the Cache-Control and the Expire headers). If the URL has not expired, it will retrieve the locally cached resource. If it is determined that the URL has expired (is stale), the client will send a request to the server that includes its previously saved copy of the ETag in the "If-None-Match" field. [2]

If-None-Match: "686897696a7c876b7e"

On this subsequent request, the server may now compare the client's ETag with the ETag for the current version of the resource. If the ETag values match, meaning that the resource has not changed, the server may send back a very short response with a HTTP 304 Not Modified status. The 304 status tells the client that its cached version is still good and that it should use that.

However, if the ETag values do not match, meaning the resource has likely changed, a full response including the resource's content is returned, just as if ETags were not being used. In this case, the client may decide to replace its previously cached version with the newly returned representation of the resource and the new ETag.

ETag values can be used in Web page monitoring systems. Efficient Web page monitoring is hindered by the fact that most websites do not set the ETag headers for Web pages. When a Web monitor has no hints whether Web content has been changed, all content has to be retrieved and analyzed using computing resources for both the publisher and subscriber.

Mismatched ETag detection

A buggy website can at times fail to update the ETag after its semantic resource has been updated. As of 2019, an example of a prominent such site is export.arxiv.org. [3] As a result, the incorrectly returned response is status 304, and the client fails to retrieve the updated resource. To detect such a buggy website:

Tracking using ETags

ETags can be used to track unique users, [4] as HTTP cookies are increasingly being deleted by privacy-aware users. In July 2011, Ashkan Soltani and a team of researchers at UC Berkeley reported that a number of websites, including Hulu, were using ETags for tracking purposes. [5] Hulu and KISSmetrics have both ceased "respawning" as of 29 July 2011, [6] as KISSmetrics and over 20 of its clients are facing a class-action lawsuit over the use of "undeletable" tracking cookies partially involving the use of ETags. [7]

Because ETags are cached by the browser and returned with subsequent requests for the same resource, a tracking server can simply repeat any ETag received from the browser to ensure an assigned ETag persists indefinitely (in a similar way to persistent cookies). Additional caching headers can also enhance the preservation of ETag data. [8]

ETags may be flushable by clearing the browser cache (implementations vary).

Related Research Articles

<span class="mw-page-title-main">HTTP</span> Application protocol for distributed, collaborative, hypermedia information systems

HTTP is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web, where hypertext documents include hyperlinks to other resources that the user can easily access, for example by a mouse click or by tapping the screen in a web browser.

Time to live (TTL) or hop limit is a mechanism which limits the lifespan or lifetime of data in a computer or network. TTL may be implemented as a counter or timestamp attached to or embedded in the data. Once the prescribed event count or timespan has elapsed, data is discarded or revalidated. In computer networking, TTL prevents a data packet from circulating indefinitely. In computing applications, TTL is commonly used to improve the performance and manage the caching of data.

<span class="mw-page-title-main">Web server</span> Computer software that distributes web pages

A web server is computer software and underlying hardware that accepts requests via HTTP or its secure variant HTTPS. A user agent, commonly a web browser or web crawler, initiates communication by making a request for a web page or other resource using HTTP, and the server responds with the content of that resource or an error message. A web server can also accept and store resources sent from the user agent if configured to do so.

<span class="mw-page-title-main">Proxy server</span> Computer server that makes and receives requests on behalf of a user

In computer networking, a proxy server is a server application that acts as an intermediary between a client requesting a resource and the server providing that resource. It improves privacy, security, and possibly performance in the process.

A Web cache is a system for optimizing the World Wide Web. It is implemented both client-side and server-side. The caching of multimedia and other files can result in less overall delay when browsing the Web.

URL redirection, also called URL forwarding, is a World Wide Web technique for making a web page available under more than one URL address. When a web browser attempts to open a URL that has been redirected, a page with a different URL is opened. Similarly, domain redirection or domain forwarding is when all pages in a URL domain are redirected to a different domain, as when wikipedia.com and wikipedia.net are automatically redirected to wikipedia.org.

The Simple Common Gateway Interface (SCGI) is a protocol for applications to interface with HTTP servers, as an alternative to the CGI protocol. It is similar to FastCGI but is designed to be easier to parse. Unlike CGI, it permits a long-running service process to continue serving requests, thus avoiding delays in responding to requests due to setup overhead.

Real-Time Messaging Protocol (RTMP) is a communication protocol for streaming audio, video, and data over the Internet. Originally developed as a proprietary protocol by Macromedia for streaming between Flash Player and the Flash Communication Server, Adobe has released an incomplete version of the specification of the protocol for public use.

<span class="mw-page-title-main">HTTP cookie</span> Small pieces of data stored by a web browser while on a website

HTTP cookies are small blocks of data created by a web server while a user is browsing a website and placed on the user's computer or other device by the user's web browser. Cookies are placed on the device used to access a website, and more than one cookie may be placed on a user's device during a session.

<span class="mw-page-title-main">HTTP 301</span> HTTP response status code

On the World Wide Web, HTTP 301 is the HTTP response status code for 301 Moved Permanently. It is used for permanent redirecting, meaning that links or records returning this response should be updated. The new URL should be provided in the Location field, included with the response. The 301 redirect is considered a best practice for upgrading users from HTTP to HTTPS.

<span class="mw-page-title-main">Byte serving</span> HTTP facility to fetch a specific part of a file

Byte serving is the process introduced in HTTP protocol 1.1 of sending only a portion of a message from a server to a client. Byte serving begins when an HTTP server advertises its willingness to serve partial requests using the Accept-Ranges response header. A client then requests a specific part of a file from the server using the Range request header. If the range is valid, the server sends it to the client with a 206 Partial Content status code and a Content-Range header listing the range sent. If the range is invalid, the server responds with a 416 Requested Range Not Satisfiable status code.

<span class="mw-page-title-main">HTTP message body</span>

HTTP Message Body is the data bytes transmitted in an HTTP transaction message immediately following the headers if there are any.

Web tracking is the practice by which operators of websites and third parties collect, store and share information about visitors' activities on the World Wide Web. Analysis of a user's behaviour may be used to provide content that enables the operator to infer their preferences and may be of interest to various parties, such as advertisers. Web tracking can be part of visitor management.

Cross-site request forgery, also known as one-click attack or session riding and abbreviated as CSRF or XSRF, is a type of malicious exploit of a website or web application where unauthorized commands are submitted from a user that the web application trusts. There are many ways in which a malicious website can transmit such commands; specially-crafted image tags, hidden forms, and JavaScript fetch or XMLHttpRequests, for example, can all work without the user's interaction or even knowledge. Unlike cross-site scripting (XSS), which exploits the trust a user has for a particular site, CSRF exploits the trust that a site has in a user's browser. In a CSRF attack, an innocent end user is tricked by an attacker into submitting a web request that they did not intend. This may cause actions to be performed on the website that can include inadvertent client or server data leakage, change of session state, or manipulation of an end user's account.

Constrained Application Protocol (CoAP) is a specialized UDP-based Internet application protocol for constrained devices, as defined in RFC 7252. It enables those constrained devices called "nodes" to communicate with the wider Internet using similar protocols. CoAP is designed for use between devices on the same constrained network, between devices and general nodes on the Internet, and between devices on different constrained networks both joined by an internet. CoAP is also being used via other mechanisms, such as SMS on mobile communication networks.

<span class="mw-page-title-main">Ashkan Soltani</span> American computer scientist

Ashkan Soltani is the executive director of the California Privacy Protection Agency. He has previously been the Chief Technologist of the Federal Trade Commission and an independent privacy and security researcher based in Washington, DC.

Dynamic Site Acceleration (DSA) is a group of technologies which make the delivery of dynamic websites more efficient. Manufacturers of application delivery controllers and content delivery networks (CDNs) use a host of techniques to accelerate dynamic sites, including:

<span class="mw-page-title-main">PATCH (HTTP)</span> Request method in the HTTP protocol

In computing, the PATCH method is a request method in HTTP for making partial changes to an existing resource. The PATCH method provides an entity containing a list of changes to be applied to the resource requested using the HTTP Uniform Resource Identifier (URI). The list of changes are supplied in the form of a PATCH document. If the requested resource does not exist then the server may create the resource depending on the PATCH document media type and permissions. The changes described in the PATCH document must be semantically well defined but can have a different media type than the resource being patched. Languages such as XML or JSON can be used in describing the changes in the PATCH document.

References

  1. "Editing the Web – Detecting the Lost Update Problem Using Unreserved Checkout". W3C Note. 10 May 1999.
  2. 1 2 "ETag – HTTP | MDN". developer.mozilla.org. Retrieved 10 October 2021.
  3. "Mismatched export.arxiv.org ETag". Gist.
  4. "tracking without cookies". 17 February 2003.
  5. Ayenson, Mika D.; Wambach, Dietrich James; Soltani, Ashkan; Good, Nathan; Hoofnagle, Chris Jay (29 July 2011). "Flash Cookies and Privacy II: Now with HTML5 and ETag Respawning". SSRN   1898390.
  6. Soltani, Ashkan (11 August 2011). "Flash Cookies and Privacy II". askhansoltani.org. Retrieved 27 June 2023.
  7. Anthony, Sebastian (4 August 2011). "AOL, Spotify, GigaOm, Etsy, KISSmetrics sued over undeletable tracking cookies". ExtremeTech . Retrieved 27 June 2023.
  8. "Cookieless cookies". GitHub lucb1e. 25 August 2013. Retrieved 27 June 2023.