URI normalization

Last updated April 18, 2023

URI normalization is the process by which URIs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URI into a normalized URI so it is possible to determine if two syntactically different URIs may be equivalent.

Search engines employ URI normalization in order to correctly rank pages that may be found with multiple URIs, and to reduce indexing of duplicate pages. Web crawlers perform URI normalization in order to avoid crawling the same resource more than once. Web browsers may perform normalization to determine if a link has been visited or to determine if a page has been cached. Web servers may also perform normalization for many reasons (i.e. to be able to more easily intercept security risks coming from client requests, to use only one absolute file name for each resource stored in their caches, named in log files, etc.).

Normalization process

There are several types of normalization that may be performed. Some of them are always semantics preserving and some may not be.

Normalizations that preserve semantics

The following normalizations are described in RFC 3986 ^[1] to result in equivalent URIs:

Converting percent-encoded triplets to uppercase. The hexadecimal digits within a percent-encoding triplet of the URI (e.g., %3a versus %3A) are case-insensitive and therefore should be normalized to use uppercase letters for the digits A-F.^[2] Example:

http://example.com/foo%2a → http://example.com/foo%2A

Converting the scheme and host to lowercase. The scheme and host components of the URI are case-insensitive and therefore should be normalized to lowercase.^[3] Example:

HTTP://User@Example.COM/Foo → http://User@example.com/Foo

Decoding percent-encoded triplets of unreserved characters. Percent-encoded triplets of the URI in the ranges of ALPHA (%41–%5A and %61–%7A), DIGIT (%30–%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) do not require percent-encoding and should be decoded to their corresponding unreserved characters.^[4] Example:

http://example.com/%7Efoo → http://example.com/~foo

Removing dot-segments. Dot-segments . and .. in the path component of the URI should be removed by applying the remove_dot_segments algorithm^[5] to the path described in RFC 3986.^[6] Example:

http://example.com/foo/./bar/baz/../qux → http://example.com/foo/bar/qux

Converting an empty path to a "/" path. In presence of an authority component, an empty path component should be normalized to a path component of "/".^[7] Example:

http://example.com → http://example.com/

Removing the default port. An empty or default port component of the URI (port 80 for the http scheme) with its ":" delimiter should be removed.^[8] Example:

http://example.com:80/ → http://example.com/

Normalizations that usually preserve semantics

For http and https URIs, the following normalizations listed in RFC 3986 may result in equivalent URIs, but are not guaranteed to by the standards:

Adding a trailing "/" to a non-empty path. Directories (folders) are indicated with a trailing slash and should be included in URIs. Example:

http://example.com/foo → http://example.com/foo/

However, there is no way to know if a URI path component represents a directory or not. RFC 3986 notes that if the former URI redirects to the latter URI, then that is an indication that they are equivalent.

Normalizations that change semantics

Applying the following normalizations result in a semantically different URI although it may refer to the same resource:

Removing directory index. Default directory indexes are generally not needed in URIs. Examples:

http://example.com/a/index.html → http://example.com/a/

http://example.com/default.asp → http://example.com/

Removing the fragment. The fragment component of a URI is never seen by the server and can sometimes be removed. Example:

http://example.com/bar.html#section1 → http://example.com/bar.html

However, AJAX applications frequently use the value in the fragment.

Replacing IP with domain name. Check if the IP address maps to a domain name. Example:

http://208.77.188.166/ → http://example.com/

The reverse replacement is rarely safe due to virtual web servers.

Limiting protocols. Limiting different application layer protocols. For example, the “https” scheme could be replaced with “http”. Example:

https://example.com/ → http://example.com/

Removing duplicate slashes Paths which include two adjacent slashes could be converted to one. Example:

http://example.com/foo//bar.html → http://example.com/foo/bar.html

Removing or adding “www” as the first domain label. Some websites operate identically in two Internet domains: one whose least significant label is “www” and another whose name is the result of omitting the least significant label from the name of the first, the latter being known as a naked domain. For example, http://www.example.com/ and http://example.com/ may access the same website. Many websites redirect the user from the www to the non-www address or vice versa. A normalizer may determine if one of these URIs redirects to the other and normalize all URIs appropriately. Example:

http://www.example.com/ → http://example.com/

Sorting the query parameters. Some web pages use more than one query parameter in the URI. A normalizer can sort the parameters into alphabetical order (with their values), and reassemble the URI. Example:

http://example.com/display?lang=en&article=fred → http://example.com/display?article=fred&lang=en

However, the order of parameters in a URI may be significant (this is not defined by the standard) and a web server may allow the same variable to appear multiple times.^[9]

Removing unused query variables. A page may only expect certain parameters to appear in the query; unused parameters can be removed. Example:

http://example.com/display?id=123&fakefoo=fakebar → http://example.com/display?id=123

Note that a parameter without a value is not necessarily an unused parameter.

Removing default query parameters. A default value in the query string may render identically whether it is there or not. Example:

http://example.com/display?id=&sort=ascending → http://example.com/display

Removing the "?" when the query is empty. When the query is empty, there may be no need for the "?". Example:

http://example.com/display? → http://example.com/display

Normalization based on URI lists

Some normalization rules may be developed for specific websites by examining URI lists obtained from previous crawls or web server logs. For example, if the URI

http://example.com/story?id=xyz

appears in a crawl log several times along with

http://example.com/story_xyz

we may assume that the two URIs are equivalent and can be normalized to one of the URI forms.

Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URIs with similar text) rules that can be applied to URI lists. They showed that once the correct DUST rules were found and applied with a normalization algorithm, they were able to find up to 68% of the redundant URIs in a URI list.

Related Research Articles

In computing, Common Gateway Interface (CGI) is an interface specification that enables web servers to execute an external program, typically to process user requests.

A Uniform Resource Identifier (URI) is a unique sequence of characters that identifies a logical or physical resource used by web technologies. URIs may be used to identify anything, including real-world objects, such as people and places, concepts, or information resources such as web pages and books. Some URIs provide a means of locating and retrieving information resources on a network ; these are Uniform Resource Locators (URLs). A URL provides the location of the resource. A URI identifies the resource by name at the specified location or URL. Other URIs provide only a unique name, without a means of locating or retrieving the resource or information about it, these are Uniform Resource Names (URNs). The web technologies that use URIs are not limited to web browsers. URIs are used to identify anything described using the Resource Description Framework (RDF), for example, concepts that are part of an ontology defined using the Web Ontology Language (OWL), and people who are described using the Friend of a Friend vocabulary would each have an individual URI.

A Uniform Resource Name (URN) is a Uniform Resource Identifier (URI) that uses the urn scheme. URNs are globally unique persistent identifiers assigned within defined namespaces so they will be available for a long period of time, even after the resource which they identify ceases to exist or becomes unavailable. URNs cannot be used to directly locate an item and need not be resolvable, as they are simply templates that another parser may use to find an item.

URL redirection, also called URL forwarding, is a World Wide Web technique for making a web page available under more than one URL address. When a web browser attempts to open a URL that has been redirected, a page with a different URL is opened. Similarly, domain redirection or domain forwarding is when all pages in a URL domain are redirected to a different domain, as when wikipedia.com and wikipedia.net are automatically redirected to wikipedia.org.

A query string is a part of a uniform resource locator (URL) that assigns values to specified parameters. A query string commonly includes fields added to a base URL by a Web browser or other client application, for example as part of an HTML document, choosing the appearance of a page, or jumping to positions in multimedia content.

In computer science, canonicalization is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.

The Internationalized Resource Identifier (IRI) is an internet protocol standard which builds on the Uniform Resource Identifier (URI) protocol by greatly expanding the set of permitted characters. It was defined by the Internet Engineering Task Force (IETF) in 2005 in RFC 3987. While URIs are limited to a subset of the US-ASCII character set, IRIs may additionally contain most characters from the Universal Character Set, including Chinese, Japanese, Korean, and Cyrillic characters.

The data URI scheme is a uniform resource identifier (URI) scheme that provides a way to include data in-line in Web pages as if they were external resources. It is a form of file literal or here document. This technique allows normally separate elements such as images and style sheets to be fetched in a single Hypertext Transfer Protocol (HTTP) request, which may be more efficient than multiple HTTP requests, and used by several browser extensions to package images as well as other multimedia contents in a single HTML file for page saving. As of 2022, data URIs are fully supported by most major browsers, and partially supported in Internet Explorer.

In computer hypertext, a URI fragment is a string of characters that refers to a resource that is subordinate to another, primary resource. The primary resource is identified by a Uniform Resource Identifier (URI), and the fragment identifier points to the subordinate resource.

URL encoding, officially known as percent-encoding, is a method to encode arbitrary data in a Uniform Resource Identifier (URI) using only the limited US-ASCII characters legal within a URI. Although it is known as URL encoding, it is also used more generally within the main Uniform Resource Identifier (URI) set, which includes both Uniform Resource Locator (URL) and Uniform Resource Name (URN). As such, it is also used in the preparation of data of the application/x-www-form-urlencoded media type, as is often used in the submission of HTML form data in HTTP requests.

The feed URI scheme was a suggested uniform resource identifier (URI) scheme designed to facilitate subscription to web feeds; specifically, it was intended that a news aggregator be launched whenever a hyperlink to a feed URI was clicked in a web browser. The scheme was intended to flag a document in a syndication format such as Atom or RSS. The document would be typically served over HTTP.

HTTP cookies are small blocks of data created by a web server while a user is browsing a website and placed on the user's computer or other device by the user's web browser. Cookies are placed on the device used to access a website, and more than one cookie may be placed on a user's device during a session.

Clean URLs are web addresses or Uniform Resource Locator (URLs) intended to improve the usability and accessibility of a website, web application, or web service by being immediately and intuitively meaningful to non-expert users. Such URL schemes tend to reflect the conceptual structure of a collection of information and decouple the user interface from a server's internal representation of information. Other reasons for using clean URLs include search engine optimization (SEO), conforming to the representational state transfer (REST) style of software architecture, and ensuring that individual web resources remain consistently at the same URL. This makes the World Wide Web a more stable and useful system, and allows more durable and reliable bookmarking of web resources.

The file URI Scheme is a URI scheme defined in RFC 8089, typically used to retrieve files from within one's own computer.

In computer networking, the Message Session Relay Protocol (MSRP) is a protocol for transmitting a series of related instant messages in the context of a communications session. An application instantiates the session with the Session Description Protocol (SDP) over Session Initiation Protocol (SIP) or other rendezvous methods.

In computing, POST is a request method supported by HTTP used by the World Wide Web. By design, the POST request method requests that a web server accept the data enclosed in the body of the request message, most likely for storing it. It is often used when uploading a file or when submitting a completed web form.

A URI Template is a way to specify a URI that includes parameters that must be substituted before the URI is resolved. It was standardized by RFC 6570 in March 2012.

The HTTP Location header field is returned in responses from an HTTP server under two circumstances:

To ask a web browser to load a different web page. In this circumstance, the Location header should be sent with an HTTP status code of 3xx. It is passed as part of the response by a web server when the requested URI has:
To provide information about the location of a newly created resource. In this circumstance, the Location header should be sent with an HTTP status code of 201 or 202.

The geo URI scheme is a Uniform Resource Identifier (URI) scheme defined by the Internet Engineering Task Force's RFC 5870 as:

a Uniform Resource Identifier (URI) for geographic locations using the 'geo' scheme name. A 'geo' URI identifies a physical location in a two- or three-dimensional coordinate reference system in a compact, simple, human-readable, and protocol-independent way.

A Uniform Resource Locator (URL), colloquially termed a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifier (URI), although many people use the two terms interchangeably. URLs occur most commonly to reference web pages (HTTP) but are also used for file transfer (FTP), email (mailto), database access (JDBC), and many other applications.

References

RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax
Sang Ho Lee; Sung Jin Kim & Seok Hoo Hong (2005). On URL normalization (PDF). Proceedings of the International Conference on Computational Science and its Applications (ICCSA 2005). pp. 1076–1085. Archived from the original (PDF) on September 18, 2006.
Uri Schonfeld; Ziv Bar-Yossef & Idit Keidar (2006). Do not crawl in the dust: different URLs with similar text. Proceedings of the 15th international conference on World Wide Web. pp. 1015–1016.
Uri Schonfeld; Ziv Bar-Yossef & Idit Keidar (2007). Do not crawl in the dust: different URLs with similar text. Proceedings of the 16th international conference on World Wide Web. pp. 111–120.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] RFC 3986, Section 6. Normalization and Comparison

[2] RFC 3986, Section 6.2.2.1. Case Normalization

[3] RFC 3986, Section 6.2.2.1. Case Normalization

[4] RFC 3986, Section 6.2.2.3. Path Segment Normalization

[5] RFC 3986, 5.2.4. Remove Dot Segments

[6] RFC 3986, 6.2.2.3. Path Segment Normalization

[7] RFC 3986, Section 6.2.3. Scheme-Based Normalization

[8] RFC 3986, Section 6.2.3. Scheme-Based Normalization

[9] "jQuery 1.4 $.param demystified". Ben Alman. December 20, 2009. Retrieved August 24, 2013.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]