Canonicalization

Last updated

In computer science, canonicalization (sometimes standardization or normalization ) is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.

Contents

Usage cases

Filenames

Files in file systems may in most cases be accessed through multiple filenames. For instance in Unix-like systems, the string "/./" can be replaced by "/". In the C standard library, the function realpath() performs this task. Other operations performed by this function to canonicalize filenames are the handling of /.. components referring to parent directories, simplification of sequences of multiple slashes, removal of trailing slashes, and the resolution of symbolic links.

Canonicalization of filenames is important for computer security. For example, a web server may have a restriction that only files under the cgi directory C:\inetpub\wwwroot\cgi-bin may be executed. This rule is enforced by checking that the path starts with C:\inetpub\wwwroot\cgi-bin\ and only then executing it. While the file C:\inetpub\wwwroot\cgi-bin\..\..\..\Windows\System32\cmd.exe initially appears to be in the cgi directory, it exploits the .. path specifier to traverse back up the directory hierarchy in an attempt to execute a file outside of cgi-bin. Permitting cmd.exe to execute would be an error caused by a failure to canonicalize the filename to the simplest representation, C:\Windows\System32\cmd.exe, and is called a directory traversal vulnerability. With the path canonicalized, it is clear the file should not be executed.

Unicode

In Unicode, many accented letters can be represented in more than one way. For example, é can be represented in Unicode as the Unicode character U+0065 (LATIN SMALL LETTER E) followed by the character U+0301 (COMBINING ACUTE ACCENT), but it can also be represented as the precomposed character U+00E9 (LATIN SMALL LETTER E WITH ACUTE). This makes string comparison more complicated, since every possible representation of a string containing such glyphs must be considered. To deal with this, Unicode provides the mechanism of canonical equivalence. In this context, canonicalization is Unicode normalization.

Variable-width encodings in the Unicode standard, in particular UTF-8, may cause an additional need for canonicalization in some situations. Namely, by the standard, in UTF-8 there is only one valid byte sequence for any Unicode character, [1] but some byte sequences are invalid, i.e., they cannot be obtained by encoding any string of Unicode characters into UTF-8. Some sloppy decoder implementations may accept invalid byte sequences as input and produce a valid Unicode character as output for such a sequence. If one uses such a decoder, some Unicode characters effectively have more than one corresponding byte sequence: the valid one and some invalid ones. This could lead to security issues similar to the one described in the previous section. Therefore, if one wants to apply some filter (e.g., a regular expression written in UTF-8) to UTF-8 strings that will later be passed to a decoder that allows invalid byte sequences, one should canonicalize the strings before passing them to the filter. In this context, canonicalization is the process of translating every string character to its single valid byte sequence. An alternative to canonicalization is to reject any strings containing invalid byte sequences.

URL

A canonical URL is a URL for defining the single source of truth for duplicate content.

Use by Google

A canonical URL is the URL of the page that Google thinks is most representative from a set of duplicate pages on your site. For example, if you have URLs for the same page (for example https://example.com/?dress=1234 and https://example.com/dresses/1234 ), Google chooses one as canonical. Note that the pages do not need to be absolutely identical; minor changes in sorting or filtering of list pages do not make the page unique (for example, sorting by price or filtering by item color).

The canonical can be in a different domain than a duplicate. [2]

Internet

With the help of canonical URLs, a search engine knows which link should be provided in a query result.

A canonical link element can get used to define a canonical URL.

Intranet

In intranets, manual searching for information is predominant. In this case, canonical URLs can be defined in a non-machine-readable form, too. For example in a guideline.

Misc

Canonical URLs are usually the URLs that get used for the share action.

Since the Canonical URL gets used in the search result of search engines, they are in most cases a landing page.

Search engines and SEO

In web search and search engine optimization (SEO), URL canonicalization deals with web content that has more than one possible URL. Having multiple URLs for the same web content can cause problems for search engines - specifically in determining which URL should be shown in search results. [3] Most search engines support the Canonical link element as a hint to which URL should be treated as the true version. As indicated by John Mueller of Google, having other directives in a page, like the robots noindex element can give search engines conflicting signals about how to handle canonicalization [4]

Example:

All of these URLs point to the homepage of Wikipedia, but a search engine will only consider one of them to be the canonical form of the URL.

XML

A Canonical XML document is by definition an XML document that is in XML Canonical form, defined by The Canonical XML specification. Briefly, canonicalization removes whitespace within tags, uses particular character encodings, sorts namespace references and eliminates redundant ones, removes XML and DOCTYPE declarations, and transforms relative URIs into absolute URIs.

A simple example would be the following two snippets of XML:

  1. <node1 x='1' a="1" a="2">Data</node1    > <node2>Data</node2>
  2. <node1 a="2" x="1">Data</node1> <node2>Data</node2>

The first example contains extra spaces in the closing tag of the first node. The second example, which has been canonicalized, has had these spaces removed. Note that only the spaces within the tags are removed under W3C canonicalization, not those between tags.

A full summary of canonicalization changes is listed below:

Computational linguistics

In morphology and lexicography, a lemma is the canonical form of a set of words. In English, for example, run, runs, ran, and running are forms of the same lexeme, so we can select one of them; ex. run, to represent all the forms. Lexical databases such as Unitex use this kind of representation.

Lemmatisation is the process of converting a word to its canonical form.

See also

Related Research Articles

While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.

Web pages authored using HyperText Markup Language (HTML) may contain multilingual text represented with the Unicode universal character set. Key to the relationship between Unicode and HTML is the relationship between the "document character set", which defines the set of characters that may be present in an HTML document and assigns numbers to them, and the "external character encoding", or "charset", used to encode a given document as a sequence of bytes.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit.

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

The byte-order mark (BOM) is a particular usage of the special Unicode character code, U+FEFFZERO WIDTH NO-BREAK SPACE, whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text:

UTF-32 (32-bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode code points, needing actually only 21 bits). UTF-32 is a fixed-length encoding, in contrast to all other Unicode transformation formats, which are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.

<span class="mw-page-title-main">Filename</span> Text string used to uniquely identify a computer file

A filename or file name is a name used to uniquely identify a computer file in a file system. Different file systems impose different restrictions on filename lengths.

XML Signature defines an XML syntax for digital signatures and is defined in the W3C recommendation XML Signature Syntax and Processing. Functionally, it has much in common with PKCS #7 but is more extensible and geared towards signing XML documents. It is used by various Web technologies such as SOAP, SAML, and others.

UTF-EBCDIC is a character encoding capable of encoding all 1,112,064 valid character code points in Unicode using 1 to 5 bytes. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for existing ASCII-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.

<span class="mw-page-title-main">JSON</span> Open standard file format and data interchange

JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays. It is a commonly used data format with diverse uses in electronic data interchange, including that of web applications with servers.

XML namespaces are used for providing uniquely named elements and attributes in an XML document. They are defined in a W3C recommendation. An XML instance may contain element or attribute names from more than one XML vocabulary. If each vocabulary is given a namespace, the ambiguity between identically named elements or attributes can be resolved.

This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments, and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.

A whitespace character is a character data element that represents white space when text is rendered for display by a computer.

.properties is a file extension for files mainly used in Java-related technologies to store the configurable parameters of an application. They can also be used for storing strings for Internationalization and localization; these are known as Property Resource Bundles.

Clean URLs are web addresses or Uniform Resource Locator (URLs) intended to improve the usability and accessibility of a website, web application, or web service by being immediately and intuitively meaningful to non-expert users. Such URL schemes tend to reflect the conceptual structure of a collection of information and decouple the user interface from a server's internal representation of information. Other reasons for using clean URLs include search engine optimization (SEO), conforming to the representational state transfer (REST) style of software architecture, and ensuring that individual web resources remain consistently at the same URL. This makes the World Wide Web a more stable and useful system, and allows more durable and reliable bookmarking of web resources.

A Canonical S-expression is a binary encoding form of a subset of general S-expression. It was designed for use in SPKI to retain the power of S-expressions and ensure canonical form for applications such as digital signatures while achieving the compactness of a binary form and maximizing the speed of parsing.

Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting standard character sets, which often included similar or identical characters.

Canonical XML is a normal form of XML, intended to allow relatively simple comparison of pairs of XML documents for equivalence; for this purpose, the Canonical XML transformation removes non-meaningful differences between the documents. Any XML document can be converted to Canonical XML.

Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages which mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.

XPath is an expression language designed to support the query or transformation of XML documents. It was defined by the World Wide Web Consortium (W3C) in 1999, and can be used to compute values from the content of an XML document. Support for XPath exists in applications that support XML, such as web browsers, and many programming languages.

References

  1. RFC 2279: UTF-8, a transformation format of ISO 10646
  2. "Consolidate Duplicate URLs with Canonicals | Google Search Central".
  3. Cutts, Matt (4 January 2006). "SEO advice: url canonicalization". Matt Cutts: Gadgets, Google, and SEO. Retrieved 3 September 2013.
  4. "Canonicalized URL is noindex, nofollow" . Retrieved 20 April 2020.