In computer science, canonicalization (sometimes standardization or normalization ) is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.
Files in file systems may in most cases be accessed through multiple filenames. For instance in Unix-like systems, the string "/./
" can be replaced by "/
". In the C standard library, the function realpath()
performs this task. Other operations performed by this function to canonicalize filenames are the handling of /..
components referring to parent directories, simplification of sequences of multiple slashes, removal of trailing slashes, and the resolution of symbolic links.
Canonicalization of filenames is important for computer security. For example, a web server may have a restriction that only files under the cgi directory C:\inetpub\wwwroot\cgi-bin
may be executed. This rule is enforced by checking that the path starts with C:\inetpub\wwwroot\cgi-bin\
and only then executing it. While the file C:\inetpub\wwwroot\cgi-bin\..\..\..\Windows\System32\cmd.exe
initially appears to be in the cgi directory, it exploits the ..
path specifier to traverse back up the directory hierarchy in an attempt to execute a file outside of cgi-bin
. Permitting cmd.exe
to execute would be an error caused by a failure to canonicalize the filename to the simplest representation, C:\Windows\System32\cmd.exe
, and is called a directory traversal vulnerability. With the path canonicalized, it is clear the file should not be executed.
In Unicode, many accented letters can be represented in more than one way. For example, é can be represented in Unicode as the Unicode character U+0065 (LATIN SMALL LETTER E) followed by the character U+0301 (COMBINING ACUTE ACCENT), but it can also be represented as the precomposed character U+00E9 (LATIN SMALL LETTER E WITH ACUTE). This makes string comparison more complicated, since every possible representation of a string containing such glyphs must be considered. To deal with this, Unicode provides the mechanism of canonical equivalence. In this context, canonicalization is Unicode normalization.
Variable-width encodings in the Unicode standard, in particular UTF-8, may cause an additional need for canonicalization in some situations. Namely, by the standard, in UTF-8 there is only one valid byte sequence for any Unicode character, [1] but some byte sequences are invalid, i.e., they cannot be obtained by encoding any string of Unicode characters into UTF-8. Some sloppy decoder implementations may accept invalid byte sequences as input and produce a valid Unicode character as output for such a sequence. If one uses such a decoder, some Unicode characters effectively have more than one corresponding byte sequence: the valid one and some invalid ones. This could lead to security issues similar to the one described in the previous section. Therefore, if one wants to apply some filter (e.g., a regular expression written in UTF-8) to UTF-8 strings that will later be passed to a decoder that allows invalid byte sequences, one should canonicalize the strings before passing them to the filter. In this context, canonicalization is the process of translating every string character to its single valid byte sequence. An alternative to canonicalization is to reject any strings containing invalid byte sequences.
A canonical URL is a URL for defining the single source of truth for duplicate content.
A canonical URL is the URL of the page that Google thinks is most representative from a set of duplicate pages on your site. For example, if you have URLs for the same page, such as https://example.com/?dress=1234
and https://example.com/dresses/1234
, Google chooses one as canonical. Note that the pages do not need to be absolutely identical; minor changes in sorting or filtering of list pages do not make the page unique (for example, sorting by price or filtering by item color).
The canonical can be in a different domain than a duplicate. [2]
With the help of canonical URLs, a search engine knows which link should be provided in a query result.
A canonical link element can get used to define a canonical URL.
In intranets, manual searching for information is predominant. In this case, canonical URLs can be defined in a non-machine-readable form, too. For example in a guideline.
Canonical URLs are usually the URLs that get used for the share action.
Since the Canonical URL gets used in the search result of search engines, they are in most cases a landing page.
In web search and search engine optimization (SEO), URL canonicalization deals with web content that has more than one possible URL. Having multiple URLs for the same web content can cause problems for search engines - specifically in determining which URL should be shown in search results. [3] Most search engines support the Canonical link element as a hint to which URL should be treated as the true version. As indicated by John Mueller of Google, having other directives in a page, like the robots noindex element can give search engines conflicting signals about how to handle canonicalization [4]
Example:
http://wikipedia.com
http://www.wikipedia.com
http://www.wikipedia.com/
http://www.wikipedia.com/?source=asdf
All of these URLs point to the homepage of Wikipedia, but a search engine will only consider one of them to be the canonical form of the URL.
A Canonical XML document is by definition an XML document that is in XML Canonical form, defined by The Canonical XML specification. Briefly, canonicalization removes whitespace within tags, uses particular character encodings, sorts namespace references and eliminates redundant ones, removes XML and DOCTYPE declarations, and transforms relative URIs into absolute URIs.
A simple example would be the following two snippets of XML:
<node1 x='1' a="1" a="2">Data</node1 > <node2>Data</node2>
<node1 a="2" x="1">Data</node1> <node2>Data</node2>
The first example contains extra spaces in the closing tag of the first node. The second example, which has been canonicalized, has had these spaces removed. Note that only the spaces within the tags are removed under W3C canonicalization, not those between tags.
A full summary of canonicalization changes is listed below:
xml:base
attributes is performedIn morphology and lexicography, a lemma is the canonical form of a set of words. In English, for example, run, runs, ran, and running are forms of the same lexeme, so we can select one of them; ex. run, to represent all the forms. Lexical databases such as Unitex use this kind of representation.
Lemmatisation is the process of converting a word to its canonical form.
While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.
A document type definition (DTD) is a specification file that contains set of markup declarations that define a document type for an SGML-family markup language. The DTD specification file can be used to validate documents.
Web pages authored using HyperText Markup Language (HTML) may contain multilingual text represented with the Unicode universal character set. Key to the relationship between Unicode and HTML is the relationship between the "document character set", which defines the set of characters that may be present in an HTML document and assigns numbers to them, and the "external character encoding", or "charset", used to encode a given document as a sequence of bytes.
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit. Almost every webpage is stored in UTF-8.
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.
The byte-order mark (BOM) is a particular usage of the special Unicode character code, U+FEFFZERO WIDTH NO-BREAK SPACE, whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text:
UTF-32 (32-bit Unicode Transformation Format), sometimes called UCS-4, is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode code points, needing actually only 21 bits). In contrast, all other Unicode transformation formats are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.
A filename or file name is a name used to uniquely identify a computer file in a file system. Different file systems impose different restrictions on filename lengths.
XML Signature defines an XML syntax for digital signatures and is defined in the W3C recommendation XML Signature Syntax and Processing. Functionally, it has much in common with PKCS #7 but is more extensible and geared towards signing XML documents. It is used by various Web technologies such as SOAP, SAML, and others.
UTF-EBCDIC is a character encoding capable of encoding all 1,112,064 valid character code points in Unicode using 1 to 5 bytes. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for existing ASCII-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.
XML namespaces are used for providing uniquely named elements and attributes in an XML document. They are defined in a W3C recommendation. An XML instance may contain element or attribute names from more than one XML vocabulary. If each vocabulary is given a namespace, the ambiguity between identically named elements or attributes can be resolved.
This article compares Unicode encodings in two types of environments: 8-bit clean environments, and environments that forbid the use of byte values with the high bit set. Originally, such prohibitions allowed for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. The Standard Compression Scheme for Unicode and the Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.
A whitespace character is a character data element that represents white space when text is rendered for display by a computer.
.properties is a file extension for files mainly used in Java-related technologies to store the configurable parameters of an application. They can also be used for storing strings for Internationalization and localization; these are known as Property Resource Bundles.
Clean URLs are web addresses or Uniform Resource Locator (URLs) intended to improve the usability and accessibility of a website, web application, or web service by being immediately and intuitively meaningful to non-expert users. Such URL schemes tend to reflect the conceptual structure of a collection of information and decouple the user interface from a server's internal representation of information. Other reasons for using clean URLs include search engine optimization (SEO), conforming to the representational state transfer (REST) style of software architecture, and ensuring that individual web resources remain consistently at the same URL. This makes the World Wide Web a more stable and useful system, and allows more durable and reliable bookmarking of web resources.
A Canonical S-expression is a binary encoding form of a subset of general S-expression. It was designed for use in SPKI to retain the power of S-expressions and ensure canonical form for applications such as digital signatures while achieving the compactness of a binary form and maximizing the speed of parsing.
Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with pre-existing standard character sets, which often included similar or identical characters.
Canonical XML is a normal form of XML, intended to allow relatively simple comparison of pairs of XML documents for equivalence; for this purpose, the Canonical XML transformation removes non-meaningful differences between the documents. Any XML document can be converted to Canonical XML.
In programming, a file uniform resource identifier (URI) scheme is a specific format of URI, used to specifically identify a file on a host computer. While URIs can be used to identify anything, there is specific syntax associated with identifying files.
XPath is an expression language designed to support the query or transformation of XML documents. It was defined by the World Wide Web Consortium (W3C) in 1999, and can be used to compute values from the content of an XML document. Support for XPath exists in applications that support XML, such as web browsers, and many programming languages.