Canonicalization
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, canonicalization (sometimes standardization or normalization) is a process for converting
data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...
that has more than one possible representation into a "standard", "normal", or
canonical form In mathematics and computer science, a canonical, normal, or standard form of a mathematical object is a standard way of presenting that object as a mathematical expression. Often, it is one which provides the simplest representation of an ...
. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
s by eliminating repeated calculations, or to make it possible to impose a meaningful
sorting Sorting refers to ordering data in an increasing or decreasing manner according to some linear relationship among the data items. # ordering: arranging items in a sequence ordered by some criterion; # categorizing: grouping items with similar pro ...
order.


Usage cases


Filenames

Files in
file system In computing, file system or filesystem (often abbreviated to fs) is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one larg ...
s may in most cases be accessed through multiple
filename A filename or file name is a name used to uniquely identify a computer file in a directory structure. Different file systems impose different restrictions on filename lengths. A filename may (depending on the file system) include: * name &ndas ...
s. For instance in
Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, ...
-like systems, the string "/./" can be replaced by "/". In the
C standard library The C standard library or libc is the standard library for the C programming language, as specified in the ISO C standard. ISO/IEC (2018). '' ISO/IEC 9899:2018(E): Programming Languages - C §7'' Starting from the original ANSI C standard, it was ...
, the function realpath() performs this task. Other operations performed by this function to canonicalize filenames are the handling of /.. components referring to parent directories, simplification of sequences of multiple slashes, removal of trailing slashes, and the resolution of
symbolic link In computing, a symbolic link (also symlink or soft link) is a file whose purpose is to point to a file or directory (called the "target") by specifying a path thereto. Symbolic links are supported by POSIX and by most Unix-like operating syste ...
s. Canonicalization of filenames is important for computer security. For example, a web server may have a restriction that only files under the cgi directory C:\inetpub\wwwroot\cgi-bin may be executed. This rule is enforced by checking that the path starts with C:\inetpub\wwwroot\cgi-bin\ and only then executing it. While the file C:\inetpub\wwwroot\cgi-bin\..\..\..\Windows\System32\cmd.exe initially appears to be in the cgi directory, it exploits the .. path specifier to traverse back up the directory hierarchy in an attempt to execute a file outside of cgi-bin. Permitting cmd.exe to execute would be an error caused by a failure to canonicalize the filename to the simplest representation, C:\Windows\System32\cmd.exe, and is called a directory traversal vulnerability. With the path canonicalized, it is clear the file should not be executed.


Unicode

In
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
, many accented letters can be represented in more than one way. For example, ''é'' can be represented in Unicode as the Unicode character U+0065 (LATIN SMALL LETTER E) followed by the character U+0301 (COMBINING ACUTE ACCENT), but it can also be represented as the precomposed character U+00E9 (LATIN SMALL LETTER E WITH ACUTE). This makes string comparison more complicated, since every possible representation of a string containing such glyphs must be considered. To deal with this, Unicode provides the mechanism of
canonical equivalence Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting st ...
. In this context, canonicalization is
Unicode normalization Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting ...
.
Variable-width encoding A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set (a repertoire of symbols) for representation, usually in a computer. Most common variable-width encodings a ...
s in the Unicode standard, in particular
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of e ...
, may cause an additional need for canonicalization in some situations. Namely, by the standard, in UTF-8 there is only one valid byte sequence for any Unicode character, but some byte sequences are invalid, i. e. cannot be obtained by encoding any string of Unicode characters into UTF-8. Some sloppy decoder implementations may accept invalid byte sequences as input and produce a valid Unicode character as output for such a sequence. If one uses such a decoder, some Unicode characters have effectively more than one corresponding byte sequence: the valid one and some invalid ones. This could lead to security issues similar to the one described in the previous section. Therefore, if one wants to apply some filter (e. g. a regular expression written in UTF-8) to UTF-8 strings that will later be passed to a decoder that allows invalid byte sequences, one should canonicalize the strings before passing them to the filter. In this context, canonicalization is the process of translating every string character to its single valid byte sequence. An alternative to canonicalization is to reject any strings containing invalid byte sequences.


URL

A canonical URL is a URL for defining the
single source of truth In information science and information technology, single source of truth (SSOT) architecture, or single point of truth (SPOT) architecture, for information systems is the practice of structuring information models and associated data schemas ...
for
duplicate content Duplicate content is a term used in the field of search engine optimization to describe content that appears on more than one web page. The duplicate content can be substantial parts of the content within or across domains and can be either exactly ...
.


Use by Google

A canonical URL is the URL of the page that Google thinks is most representative from a set of duplicate pages on your site. For example, if you have URLs for the same page (for example https://example.com/?dress=1234 and https://example.com/dresses/1234), Google chooses one as canonical. Note that the pages do not need to be absolutely identical; minor changes in sorting or filtering of list pages do not make the page unique (for example, sorting by price or filtering by item color). The canonical can be in a different domain than a duplicate.


Internet

With the help of canonical URLs, a search engine knows which link should be provided in a query result. A
canonical link element A canonical link element is an HTML element that helps webmasters prevent duplicate content issues in search engine optimization by specifying the "canonical" or "preferred" version of a web page. It is described in RFC 6596, which went live in Apr ...
can get used to define a canonical URL.


Intranet

In
intranets An intranet is a computer network for sharing information, easier communication, collaboration tools, operational systems, and other computing services within an organization, usually to the exclusion of access by outsiders. The term is used in ...
, manual searching for information is predominant. In this case, canonical URLs can be defined in a non-machine-readable form, too. For example in a guideline.


Misc

Canonical URLs are usually the URLs that get used for the share action. Since the Canonical URL gets used in the search result of search engines, they are in most cases a
landing page In online marketing, a landing page, sometimes known as a "lead capture page", "single property page", "static page", "squeeze page" or a "destination page", is a single web page that appears in response to clicking on a search engine optimized s ...
.


Search engines and SEO

In web search and
search engine optimization Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic (known as "natural" or "organic" results) rather than dire ...
(SEO),
URL canonicalization A canonical link element is an HTML element that helps webmasters prevent duplicate content issues in search engine optimization by specifying the "canonical" or "preferred" version of a web page. It is described in RFC 6596, which went live in Apr ...
deals with web content that has more than one possible URL. Having multiple URLs for the same web content can cause problems for search engines - specifically in determining which URL should be shown in search results. Most search engines support the
Canonical link element A canonical link element is an HTML element that helps webmasters prevent duplicate content issues in search engine optimization by specifying the "canonical" or "preferred" version of a web page. It is described in RFC 6596, which went live in Apr ...
as a hint to which URL should be treated as the true version. As indicated by John Mueller of Google, having other directives in a page, like the robots noindex element can give search engines conflicting signals about how to handle canonicalization Example: * http://wikipedia.com * http://www.wikipedia.com * http://www.wikipedia.com/ * http://www.wikipedia.com/?source=asdf All of these URLs point to the homepage of Wikipedia, but a search engine will only consider one of them to be the canonical form of the URL.


XML

A
Canonical XML Canonical XML is a normal form of XML, intended to allow relatively simple comparison of pairs of XML documents for equivalence; for this purpose, the Canonical XML transformation removes non-meaningful differences between the documents. Any XML do ...
document is by definition an XML document that is in XML Canonical form, defined b
The Canonical XML specification
Briefly, canonicalization removes whitespace within tags, uses particular character encodings, sorts namespace references and eliminates redundant ones, removes XML and DOCTYPE declarations, and transforms relative URIs into absolute URIs. A simple example would be the following two snippets of XML: # Data Data # Data Data The first example contains extra spaces in the closing tag of the first node. The second example, which has been canonicalized, has had these spaces removed. Note that only the spaces within the tags are removed under W3C canonicalization, not those between tags. A full summary of canonicalization changes is listed below: * The document is encoded in UTF-8 * Line breaks normalized to #xA on input, before parsing * Attribute values are normalized, as if by a validating processor * Character and parsed entity references are replaced * CDATA sections are replaced with their character content * The XML declaration and document type declaration are removed * Empty elements are converted to start-end tag pairs * Whitespace outside of the document element and within start and end tags is normalized * All whitespace in character content is retained (excluding characters removed during line feed normalization) * Attribute value delimiters are set to quotation marks (double quotes) * Special characters in attribute values and character content are replaced by character references * Superfluous namespace declarations are removed from each element * Default attributes are added to each element * Fixup of xml:base attributes is performed * Lexicographic order is imposed on the namespace declarations and attributes of each element


Computational linguistics

In morphology and
lexicography Lexicography is the study of lexicons, and is divided into two separate academic disciplines. It is the art of compiling dictionaries. * Practical lexicography is the art or craft of compiling, writing and editing dictionaries. * Theoreti ...
, a
lemma Lemma may refer to: Language and linguistics * Lemma (morphology), the canonical, dictionary or citation form of a word * Lemma (psycholinguistics), a mental abstraction of a word about to be uttered Science and mathematics * Lemma (botany), ...
is the ''canonical form'' of a set of
word A word is a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no conse ...
s. In
English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ...
, for example, ''run'', ''runs'', ''ran'', and ''running'' are forms of the same
lexeme A lexeme () is a unit of lexical meaning that underlies a set of words that are related through inflection. It is a basic abstract unit of meaning, a unit of morphological analysis in linguistics that roughly corresponds to a set of forms take ...
, so we can select one of them; ex. ''run'', to represent all the forms. Lexical databases such a
Unitex
use this kind of representation.
Lemmatisation Lemmatisation ( or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. In computational linguistics, lemmati ...
is the process of converting a word to its ''canonical form''.


See also

*
Canonical form In mathematics and computer science, a canonical, normal, or standard form of a mathematical object is a standard way of presenting that object as a mathematical expression. Often, it is one which provides the simplest representation of an ...
* Graph canonization *
Lemmatisation Lemmatisation ( or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. In computational linguistics, lemmati ...
*
Text normalization Text normalization is the process of transforming text into a single canonical form that it might not have had before. Normalizing text before storing or processing it allows for separation of concerns, since input is guaranteed to be consis ...
*
Type species In zoological nomenclature, a type species (''species typica'') is the species name with which the name of a genus or subgenus is considered to be permanently taxonomically associated, i.e., the species that contains the biological type specim ...


References

{{reflist


External links


Canonical XML Version 1.0, W3C Recommendation

OWASP Security Reference for Canonicalization
Computing terminology