HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, canonicalization (sometimes standardization or
normalization Normalization or normalisation refers to a process that makes something more normal or regular. Most commonly it refers to: * Normalization (sociology) or social normalization, the process through which ideas and behaviors that may fall outside of ...
) is a process for converting
data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
that has more than one possible representation into a "standard", "normal", or
canonical form In mathematics and computer science, a canonical, normal, or standard form of a mathematical object is a standard way of presenting that object as a mathematical expression. Often, it is one which provides the simplest representation of an obje ...
. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
s by eliminating repeated calculations, or to make it possible to impose a meaningful
sorting Sorting refers to ordering data in an increasing or decreasing manner according to some linear relationship among the data items. # ordering: arranging items in a sequence ordered by some criterion; # categorizing: grouping items with similar pro ...
order.


Usage cases


Filenames

Files in
file system In computing, file system or filesystem (often abbreviated to fs) is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one larg ...
s may in most cases be accessed through multiple
filename A filename or file name is a name used to uniquely identify a computer file in a directory structure. Different file systems impose different restrictions on filename lengths. A filename may (depending on the file system) include: * name &ndas ...
s. For instance in
Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, and ot ...
-like systems, the string "/./" can be replaced by "/". In the
C standard library The C standard library or libc is the standard library for the C programming language, as specified in the ISO C standard.ISO/IEC (2018). '' ISO/IEC 9899:2018(E): Programming Languages - C §7'' Starting from the original ANSI C standard, it wa ...
, the function realpath() performs this task. Other operations performed by this function to canonicalize filenames are the handling of /.. components referring to parent directories, simplification of sequences of multiple slashes, removal of trailing slashes, and the resolution of
symbolic link In computing, a symbolic link (also symlink or soft link) is a file whose purpose is to point to a file or directory (called the "target") by specifying a path thereto. Symbolic links are supported by POSIX and by most Unix-like operating system ...
s. Canonicalization of filenames is important for computer security. For example, a web server may have a restriction that only files under the cgi directory C:\inetpub\wwwroot\cgi-bin may be executed. This rule is enforced by checking that the path starts with C:\inetpub\wwwroot\cgi-bin\ and only then executing it. While the file C:\inetpub\wwwroot\cgi-bin\..\..\..\Windows\System32\cmd.exe initially appears to be in the cgi directory, it exploits the .. path specifier to traverse back up the directory hierarchy in an attempt to execute a file outside of cgi-bin. Permitting cmd.exe to execute would be an error caused by a failure to canonicalize the filename to the simplest representation, C:\Windows\System32\cmd.exe, and is called a directory traversal vulnerability. With the path canonicalized, it is clear the file should not be executed.


Unicode

In
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
, many accented letters can be represented in more than one way. For example, ''é'' can be represented in Unicode as the Unicode character U+0065 (LATIN SMALL LETTER E) followed by the character U+0301 (COMBINING ACUTE ACCENT), but it can also be represented as the precomposed character U+00E9 (LATIN SMALL LETTER E WITH ACUTE). This makes string comparison more complicated, since every possible representation of a string containing such glyphs must be considered. To deal with this, Unicode provides the mechanism of
canonical equivalence Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting st ...
. In this context, canonicalization is
Unicode normalization Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting s ...
.
Variable-width encoding A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set (a repertoire of symbols) for representation, usually in a computer. Most common variable-width encodings are ...
s in the Unicode standard, in particular
UTF-8 UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
, may cause an additional need for canonicalization in some situations. Namely, by the standard, in UTF-8 there is only one valid byte sequence for any Unicode character, but some byte sequences are invalid, i. e. cannot be obtained by encoding any string of Unicode characters into UTF-8. Some sloppy decoder implementations may accept invalid byte sequences as input and produce a valid Unicode character as output for such a sequence. If one uses such a decoder, some Unicode characters have effectively more than one corresponding byte sequence: the valid one and some invalid ones. This could lead to security issues similar to the one described in the previous section. Therefore, if one wants to apply some filter (e. g. a regular expression written in UTF-8) to UTF-8 strings that will later be passed to a decoder that allows invalid byte sequences, one should canonicalize the strings before passing them to the filter. In this context, canonicalization is the process of translating every string character to its single valid byte sequence. An alternative to canonicalization is to reject any strings containing invalid byte sequences.


URL

A canonical URL is a URL for defining the
single source of truth In information science and information technology, single source of truth (SSOT) architecture, or single point of truth (SPOT) architecture, for information systems is the practice of structuring information models and associated data schemas su ...
for
duplicate content Duplicate content is a term used in the field of search engine optimization to describe content that appears on more than one web page. The duplicate content can be substantial parts of the content within or across domains and can be either exactly ...
.


Use by Google

A canonical URL is the URL of the page that Google thinks is most representative from a set of duplicate pages on your site. For example, if you have URLs for the same page (for example https://example.com/?dress=1234 and https://example.com/dresses/1234), Google chooses one as canonical. Note that the pages do not need to be absolutely identical; minor changes in sorting or filtering of list pages do not make the page unique (for example, sorting by price or filtering by item color). The canonical can be in a different domain than a duplicate.


Internet

With the help of canonical URLs, a search engine knows which link should be provided in a query result. A
canonical link element A canonical link element is an HTML element that helps webmasters prevent duplicate content issues in search engine optimization by specifying the "canonical" or "preferred" version of a web page. It is described in RFC 6596, which went live in Apr ...
can get used to define a canonical URL.


Intranet

In
intranets An intranet is a computer network for sharing information, easier communication, collaboration tools, operational systems, and other computing services within an organization, usually to the exclusion of access by outsiders. The term is used in c ...
, manual searching for information is predominant. In this case, canonical URLs can be defined in a non-machine-readable form, too. For example in a
guideline A guideline is a statement by which to determine a course of action. A guideline aims to streamline particular processes according to a set routine or sound practice. Guidelines may be issued by and used by any organization (governmental or pri ...
.


Misc

Canonical URLs are usually the URLs that get used for the share action. Since the Canonical URL gets used in the search result of search engines, they are in most cases a
landing page In online marketing, a landing page, sometimes known as a "lead capture page", "single property page", "static page", "squeeze page" or a "destination page", is a single web page that appears in response to clicking on a search engine optimized se ...
.


Search engines and SEO

In web search and
search engine optimization Search engine optimization (SEO) is the process of improving the quality and quantity of Web traffic, website traffic to a website or a web page from web search engine, search engines. SEO targets unpaid traffic (known as "natural" or "Organ ...
(SEO),
URL canonicalization A canonical link element is an HTML element that helps webmasters prevent duplicate content issues in search engine optimization by specifying the "canonical" or "preferred" version of a web page. It is described in RFC 6596, which went live in Apr ...
deals with web content that has more than one possible URL. Having multiple URLs for the same web content can cause problems for search engines - specifically in determining which URL should be shown in search results. Most search engines support the
Canonical link element A canonical link element is an HTML element that helps webmasters prevent duplicate content issues in search engine optimization by specifying the "canonical" or "preferred" version of a web page. It is described in RFC 6596, which went live in Apr ...
as a hint to which URL should be treated as the true version. As indicated by John Mueller of Google, having other directives in a page, like the robots noindex element can give search engines conflicting signals about how to handle canonicalization Example: * http://wikipedia.com * http://www.wikipedia.com * http://www.wikipedia.com/ * http://www.wikipedia.com/?source=asdf All of these URLs point to the homepage of Wikipedia, but a search engine will only consider one of them to be the canonical form of the URL.


XML

A
Canonical XML Canonical XML is a normal form of XML, intended to allow relatively simple comparison of pairs of XML documents for equivalence; for this purpose, the Canonical XML transformation removes non-meaningful differences between the documents. Any XML do ...
document is by definition an XML document that is in XML Canonical form, defined b
The Canonical XML specification
Briefly, canonicalization removes whitespace within tags, uses particular character encodings, sorts namespace references and eliminates redundant ones, removes XML and DOCTYPE declarations, and transforms relative URIs into absolute URIs. A simple example would be the following two snippets of XML: # Data Data # Data Data The first example contains extra spaces in the closing tag of the first node. The second example, which has been canonicalized, has had these spaces removed. Note that only the spaces within the tags are removed under W3C canonicalization, not those between tags. A full summary of canonicalization changes is listed below: * The document is encoded in UTF-8 * Line breaks normalized to #xA on input, before parsing * Attribute values are normalized, as if by a validating processor * Character and parsed entity references are replaced * CDATA sections are replaced with their character content * The XML declaration and document type declaration are removed * Empty elements are converted to start-end tag pairs * Whitespace outside of the document element and within start and end tags is normalized * All whitespace in character content is retained (excluding characters removed during line feed normalization) * Attribute value delimiters are set to quotation marks (double quotes) * Special characters in attribute values and character content are replaced by character references * Superfluous namespace declarations are removed from each element * Default attributes are added to each element * Fixup of xml:base attributes is performed * Lexicographic order is imposed on the namespace declarations and attributes of each element


Computational linguistics

In
morphology Morphology, from the Greek and meaning "study of shape", may refer to: Disciplines * Morphology (archaeology), study of the shapes or forms of artifacts * Morphology (astronomy), study of the shape of astronomical objects such as nebulae, galaxies ...
and
lexicography Lexicography is the study of lexicons, and is divided into two separate academic disciplines. It is the art of compiling dictionaries. * Practical lexicography is the art or craft of compiling, writing and editing dictionaries. * Theoretica ...
, a lemma is the ''canonical form'' of a set of
word A word is a basic element of language that carries an semantics, objective or pragmatics, practical semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of w ...
s. In
English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ide ...
, for example, ''run'', ''runs'', ''ran'', and ''running'' are forms of the same
lexeme A lexeme () is a unit of lexical meaning that underlies a set of words that are related through inflection. It is a basic abstract unit of meaning, a unit of morphological analysis in linguistics that roughly corresponds to a set of forms taken ...
, so we can select one of them; ex. ''run'', to represent all the forms.
Lexical database In digital lexicography, natural language processing, and digital humanities, a lexical resource is a language resource consisting of data regarding the lexemes of the lexicon of one or more languages e.g., in the form of a database. Character ...
s such a
Unitex
use this kind of representation.
Lemmatisation Lemmatisation ( or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. In computational linguistics, lemma ...
is the process of converting a word to its ''canonical form''.


See also

*
Canonical form In mathematics and computer science, a canonical, normal, or standard form of a mathematical object is a standard way of presenting that object as a mathematical expression. Often, it is one which provides the simplest representation of an obje ...
*
Graph canonization In graph theory, a branch of mathematics, graph canonization is the problem of finding a canonical form of a given graph ''G''. A canonical form is a labeled graph Canon(''G'') that is isomorphic to ''G'', such that every graph that is isomorphic t ...
*
Lemmatisation Lemmatisation ( or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. In computational linguistics, lemma ...
*
Text normalization Text normalization is the process of transforming text into a single canonical form that it might not have had before. Normalizing text before storing or processing it allows for separation of concerns, since input is guaranteed to be consisten ...
*
Type species In zoological nomenclature, a type species (''species typica'') is the species name with which the name of a genus or subgenus is considered to be permanently taxonomically associated, i.e., the species that contains the biological type specimen ...


References

{{reflist


External links


Canonical XML Version 1.0, W3C Recommendation

OWASP Security Reference for Canonicalization
Computing terminology