URL normalization
   HOME

TheInfoList



OR:

URI normalization is the process by which URIs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URI into a normalized URI so it is possible to determine if two syntactically different URIs may be equivalent.
Search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
s employ URI normalization in order to correctly rank pages that may be found with multiple URIs, and to reduce indexing of duplicate pages.
Web crawler A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web s ...
s perform URI normalization in order to avoid crawling the same resource more than once.
Web browser A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used o ...
s may perform normalization to determine if a link has been visited or to determine if a page has been cached.
Web server A web server is computer software and underlying hardware that accepts requests via HTTP (the network protocol created to distribute web content) or its secure variant HTTPS. A user agent, commonly a web browser or web crawler, initia ...
s may also perform normalization for many reasons (i.e. to be able to more easily intercept security risks coming from client requests, to use only one absolute file name for each resource stored in their caches, named in log files, etc.).


Normalization process

There are several types of normalization that may be performed. Some of them are always semantics preserving and some may not be.


Normalizations that preserve semantics

The following normalizations are described in RFC 3986 to result in equivalent URIs: * Converting percent-encoded triplets to uppercase. The hexadecimal digits within a
percent-encoding Percent-encoding, also known as URL encoding, is a method to encode arbitrary data in a Uniform Resource Identifier (URI) using only the limited US-ASCII characters legal within a URI. Although it is known as ''URL encoding'', it is also used ...
triplet of the URI (e.g., %3a versus %3A) are case-insensitive and therefore should be normalized to use uppercase letters for the digits A-F. Example: :http://example.com/foo%2ahttp://example.com/foo%2A * Converting the scheme and host to lowercase. The
scheme A scheme is a systematic plan for the implementation of a certain idea. Scheme or schemer may refer to: Arts and entertainment * ''The Scheme'' (TV series), a BBC Scotland documentary series * The Scheme (band), an English pop band * ''The Schem ...
and host components of the URI are case-insensitive and therefore should be normalized to lowercase. Example: :HTTP://User@Example.COM/Foohttp://User@example.com/Foo * Decoding percent-encoded triplets of unreserved characters. Percent-encoded triplets of the URI in the ranges of ''ALPHA'' (%41%5A and %61%7A), ''DIGIT'' (%30%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) do not require percent-encoding and should be decoded to their corresponding unreserved characters. Example: :http://example.com/%7Efoohttp://example.com/~foo * Removing dot-segments. Dot-segments . and .. in the path component of the URI should be removed by applying the remove_dot_segments algorithm to the path described in RFC 3986. Example: :http://example.com/foo/./bar/baz/../quxhttp://example.com/foo/bar/qux * Converting an empty path to a "/" path. In presence of an authority component, an empty path component should be normalized to a path component of "/". Example: :http://example.comhttp://example.com/ * Removing the default port. An empty or default port component of the URI (port 80 for the http scheme) with its ":" delimiter should be removed. Example: :http://example.com:80/http://example.com/


Normalizations that usually preserve semantics

For http and https URIs, the following normalizations listed in RFC 3986 may result in equivalent URIs, but are not guaranteed to by the standards: * Adding a trailing "/" to a non-empty path. Directories (folders) are indicated with a trailing slash and should be included in URIs. Example: :http://example.com/foohttp://example.com/foo/ :However, there is no way to know if a URI path component represents a directory or not. RFC 3986 notes that if the former URI redirects to the latter URI, then that is an indication that they are equivalent.


Normalizations that change semantics

Applying the following normalizations result in a semantically different URI although it may refer to the same resource: * Removing directory index. Default directory indexes are generally not needed in URIs. Examples: :http://example.com/default.asphttp://example.com/ :http://example.com/a/index.htmlhttp://example.com/a/ * Removing the fragment. The fragment component of a URI is never seen by the server and can sometimes be removed. Example: :http://example.com/bar.html#section1http://example.com/bar.html :However,
AJAX Ajax may refer to: Greek mythology and tragedy * Ajax the Great, a Greek mythological hero, son of King Telamon and Periboea * Ajax the Lesser, a Greek mythological hero, son of Oileus, the king of Locris * ''Ajax'' (play), by the ancient Gree ...
applications frequently use the value in the fragment. * Replacing IP with domain name. Check if the
IP address An Internet Protocol address (IP address) is a numerical label such as that is connected to a computer network that uses the Internet Protocol for communication.. Updated by . An IP address serves two main functions: network interface ident ...
maps to a domain name. Example: :http://208.77.188.166/http://example.com/ :The reverse replacement is rarely safe due to virtual web servers. * Limiting protocols. Limiting different
application layer An application layer is an abstraction layer that specifies the shared communications protocols and interface methods used by hosts in a communications network. An ''application layer'' abstraction is specified in both the Internet Protocol Su ...
protocols. For example, the “https” scheme could be replaced with “http”. Example: :https://example.com/http://example.com/ * Removing duplicate slashes Paths which include two adjacent slashes could be converted to one. Example: :http://example.com/foo//bar.htmlhttp://example.com/foo/bar.html * Removing or adding “www” as the first domain label. Some websites operate identically in two Internet domains: one whose least significant label is “www” and another whose name is the result of omitting the least significant label from the name of the first, the latter being known as a naked domain. For example, http://www.example.com/ and http://example.com/ may access the same website. Many websites
redirect Redirect and its variants (e.g., redirection) may refer to: Arts, entertainment, and media * ''Redirect'', 2012 Christian metal album and its title track by Your Memorial * ''Redirected'' (film), a 2014 action comedy film Computing * ICMP R ...
the user from the www to the non-www address or vice versa. A normalizer may determine if one of these URIs redirects to the other and normalize all URIs appropriately. Example: :http://www.example.com/http://example.com/ * Sorting the query parameters. Some web pages use more than one query parameter in the URI. A normalizer can sort the parameters into alphabetical order (with their values), and reassemble the URI. Example: :http://example.com/display?lang=en&article=fredhttp://example.com/display?article=fred&lang=en :However, the order of parameters in a URI may be significant (this is not defined by the standard) and a web server may allow the same variable to appear multiple times. * Removing unused query variables. A page may only expect certain parameters to appear in the query; unused parameters can be removed. Example: :http://example.com/display?id=123&fakefoo=fakebarhttp://example.com/display?id=123 :Note that a parameter without a value is not necessarily an unused parameter. * Removing default query parameters. A default value in the query string may render identically whether it is there or not. Example: :http://example.com/display?id=&sort=ascendinghttp://example.com/display * Removing the "?" when the query is empty. When the query is empty, there may be no need for the "?". Example: :http://example.com/display?http://example.com/display


Normalization based on URI lists

Some normalization rules may be developed for specific websites by examining URI lists obtained from previous crawls or web server logs. For example, if the URI :http://example.com/story?id=xyz appears in a crawl log several times along with :http://example.com/story_xyz we may assume that the two URIs are equivalent and can be normalized to one of the URI forms. Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URIs with similar text) rules that can be applied to URI lists. They showed that once the correct DUST rules were found and applied with a normalization algorithm, they were able to find up to 68% of the redundant URIs in a URI list.


See also

*
Uniform Resource Locator A Uniform Resource Locator (URL), colloquially termed as a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identif ...
*
Fragment identifier In computer hypertext, a URI fragment is a string of characters that refers to a resource that is subordinate to another, primary resource. The primary resource is identified by a Uniform Resource Identifier (URI), and the fragment identifier p ...
*
Web crawler A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web s ...


References

* RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax * * * {{cite conference , author1=Uri Schonfeld , author2=Ziv Bar-Yossef , author3=Idit Keidar , name-list-style=amp , year = 2007 , title = Do not crawl in the dust: different URLs with similar text , conference = Proceedings of the 16th international conference on World Wide Web , pages = 111–120 , url = http://www2007.org/paper194.php URL Internet search algorithms