HOME

TheInfoList



OR:

Hyperlink-Induced Topic Search (HITS; also known as hubs and authorities) is a link analysis
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
that rates Web pages, developed by Jon Kleinberg. The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that they held, but were used as compilations of a broad catalog of information that led users direct to other authoritative pages. In other words, a good hub represents a page that pointed to many other pages, while a good authority represents a page that is linked by many different hubs. The scheme therefore assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages.


History


In journals

Many methods have been used to rank the importance of scientific journals. One such method is Garfield's
impact factor The impact factor (IF) or journal impact factor (JIF) of an academic journal is a scientometric index calculated by Clarivate that reflects the yearly mean number of citations of articles published in the last two years in a given journal, as ...
. Journals such as ''
Science Science is a systematic endeavor that builds and organizes knowledge in the form of testable explanations and predictions about the universe. Science may be as old as the human species, and some of the earliest archeological evidence ...
'' and ''
Nature Nature, in the broadest sense, is the physical world or universe. "Nature" can refer to the phenomena of the physical world, and also to life in general. The study of nature is a large, if not the only, part of science. Although humans are ...
'' are filled with numerous citations, making these magazines have very high impact factors. Thus, when comparing two more obscure journals which have received roughly the same number of citations but one of these journals has received many citations from ''Science'' and ''Nature'', this journal needs be ranked higher. In other words, it is better to receive citations from an important journal than from an unimportant one.


On the Web

This phenomenon also occurs in the
Internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, p ...
. Counting the number of links to a page can give us a general estimate of its prominence on the Web, but a page with very few incoming links may also be prominent, if two of these links come from the home pages of sites like
Yahoo! Yahoo! (, styled yahoo''!'' in its logo) is an American web services provider. It is headquartered in Sunnyvale, California and operated by the namesake company Yahoo Inc., which is 90% owned by investment funds managed by Apollo Global Mana ...
,
Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
, or
MSN MSN (meaning Microsoft Network) is a web portal and related collection of Internet services and apps for Windows and mobile devices, provided by Microsoft and launched on August 24, 1995, alongside the release of Windows 95. The Microsoft Net ...
. Because these sites are of very high importance but are also
search engines A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in ...
, a page can be ranked much higher than its actual relevance.


Algorithm

In the HITS algorithm, the first step is to retrieve the most relevant pages to the search query. This set is called the ''root set'' and can be obtained by taking the top pages returned by a text-based search algorithm. A ''base set'' is generated by augmenting the root set with all the web pages that are linked from it and some of the pages that link to it. The web pages in the base set and all hyperlinks among those pages form a focused subgraph. The HITS computation is performed only on this ''focused subgraph''. According to Kleinberg the reason for constructing a base set is to ensure that most (or many) of the strongest authorities are included. Authority and hub values are defined in terms of one another in a mutual recursion. An authority value is computed as the sum of the scaled hub values that point to that page. A hub value is the sum of the scaled authority values of the pages it points to. Some implementations also consider the relevance of the linked pages. The algorithm performs a series of iterations, each consisting of two basic steps: *Authority update: Update each node's ''authority score'' to be equal to the sum of the ''hub scores'' of each node that points to it. That is, a node is given a high authority score by being linked from pages that are recognized as Hubs for information. *Hub update: Update each node's ''hub score'' to be equal to the sum of the ''authority scores'' of each node that it points to. That is, a node is given a high hub score by linking to nodes that are considered to be authorities on the subject. The Hub score and Authority score for a node is calculated with the following algorithm: * Start with each node having a hub score and authority score of 1. * Run the authority update rule * Run the hub update rule * Normalize the values by dividing each Hub score by square root of the sum of the squares of all Hub scores, and dividing each Authority score by square root of the sum of the squares of all Authority scores. * Repeat from the second step as necessary. HITS, like
Page Page most commonly refers to: * Page (paper), one side of a leaf of paper, as in a book Page, PAGE, pages, or paging may also refer to: Roles * Page (assistance occupation), a professional occupation * Page (servant), traditionally a young m ...
and Brin's
PageRank PageRank (PR) is an algorithm used by Google Search to rank webpages, web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. A ...
, is an
iterative algorithm In computational mathematics, an iterative method is a mathematical procedure that uses an initial value to generate a sequence of improving approximate solutions for a class of problems, in which the ''n''-th approximation is derived from the pre ...
based on the linkage of the documents on the web. However it does have some major differences: * It is query dependent, that is, the (Hubs and Authority) scores resulting from the link analysis are influenced by the search terms; * As a corollary, it is executed at query time, not at indexing time, with the associated hit on performance that accompanies query-time processing. * It is not commonly used by search engines. (Though a similar algorithm was said to be used by
Teoma Teoma (from Scottish Gaelic ''teòma'' "expert") was an Internet search engine founded in April 2000 by Professor Apostolos Gerasoulis and his colleagues at Rutgers University in New Jersey. Professor Tao Yang from the University of California, ...
, which was acquired by Ask Jeeves/Ask.com.) * It computes two scores per document, hub and authority, as opposed to a single score; * It is processed on a small subset of ‘relevant’ documents (a 'focused subgraph' or base set), not all documents as was the case with PageRank.


In detail

To begin the ranking, we let \mathrm(p) = 1 and \mathrm(p) = 1 for each page p . We consider two types of updates: Authority Update Rule and Hub Update Rule. In order to calculate the hub/authority scores of each node, repeated iterations of the Authority Update Rule and the Hub Update Rule are applied. A k-step application of the Hub-Authority algorithm entails applying for k times first the Authority Update Rule and then the Hub Update Rule.


Authority update rule

For each p , we update \mathrm(p) to \mathrm(p)=\displaystyle \sum\nolimits_ \mathrm(q) where P_ is all pages which link to page p. That is, a page's authority score is the sum of all the hub scores of pages that point to it.


Hub update rule

For each p , we update \mathrm(p) to \mathrm(p)=\displaystyle \sum\nolimits_ \mathrm(q) where P_ is all pages which page p links to. That is, a page's hub score is the sum of all the authority scores of pages it points to.


Normalization

The final hub-authority scores of nodes are determined after infinite repetitions of the algorithm. As directly and iteratively applying the Hub Update Rule and Authority Update Rule leads to diverging values, it is necessary to normalize the matrix after every iteration. Thus the values obtained from this process will eventually converge.


Pseudocode

''G'' := set of pages for each page ''p'' in ''G'' do ''p''.auth = 1 // ''p''.auth is the authority score of the page ''p'' ''p''.hub = 1 // ''p''.hub is the hub score of the page ''p'' for step from 1 to k do // run the algorithm for k steps norm = 0 for each page ''p'' in ''G'' do // update all authority values first ''p''.auth = 0 for each page ''q'' in ''p.incomingNeighbors'' do // ''p.incomingNeighbors'' is the set of pages that link to ''p'' ''p''.auth += ''q''.hub norm += square(''p''.auth) // calculate the sum of the squared auth values to normalise norm = sqrt(norm) for each page ''p'' in ''G'' do // update the auth scores ''p''.auth = ''p''.auth / norm // normalise the auth values norm = 0 for each page ''p'' in ''G'' do // then update all hub values ''p''.hub = 0 for each page ''r'' in ''p.outgoingNeighbors'' do // ''p.outgoingNeighbors'' is the set of pages that ''p'' links to ''p''.hub += ''r''.auth norm += square(''p''.hub) // calculate the sum of the squared hub values to normalise norm = sqrt(norm) for each page ''p'' in ''G'' do // then update all hub values ''p''.hub = ''p''.hub / norm // normalise the hub values The hub and authority values converge in the pseudocode above. The code below does not converge, because it is necessary to limit the number of steps that the algorithm runs for. One way to get around this, however, would be to normalize the hub and authority values after each "step" by dividing each authority value by the square root of the sum of the squares of all authority values, and dividing each hub value by the square root of the sum of the squares of all hub values. This is what the pseudocode above does.


Non-converging pseudocode

''G'' := set of pages for each page ''p'' in ''G'' do ''p''.auth = 1 // ''p''.auth is the authority score of the page ''p'' ''p''.hub = 1 // ''p''.hub is the hub score of the page ''p'' function HubsAndAuthorities(''G'') for step from 1 to k do // run the algorithm for k steps for each page ''p'' in ''G'' do // update all authority values first ''p''.auth = 0 for each page ''q'' in ''p.incomingNeighbors'' do // ''p.incomingNeighbors'' is the set of pages that link to ''p'' ''p''.auth += ''q''.hub for each page ''p'' in ''G'' do // then update all hub values ''p''.hub = 0 for each page ''r'' in ''p.outgoingNeighbors'' do // ''p.outgoingNeighbors'' is the set of pages that ''p'' links to ''p''.hub += ''r''.auth


See also

*
PageRank PageRank (PR) is an algorithm used by Google Search to rank webpages, web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. A ...


References

* *


External links

*{{US patent, 6112202
Create a data search engine from a relational database
Search engine in C# based on HITS Link analysis Articles with example pseudocode