Ranking of query is one of the fundamental problems in

information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...

(IR), the scientific/engineering discipline behind

search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...

s. Given a query and a collection of documents that match the query, the problem is to rank, that is, sort, the documents in according to some criterion so that the "best" results appear early in the result list displayed to the user. Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and

recommender system A recommender system, or a recommendation system (sometimes replacing 'system' with a synonym such as platform or engine), is a subclass of information filtering system that provide suggestions for items that are most pertinent to a particular ...

s. A majority of search engines use ranking algorithms to provide users with accurate and relevant results.

History

The notion of page rank dates back to the 1940s and the idea originated in the field of economics. In 1941,

Wassily Leontief Wassily Wassilyevich Leontief (russian: Васи́лий Васи́льевич Лео́нтьев; August 5, 1905 – February 5, 1999), was a Soviet-American economist known for his research on input–output analysis and how changes in one e ...

developed an iterative method of valuing a country's sector based on the importance of other sectors that supplied resources to it. In 1965, Charles H Hubbell at the

University of California, Santa Barbara The University of California, Santa Barbara (UC Santa Barbara or UCSB) is a public land-grant research university in Santa Barbara, California with 23,196 undergraduates and 2,983 graduate students enrolled in 2021–2022. It is part of the U ...

, published a technique for determining the importance of individuals based on the importance of the people who endorse them. Gabriel Pinski and Francis Narin came up with an approach to rank journals. Their rule was that a journal is important if it is cited by other important journals.

Jon Kleinberg Jon Michael Kleinberg (born 1971) is an American computer scientist and the Tisch University Professor of Computer Science and Information Science at Cornell University known for his work in algorithms and networks. He is a recipient of the Nevanl ...

, a computer scientist at

Cornell University Cornell University is a private statutory land-grant research university based in Ithaca, New York. It is a member of the Ivy League. Founded in 1865 by Ezra Cornell and Andrew Dickson White, Cornell was founded with the intention to teac ...

, developed an almost identical approach to

PageRank PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. According ...

which was called Hypertext Induced Topic Search or HITS and it treated web pages as "hubs" and "authorities". Google’s PageRank algorithm was developed in 1998 by Google’s founders

Sergey Brin Sergey Mikhailovich Brin (russian: link=no, Сергей Михайлович Брин; born August 21, 1973) is an American business magnate, computer scientist, and internet entrepreneur, who co-founded Google with Larry Page. Brin was the ...

and

Larry Page Lawrence Edward Page (born March 26, 1973) is an American business magnate, computer scientist and internet entrepreneur. He is best known for co-founding Google with Sergey Brin. Page was the chief executive officer of Google from 1997 until ...

and it is a key part of Google’s method of ranking web pages in search results. All the above methods are somewhat similar as all of them exploit the structure of links and require an iterative approach.

Ranking models

Ranking functions are evaluated by a variety of means; one of the simplest is determining the

precision Precision, precise or precisely may refer to: Science, and technology, and mathematics Mathematics and computing (general) * Accuracy and precision, measurement deviation from true value and its scatter * Significant figures, the number of digit ...

of the first ''k'' top-ranked results for some fixed ''k''; for example, the proportion of the top 10 results that are relevant, on average over many queries. IR models can be broadly divided into three types: Boolean models or BIR, Vector Space Models, and

Probabilistic Models Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speakin ...

. Various comparisons between retrieval models can be found in the literature (e.g., ).

Boolean Models

Boolean Model or BIR is a simple baseline query model where each query follows the underlying principles of relational algebra with algebraic expressions and where documents are not fetched unless they completely match with each other. Since the query is either fetch the document (1) or doesn’t fetch the document (0), there is no methodology to rank them.

Vector Space Model

Since the Boolean Model only fetches complete matches, it doesn’t address the problem of the documents being partially matched. The

Vector Space Model Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers (such as index terms). It is used in information filtering, information retrieval, indexing and r ...

solves this problem by introducing vectors of index items each assigned with weights. The weights are ranged from positive (if matched completely or to some extent) to negative (if unmatched or completely oppositely matched) if documents are present. Term Frequency - Inverse Document Frequency ( tf-idf) is one of the most popular techniques where weights are terms (e.g. words, keywords, phrases etc.) and dimensions is number of words inside corpus. The similarity score between query and document can be found by calculating cosine value between query weight vector and document weight vector using cosine similarity. Desired documents can be fetched by ranking them according to similarity score and fetched top k documents which has the highest scores or most relevant to query vector.

Probabilistic Model

In probabilistic model, probability theory has been used as a principal means for modeling the retrieval process in mathematical terms. The probability model of information retrieval was introduced by Maron and Kuhns in 1960 and further developed by Roberston and other researchers. According to Spack Jones and Willett (1997): The rationale for introducing probabilistic concepts is obvious: IR systems deal with natural language, and this is too far imprecise to enable a system to state with certainty which document will be relevant to a particular query. The model applies the theory of probability to information retrieval (An event has a possibility from 0 percent to 100 percent of occurring). i.e, in probability model, relevance is expressed in terms of probability. Here, documents are ranked in order of decreasing probability of relevance. It takes into the consideration of uncertainty element in the IR process. i.e., uncertainty about whether documents retrieved by the system are relevant to a given query. The probability model intends to estimate and calculate the probability that a document will be relevant to a given query based on some methods. The “event” in this context of information retrieval refers to the probability of relevance between a query and a document. Unlike other IR models, the probability model does not treat relevance as an exact miss-or-match measurement. The model adopts various methods to determine the probability of relevance between queries and documents. Relevance in the probability model is judged according to the similarity between queries and documents. The similarity judgment is further dependent on term frequency. Thus, for a query consisting of only one term (B), the probability that a particular document (Dm) will be judged relevant is the ratio of users who submit query term (B) and consider the document (Dm) to be relevant in relation to the number of users who submitted the term (B). As represented in Maron’s and Kuhn’s model, can be represented as the probability that users submitting a particular query term (B) will judge an individual document (Dm) to be relevant. According to Gerard Salton and Michael J. McGill, the essence of this model is that if estimates for the probability of occurrence of various terms in relevant documents can be calculated, then the probabilities that a document will be retrieved, given that it is relevant, or that it is not, can be estimated. Several experiments have shown that the probabilistic model can yield good results. However, such results have not been sufficiently better than those obtained using the Boolean or Vector Space model.

Evaluation Measures

The most common measures of evaluation are precision, recall, and f-score. They are computed using unordered sets of documents. These measures must be extended, or new measures must be defined, in order to evaluate the ranked retrieval results that are standard in modern search engines. In a ranked retrieval context, appropriate sets of retrieved documents are naturally given by the top k retrieved documents. For each such set, precision and recall values can be plotted to give a precision-recall curve.

Precision

Precision measures the exactness of the retrieval process. If the actual set of relevant documents is denoted by I and the retrieved set of documents is denoted by O, then the precision is given by:

\text=\frac

Recall

Recall is a measure of completeness of the IR process. If the actual set of relevant documents is denoted by I and the retrieved set of documents is denoted by O, then the recall is given by:

\text=\frac

F1 Score

F1 Score tries to combine the precision and recall measure. It is the harmonic mean of the two. If P is the precision and R is the recall then the F-Score is given by: :

F_1 =  2 \cdot \frac

Page Rank Algorithm

The

algorithm outputs a probability distribution used to represent the likelihood that a person randomly clicking on the links will arrive at any particular page. PageRank can be calculated for collections of documents of any size. It is assumed in several research papers that the distribution is evenly divided among all documents in the collection at the beginning of the computational process. The PageRank computations require several passes through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value. The formulae is given below: :

PR(u) = \sum_ \frac

i.e. the PageRank value for a page u is dependent on the PageRank values for each page v contained in the set B_u (the set containing all pages linking to page u), divided by the amount ''L''(''v'') of links from page v.

HITS Algorithm

Similar to

, HITS uses Link Analysis for analyzing the relevance of the pages but only works on small sets of subgraph (rather than entire web graph) and as well as being query dependent. The subgraphs are ranked according to weights in hubs and authorities where pages that rank highest are fetched and displayed.

References

{{reflist *