Learning to rank. Slides from Tie-Yan Liu's talk at WWW 2009 conference ar
available online
or machine-learned ranking (MLR) is the application of

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

, typically supervised, semi-supervised or

reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...

, in the construction of ranking models for

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

systems. Training data may, for example, consist of lists of items with some

partial order In mathematics, especially order theory, a partial order on a set is an arrangement such that, for certain pairs of elements, one precedes the other. The word ''partial'' is used to indicate that not every pair of elements needs to be comparable ...

specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. "relevant" or "not relevant") for each item. The goal of constructing the ranking model is to rank new, unseen lists in a similar way to rankings in the training data.

Applications

In information retrieval

Ranking is a central part of many

problems, such as

document retrieval Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly natural language, unstructured text, such as newspaper articles, real estate records or paragraphs ...

, collaborative filtering,

sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subje ...

, and

online advertising Online advertising, also known as online marketing, Internet advertising, digital advertising or web advertising, is a form of marketing and advertising that uses the Internet to promote products and services to audiences and platform users. ...

. A possible architecture of a machine-learned search engine is shown in the accompanying figure. Training data consists of queries and documents matching them together with the relevance degree of each match. It may be prepared manually by human ''assessors'' (or ''raters'', as

Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...

calls them), who check results for some queries and determine

relevance Relevance is the connection between topics that makes one useful for dealing with the other. Relevance is studied in many different fields, including cognitive science, logic, and library and information science. Epistemology studies it in gener ...

of each result. It is not feasible to check the relevance of all documents, and so typically a technique called pooling is used — only the top few documents, retrieved by some existing ranking models are checked. This technique may introduce selection bias. Alternatively, training data may be derived automatically by analyzing ''clickthrough logs'' (i.e. search results which got clicks from users), ''query chains'', or such search engines' features as Google's (since-replaced) SearchWiki. Clickthrough logs can be biased by the tendency of users to click on the top search results on the assumption that they are already well-ranked. Training data is used by a learning algorithm to produce a ranking model which computes the relevance of documents for actual queries. Typically, users expect a search query to complete in a short time (such as a few hundred milliseconds for web search), which makes it impossible to evaluate a complex ranking model on each document in the corpus, and so a two-phase scheme is used. First, a small number of potentially relevant documents are identified using simpler retrieval models which permit fast query evaluation, such as the

vector space model Vector space model or term vector model is an algebraic model for representing text documents (or more generally, items) as vector space, vectors such that the distance between vectors represents the relevance between the documents. It is used in i ...

Boolean model Any kind of logic, function, expression, or theory based on the work of George Boole is considered Boolean. Related to this, "Boolean" may refer to: * Boolean data type, a form of data with only two possible values (usually "true" and "false" ...

, weighted AND, or BM25. This phase is called ''top-

k

document retrieval'' and many heuristics were proposed in the literature to accelerate it, such as using a document's static quality score and tiered indexes.. Sectio
7.1
In the second phase, a more accurate but computationally expensive machine-learned model is used to re-rank these documents.

In other areas

Learning to rank algorithms have been applied in areas other than information retrieval: * In

machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...

for ranking a set of hypothesized translations; * In

computational biology Computational biology refers to the use of techniques in computer science, data analysis, mathematical modeling and Computer simulation, computational simulations to understand biological systems and relationships. An intersection of computer sci ...

for ranking candidate 3-D structures in protein structure prediction problems; * In

recommender system A recommender system (RecSys), or a recommendation system (sometimes replacing ''system'' with terms such as ''platform'', ''engine'', or ''algorithm'') and sometimes only called "the algorithm" or "algorithm", is a subclass of information fi ...

s for identifying a ranked list of related news articles to recommend to a user after he or she has read a current news article.

Feature vectors

For the convenience of MLR algorithms, query-document pairs are usually represented by numerical vectors, which are called ''

feature vector In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a data set. Choosing informative, discriminating, and independent features is crucial to produce effective algorithms for pattern re ...

s''. Such an approach is sometimes called ''bag of features'' and is analogous to the bag of words model and

used in information retrieval for representation of documents. Components of such vectors are called ''

feature Feature may refer to: Computing * Feature recognition, could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (machine learning), in statistics: individual measurable properties of the phenome ...

s'', ''factors'' or ''ranking signals''. They may be divided into three groups (features from

are shown as examples): * ''Query-independent'' or ''static'' features — those features, which depend only on the document, but not on the query. For example,

PageRank PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. Accordin ...

or document's length. Such features can be precomputed in off-line mode during indexing. They may be used to compute document's ''static quality score'' (or ''static rank''), which is often used to speed up search query evaluation. * ''Query-dependent'' or ''dynamic'' features — those features, which depend both on the contents of the document and the query, such as TF-IDF score or other non-machine-learned ranking functions. * '' Query-level features'' or ''query features'', which depend only on the query. For example, the number of words in a query. Some examples of features, which were used in the well-know
LETOR
dataset: * TF, TF-IDF, BM25, and language modeling scores of document's zones (title, body, anchors text, URL) for a given query; * Lengths and IDF sums of document's zones; * Document's

, HITS ranks and their variants. Selecting and designing good features is an important area in machine learning, which is called feature engineering.

Evaluation measures

There are several measures (metrics) which are commonly used to judge how well an algorithm is doing on training data and to compare the performance of different MLR algorithms. Often a learning-to-rank problem is reformulated as an optimization problem with respect to one of these metrics. Examples of ranking quality measures: * Mean average precision (MAP); * DCG and NDCG; * Precision@''n'', NDCG@''n'', where "@''n''" denotes that the metrics are evaluated only on top ''n'' documents; * Mean reciprocal rank; *

Kendall's tau In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient (after the Greek letter τ, tau), is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non ...

; * Spearman's rho. DCG and its normalized variant NDCG are usually preferred in academic research when multiple levels of relevance are used. Other metrics such as MAP, MRR and precision, are defined only for binary judgments. Recently, there have been proposed several new evaluation metrics which claim to model user's satisfaction with search results better than the DCG metric: * Expected reciprocal rank (ERR); *

Yandex Yandex LLC ( rus, Яндекс, r=Yandeks, p=ˈjandəks) is a Russian technology company that provides Internet-related products and services including a web browser, search engine, cloud computing, web mapping, online food ordering, streaming ...

's pfound. Both of these metrics are based on the assumption that the user is more likely to stop looking at search results after examining a more relevant document, than after a less relevant document.

Approaches

Learning to Rank approaches are often categorized using one of three approaches: pointwise (where individual documents are ranked), pairwise (where pairs of documents are ranked into a relative order), and listwise (where an entire list of documents are ordered). Tie-Yan Liu of Microsoft Research Asia has analyzed existing algorithms for learning to rank problems in his book ''Learning to Rank for Information Retrieval''. He categorized them into three groups by their input spaces, output spaces, hypothesis spaces (the core function of the model) and loss functions: the pointwise, pairwise, and listwise approach. In practice, listwise approaches often outperform pairwise approaches and pointwise approaches. This statement was further supported by a large scale experiment on the performance of different learning-to-rank methods on a large collection of benchmark data sets. In this section, without further notice,

x

denotes an object to be evaluated, for example, a document or an image,

f(x)

denotes a single-value hypothesis,

h(\cdot)

denotes a bi-variate or multi-variate function and

L(\cdot)

denotes the loss function.

Pointwise approach

In this case, it is assumed that each query-document pair in the training data has a numerical or ordinal score. Then the learning-to-rank problem can be approximated by a regression problem — given a single query-document pair, predict its score. Formally speaking, the pointwise approach aims at learning a function

f(x)

predicting the real-value or ordinal score of a document

x

using the loss function

L(f; x_j, y_j)

. A number of existing supervised machine learning algorithms can be readily used for this purpose. Ordinal regression and

classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

algorithms can also be used in pointwise approach when they are used to predict the score of a single query-document pair, and it takes a small, finite number of values.

Pairwise approach

In this case, the learning-to-rank problem is approximated by a classification problem — learning a binary classifier

h(x_u, x_v)

that can tell which document is better in a given pair of documents. The classifier shall take two documents as its input and the goal is to minimize a loss function

L(h; x_u, x_v, y_)

. The loss function typically reflects the number and magnitude of inversions in the induced ranking. In many cases, the binary classifier

h(x_u, x_v)

is implemented with a scoring function

f(x)

. As an example, RankNet adapts a probability model and defines

h(x_u, x_v)

as the estimated probability of the document

x_u

has higher quality than

x_v

: :

P_(f) = 
  \text (f(x_u) - f(x_v)),

where

\text(\cdot)

is a

cumulative distribution function In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Ever ...

, for example, the standard logistic CDF, i.e. :

\text(x)
  =
  \frac.

Listwise approach

These algorithms try to directly optimize the value of one of the above evaluation measures, averaged over all queries in the training data. This is often difficult in practice because most evaluation measures are not continuous functions with respect to ranking model's parameters, and so continuous approximations or bounds on evaluation measures have to be used. For example the SoftRank algorithm. LambdaMART is a pairwise algorithm which has been empirically shown to approximate listwise objective functions.

List of methods

A partial list of published learning-to-rank algorithms is shown below with years of first publication of each method: : Note: as most supervised learning-to-rank algorithms can be applied to pointwise, pairwise and listwise case, only those methods which are specifically designed with ranking in mind are shown above.

History

Norbert Fuhr introduced the general idea of MLR in 1992, describing learning approaches in information retrieval as a generalization of parameter estimation; a specific variant of this approach (using polynomial regression) had been published by him three years earlier. Bill Cooper proposed

logistic regression In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...

for the same purpose in 1992 and used it with his Berkeley research group to train a successful ranking function for TREC. Manning et al. suggest that these early works achieved limited results in their time due to little available training data and poor machine learning techniques. Several conferences, such as NeurIPS, SIGIR and ICML have had workshops devoted to the learning-to-rank problem since the mid-2000s (decade).

Practical usage by search engines

Commercial

web search engine A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...

s began using machine-learned ranking systems since the 2000s (decade). One of the first search engines to start using it was

AltaVista AltaVista was a web search engine established in 1995. It became one of the most-used early search engines, but lost ground to Google and was purchased by Yahoo! in 2003, which retained the brand, but based all AltaVista searches on its own sear ...

(later its technology was acquired by

Overture Overture (from French ''ouverture'', "opening") is a music instrumental introduction to a ballet, opera, or oratorio in the 17th century. During the early Romantic era, composers such as Beethoven and Mendelssohn composed overtures which ...

, and then

Yahoo Yahoo (, styled yahoo''!'' in its logo) is an American web portal that provides the search engine Yahoo Search and related services including My Yahoo, Yahoo Mail, Yahoo News, Yahoo Finance, Yahoo Sports, y!entertainment, yahoo!life, an ...

), which launched a gradient boosting-trained ranking function in April 2003.

Bing Bing most often refers to: * Bing Crosby (1903–1977), American singer * Microsoft Bing, a web search engine Bing may also refer to: Food and drink * Bing (bread), a Chinese flatbread * Bing (soft drink), a UK brand * Bing cherry, a varie ...

's search is said to be powered b
RankNet
algorithm, which was invented at

Microsoft Research Microsoft Research (MSR) is the research subsidiary of Microsoft. It was created in 1991 by Richard Rashid, Bill Gates and Nathan Myhrvold with the intent to advance state-of-the-art computing and solve difficult world problems through technologi ...

in 2005. In November 2009 a Russian search engine

announcedYandex corporate blog entry about new ranking model "Snezhinsk"
(in Russian) that it had significantly increased its search quality due to deployment of a new proprietary MatrixNet algorithm, a variant of gradient boosting method which uses oblivious decision trees. Recently they have also sponsored a machine-learned ranking competition "Internet Mathematics 2009" based on their own search engine's production data. Yahoo has announced a similar competition in 2010. As of 2008,

Peter Norvig Peter Norvig (born 14 December 1956) is an American computer scientist and Distinguished Education Fellow at the Stanford Institute for Human-Centered AI. He previously served as a director of research and search quality at Google. Norvig is th ...

denied that their search engine exclusively relies on machine-learned ranking. Cuil's CEO, Tom Costello, suggests that they prefer hand-built models because they can outperform machine-learned models when measured against metrics like click-through rate or time on landing page, which is because machine-learned models "learn what people say they like, not what people actually like". In January 2017, the technology was included in the

open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...

search engine Apache Solr. It is also available in the open source OpenSearch and

Elasticsearch Elasticsearch is a Search engine (computing), search engine based on Apache Lucene, a free and open-source search engine. It provides a distributed, Multitenancy, multitenant-capable full-text search engine with an HTTP web interface and schema ...

. These implementations make learning to rank widely accessible for enterprise search.

Vulnerabilities

Similar to recognition applications in

computer vision Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...

, recent neural network based ranking algorithms are also found to be susceptible to covert adversarial attacks, both on the candidates and the queries. With small perturbations imperceptible to human beings, ranking order could be arbitrarily altered. In addition, model-agnostic transferable adversarial examples are found to be possible, which enables black-box adversarial attacks on deep ranking systems without requiring access to their underlying implementations. Conversely, the robustness of such ranking systems can be improved via adversarial defenses such as the Madry defense.

References

{{reflist

External links

; Competitions and public datasets
LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval

Yandex's Internet Mathematics 2009

Yahoo! Learning to Rank Challenge

Microsoft Learning to Rank Datasets
Information retrieval techniques Machine learning Ranking functions