The Rocchio algorithm is based on a method of

relevance feedback Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query, to gather user feedback, and to use information about whether or not th ...

found in information retrieval systems which stemmed from the SMART Information Retrieval System developed between 1960 and 1964. Like many other retrieval systems, the Rocchio algorithm was developed using the vector space model. Its underlying assumption is that most users have a general conception of which documents should be denoted as

relevant Relevant is something directly related, connected or pertinent to a topic; it may also mean something that is current. Relevant may also refer to: * Relevant operator, a concept in physics, see renormalization group * Relevant, Ain, a commune ...

or irrelevant.Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze: ''An Introduction to Information Retrieval'', page 163-167. Cambridge University Press, 2009. Therefore, the user's search query is revised to include an arbitrary percentage of relevant and irrelevant documents as a means of increasing the

search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...

's recall, and possibly the precision as well. The number of relevant and irrelevant documents allowed to enter a query is dictated by the weights of the a, b, c variables listed below in the Algorithm section.

Algorithm

The formula and variable definitions for Rocchio relevance feedback are as follows:

\vec_m = a \,\vec_o + b\, \frac \sum_ \vec_j - c\, \frac \sum_ \vec_k

As demonstrated in the formula, the associated weights (a, b, c) are responsible for shaping the modified vector in a direction closer, or farther away, from the original query, related documents, and non-related documents. In particular, the values for b and c should be incremented or decremented proportionally to the set of documents classified by the user. If the user decides that the modified query should not contain terms from either the original query, related documents, or non-related documents, then the corresponding weight (a, b, c) value for the category should be set to 0. In the later part of the algorithm, the variables

D_

, and

D_

are presented to be sets of vectors containing the coordinates of related documents and non-related documents. Though

D_

and

D_

are not vectors themselves,

\vec_j

and

\vec_k

are the vectors used to iterate through the two sets and form vector

summation In mathematics, summation is the addition of a sequence of any kind of numbers, called ''addends'' or ''summands''; the result is their ''sum'' or ''total''. Beside numbers, other types of values can be summed as well: functions, vectors, mat ...

s. These sums are normalized (divided) by the size of their respective document set (

D_

D_

). In order to visualize the changes taking place on the modified vector, please refer to the image below. As the weights are increased or decreased for a particular category of documents, the coordinates for the modified vector begin to move either closer, or farther away, from the centroid of the document collection. Thus if the weight is increased for related documents, then the modified vectors

coordinate In geometry, a coordinate system is a system that uses one or more numbers, or coordinates, to uniquely determine the position of the points or other geometric elements on a manifold such as Euclidean space. The order of the coordinates is si ...

s will reflect being closer to the centroid of related documents.

Time complexity

The time complexity for training and testing the algorithm are listed below and followed by the definition of each variable. Note that when in testing phase, the time complexity can be reduced to that of calculating the euclidean distance between a class centroid and the respective document. As shown by:

\Theta(\vert\mathbb\vert M_)

. Training =

\Theta(\vert\mathbb\vert L_+\vert\mathbb\vert\vert V\vert)

Testing =

\Theta( L_+\vert\mathbb\vert M_)= \Theta(\vert\mathbb\vert M_)

Usage

Though there are benefits to ranking documents as not-relevant, a

document ranking will result in more precise documents being made available to the user. Therefore, traditional values for the algorithm's weights (a, b, c) in

Rocchio classification In machine learning, a nearest centroid classifier or nearest prototype classifier is a classification model that assigns to observations the label of the class of training samples whose mean (centroid) is closest to the observation. When applied ...

are typically around a = 1, b = 0.8, and c = 0.1. Modern information retrieval systems have moved towards eliminating the non-related documents by setting c = 0 and thus only accounting for related documents. Although not all retrieval systems have eliminated the need for non-related documents, most have limited the effects on modified query by only accounting for strongest non-related documents in the

D_

set.

Limitations

The Rocchio algorithm often fails to classify multimodal classes and relationships. For instance, the country of

Burma Myanmar, ; UK pronunciations: US pronunciations incl. . Note: Wikipedia's IPA conventions require indicating /r/ even in British English although only some British English speakers pronounce r at the end of syllables. As John C. Wells, Joh ...

was renamed to

Myanmar Myanmar, ; UK pronunciations: US pronunciations incl. . Note: Wikipedia's IPA conventions require indicating /r/ even in British English although only some British English speakers pronounce r at the end of syllables. As John C. Wells, Joh ...

in 1989. Therefore, the two queries of "Burma" and "Myanmar" will appear much farther apart in the vector space model, though they both contain similar origins.

References

{{reflist
Relevance Feedback in Information Retrieval

Relevance Feedback and Query Expansion

Vector Space Classification

Search algorithms

Algorithm

Time complexity

Usage

Limitations

See also

References