The Rocchio algorithm is based on a method of
relevance feedback found in
information retrieval
Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
systems which stemmed from the
SMART Information Retrieval System The SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System is an information retrieval system developed at Cornell University in the 1960s. Many important concepts in information retrieval were developed as par ...
developed between 1960 and 1964. Like many other retrieval systems, the Rocchio
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
was developed using the
vector space model
Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers (such as index terms). It is used in information filtering, information retrieval, indexing and r ...
. Its underlying assumption is that most users have a general conception of which documents should be denoted as
relevant
Relevant is something directly related, connected or pertinent to a topic; it may also mean something that is current.
Relevant may also refer to:
* Relevant operator, a concept in physics, see renormalization group
* Relevant, Ain, a commune ...
or irrelevant.
[Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze: ''An Introduction to Information Retrieval'', page 163-167. Cambridge University Press, 2009.] Therefore, the user's search query is revised to include an arbitrary percentage of relevant and irrelevant documents as a means of increasing the
search engine
A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
's
recall
Recall may refer to:
* Recall (bugle call), a signal to stop
* Recall (information retrieval), a statistical measure
* ''ReCALL'' (journal), an academic journal about computer-assisted language learning
* Recall (memory)
* ''Recall'' (Overwatch ...
, and possibly the precision as well. The number of relevant and irrelevant documents allowed to enter a
query is dictated by the weights of the a, b, c variables listed below in the
Algorithm section.
Algorithm
The
formula
In science, a formula is a concise way of expressing information symbolically, as in a mathematical formula or a ''chemical formula''. The informal use of the term ''formula'' in science refers to the general construct of a relationship betwee ...
and variable definitions for Rocchio relevance feedback are as follows:
As demonstrated in the formula, the associated weights (a, b, c) are responsible for shaping the modified
vector
Vector most often refers to:
*Euclidean vector, a quantity with a magnitude and a direction
*Vector (epidemiology), an agent that carries and transmits an infectious pathogen into another living organism
Vector may also refer to:
Mathematic ...
in a direction closer, or farther away, from the original query, related documents, and non-related documents. In particular, the values for b and c should be incremented or decremented proportionally to the set of documents classified by the user. If the user decides that the modified query should not contain terms from either the original query, related documents, or non-related documents, then the corresponding weight (a, b, c) value for the category should be set to 0.
In the later part of the algorithm, the variables
, and
are presented to be sets of
vectors containing the coordinates of related documents and non-related documents. Though
and
are not vectors themselves,
and
are the vectors used to iterate through the two sets and form vector
summation
In mathematics, summation is the addition of a sequence of any kind of numbers, called ''addends'' or ''summands''; the result is their ''sum'' or ''total''. Beside numbers, other types of values can be summed as well: functions, vectors, mat ...
s. These sums are normalized (divided) by the size of their respective document set (
,
).
In order to visualize the changes taking place on the modified vector, please refer to the image below.
As the weights are increased or decreased for a particular category of documents, the coordinates for the modified vector begin to move either closer, or farther away, from the
centroid
In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the surface of the figure. The same definition extends to any ob ...
of the document collection. Thus if the weight is increased for related documents, then the modified vectors
coordinate
In geometry, a coordinate system is a system that uses one or more numbers, or coordinates, to uniquely determine the position of the points or other geometric elements on a manifold such as Euclidean space. The order of the coordinates is sign ...
s will reflect being closer to the centroid of related documents.
Time complexity
The
time complexity
In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...
for training and testing the algorithm are listed below and followed by the definition of each
variable
Variable may refer to:
* Variable (computer science), a symbolic name associated with a value and whose associated value may be changed
* Variable (mathematics), a symbol that represents a quantity in a mathematical expression, as used in many ...
. Note that when in testing phase, the time complexity can be reduced to that of calculating the
euclidean distance
In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points.
It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefor ...
between a class
centroid
In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the surface of the figure. The same definition extends to any ob ...
and the respective document. As shown by:
.
Training =
Testing =
Usage
Though there are benefits to ranking documents as not-relevant, a
relevant
Relevant is something directly related, connected or pertinent to a topic; it may also mean something that is current.
Relevant may also refer to:
* Relevant operator, a concept in physics, see renormalization group
* Relevant, Ain, a commune ...
document ranking will result in more precise documents being made available to the user. Therefore, traditional values for the algorithm's weights (a, b, c) in
Rocchio classification are typically around a = 1, b = 0.8, and c = 0.1. Modern
information retrieval
Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
systems have moved towards eliminating the non-related documents by setting c = 0 and thus only accounting for related documents. Although not all
retrieval systems have eliminated the need for non-related documents, most have limited the effects on modified query by only accounting for strongest non-related documents in the
set.
Limitations
The Rocchio algorithm often fails to classify multimodal classes and relationships. For instance, the country of
Burma
Myanmar, ; UK pronunciations: US pronunciations incl. . Note: Wikipedia's IPA conventions require indicating /r/ even in British English although only some British English speakers pronounce r at the end of syllables. As John Wells explai ...
was renamed to
Myanmar
Myanmar, ; UK pronunciations: US pronunciations incl. . Note: Wikipedia's IPA conventions require indicating /r/ even in British English although only some British English speakers pronounce r at the end of syllables. As John C. Wells, Joh ...
in 1989. Therefore, the two queries of "Burma" and "Myanmar" will appear much farther apart in the
vector space model
Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers (such as index terms). It is used in information filtering, information retrieval, indexing and r ...
, though they both contain similar origins.
See also
*
Nearest centroid classifier
In machine learning, a nearest centroid classifier or nearest prototype classifier is a classification model that assigns to observations the label of the class of training samples whose mean (centroid) is closest to the observation. When applied ...
, aka Rocchio classifier
References
{{reflist
Relevance Feedback in Information RetrievalRelevance Feedback and Query ExpansionVector Space Classification
Search algorithms