Fuzzy retrieval techniques are based on the
Extended Boolean model and the
Fuzzy set
Fuzzy or Fuzzies may refer to:
Music
* Fuzzy (band), a 1990s Boston indie pop band
* Fuzzy (composer), Danish composer Jens Vilhelm Pedersen (born 1939)
* Fuzzy (album), ''Fuzzy'' (album), 1993 debut album of American rock band Grant Lee Buffalo
...
theory. There are two classical fuzzy retrieval models: Mixed Min and Max (MMM) and the Paice model. Both models do not provide a way of evaluating query weights, however this is considered by the
P-norms algorithm.
Mixed Min and Max model (MMM)
In fuzzy-set theory, an element has a varying degree of membership, say ''d
A'', to a given set ''A'' instead of the traditional membership choice (is an element/is not an element).
In MMM each index term has a fuzzy set associated with it. A document's weight with respect to an index term ''A'' is considered to be the degree of membership of the document in the fuzzy set associated with ''A''. The degree of membership for union and intersection are defined as follows in Fuzzy set theory:
:
:
According to this, documents that should be retrieved for a query of the form ''A or B'', should be in the fuzzy set associated with the union of the two sets ''A'' and ''B''. Similarly, the documents that should be retrieved for a query of the form ''A and B'', should be in the fuzzy set associated with the intersection of the two sets. Hence, it is possible to define the similarity of a document to the ''or'' query to be ''max(d
A, d
B)'' and the similarity of the document to the ''and'' query to be ''min(d
A, d
B)''. The MMM model tries to soften the Boolean operators by considering the query-document similarity to be a linear combination of the ''min'' and ''max'' document weights.
Given a document ''D'' with index-term weights ''d
A1, d
A2, ..., d
An'' for terms ''A
1, A
2, ..., A
n'', and the queries:
''Q
or = (A
1 or A
2 or ... or A
n)''
''Q
and = (A
1 and A
2 and ... and A
n)''
the query-document similarity in the MMM model is computed as follows:
''SlM(Q
or, D) = C
or1 * max(d
A1, d
A2, ..., d
An) + C
or2 * min(d
A1, d
A2, ..., d
An)''
''SlM(Q
and, D) = C
and1 * min(d
A1, d
A2, ..., d
An) + C
and2 * max(d
A1, d
A2 ..., d
An)''
where ''C
or1, C
or2'' are "softness" coefficients for the ''or'' operator, and ''C
and1, C
and2'' are softness coefficients for the ''and'' operator. Since we would like to give the maximum of the document weights more importance while considering an ''or'' query and the minimum more importance while considering an ''and'' query, generally we have ''C
or1 > C
or2 and C
and1 > C
and2''. For simplicity it is generally assumed that ''C
or1 = 1 - C
or2'' and ''C
and1 = 1 - C
and2''.
Lee and Fox
experiments indicate that the best performance usually occurs with ''C
and1'' in the range
.5, 0.8and with ''C
or1'' > 0.2. In general, the computational cost of MMM is low, and retrieval effectiveness is much better than with the
Standard Boolean model.
Paice model
The
Paice model is a general extension to the MMM model. In comparison to the MMM model that considers only the minimum and maximum weights for the index terms, the Paice model incorporates all of the term weights when calculating the similarity:
:
where ''r'' is a constant coefficient and ''w
di'' is arranged in ascending order for ''and'' queries and descending order for ''or'' queries. When n = 2 the Paice model shows the same behavior as the MMM model.
The experiments of Lee and Fox
have shown that setting the ''r'' to 1.0 for ''and'' queries and 0.7 for ''or'' queries gives good retrieval effectiveness. The computational cost for this model is higher than that for the MMM model. This is because the MMM model only requires the determination of ''min'' or ''max'' of a set of term weights each time an ''and'' or ''or'' clause is considered, which can be done in ''O(n)''. The Paice model requires the term weights to be sorted in ascending or descending order, depending on whether an ''and'' clause or an ''or'' clause is being considered. This requires at least an ''0(n log n)'' sorting algorithm. A good deal of floating point calculation is needed too.
Improvements over the Standard Boolean model
Lee and Fox
compared the Standard Boolean model with MMM and Paice models with three test collections, CISI, CACM and INSPEC. These are the reported results for average mean precision improvement:
These are very good improvements over the Standard model. MMM is very close to Paice and P-norm results which indicates that it can be a very good technique, and is the most efficient of the three.
Recent work
In 2005, Kang ''et al.'' have devised a fuzzy retrieval system indexed by concept identification.
If we look at documents on a pure
Tf-idf approach, even eliminating stop words, there will be words more relevant to the topic of the document than others and they will have the same weight because they have the same term frequency. If we take into account the user intent on a query we can better weight the terms of a document. Each term can be identified as a concept in a certain lexical chain that translates the importance of that concept for that document.
They report improvements over Paice and P-norm on the average
precision and recall
In pattern recognition, information retrieval, object detection and classification (machine learning), precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.
Precision (also calle ...
for the Top-5 retrieved documents.
Zadrozny
revisited the fuzzy information retrieval model. He further extends the fuzzy extended Boolean model by:
* assuming linguistic terms as importance weights of keywords also in documents
* taking into account the uncertainty concerning the representation of documents and queries
* interpreting the linguistic terms in the representation of documents and queries as well as their matching in terms of the Zadeh's fuzzy logic (calculus of linguistic statements)
* addressing some pragmatic aspects of the proposed model, notably the techniques of indexing documents and queries
The proposed model makes it possible to grasp both imprecision and uncertainty concerning the textual information representation and retrieval.
See also
*
Information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
Further reading
*
References
{{DEFAULTSORT:Fuzzy Retrieval
Information retrieval techniques