The Extended Boolean model was described in a Communications of the ACM article appearing in 1983, by Gerard Salton, Edward A. Fox, and Harry Wu. The goal of the Extended Boolean model is to overcome the drawbacks of the Boolean model that has been used in

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

. The Boolean model doesn't consider term weights in queries, and the result set of a Boolean query is often either too small or too big. The idea of the extended model is to make use of partial matching and term weights as in the vector space model. It combines the characteristics of the

Vector Space Model Vector space model or term vector model is an algebraic model for representing text documents (or more generally, items) as vector space, vectors such that the distance between vectors represents the relevance between the documents. It is used in i ...

with the properties of

Boolean algebra In mathematics and mathematical logic, Boolean algebra is a branch of algebra. It differs from elementary algebra in two ways. First, the values of the variable (mathematics), variables are the truth values ''true'' and ''false'', usually denot ...

and ranks the similarity between queries and documents. This way a document may be somewhat relevant if it matches some of the queried terms and will be returned as a result, whereas in the Standard Boolean model it wasn't. Thus, the extended Boolean model can be considered as a generalization of both the Boolean and vector space models; those two are special cases if suitable settings and definitions are employed. Further, research has shown effectiveness improves relative to that for Boolean query processing. Other research has shown that relevance feedback and

query expansion Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. In the context of search engines, query expansion involves ...

can be integrated with extended Boolean query processing.

Definitions

In the Extended Boolean model, a document is represented as a vector (similarly to in the vector model). Each ''i''

dimension In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coo ...

corresponds to a separate term associated with the document. The weight of term associated with document is measured by its normalized

Term frequency Term may refer to: Language *Terminology, context-specific nouns or compound words **Technical term (or ''term of art''), used by specialists in a field ***Scientific terminology, used by scientists *Term (argumentation), part of an argument in d ...

and can be defined as:

w_=f_*\frac

where is inverse document frequency and the term frequency for term x in document j. The weight vector associated with document can be represented as:

\mathbf_ =_, w_, \ldots, w_/math>

The 2 Dimensions Example

Considering the space composed of two terms and only, the corresponding term weights are and . Thus, for query , we can calculate the similarity with the following formula:

sim(q_,d)=\sqrt

For query , we can use:

sim(q_,d)=1-\sqrt

Generalizing the idea and P-norms

We can generalize the previous 2D extended Boolean model example to higher t-dimensional space using Euclidean distances. This can be done using

P-norm In mathematics, the spaces are function spaces defined using a natural generalization of the -norm for finite-dimensional vector spaces. They are sometimes called Lebesgue spaces, named after Henri Lebesgue , although according to the Bourbaki ...

s which extends the notion of distance to include p-distances, where is a new parameter. *A generalized conjunctive query is given by: :

q_=k_1 \lor^p k_2 \lor^p .... \lor^p k_t

*The similarity of

q_

and

d_j

can be defined as: :

sim(q_,d_j)=\sqrt /math>

*A generalized disjunctive query is given by:
: q_=k_1 \land^p k_2 \land^p .... \land^p k_t *The similarity of q_and d_j can be defined as:
: sim(q_,d_j)=1-\sqrt /math>

Examples

Consider the query . The similarity between query and document can be computed using the formula:

sim(q,d)=\sqrt /math>

Improvements over the Standard Boolean Model

Lee and Fox compared the Standard and Extended Boolean models with three test collections, CISI, CACM and INSPEC. Using P-norms they obtained an average precision improvement of 79%, 106% and 210% over the Standard model, for the CISI, CACM and INSPEC collections, respectively.
The P-norm model is computationally expensive because of the number of exponentiation operations that it requires but it achieves much better results than the Standard model and even Fuzzy retrieval techniques. The Standard Boolean model is still the most efficient.

References

{{DEFAULTSORT:Extended Boolean Model Information retrieval techniques