The (standard) Boolean model of information retrieval (BIR) is a classical
information retrieval
Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...
(IR) model and, at the same time, the first and most-adopted one. It is used by many IR systems to this day. The BIR is based on
Boolean logic
In mathematics and mathematical logic, Boolean algebra is a branch of algebra. It differs from elementary algebra in two ways. First, the values of the variable (mathematics), variables are the truth values ''true'' and ''false'', usually denote ...
and classical
set theory
Set theory is the branch of mathematical logic that studies sets, which can be informally described as collections of objects. Although objects of any kind can be collected into a set, set theory, as a branch of mathematics, is mostly conce ...
in that both the documents to be searched and the user's query are conceived as sets of terms (a
bag-of-words model
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding g ...
). Retrieval is based on whether or not the documents contain the query terms.
Definitions
An ''index term'' is a word or expression'','' which may be
stemmed, describing or characterizing a document, such as a keyword given for a journal article. Let
be the set of all such index terms.
A ''document'' is any subset of
. Let
be the set of all documents.
A ''query'' is a Boolean expression
in normal form:
where
is true for
when
. (Equivalently,
could be expressed in
disjunctive normal form
In boolean logic, a disjunctive normal form (DNF) is a canonical normal form of a logical formula consisting of a disjunction of conjunctions; it can also be described as an OR of ANDs, a sum of products, or (in philosophical logic) a ''cluster c ...
.)
We seek to find the set of documents that satisfy
. This operation is called ''retrieval'' and consists of the following two steps:
: 1. For each
in
, find the set
of documents that satisfy
:
2. Then the set of documents that satisfy Q is given by:
Example
Let the set of original (real) documents be, for example
:
where
= "Bayes' principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution)."
= "
Bayesian decision theory
In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function (i.e., the posterior expected loss). Equivalently, it maximizes the pos ...
: A mathematical theory of decision-making which presumes utility and probability functions, and according to which the act to be chosen is the Bayes act, i.e. the one with highest subjective expected utility. If one had unlimited time and calculating power with which to make every decision, this procedure would be the best way to make any decision."
= "Bayesian
epistemology
Epistemology (; ), or the theory of knowledge, is the branch of philosophy concerned with knowledge. Epistemology is considered a major subfield of philosophy, along with other major subfields such as ethics, logic, and metaphysics.
Episte ...
: A philosophical theory which holds that the epistemic status of a proposition (i.e. how well proven or well established it is) is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures. A Bayesian epistemologist would use probability to define, and explore the relationship between, concepts such as epistemic status, support or explanatory power."
Let the set
of terms be:
Then, the set
of documents is as follows:
:
where
Let the query
be:
Then to retrieve the relevant documents:
# Firstly, the following sets
and
of documents
are obtained (retrieved):
# Finally, the following documents
are retrieved in response to
This means that the original document
(corresponding to
) is the answer to
.
Obviously, if there is more than one document with the same representation, every such document is retrieved. Such documents are indistinguishable in the BIR (in other words, equivalent).
Advantages
* Clean formalism
* Easy to implement
* Intuitive concept
* If the resulting document set is either too small or too big, it is directly clear which operators will produce respectively a bigger or smaller set.
*It gives (expert) users a sense of control over the system. It is immediately clear why a document has been retrieved given a query.
Disadvantages
*
Exact matching may retrieve too few or too many documents
* Hard to translate a query into a Boolean expression
* All terms are equally weighted
* More like ''
data retrieval
Data retrieval means obtaining data from a Database Management System (DBMS) such as ODBMS. In this case, it is considered that data is represented in a structured way, and there is no ambiguity in data.
In order to retrieve the desired data ...
'' than ''information retrieval''
* Retrieval based on binary decision criteria with no notion of partial matching
* No ranking of the documents is provided (absence of a grading scale)
* Information need has to be translated into a Boolean expression, which most users find awkward
* The Boolean queries formulated by the users are most often too simplistic
* The model frequently returns either too few or too many documents in response to a user query
Data structures and algorithms
From a pure formal mathematical point of view, the BIR is straightforward. From a practical point of view, however, several further problems should be solved that relate to algorithms and data structures, such as, for example, the choice of terms (manual or automatic selection or both),
stemming
In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morpholog ...
,
hash table
In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', als ...
s,
inverted file
In computer science, an inverted index (also referred to as a postings list, postings file, or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of do ...
structure, and so on.
Hash sets
Another possibility is to use
hash sets. Each document is represented by a hash table which contains every single term of that document. Since hash table size increases and decreases in real time with the addition and removal of terms, each document will occupy much less space in memory. However, it will have a slowdown in performance because the operations are more complex than with
bit vector
A bit array (also known as bitmask, bit map, bit set, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level pa ...
s. On the worst-case performance can degrade from O(''n'') to O(''n''
2). On the average case, the performance slowdown will not be that much worse than bit vectors and the space usage is much more efficient.
Signature file
Each document can be summarized by
Bloom filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in ...
representing the set of words in that document, stored in a fixed-length bitstring, called a signature.
The signature file contains one such
superimposed code
A superimposed code such as Zatocoding is a kind of hash code that was popular in marginal punched-card systems.
Marginal punched-card systems
Many names, some of them trademarked, have been used for marginal punched-card systems:
edge-notc ...
bitstring for every document in the collection.
Each query can also be summarized by a
Bloom filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in ...
representing the set of words in the query, stored in a bitstring of the same fixed length.
The query bitstring is tested against each signature.
[
Justin Zobel; Alistair Moffat; and Kotagiri Ramamohanarao]
"Inverted Files Versus Signature Files for Text Indexing"
[
Bob Goodwin; et al]
"BitFunnel: Revisiting Signatures for Search"
2017.
[
Richard Startin]
"Bit-Sliced Signatures and Bloom Filters"
The signature file approached is used in
BitFunnel
BitFunnel is the search engine indexing algorithm and a set of components used in the Bing search engine, which were made open source in 2016. BitFunnel uses bit-sliced signatures instead of an inverted index in an attempt to reduce operations ...
.
Inverted file
An inverted index file contains two parts:
a vocabulary containing all the terms used in the collection,
and for each distinct term an inverted index that lists every document that mentions that term.
References
* {{ citation , last1= Lashkari , first1=A.H. , last2=Mahdavi , first2= F., last3=Ghomi , first3= V. , title=2009 International Conference on Information Management and Engineering , doi= 10.1109/ICIME.2009.101 , chapter= A Boolean Model in Information Retrieval for Search Engines , year=2009 , pages=385–389 , isbn=978-0-7695-3595-1 , s2cid=18147603
Mathematical modeling
Information retrieval techniques