HOME

TheInfoList



OR:

The (standard) Boolean model of information retrieval (BIR) is a classical
information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
(IR) model and, at the same time, the first and most-adopted one. The BIR is based on
Boolean logic In mathematics and mathematical logic, Boolean algebra is a branch of algebra. It differs from elementary algebra in two ways. First, the values of the variable (mathematics), variables are the truth values ''true'' and ''false'', usually denot ...
and classical
set theory Set theory is the branch of mathematical logic that studies Set (mathematics), sets, which can be informally described as collections of objects. Although objects of any kind can be collected into a set, set theory – as a branch of mathema ...
in that both the documents to be searched and the user's query are conceived as sets of terms (a bag-of-words model). Retrieval is based on whether or not the documents contain the query terms and whether they satisfy the boolean conditions described by the query.


Definitions

An ''index term'' is a word or expression'','' which may be stemmed, describing or characterizing a document, such as a keyword given for a journal article. LetT = \be the set of all such index terms. A ''document'' is any subset of T. LetD = \be the set of all documents. T is a series of words or small phrases (index terms). Each of those words or small phrases are named t_n, where n is the number of the term in the series/list. You can think of T as "Terms" and t_n as "index term ''n''". The words or small phrases (index terms t_n) can exist in documents. These documents then form a series/list D where each individual documents are called D_n. These documents (D_n) can contain words or small phrases (index terms t_n) such as D_1 ''could'' contain the terms t_1and t_2 from T. There is an example of this in the following section. Index terms generally want to represent words which have more meaning to them and corresponds to what the content of an article or document could talk about. Terms like "the" and "like" would appear in nearly all documents whereas "Bayesian" would only be a small fraction of documents. Therefor, rarer terms like "Bayesian" are a better choice to be selected in the T sets. This relates to
Entropy (information theory) In information theory, the entropy of a random variable quantifies the average level of uncertainty or information associated with the variable's potential states or possible outcomes. This measures the expected amount of information needed ...
. There are multiple types of operations that can be applied to index terms used in queries to make them more generic and more relevant. One such is Stemming. A ''query'' is a Boolean expression Q in normal form:Q = (W_1\ \or\ W_2\ \or\ \cdots) \and\ \cdots\ \and\ (W_i\ \or\ W_\ \or\ \cdots)where W_i is true for D_j when t_i \in D_j. (Equivalently, Q could be expressed in disjunctive normal form.) Any Q queries are a selection of index terms (t_n or W_n) picked from a set T of terms which are combined using Boolean operators to form a set of conditions. These conditions are then applied to a set D of documents which contain the same index terms (t_n) from the set T. We seek to find the set of documents that satisfy Q. This operation is called ''retrieval'' and consists of the following two steps: : 1. For each W_j in Q, find the set S_j of documents that satisfy W_j:S_j = \2. Then the set of documents that satisfy Q is given by:(S_1 \cup S_2 \cup \cdots) \cap \cdots \cap (S_i \cup S_ \cup \cdots)Where \cup means ''OR'' and \cap means ''AND'' as Boolean operators.


Example

Let the set of original (real) documents be, for example : D = \ where D_1 = "Bayes' principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution)." D_2 = " Bayesian decision theory: A mathematical theory of decision-making which presumes utility and probability functions, and according to which the act to be chosen is the Bayes act, i.e. the one with highest subjective expected utility. If one had unlimited time and calculating power with which to make every decision, this procedure would be the best way to make any decision." D_3 = "Bayesian
epistemology Epistemology is the branch of philosophy that examines the nature, origin, and limits of knowledge. Also called "the theory of knowledge", it explores different types of knowledge, such as propositional knowledge about facts, practical knowle ...
: A philosophical theory which holds that the epistemic status of a proposition (i.e. how well proven or well established it is) is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures. A Bayesian epistemologist would use probability to define, and explore the relationship between, concepts such as epistemic status, support or explanatory power." Let the set T of terms be: T = \ Then, the set D of documents is as follows: : D = \ where \begin D_1 &= \ \\ D_2 &= \ \\ D_3 &= \ \end Let the query Q be ("probability" AND "decision-making"): Q = \text \and \textThen to retrieve the relevant documents: # Firstly, the following sets S_1 and S_2 of documents D_i are obtained (retrieved):\begin S_1 &= \ \\ S_2 &= \ \endWhere S_1 corresponds to the documents which contain the term "probability" and S_2 contain the term "decision-making". # Finally, the following documents D_i are retrieved in response to Q: Q: \\ \cap\ \\ =\ \Where the query looks for documents that are contained in both sets S using the intersection operator. This means that the original document D_2 is the answer to Q. If there is more than one document with the same representation (the same subset of index terms t_n), every such document is retrieved. Such documents are indistinguishable in the BIR (in other words, equivalent).


Advantages

* Clean formalism * Easy to implement * Intuitive concept * If the resulting document set is either too small or too big, it is directly clear which operators will produce respectively a bigger or smaller set. *It gives (expert) users a sense of control over the system. It is immediately clear why a document has been retrieved given a query.


Disadvantages

* Exact matching may retrieve too few or too many documents * Hard to translate a query into a Boolean expression * Ineffective for Search-Resistant Concepts * All terms are equally weighted * More like '' data retrieval'' than ''information retrieval'' * Retrieval based on binary decision criteria with no notion of partial matching * No ranking of the documents is provided (absence of a grading scale) * Information need has to be translated into a Boolean expression, which most users find awkward * The Boolean queries formulated by the users are most often too simplistic * The model frequently returns either too few or too many documents in response to a user query


Data structures and algorithms

From a pure formal mathematical point of view, the BIR is straightforward. From a practical point of view, however, several further problems should be solved that relate to algorithms and data structures, such as, for example, the choice of terms (manual or automatic selection or both), stemming,
hash table In computer science, a hash table is a data structure that implements an associative array, also called a dictionary or simply map; an associative array is an abstract data type that maps Unique key, keys to Value (computer science), values. ...
s, inverted file structure, and so on.


Hash sets

Another possibility is to use hash sets. Each document is represented by a hash table which contains every single term of that document. Since hash table size increases and decreases in real time with the addition and removal of terms, each document will occupy much less space in memory. However, it will have a slowdown in performance because the operations are more complex than with bit vectors. On the worst-case performance can degrade from O(''n'') to O(''n''2). On the average case, the performance slowdown will not be that much worse than bit vectors and the space usage is much more efficient.


Signature file

Each document can be summarized by Bloom filter representing the set of words in that document, stored in a fixed-length bitstring, called a signature. The signature file contains one such superimposed code bitstring for every document in the collection. Each query can also be summarized by a Bloom filter representing the set of words in the query, stored in a bitstring of the same fixed length. The query bitstring is tested against each signature. Justin Zobel; Alistair Moffat; and Kotagiri Ramamohanarao
"Inverted Files Versus Signature Files for Text Indexing"
Bob Goodwin; et al
"BitFunnel: Revisiting Signatures for Search"
2017.
Richard Startin
"Bit-Sliced Signatures and Bloom Filters"
The signature file approached is used in BitFunnel.


Inverted file

An inverted index file contains two parts: a vocabulary containing all the terms used in the collection, and for each distinct term an inverted index that lists every document that mentions that term.


References

* {{ citation , last1= Lashkari , first1=A.H. , last2=Mahdavi , first2= F., last3=Ghomi , first3= V. , title=2009 International Conference on Information Management and Engineering , doi= 10.1109/ICIME.2009.101 , chapter= A Boolean Model in Information Retrieval for Search Engines , year=2009 , pages=385–389 , isbn=978-0-7695-3595-1 , s2cid=18147603 Mathematical modeling Information retrieval techniques