Definitions
An ''index term'' is a word or expression'','' which may be stemmed, describing or characterizing a document, such as a keyword given for a journal article. Letbe the set of all such index terms. A ''document'' is any subset of . Letbe the set of all documents. is a series of words or small phrases (index terms). Each of those words or small phrases are named , where is the number of the term in the series/list. You can think of as "Terms" and as "index term ''n''". The words or small phrases (index terms ) can exist in documents. These documents then form a series/list where each individual documents are called . These documents () can contain words or small phrases (index terms ) such as ''could'' contain the terms and from . There is an example of this in the following section. Index terms generally want to represent words which have more meaning to them and corresponds to what the content of an article or document could talk about. Terms like "the" and "like" would appear in nearly all documents whereas "Bayesian" would only be a small fraction of documents. Therefor, rarer terms like "Bayesian" are a better choice to be selected in the sets. This relates toExample
Let the set of original (real) documents be, for example : where = "Bayes' principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution)." = " Bayesian decision theory: A mathematical theory of decision-making which presumes utility and probability functions, and according to which the act to be chosen is the Bayes act, i.e. the one with highest subjective expected utility. If one had unlimited time and calculating power with which to make every decision, this procedure would be the best way to make any decision." = "BayesianAdvantages
* Clean formalism * Easy to implement * Intuitive concept * If the resulting document set is either too small or too big, it is directly clear which operators will produce respectively a bigger or smaller set. *It gives (expert) users a sense of control over the system. It is immediately clear why a document has been retrieved given a query.Disadvantages
* Exact matching may retrieve too few or too many documents * Hard to translate a query into a Boolean expression * Ineffective for Search-Resistant Concepts * All terms are equally weighted * More like '' data retrieval'' than ''information retrieval'' * Retrieval based on binary decision criteria with no notion of partial matching * No ranking of the documents is provided (absence of a grading scale) * Information need has to be translated into a Boolean expression, which most users find awkward * The Boolean queries formulated by the users are most often too simplistic * The model frequently returns either too few or too many documents in response to a user queryData structures and algorithms
From a pure formal mathematical point of view, the BIR is straightforward. From a practical point of view, however, several further problems should be solved that relate to algorithms and data structures, such as, for example, the choice of terms (manual or automatic selection or both), stemming,Hash sets
Another possibility is to use hash sets. Each document is represented by a hash table which contains every single term of that document. Since hash table size increases and decreases in real time with the addition and removal of terms, each document will occupy much less space in memory. However, it will have a slowdown in performance because the operations are more complex than with bit vectors. On the worst-case performance can degrade from O(''n'') to O(''n''2). On the average case, the performance slowdown will not be that much worse than bit vectors and the space usage is much more efficient.Signature file
Each document can be summarized by Bloom filter representing the set of words in that document, stored in a fixed-length bitstring, called a signature. The signature file contains one such superimposed code bitstring for every document in the collection. Each query can also be summarized by a Bloom filter representing the set of words in the query, stored in a bitstring of the same fixed length. The query bitstring is tested against each signature. Justin Zobel; Alistair Moffat; and Kotagiri RamamohanaraoInverted file
An inverted index file contains two parts: a vocabulary containing all the terms used in the collection, and for each distinct term an inverted index that lists every document that mentions that term.References
* {{ citation , last1= Lashkari , first1=A.H. , last2=Mahdavi , first2= F., last3=Ghomi , first3= V. , title=2009 International Conference on Information Management and Engineering , doi= 10.1109/ICIME.2009.101 , chapter= A Boolean Model in Information Retrieval for Search Engines , year=2009 , pages=385–389 , isbn=978-0-7695-3595-1 , s2cid=18147603 Mathematical modeling Information retrieval techniques