Binary Independence Model
   HOME

TheInfoList



OR:

The Binary Independence Model (BIM) in
computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, ...
and
information science Information science (also known as information studies) is an academic field which is primarily concerned with analysis, collection, classification, manipulation, storage, retrieval, movement, dissemination, and protection of informatio ...
is a probabilistic information retrieval technique. The model makes some simple assumptions to make the estimation of document/query similarity probable and feasible.


Definitions

The Binary Independence Assumption is that documents are binary vectors. That is, only the presence or absence of terms in documents are recorded. Terms are independently distributed in the set of relevant documents and they are also independently distributed in the set of irrelevant documents. The representation is an ordered set of Boolean variables. That is, the representation of a document or query is a vector with one Boolean element for each term under consideration. More specifically, a document is represented by a vector where if term ''t'' is present in the document ''d'' and if it's not. Many documents can have the same vector representation with this simplification. Queries are represented in a similar way. "Independence" signifies that terms in the document are considered independently from each other and no association between terms is modeled. This assumption is very limiting, but it has been shown that it gives good enough results for many situations. This independence is the "naive" assumption of a Naive Bayes classifier, where properties that imply each other are nonetheless treated as independent for the sake of simplicity. This assumption allows the representation to be treated as an instance of a
Vector space model Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers (such as index terms). It is used in information filtering, information retrieval, indexing and ...
by considering each term as a value of 0 or 1 along a dimension orthogonal to the dimensions used for the other terms. The probability P(R, d,q) that a document is relevant derives from the probability of relevance of the terms vector of that document P(R, x,q). By using the
Bayes rule In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...
we get: : P(R, x,q) = \frac where P(x, R=1,q) and P(x, R=0,q) are the probabilities of retrieving a relevant or nonrelevant document, respectively. If so, then that document's representation is ''x''. The exact probabilities can not be known beforehand, so estimates from statistics about the collection of documents must be used. P(R=1, q) and P(R=0, q) indicate the previous probability of retrieving a relevant or nonrelevant document respectively for a query ''q''. If, for instance, we knew the percentage of relevant documents in the collection, then we could use it to estimate these probabilities. Since a document is either relevant or nonrelevant to a query we have that: : P(R=1, x,q) + P(R=0, x,q) = 1


Query Terms Weighting

Given a binary query and the
dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an alge ...
as the similarity function between a document and a query, the problem is to assign weights to the terms in the query such that the retrieval effectiveness will be high. Let p_i and q_i be the probability that a relevant document and an irrelevant document has the term respectively. Yu and Salton, who first introduce BIM, propose that the weight of the term is an increasing function of Y_i = \frac. Thus, if Y_i is higher than Y_j, the weight of term will be higher than that of term . Yu and Salton showed that such a weight assignment to query terms yields better retrieval effectiveness than if query terms are equally weighted.
Robertson Robertson may refer to: People * Robertson (surname) (includes a list of people with this name) * Robertson (given name) * Clan Robertson, a Scottish clan * Robertson, stage name of Belgian magician Étienne-Gaspard Robert (1763–1837) Places ...
and Spärck Jones later showed that if the term is assigned the weight of \log Y_i, then optimal retrieval effectiveness is obtained under the Binary Independence Assumption. The Binary Independence Model was introduced by Yu and Salton. The name Binary Independence Model was coined by Robertson and Spärck Jones who used the log-odds probability of the probabilistic relevance model to derive \log Y_i where the log-odds probability is shown to be rank equivalent to the probability of relevance (i.e., P(R, d,q)) by Luk, obeying the probability ranking principle.


See also

*
Bag of words model The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding g ...


Further reading

* *


References

{{Reflist, refs= {{Cite journal , doi = 10.1145/321921.321930, title = Precision Weighting – An Effective Automatic Indexing Method, journal = Journal of the ACM, volume = 23, pages = 76, year = 1976, last1 = Yu , first1 = C. T., last2 = Salton , first2 = G. , authorlink2 = Gerard Salton, url = http://ecommons.cornell.edu/bitstream/1813/7313/1/75-232.pdf, hdl = 1813/7313, hdl-access = free {{Cite journal , doi = 10.1002/asi.4630270302, title = Relevance weighting of search terms, journal = Journal of the American Society for Information Science, volume = 27, issue = 3, pages = 129, year = 1976, last1 = Robertson , first1 = S. E. , authorlink1 = Stephen Robertson (computer scientist), last2 = Spärck Jones , first2 = K. , authorlink2 = Karen Spärck Jones {{Cite journal , doi = 10.1007/s10699-020-09685-x, title = Why is information retrieval a scientific discipline?, journal = Foundations of Science, volume = 27, issue = 2, pages = 427-453, year = 2022 , last1 = Luk , first1 = R. W. P. {{Cite journal , doi = 10.1108/eb026647, title = The Probability Ranking Principle in IR, journal = Journal of Documentation , volume = 33, issue = 4, pages = 294-304 , year = 1977, last1 = Robertson , first1 = S. E. , authorlink1 = Stephen Robertson (computer scientist) Information retrieval techniques Probabilistic models