HOME

TheInfoList



OR:

BitFunnel is the
search engine indexing Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and ...
algorithm and a set of components used in the Bing search engine, which were made open source in 2016. BitFunnel uses bit-sliced signatures instead of an
inverted index In computer science, an inverted index (also referred to as a postings list, postings file, or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of do ...
in an attempt to reduce operations cost.


History

Progress on the implementation of BitFunnel was made public in early 2016, with the expectation that there would be a usable implementation later that year. In September 2016, the source code was made available via
GitHub GitHub, Inc. () is an Internet hosting service for software development and version control using Git. It provides the distributed version control of Git plus access control, bug tracking, software feature requests, task management, continuous ...
. A paper discussing the BitFunnel algorithm and implementation was released as through the
Special Interest Group on Information Retrieval SIGIR is the Association for Computing Machinery's Special Interest Group on Information Retrieval. The scope of the group's specialty is the theory and application of computers to the acquisition, organization, storage, retrieval and distributi ...
of the
Association for Computing Machinery The Association for Computing Machinery (ACM) is a US-based international learned society for computing. It was founded in 1947 and is the world's largest scientific and educational computing society. The ACM is a non-profit professional member ...
in 2017 and won the Best Paper Award.


Components

BitFunnel consists of three major components: * BitFunnel – the text search/retrieval system itself * WorkBench – a tool for preparing text for use in BitFunnel * NativeJIT – a software component that takes expressions that use C
data structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
s and transforms them into highly optimized
assembly code In computer programming, assembly language (or assembler language, or symbolic machine code), often referred to simply as Assembly and commonly abbreviated as ASM or asm, is any low-level programming language with a very strong correspondence be ...


Algorithm


Initial problem and solution overview

The BitFunnel paper describes the "matching problem", which occurs when an algorithm must identify documents through the usage of keywords. The goal of the problem is to identify a set of matches given a corpus to search and a query of keyword terms to match against. This problem is commonly solved through
inverted index In computer science, an inverted index (also referred to as a postings list, postings file, or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of do ...
es, where each searchable item is maintained with a
map A map is a symbolic depiction emphasizing relationships between elements of some space, such as objects, regions, or themes. Many maps are static, fixed to paper or some other durable medium, while others are dynamic or interactive. Although ...
of keywords. In contrast, BitFunnel represents each searchable item through a signature. A signature is a sequence of bits which describe a
Bloom filter A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in ...
of the searchable terms in a given searchable item. The bloom filter is constructed through hashing through several bit positions.


Theoretical implementation of bit-string signatures

The signature of a document (D) can be described as the logical-or of its term signatures: \overrightarrow=\bigcup \limits_ \overrightarrow Similarly, a query for a document (Q) can be defined as a union: \overrightarrow=\bigcup \limits_ \overrightarrow Additionally, a document D is a member of the set ''M when the following condition is satisfied: \overrightarrow \cap \overrightarrow = \overrightarrow This knowledge is then combined to produce a formula where ''M is identified by documents which match the query signature: M'=\ These steps and their proofs are discussed in the 2017 paper.


Pseudocode for bit-string signatures

This algorithm is described in the 2017 paper. :\begin M'=\emptyset \\ \texttt\ \texttt\ D \in C\ \texttt \\ \qquad \texttt\ \overrightarrow \cap \overrightarrow = \overrightarrow\ \texttt \\ \qquad \qquad M' = M' \cup \ \\ \qquad \texttt \\ \texttt \end


References


External links


BitFunnel · GitHubBitFunnel · BitFunnel
{{Microsoft Research Data management Search algorithms Database index techniques Free and open-source software Microsoft free software Microsoft Research