HOME

TheInfoList



OR:

A Bloom filter is a space-efficient
probabilistic Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
data structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a
set Set, The Set, SET or SETS may refer to: Science, technology, and mathematics Mathematics *Set (mathematics), a collection of elements *Category of sets, the category whose objects and morphisms are sets and total functions, respectively Electro ...
.
False positive A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test result ...
matches are possible, but
false negatives A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test result ...
are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant); the more items added, the larger the probability of false positives. Bloom proposed the technique for applications where the amount of source data would require an impractically large amount of memory if "conventional" error-free
hashing Hash, hashes, hash mark, or hashing may refer to: Substances * Hash (food), a coarse mixture of ingredients * Hash, a nickname for hashish, a cannabis product Hash mark * Hash mark (sports), a marking on hockey rinks and gridiron football fiel ...
techniques were applied. He gave the example of a
hyphenation algorithm Syllabification () or syllabication (), also known as hyphenation, is the separation of a word into syllables, whether spoken, written or signed. Overview The written separation into syllables is usually marked by a hyphen when using English or ...
for a dictionary of 500,000 words, out of which 90% follow simple hyphenation rules, but the remaining 10% require expensive disk accesses to retrieve specific hyphenation patterns. With sufficient
core memory Core or cores may refer to: Science and technology * Core (anatomy), everything except the appendages * Core (manufacturing), used in casting and molding * Core (optical fiber), the signal-carrying portion of an optical fiber * Core, the central ...
, an error-free hash could be used to eliminate all unnecessary disk accesses; on the other hand, with limited core memory, Bloom's technique uses a smaller hash area but still eliminates most unnecessary accesses. For example, a hash area only 15% of the size needed by an ideal error-free hash still eliminates 85% of the disk accesses. More generally, fewer than 10
bit The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
s per element are required for a 1% false positive probability, independent of the size or number of elements in the set.


Algorithm description

An ''empty Bloom filter'' is a
bit array A bit array (also known as bitmask, bit map, bit set, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level p ...
of bits, all set to 0. There must also be different
hash function A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually u ...
s defined, each of which
maps A map is a symbolic depiction emphasizing relationships between elements of some space, such as objects, regions, or themes. Many maps are static, fixed to paper or some other durable medium, while others are dynamic or interactive. Although ...
or hashes some set element to one of the array positions, generating a uniform random distribution. Typically, is a small constant which depends on the desired false error rate , while is proportional to and the number of elements to be added. To ''add'' an element, feed it to each of the hash functions to get array positions. Set the bits at all these positions to 1. To ''query'' for an element (test whether it is in the set), feed it to each of the hash functions to get array positions. If ''any'' of the bits at these positions is 0, the element is definitely not in the set; if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, ''or'' the bits have by chance been set to 1 during the insertion of other elements, resulting in a
false positive A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test result ...
. In a simple Bloom filter, there is no way to distinguish between the two cases, but more advanced techniques can address this problem. The requirement of designing different independent hash functions can be prohibitive for large . For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple "different" hash functions by slicing its output into multiple bit fields. Alternatively, one can pass different initial values (such as 0, 1, ...,  − 1) to a hash function that takes an initial value; or add (or append) these values to the key. For larger and/or , independence among the hash functions can be relaxed with negligible increase in false positive rate. (Specifically, show the effectiveness of deriving the indices using enhanced double hashing and triple hashing, variants of
double hashing Double hashing is a computer programming technique used in conjunction with open addressing in hash tables to resolve hash collisions, by using a secondary hash of the key as an offset when a collision occurs. Double hashing with open addressing is ...
that are effectively simple random number generators seeded with the two or three hash values.) Removing an element from this simple Bloom filter is impossible because there is no way to tell which of the bits it maps to should be cleared. Although setting any one of those bits to zero suffices to remove the element, it would also remove any other elements that happen to map onto that bit. Since the simple algorithm provides no way to determine whether any other elements have been added that affect the bits for the element to be removed, clearing any of the bits would introduce the possibility of false negatives. One-time removal of an element from a Bloom filter can be simulated by having a second Bloom filter that contains items that have been removed. However, false positives in the second filter become false negatives in the composite filter, which may be undesirable. In this approach re-adding a previously removed item is not possible, as one would have to remove it from the "removed" filter. It is often the case that all the keys are available but are expensive to enumerate (for example, requiring many disk reads). When the false positive rate gets too high, the filter can be regenerated; this should be a relatively rare event.


Space and time advantages

While risking false positives, Bloom filters have a substantial space advantage over other data structures for representing sets, such as
self-balancing binary search tree In computer science, a self-balancing binary search tree (BST) is any node-based binary search tree that automatically keeps its height (maximal number of levels below the root) small in the face of arbitrary item insertions and deletions.Donald ...
s,
trie In computer science, a trie, also called digital tree or prefix tree, is a type of ''k''-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes def ...
s,
hash table In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', als ...
s, or simple
arrays An array is a systematic arrangement of similar objects, usually in rows and columns. Things called an array include: {{TOC right Music * In twelve-tone and serial composition, the presentation of simultaneous twelve-tone sets such that the ...
or
linked list In computer science, a linked list is a linear collection of data elements whose order is not given by their physical placement in memory. Instead, each element points to the next. It is a data structure consisting of a collection of nodes whic ...
s of the entries. Most of these require storing at least the data items themselves, which can require anywhere from a small number of bits, for small integers, to an arbitrary number of bits, such as for strings ( are an exception since they can share storage between elements with equal prefixes). However, Bloom filters do not store the data items at all, and a separate solution must be provided for the actual storage. Linked structures incur an additional linear space overhead for pointers. A Bloom filter with a 1% error and an optimal value of , in contrast, requires only about 9.6 bits per element, regardless of the size of the elements. This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. The 1% false-positive rate can be reduced by a factor of ten by adding only about 4.8 bits per element. However, if the number of potential values is small and many of them can be in the set, the Bloom filter is easily surpassed by the deterministic
bit array A bit array (also known as bitmask, bit map, bit set, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level p ...
, which requires only one bit for each potential element. Hash tables gain a space and time advantage if they begin ignoring collisions and store only whether each bucket contains an entry; in this case, they have effectively become Bloom filters with . Bloom filters also have the unusual property that the time needed either to add items or to check whether an item is in the set is a fixed constant, , completely independent of the number of items already in the set. No other constant-space set data structure has this property, but the average access time of sparse
hash table In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', als ...
s can make them faster in practice than some Bloom filters. In a hardware implementation, however, the Bloom filter shines because its lookups are independent and can be parallelized. To understand its space efficiency, it is instructive to compare the general Bloom filter with its special case when . If , then in order to keep the false positive rate sufficiently low, a small fraction of bits should be set, which means the array must be very large and contain long runs of zeros. The
information content In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative wa ...
of the array relative to its size is low. The generalized Bloom filter ( greater than 1) allows many more bits to be set while still maintaining a low false positive rate; if the parameters ( and ) are chosen well, about half of the bits will be set, and these will be apparently random, minimizing redundancy and maximizing information content.


Probability of false positives

Assume that a
hash function A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually u ...
selects each array position with equal probability. If ''m'' is the number of bits in the array, the probability that a certain bit is not set to 1 by a certain hash function during the insertion of an element is :1-\frac. If ''k'' is the number of hash functions and each has no significant correlation between each other, then the probability that the bit is not set to 1 by any of the hash functions is :\left(1-\frac\right)^k. We can use the well-known identity for ''e''−1 :\lim_ \left(1 - \frac\right)^m = \frac to conclude that, for large ''m'', :\left(1-\frac\right)^k = \left(\left(1-\frac\right)^m\right)^ \approx e^. If we have inserted ''n'' elements, the probability that a certain bit is still 0 is :\left(1-\frac\right)^ \approx e^; the probability that it is 1 is therefore :1-\left(1-\frac\right)^ \approx 1 - e^. Now test membership of an element that is not in the set. Each of the ''k'' array positions computed by the hash functions is 1 with a probability as above. The probability of all of them being 1, which would cause the
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
to erroneously claim that the element is in the set, is often given as :\varepsilon = \left(1-\left -\frac\right\right)^k \approx \left( 1-e^ \right)^k. This is not strictly correct as it assumes independence for the probabilities of each bit being set. However, assuming it is a close approximation we have that the probability of false positives decreases as ''m'' (the number of bits in the array) increases, and increases as ''n'' (the number of inserted elements) increases. The true probability of a false positive, without assuming independence, is : \frac\sum_^m i^k i! \left\ where the denote
Stirling numbers of the second kind In mathematics, particularly in combinatorics, a Stirling number of the second kind (or Stirling partition number) is the number of ways to partition a set of ''n'' objects into ''k'' non-empty subsets and is denoted by S(n,k) or \textstyle \lef ...
. An alternative analysis arriving at the same approximation without the assumption of independence is given by Mitzenmacher and Upfal. After all ''n'' items have been added to the Bloom filter, let ''q'' be the fraction of the ''m'' bits that are set to 0. (That is, the number of bits still set to 0 is ''qm''.) Then, when testing membership of an element not in the set, for the array position given by any of the ''k'' hash functions, the probability that the bit is found set to 1 is 1-q. So the probability that all ''k'' hash functions find their bit set to 1 is (1 - q)^k. Further, the expected value of ''q'' is the probability that a given array position is left untouched by each of the ''k'' hash functions for each of the ''n'' items, which is (as above) : E = \left(1 - \frac\right)^. It is possible to prove, without the independence assumption, that ''q'' is very strongly concentrated around its expected value. In particular, from the
Azuma–Hoeffding inequality In probability theory, the Azuma–Hoeffding inequality (named after Kazuoki Azuma and Wassily Hoeffding) gives a concentration result for the values of martingales that have bounded differences. Suppose \ is a martingale (or super-martingale) ...
, they prove that : \Pr(\left, q - E \ge \frac) \le 2\exp(-2\lambda^2/kn) Because of this, we can say that the exact probability of false positives is : \sum_ \Pr(q = t) (1 - t)^k \approx (1 - E ^k = \left(1-\left -\frac\right\right)^k \approx \left( 1-e^ \right)^k as before.


Optimal number of hash functions

The number of hash functions, ''k'', must be a positive integer. Putting this constraint aside, for a given ''m'' and ''n'', the value of ''k'' that minimizes the false positive probability is :k = \frac \ln 2. The required number of bits, ''m'', given ''n'' (the number of inserted elements) and a desired false positive probability ''ε'' (and assuming the optimal value of ''k'' is used) can be computed by substituting the optimal value of ''k'' in the probability expression above: :\varepsilon = \left( 1-e^ \right)^ which can be simplified to: :\ln\varepsilon = -\frac \left(\ln 2\right)^2. This results in: :m=-\frac So the optimal number of bits per element is :\frac=-\frac\approx-1.44\log_2\varepsilon with the corresponding number of hash functions ''k'' (ignoring integrality): :k=-\frac=-\log_2\varepsilon. This means that for a given false positive probability ''ε'', the length of a Bloom filter ''m'' is proportionate to the number of elements being filtered ''n'' and the required number of hash functions only depends on the target false positive probability ''ε''. The formula m=-\frac is approximate for three reasons. First, and of least concern, it approximates 1 - \frac as e^, which is a good asymptotic approximation (i.e., which holds as ''m'' →∞). Second, of more concern, it assumes that during the membership test the event that one tested bit is set to 1 is independent of the event that any other tested bit is set to 1. Third, of most concern, it assumes that k = \frac \ln 2 is fortuitously integral. Goel and Gupta, however, give a rigorous upper bound that makes no approximations and requires no assumptions. They show that the false positive probability for a finite Bloom filter with ''m'' bits ( m > 1), ''n'' elements, and ''k'' hash functions is at most :\varepsilon \leq \left( 1-e^ \right)^k. This bound can be interpreted as saying that the approximate formula \left( 1-e^ \right)^k can be applied at a penalty of at most half an extra element and at most one fewer bit.


Approximating the number of items in a Bloom filter

showed that the number of items in a Bloom filter can be approximated with the following formula, : n^* = -\frac \ln \left 1 - \frac \right where n^* is an estimate of the number of items in the filter, m is the length (size) of the filter, k is the number of hash functions, and X is the number of bits set to one.


The union and intersection of sets

Bloom filters are a way of compactly representing a set of items. It is common to try to compute the size of the intersection or union between two sets. Bloom filters can be used to approximate the size of the intersection and union of two sets. showed that for two Bloom filters of length , their counts, respectively can be estimated as : n(A^*) = -\frac \ln \left 1 - \frac m \right and : n(B^*) = -\frac \ln \left 1 - \frac m \right The size of their union can be estimated as : n(A^*\cup B^*) = -\frac \ln \left 1 - \frac m \right where n(A \cup B) is the number of bits set to one in either of the two Bloom filters. Finally, the intersection can be estimated as : n(A^*\cap B^*) = n(A^*) + n(B^*) - n(A^* \cup B^*), using the three formulas together.


Interesting properties

* Unlike a standard
hash table In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', als ...
using
open addressing Open addressing, or closed hashing, is a method of collision resolution in hash tables. With this method a hash collision is resolved by probing, or searching through alternative locations in the array (the ''probe sequence'') until either the t ...
for collision resolution, a Bloom filter of a fixed size can represent a set with an arbitrarily large number of elements; adding an element never fails due to the data structure "filling up". However, the false positive rate increases steadily as elements are added until all bits in the filter are set to 1, at which point ''all'' queries yield a positive result. With open addressing hashing, false positives are never produced, but performance steadily deteriorates until it approaches linear search. *
Union Union commonly refers to: * Trade union, an organization of workers * Union (set theory), in mathematics, a fundamental operation on sets Union may also refer to: Arts and entertainment Music * Union (band), an American rock group ** ''Un ...
and
intersection In mathematics, the intersection of two or more objects is another object consisting of everything that is contained in all of the objects simultaneously. For example, in Euclidean geometry, when two lines in a plane are not parallel, their i ...
of Bloom filters with the same size and set of hash functions can be implemented with bitwise OR and AND operations, respectively. The union operation on Bloom filters is lossless in the sense that the resulting Bloom filter is the same as the Bloom filter created from scratch using the union of the two sets. The intersect operation satisfies a weaker property: the false positive probability in the resulting Bloom filter is at most the false-positive probability in one of the constituent Bloom filters, but may be larger than the false positive probability in the Bloom filter created from scratch using the intersection of the two sets. * Some kinds of
superimposed code A superimposed code such as Zatocoding is a kind of hash code that was popular in marginal punched-card systems. Marginal punched-card systems Many names, some of them trademarked, have been used for marginal punched-card systems: edge-notc ...
can be seen as a Bloom filter implemented with physical
edge-notched card Edge-notched cards or edge-punched cards are a system used to store a small amount of binary or logical data on paper index cards, encoded via the presence or absence of notches in the edges of the cards. The notches allowed efficient sorting and s ...
s. An example is
Zatocoding A superimposed code such as Zatocoding is a kind of hash code that was popular in marginal punched-card systems. Marginal punched-card systems Many names, some of them trademarked, have been used for marginal punched-card systems: edge-notch ...
, invented by
Calvin Mooers Calvin Northrup Mooers (October 24, 1919 – December 1, 1994), was an American computer scientist known for his work in information retrieval and for the programming language TRAC. Early life Mooers was a native of Minneapolis, Minnesota, atte ...
in 1947, in which the set of categories associated with a piece of information is represented by notches on a card, with a random pattern of four notches for each category.


Examples

*
Fruit flies Fruit fly may refer to: Organisms * Drosophilidae, a family of small flies, including: ** ''Drosophila'', the genus of small fruit flies and vinegar flies ** ''Drosophila melanogaster'' or common fruit fly ** ''Drosophila suzukii'' or Asian fruit ...
use a modified version of Bloom filters to detect novelty of odors, with additional features including similarity of novel odor to that of previously experienced examples, and time elapsed since previous experience of the same odor. *The servers of
Akamai Technologies Akamai Technologies, Inc. is an American content delivery networkJ. Dilley, B. Maggs, J. Parikh, H. Prokop, R. Sitaraman, and B. Weihl. (CDN), cybersecurity, and cloud service company, providing web and Internet security services. Akamai's Inte ...
, a
content delivery Digital distribution, also referred to as content delivery, online distribution, or electronic software distribution, among others, is the delivery or distribution of digital media content (media), content such as Sound recording and reproductio ...
provider, use Bloom filters to prevent "one-hit-wonders" from being stored in its disk caches. One-hit-wonders are web objects requested by users just once, something that Akamai found applied to nearly three-quarters of their caching infrastructure. Using a Bloom filter to detect the second request for a web object and caching that object only on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk workload and increasing disk cache hit rates. * Google
Bigtable Bigtable is a fully managed wide-column and key-value NoSQL database service for large analytical and operational workloads as part of the Google Cloud portfolio. History Bigtable development began in 2004.. It is now used by a number of Googl ...
,
Apache HBase HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File Syst ...
and
Apache Cassandra Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandr ...
and
PostgreSQL PostgreSQL (, ), also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. It was originally named POSTGRES, referring to its origins as a successor to the In ...
use Bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation. * The
Google Chrome Google Chrome is a cross-platform web browser developed by Google. It was first released in 2008 for Microsoft Windows, built with free software components from Apple WebKit and Mozilla Firefox. Versions were later released for Linux, macOS ...
web browser used to use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned, if that too returned a positive result). * Microsoft
Bing (search engine) Microsoft Bing (commonly known as Bing) is a web search engine owned and operated by Microsoft. The service has its origins in Microsoft's previous search engines: MSN Search, Windows Live Search and later Live Search. Bing provides a variety ...
uses multi-level hierarchical Bloom filters for its search index,
BitFunnel BitFunnel is the search engine indexing algorithm and a set of components used in the Bing search engine, which were made open source in 2016. BitFunnel uses bit-sliced signatures instead of an inverted index in an attempt to reduce operations ...
. Bloom filters provided lower cost than the previous Bing index, which was based on
inverted files In computer science, an inverted index (also referred to as a postings list, postings file, or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of do ...
. * The
Squid True squid are molluscs with an elongated soft body, large eyes, eight arms, and two tentacles in the superorder Decapodiformes, though many other molluscs within the broader Neocoleoidea are also called squid despite not strictly fitting t ...
Web Web most often refers to: * Spider web, a silken structure created by the animal * World Wide Web or the Web, an Internet-based hypertext system Web, WEB, or the Web may also refer to: Computing * WEB, a literate programming system created by ...
Proxy
Cache Cache, caching, or caché may refer to: Places United States * Cache, Idaho, an unincorporated community * Cache, Illinois, an unincorporated community * Cache, Oklahoma, a city in Comanche County * Cache, Utah, Cache County, Utah * Cache County ...
uses Bloom filters for cache digests. *
Bitcoin Bitcoin ( abbreviation: BTC; sign: ₿) is a decentralized digital currency that can be transferred on the peer-to-peer bitcoin network. Bitcoin transactions are verified by network nodes through cryptography and recorded in a public distr ...
used Bloom filters to speed up wallet synchronization until privacy vulnerabilities with the implementation of Bloom filters were discovered. * The
Venti Venti is a network storage system that permanently stores data blocks. A 160-bit SHA-1 hash of the data (called ''score'' by Venti) acts as the address of the data. This enforces a ''write-once'' policy since no other data block can be found wi ...
archival storage system uses Bloom filters to detect previously stored data. * The
SPIN model checker SPIN is a general tool for verifying the correctness of concurrent software models in a rigorous and mostly automated fashion. It was written by Gerard J. Holzmann and others in the original Unix group of the Computing Sciences Research Center ...
uses Bloom filters to track the reachable state space for large verification problems. * The Cascading analytics framework uses Bloom filters to speed up asymmetric joins, where one of the joined data sets is significantly larger than the other (often called Bloom join in the database literature). * The
Exim Exim is a mail transfer agent (MTA) used on Unix-like operating systems. Exim is free software distributed under the terms of the GNU General Public License, and it aims to be a general and flexible mailer with extensive facilities for checking ...
mail transfer agent (MTA) uses Bloom filters in its rate-limit feature. *
Medium Medium may refer to: Science and technology Aviation *Medium bomber, a class of war plane * Tecma Medium, a French hang glider design Communication * Media (communication), tools used to store and deliver information or data * Medium of ...
uses Bloom filters to avoid recommending articles a user has previously read. *
Ethereum Ethereum is a decentralized, open-source blockchain with smart contract functionality. Ether (Abbreviation: ETH; sign: Ξ) is the native cryptocurrency of the platform. Among cryptocurrencies, ether is second only to bitcoin in market capita ...
uses Bloom filters for quickly finding logs on the Ethereum blockchain. *
Grafana Grafana is a multi-platform open source analytics and interactive visualization web application. It provides charts, graphs, and alerts for the web when connected to supported data sources. A licensed Grafana Enterprise version with additional ...
Tempo uses Bloom filters to improve query performance by storing bloom filters for each backend block. These are accessed on each query to determine the blocks containing data that meets the supplied search criteria


Alternatives

Classic Bloom filters use 1.44\log_2(1/\varepsilon) bits of space per inserted key, where \varepsilon is the false positive rate of the Bloom filter. However, the space that is strictly necessary for any data structure playing the same role as a Bloom filter is only \log_2(1/\varepsilon) per key. Hence Bloom filters use 44% more space than an equivalent optimal data structure. Instead, Pagh et al. provide an optimal-space data structure. Moreover, their data structure has constant
locality of reference In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference localit ...
independent of the false positive rate, unlike Bloom filters, where a lower false positive rate \varepsilon leads to a greater number of memory accesses per query, \log(1/\varepsilon). Also, it allows elements to be deleted without a space penalty, unlike Bloom filters. The same improved properties of optimal space usage, constant locality of reference, and the ability to delete elements are also provided by the cuckoo filter of , an open source implementation of which is available. describe a probabilistic structure based on
hash table In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', als ...
s, hash compaction, which identify as significantly more accurate than a Bloom filter when each is configured optimally. Dillinger and Manolios, however, point out that the reasonable accuracy of any given Bloom filter over a wide range of numbers of additions makes it attractive for probabilistic enumeration of state spaces of unknown size. Hash compaction is, therefore, attractive when the number of additions can be predicted accurately; however, despite being very fast in software, hash compaction is poorly suited for hardware because of worst-case linear access time. have studied some variants of Bloom filters that are either faster or use less space than classic Bloom filters. The basic idea of the fast variant is to locate the k hash values associated with each key into one or two blocks having the same size as processor's memory cache blocks (usually 64 bytes). This will presumably improve performance by reducing the number of potential memory
cache misses In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewher ...
. The proposed variants have however the drawback of using about 32% more space than classic Bloom filters. The space efficient variant relies on using a single hash function that generates for each key a value in the range \left ,n/\varepsilon\right/math> where \varepsilon is the requested false positive rate. The sequence of values is then sorted and compressed using
Golomb coding Golomb coding is a lossless data compression method using a family of data compression codes invented by Solomon W. Golomb in the 1960s. Alphabets following a geometric distribution will have a Golomb code as an optimal prefix code, making Golomb ...
(or some other compression technique) to occupy a space close to n\log_2(1/\varepsilon) bits. To query the Bloom filter for a given key, it will suffice to check if its corresponding value is stored in the Bloom filter. Decompressing the whole Bloom filter for each query would make this variant totally unusable. To overcome this problem the sequence of values is divided into small blocks of equal size that are compressed separately. At query time only half a block will need to be decompressed on average. Because of decompression overhead, this variant may be slower than classic Bloom filters but this may be compensated by the fact that a single hash function needs to be computed. Another alternative to classic Bloom filter is the cuckoo filter, based on space-efficient variants of
cuckoo hashing Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick ...
. In this case, a hash table is constructed, holding neither keys nor values, but short fingerprints (small hashes) of the keys. If looking up the key finds a matching fingerprint, then key is probably in the set. One useful property of cuckoo filters is that they are updatable; entries may be dynamically added (with a small probability of failure due to the hash table being full) and removed. describes an approach called an xor filter, where they store fingerprints in a particular type of
perfect hash In computer science, a perfect hash function for a set is a hash function that maps distinct elements in to a set of integers, with no collisions. In mathematical terms, it is an injective function. Perfect hash functions may be used to impl ...
table, producing a filter which is more memory efficient (1.23\log_2(1/\varepsilon) bits per key) and faster than Bloom or cuckoo filters. (The time saving comes from the fact that a lookup requires exactly three memory accesses, which can all execute in parallel.) However, filter creation is more complex than Bloom and cuckoo filters, and it is not possible to modify the set after creation.


Extensions and applications

There are over 60 variants of Bloom filters, many surveys of the field, and a continuing churn of applications (see e.g., Luo, ''et al'' ). Some of the variants differ sufficiently from the original proposal to be breaches from or forks of the original data structure and its philosophy. A treatment which unifies Bloom filters with other work on
random projections In mathematics and statistics, random projection is a technique used to reduce the dimensionality of a set of points which lie in Euclidean space. Random projection methods are known for their power, simplicity, and low error rates when compared ...
,
compressive sensing Compressed sensing (also known as compressive sensing, compressive sampling, or sparse sampling) is a signal processing technique for efficiently acquiring and reconstructing a Signal (electronics), signal, by finding solutions to Underdetermined ...
, and
locality sensitive hashing In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets is much smaller than the universe of possible input items.) Since ...
remains to be done (though see Dasgupta, ''et al'' for one attempt inspired by neuroscience).


Cache filtering

Content delivery network A content delivery network, or content distribution network (CDN), is a geographically distributed network of proxy servers and their data centers. The goal is to provide high availability and performance by distributing the service spatially re ...
s deploy
web cache A Web cache (or HTTP cache) is a system for optimizing the World Wide Web. It is implemented both client-side and server-side. The caching of multimedias and other files can result in less overall delay when browsing the Web. Parts of the syste ...
s around the world to cache and serve web content to users with greater performance and reliability. A key application of Bloom filters is their use in efficiently determining which web objects to store in these web caches. Nearly three-quarters of the URLs accessed from a typical web cache are "one-hit-wonders" that are accessed by users only once and never again. It is clearly wasteful of disk resources to store one-hit-wonders in a web cache, since they will never be accessed again. To prevent caching one-hit-wonders, a Bloom filter is used to keep track of all URLs that are accessed by users. A web object is cached only when it has been accessed at least once before, i.e., the object is cached on its second request. The use of a Bloom filter in this fashion significantly reduces the disk write workload, since most one-hit-wonders are not written to the disk cache. Further, filtering out the one-hit-wonders also saves cache space on disk, increasing the cache hit rates.


Avoiding false positives in a finite universe

Kiss ''et al'' described a new construction for the Bloom filter that avoids false positives in addition to the typical non-existence of false negatives. The construction applies to a finite universe from which set elements are taken. It relies on existing non-adaptive combinatorial group testing scheme by Eppstein, Goodrich and Hirschberg. Unlike the typical Bloom filter, elements are hashed to a bit array through deterministic, fast and simple-to-calculate functions. The maximal set size for which false positives are completely avoided is a function of the universe size and is controlled by the amount of allocated memory.


Counting Bloom filters

Counting filters provide a way to implement a ''delete'' operation on a Bloom filter without recreating the filter afresh. In a counting filter, the array positions (buckets) are extended from being a single bit to being a multibit counter. In fact, regular Bloom filters can be considered as counting filters with a bucket size of one bit. Counting filters were introduced by . The insert operation is extended to ''increment'' the value of the buckets, and the lookup operation checks that each of the required buckets is non-zero. The delete operation then consists of decrementing the value of each of the respective buckets.
Arithmetic overflow Arithmetic () is an elementary part of mathematics that consists of the study of the properties of the traditional operations on numbers—addition, subtraction, multiplication, division, exponentiation, and extraction of roots. In the 19th cen ...
of the buckets is a problem and the buckets should be sufficiently large to make this case rare. If it does occur then the increment and decrement operations must leave the bucket set to the maximum possible value in order to retain the properties of a Bloom filter. The size of counters is usually 3 or 4 bits. Hence counting Bloom filters use 3 to 4 times more space than static Bloom filters. In contrast, the data structures of and also allow deletions but use less space than a static Bloom filter. Another issue with counting filters is limited scalability. Because the counting Bloom filter table cannot be expanded, the maximal number of keys to be stored simultaneously in the filter must be known in advance. Once the designed capacity of the table is exceeded, the false positive rate will grow rapidly as more keys are inserted. introduced a data structure based on d-left hashing that is functionally equivalent but uses approximately half as much space as counting Bloom filters. The scalability issue does not occur in this data structure. Once the designed capacity is exceeded, the keys could be reinserted in a new hash table of double size. The space efficient variant by could also be used to implement counting filters by supporting insertions and deletions. introduced a new general method based on variable increments that significantly improves the false positive probability of counting Bloom filters and their variants, while still supporting deletions. Unlike counting Bloom filters, at each element insertion, the hashed counters are incremented by a hashed variable increment instead of a unit increment. To query an element, the exact values of the counters are considered and not just their positiveness. If a sum represented by a counter value cannot be composed of the corresponding variable increment for the queried element, a negative answer can be returned to the query.
Kim et al. (2019)
shows that false positive of Counting Bloom filter decreases from k=1 to a point defined k_, and increases from k_to positive infinity, and finds k_as a function of count threshold.


Decentralized aggregation

Bloom filters can be organized in distributed
data structures In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
to perform fully decentralized computations of
aggregate function In database management, an aggregate function or aggregation function is a function where the values of multiple rows are grouped together to form a single summary value. Common aggregate functions include: * Average (i.e., arithmetic mean) * C ...
s. Decentralized aggregation makes collective measurements locally available in every node of a distributed network without involving a centralized computational entity for this purpose.


Distributed Bloom filters

Parallel Bloom filters can be implemented to take advantage of the multiple
processing element This glossary of computer hardware terms is a list of definitions of terms and concepts related to computer hardware, i.e. the physical and structural components of computers, architectural issues, and peripheral devices. A ...
s (PEs) present in parallel shared-nothing machines. One of the main obstacles for a parallel Bloom filter is the organization and communication of the unordered data which is, in general, distributed evenly over all PEs at the initiation or at batch insertions. To order the data two approaches can be used, either resulting in a Bloom filter over all data being stored on each PE, called replicating bloom filter, or the Bloom filter over all data being split into equal parts, each PE storing one part of it. For both approaches a "Single Shot" Bloom filter is used which only calculates one hash, resulting in one flipped bit per element, to reduce the communication volume. Distributed Bloom filters are initiated by first hashing all elements on their local PE and then sorting them by their hashes locally. This can be done in linear time using e.g.
Bucket sort Bucket sort, or bin sort, is a sorting algorithm that works by distributing the elements of an array into a number of buckets. Each bucket is then sorted individually, either using a different sorting algorithm, or by recursively applying the b ...
and also allows local duplicate detection. The sorting is used to group the hashes with their assigned PE as separator to create a Bloom filter for each group. After encoding these Bloom filters using e.g.
Golomb coding Golomb coding is a lossless data compression method using a family of data compression codes invented by Solomon W. Golomb in the 1960s. Alphabets following a geometric distribution will have a Golomb code as an optimal prefix code, making Golomb ...
each bloom filter is sent as packet to the PE responsible for the hash values that where inserted into it. A PE p is responsible for all hashes between the values p*(s/, \text, ) and (p+1)*(s/, \text, ), where s is the total size of the Bloom filter over all data. Because each element is only hashed once and therefore only a single bit is set, to check if an element was inserted into the Bloom filter only the PE responsible for the hash value of the element needs to be operated on. Single insertion operations can also be done efficiently because the Bloom filter of only one PE has to be changed, compared to Replicating Bloom filters where every PE would have to update its Bloom filter. By distributing the global Bloom filter over all PEs instead of storing it separately on each PE the Bloom filters size can be far larger, resulting in a larger capacity and lower false positive rate.
Distributed Bloom filters can be used to improve duplicate detection algorithms by filtering out the most 'unique' elements. These can be calculated by communicating only the hashes of elements, not the elements themselves which are far larger in volume, and removing them from the set, reducing the workload for the duplicate detection algorithm used afterwards. During the communication of the hashes the PEs search for bits that are set in more than one of the receiving packets, as this would mean that two elements had the same hash and therefore could be duplicates. If this occurs a message containing the index of the bit, which is also the hash of the element that could be a duplicate, is sent to the PEs which sent a packet with the set bit. If multiple indices are sent to the same PE by one sender it can be advantageous to encode the indices as well. All elements that didn't have their hash sent back are now guaranteed to not be a duplicate and won't be evaluated further, for the remaining elements a Repartitioning algorithm can be used. First all the elements that had their hash value sent back are sent to the PE that their hash is responsible for. Any element and its duplicate is now guaranteed to be on the same PE. In the second step each PE uses a sequential algorithm for duplicate detection on the receiving elements, which are only a fraction of the amount of starting elements. By allowing a false positive rate for the duplicates, the communication volume can be reduced further as the PEs don't have to send elements with duplicated hashes at all and instead any element with a duplicated hash can simply be marked as a duplicate. As a result, the false positive rate for duplicate detection is the same as the false positive rate of the used bloom filter. The process of filtering out the most 'unique' elements can also be repeated multiple times by changing the hash function in each filtering step. If only a single filtering step is used it has to archive a small false positive rate, however if the filtering step is repeated once the first step can allow a higher false positive rate while the latter one has a higher one but also works on less elements as many have already been removed by the earlier filtering step. While using more than two repetitions can reduce the communication volume further if the number of duplicates in a set is small, the payoff for the additional complications is low. Replicating Bloom filters organize their data by using a well known
hypercube In geometry, a hypercube is an ''n''-dimensional analogue of a square () and a cube (). It is a closed, compact, convex figure whose 1- skeleton consists of groups of opposite parallel line segments aligned in each of the space's dimensions, ...
algorithm for gossiping, e.g. First each PE calculates the Bloom filter over all local elements and stores it. By repeating a loop where in each step i the PEs send their local Bloom filter over dimension i and merge the Bloom filter they receive over the dimension with their local Bloom filter, it is possible to double the elements each Bloom filter contains in every iteration. After sending and receiving Bloom filters over all \log , \text, dimensions each PE contains the global Bloom filter over all elements. Replicating Bloom filters are more efficient when the number of queries is much larger than the number of elements that the Bloom filter contains, the break even point compared to Distributed Bloom filters is approximately after , \text, * , \text, / \log_ , \text, accesses, with f^\text as the false positive rate of the bloom filter.


Data synchronization

Bloom filters can be used for approximate
data synchronization Data synchronization is the process of establishing consistency between source and target data stores, and the continuous harmonization of the data over time. It is fundamental to a wide variety of applications, including file synchronization and m ...
as in . Counting Bloom filters can be used to approximate the number of differences between two sets and this approach is described in .


Bloom Filters for streaming data

Bloom filters can be adapted to the context of streaming data. For instance, proposed Stable Bloom filters, which consist of a counting Bloom filter where insertion of a new element sets the associated counters to a value , and then only a fixed amount of counters are decreased by 1, hence the memory mostly contains information about recent elements (intuitively, one could assume that the lifetime of an element inside a SBF of counters is around c \tfrac s N). Another solution is the Aging Bloom filter, that consists of two Bloom filter each occupying half the total available memory: when one filter is full, the second filter is erased and newer elements are then added to this newly empty filter. However, it has been shown that no matter the filter, after insertions, the sum of the false positive FP and false negative FN probabilities is bounded below by \scriptstyle FP + FN \geq 1 - \frac where is the amount of all possible elements (the alphabet size), the memory size (in bits), assuming n > m. This result shows that for big enough and going to infinity, then the lower bound converges to FP + FN = 1, which is the characteristic relation of a random filter. Hence, after enough insertions, and if the alphabet is too big to be stored in memory (which is assumed in the context of probabilistic filters), it is impossible for a filter to perform better than randomness. This result can be leveraged by only expecting a filter to operate on a sliding window rather than the whole stream. In this case, the exponent in the formula above is replaced by , which gives a formula that might deviate from 1, if is not too small.


Bloomier filters

designed a generalization of Bloom filters that could associate a value with each element that had been inserted, implementing an
associative array In computer science, an associative array, map, symbol table, or dictionary is an abstract data type that stores a collection of (key, value) pairs, such that each possible key appears at most once in the collection. In mathematical terms an as ...
. Like Bloom filters, these structures achieve a small space overhead by accepting a small probability of false positives. In the case of "Bloomier filters", a ''false positive'' is defined as returning a result when the key is not in the map. The map will never return the wrong value for a key that ''is'' in the map.


Compact approximators

proposed a
lattice Lattice may refer to: Arts and design * Latticework, an ornamental criss-crossed framework, an arrangement of crossing laths or other thin strips of material * Lattice (music), an organized grid model of pitch ratios * Lattice (pastry), an ornam ...
-based generalization of Bloom filters. A compact approximator associates to each key an element of a lattice (the standard Bloom filters being the case of the Boolean two-element lattice). Instead of a bit array, they have an array of lattice elements. When adding a new association between a key and an element of the lattice, they compute the maximum of the current contents of the k array locations associated to the key with the lattice element. When reading the value associated to a key, they compute the minimum of the values found in the k locations associated to the key. The resulting value approximates from above the original value.


Parallel Partitioned Bloom Filters

This implementation used a separate array for each hash function. This method allows for parallel hash calculations for both insertions and inquiries.


Scalable Bloom filters

proposed a variant of Bloom filters that can adapt dynamically to the number of elements stored, while assuring a minimum false positive probability. The technique is based on sequences of standard Bloom filters with increasing capacity and tighter false positive probabilities, so as to ensure that a maximum false positive probability can be set beforehand, regardless of the number of elements to be inserted.


Spatial Bloom filters

Spatial Bloom filters (SBF) were originally proposed by as a data structure designed to store
location information Mobile phone tracking is a process for identifying the location of a mobile phone, whether stationary or moving. Localization may be effected by a number of technologies, such as the multilateration of radio signals between (several) cell towers o ...
, especially in the context of cryptographic protocols for location
privacy Privacy (, ) is the ability of an individual or group to seclude themselves or information about themselves, and thereby express themselves selectively. The domain of privacy partially overlaps with security, which can include the concepts of a ...
. However, the main characteristic of SBFs is their ability to store multiple sets in a single data structure, which makes them suitable for a number of different application scenarios. Membership of an element to a specific set can be queried, and the false positive probability depends on the set: the first sets to be entered into the filter during construction have higher false positive probabilities than sets entered at the end. This property allows a prioritization of the sets, where sets containing more "important" elements can be preserved.


Layered Bloom filters

A layered Bloom filter consists of multiple Bloom filter layers. Layered Bloom filters allow keeping track of how many times an item was added to the Bloom filter by checking how many layers contain the item. With a layered Bloom filter a check operation will normally return the deepest layer number the item was found in.


Attenuated Bloom filters

An attenuated Bloom filter of depth D can be viewed as an array of D normal Bloom filters. In the context of service discovery in a network, each node stores regular and attenuated Bloom filters locally. The regular or local Bloom filter indicates which services are offered by the node itself. The attenuated filter of level i indicates which services can be found on nodes that are i-hops away from the current node. The i-th value is constructed by taking a union of local Bloom filters for nodes i-hops away from the node. Let's take a small network shown on the graph below as an example. Say we are searching for a service A whose id hashes to bits 0,1, and 3 (pattern 11010). Let n1 node to be the starting point. First, we check whether service A is offered by n1 by checking its local filter. Since the patterns don't match, we check the attenuated Bloom filter in order to determine which node should be the next hop. We see that n2 doesn't offer service A but lies on the path to nodes that do. Hence, we move to n2 and repeat the same procedure. We quickly find that n3 offers the service, and hence the destination is located. By using attenuated Bloom filters consisting of multiple layers, services at more than one hop distance can be discovered while avoiding saturation of the Bloom filter by attenuating (shifting out) bits set by sources further away.


Chemical structure searching

Bloom filters are often used to search large chemical structure databases (see
chemical similarity Chemical similarity (or molecular similarity) refers to the similarity of chemical elements, molecules or chemical compounds with respect to either structural or functional qualities, i.e. the effect that the chemical compound has on reaction partn ...
). In the simplest case, the elements added to the filter (called a fingerprint in this field) are just the atomic numbers present in the molecule, or a hash based on the atomic number of each atom and the number and type of its bonds. This case is too simple to be useful. More advanced filters also encode atom counts, larger substructure features like carboxyl groups, and graph properties like the number of rings. In hash-based fingerprints, a hash function based on atom and bond properties is used to turn a subgraph into a PRNG seed, and the first output values used to set bits in the Bloom filter. Molecular fingerprints started in the late 1940s as way to search for chemical structures searched on punched cards. However, it wasn't until around 1990 that Daylight Chemical Information Systems, Inc. introduced a hash-based method to generate the bits, rather than use a precomputed table. Unlike the dictionary approach, the hash method can assign bits for substructures which hadn't previously been seen. In the early 1990s, the term "fingerprint" was considered different from "structural keys", but the term has since grown to encompass most molecular characteristics which can be used for a similarity comparison, including structural keys, sparse count fingerprints, and 3D fingerprints. Unlike Bloom filters, the Daylight hash method allows the number of bits assigned per feature to be a function of the feature size, but most implementations of Daylight-like fingerprints use a fixed number of bits per feature, which makes them a Bloom filter. The original Daylight fingerprints could be used for both similarity and screening purposes. Many other fingerprint types, like the popular ECFP2, can be used for similarity but not for screening because they include local environmental characteristics that introduce false negatives when used as a screen. Even if these are constructed with the same mechanism, these are not Bloom filters because they cannot be used to filter.


See also

* Count–min sketch * Feature hashing *
MinHash In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was invented by , and initially used in the AltaV ...
* Quotient filter *
Skip list In computer science, a skip list (or skiplist) is a probabilistic data structure that allows \mathcal(\log n) average complexity for search as well as \mathcal(\log n) average complexity for insertion within an ordered sequence of n elements. ...
* Bloom filters in bioinformatics * Cuckoo filter


Notes


References

* * * * * * * * * * * ** * * * * * * * * * * *. Open source implementatio
available on github
*. A preliminary version appeared at SIGCOMM '98. * * * * * * * * * * * * * *. Prototype implementatio
available on github
* * * * * * * * *


External links



Detailed Bloom Filter explanation using
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offici ...

Why Bloom filters work the way they do (Michael Nielsen, 2012)Bloom Filters — A Tutorial, Analysis, and Survey (Blustein & El-Maazawi, 2002)
at Dalhousie University

from a
University of Wisconsin–Madison A university () is an educational institution, institution of higher education, higher (or Tertiary education, tertiary) education and research which awards academic degrees in several Discipline (academia), academic disciplines. Universities ty ...
website
"More Optimal Bloom Filters", Ely Porat (Nov/2007) Google TechTalk video
on
YouTube YouTube is a global online video platform, online video sharing and social media, social media platform headquartered in San Bruno, California. It was launched on February 14, 2005, by Steve Chen, Chad Hurley, and Jawed Karim. It is owned by ...
{{DEFAULTSORT:Bloom Filter Hashing Probabilistic data structures Lossy compression algorithms Hash based data structures