00If you ever used Cassandra, you know that is well known at being extremely fast in writes and reads, and that is due in part to a data structure called Bloom Filter.
Bloom Filter is an extremely efficient way of asking if a data exists in a set or not (which Cassandra uses to avoid useless disk accesses, which is the slowest part). What’s the downside? It is a probabilistic algorithm and may have false positive (although never false negatives).
It consists of two elements:
- An array of m bits, initialized to zero.
- A set of k hash functions that, given some data, generate numbers between 0 and m-1.
When we insert data it will go through the hash functions, which will return array positions where we change their values to 1. So, whenever new data arrives, to know for sure that the item does not exist in the set we just have to check that any of the positions in the array generated by the hash functions returns 0.
Let’s try a little game to understand this.
Imagine a game of “Guess who?”, but with multiple suspects. Player A adds suspects, and Player B asks Player A questions based on the attributes of the suspects. In this case the suspects will be the data, the attributes hash functions, and the total list of attributes that keep the player A is the array of bits. These will be our possible suspects:
Suppose that player A adds as a suspect Bill Gates. This now means that Player A has the attributes “Rich”, “Philanthropist” and “Glasses”. Therefore, when the Player B asks for Steve Jobs, although the properties of “Glasses” and “Rich” are on the list, “American” isn’t, so we can be sure that Steve Jobs is not a suspect.
But what if we now add Mark Zuckerberg? In doing so, we add two new properties, so now our attributes list is as follows:
“Rich”, “Philanthropist”, “Glasses”, “American” and “Young”.
So, now, what happens if we ask for Steve Jobs’s attributes? Because all his attributes are listed, we could think that he is one of the suspects, but is not the case. So we can be sure when someone is not a suspect, but never it is. If we add Steve Jobs to our list it will not be modified. So what happens if we ask Stephen Hawking? Even with three suspects already, because Stephen Hawking has two attributes that have not yet been inserted, we can be sure that he is not one of the suspects.
And that’s how it works (on the surface). If you want more information in depth, like for example how to reduce the false positives rate, or how to deal with deletions here’s an interesting page.
And if you feel like playing a bit with it, on this website you can find several implementations in python, that may be useful when you want to handle large data such as … data from the Titanic