Today, we are going to address a prerequisite of Code Kata 5. Because it is a problem solving dealing with a specific data structure, we need to talk about the data structure. In this case a Bloom Filter.
Back when I saw this for the first time, it was crazy neat. I implemented it from my textbook and it was amazing. Unfortunately, I have never had occasion to implement this in real life. There has not been occasion for a spell check. Never have I needed a search doodad nor would I have kept my jobs in years past saying the data is definitely, maybe there.
Explaining a Bloom Filter
So now we have established you should not view me as an expert, lets pretend I am and dig into a Bloom filter. It is a pretty simple concept.
Take a value like the word “Frog”. We will hash this value, or in simple terms, turn it into a single integer. That value will be the index of a bit array that has been initialized to all 0’s. Because our dictionary value, in this case the word value is present in that index of the array, we switch that index to 1. We will repeat the process x number of times.
Let me clarify.
Bloom Filter Example
Lets say we have an array of bits 25 units long. That array will start life as an array of 25 0’s.
{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
Assuming I am not running into a counting problem, that should be 25 units of 0. Next take your word, Frog, and send it through the first hash value and lets say your first hash return 3. We will have a one at index 3 or the fourth value. Remember that indexes start at 0, not 1.
{0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
Now it is time for the second go. The next hash gives us 20 so we take index 20 and turn it into a 1.
{0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0}
Finally, lets give it one last go through an additional hash and get a value of 13. We switch the 13th index to 1 and get a bit array that looks something like this.
{0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0}
That was fun. We could repeat this same process with another word like, whale. Though the more values you add the more slots will have 1’s The more 1’s present the more error rate will exist for false positives. So lets say well return in its three passes through the hash function, 4, 20, 1. Our array will look like this.
{0,1,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0}
Notice how 20 stays the same despite the second pull on it. It is a bit array so only 0 and 1 allowed.
What to do with your array?
So we have this array, what can we do with it? We can take a series of values and see if they are not in the array.
So lets take, Frog, Goat, Sheep and Whale. Take these values and run the three hashes on them. Keep the seed value the same so you will get repeatable results. Lets look at the results:
- Frog 3, 20, 13
- Goat 4, 6, 8
- Sheep 3, 20, 19
- Yak = 4, 20, 13
- Whale, 24, 0, 15
Here is our array.
{0,1,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0}
Frog has a value of 3,20 and 13. All three have 1’s so it will come back true.
As you can see below, Goat comes back as 4,6,8 which is one 1 and two 0’s. This will return false.
{0,1,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0}
Sheep has 3, 20 and 19. This comes back as two 1’s and 1 zero. Bloom filters are not a democracy and the majority does not carry the day. This also comes back as a fail so false.
{0,1,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0}
Yak has 4,20 and 13. It comes back with three 1’s so it passes with a true.
{0,1,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0}
Finally we have Whale at 24, 0 and 15. This one has no 1’s and all 0’s so it really fails.
{0,1,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0}
Did you hear the one about the sneaky Yak?
We did not add the yak to our list of values but he came through anyway. This is the downside of the Bloom filter. It can only tell you no and not yes. In this case the hashes said yes. We said no but the hashes win. For our trouble we get a false positive.
Did you catch the part where Whale failed?
I bet at least some of you say, wait your example is wrong. We added whale to the list. It is true, whale was added to the list. Whale was not. It is very case sensitive. W and w are not equal. Something to watch out for in your implementations.
If you want to improve the accuracy, you can add more hashes but really the best thing to do is give it more bits in its array. That way your hashes can return better values.
In conclusion, this is such a fun thing. It kind of bums me out that I have never gotten to play with them in anything but an academic sense. On the next post, I will show some code to implement a simple Bloom filter. I hope my explanation was clear. Let me know in the comments below.