As we saw in a previous post, we have a basic implementation of a Bloom Filter but things are just not good for something more complex. There are several issues that we need to correct before we can setup new hashes. So in the spirit of code review, lets tear into the code so we can set something up that will work even better. Lets dig into how to improve the Bloom Filter code.
Wait a minute, you just wrote that code a couple of days ago. What kind of screw up are you?
Yes, I wrote the code a few days ago. It works too. I wrote the code in an iterative way so I try to improve things as I catch them and don’t worry too much about getting it technically correct the first time. I work for startups and most are not heavily financed. They rather having something that can work now to fund raise on then have something that is technically correct 3 weeks later.
Every sprint, I fix the things I notice that was wrong in the previous iteration and then move into the new features. Keep pushing that direction and everything generally works out. Every sprint improves the old code. While I can’t say for sure, I think that attitude make my code tend towards more maintainable since I am literally planning on maintaining it while it’s in construction.
Anyway, lets talk about the problems I see.
Problems
If we were to try to do Code Kata 5, we would quickly run out of memory. A big part of that is the byte rather than bit array used. Also we would like a Bloom filter hash algorithm to hash relatively evenly distributed. We will deal with making the hash distribute evenly in another post but it is something to think about in this one. Lets start digging into it. Here is where we are.
Original Bloom Filter
using System.Collections.Generic; using System.Text; namespace CodeKata5 { public class Filter { byte[] bloom; int maxMultipler; List<string> words; public Filter(List<string> Words, int MaxMultipler) { words = Words; maxMultipler = MaxMultipler; } public static int hash(string input, int multipler) { int value = 0; byte[] temp = Encoding.ASCII.GetBytes(input); foreach (var i in temp) { value += i; } return value * multipler; } public int getArraySize() { return bloom.Length; } private int getMaxArraySize() { int maxSize = 0; int temp; foreach(string i in words) { temp = hash(i, maxMultipler); if(temp > maxSize) { maxSize = temp; } } return maxSize + 1; } private bool isBiggerThanBloomFilter(string word) { int maxSize = hash(word, maxMultipler); if (maxSize > bloom.Length) { return true; } return false; } public void buildBloomFilter() { bloom = new byte[getMaxArraySize()]; foreach (string i in words) { for(int r = 1; r <= maxMultipler; r++) { bloom[hash(i, r)] = 1; } } } public bool QueryBloomFilter(string word) { bool result = false; if (isBiggerThanBloomFilter(word)) { return false; } for (int r = 1; r <= maxMultipler; r++) { if (bloom[hash(word, r)] == 1) { result = true; } else { result = false; } } return result; } } }
First, lets tackle the even distributed setup. We will want to be able to easily flip our code between multiple hashes and their max array size. Normally, we want to establish the Bloom filter as something as big as we want, rather than let it figure it out itself. So lets give ourselves a function and global boolean that will control that. We can then take that function and replace it with our current array size setting function to give us control as we expand this function. So we will move the original getMaxArraySize to calcMaxArraySize then let it check the global and return the hard coded desired size of the filter.
private int getMaxArraySize() { if (needCalcMaxArray) { return calcMaxArraySize(); } return 255; } private int calcMaxArraySize() { int maxSize = 0; int temp; foreach(string i in words) { temp = hash(i, maxMultipler); if(temp > maxSize) { maxSize = temp; } } return maxSize + 1; }
Speaking of globals we need to add our global variable controlling the length.
bool needCalcMaxArray = false;
Finally lets add a new global for max filter size and swap out our hard coded 255 for it. Lets see what we have so far.
using System.Collections.Generic; using System.Text; namespace CodeKata5 { public class Filter { byte[] bloom; int maxMultipler; List<string> words; bool needCalcMaxArray = false; int maxFilterSize = 255; public Filter(List<string> Words, int MaxMultipler) { words = Words; maxMultipler = MaxMultipler; } public static int hash(string input, int multipler) { int value = 0; byte[] temp = Encoding.ASCII.GetBytes(input); foreach (var i in temp) { value += i; } return value * multipler; } public int getArraySize() { return bloom.Length; } private int getMaxArraySize() { if (needCalcMaxArray) { return calcMaxArraySize(); } return maxFilterSize; } private int calcMaxArraySize() { int maxSize = 0; int temp; foreach(string i in words) { temp = hash(i, maxMultipler); if(temp > maxSize) { maxSize = temp; } } return maxSize + 1; } private bool isBiggerThanBloomFilter(string word) { int maxSize = hash(word, maxMultipler); if (maxSize > bloom.Length) { return true; } return false; } public void buildBloomFilter() { bloom = new byte[getMaxArraySize()]; foreach (string i in words) { for(int r = 1; r <= maxMultipler; r++) { bloom[hash(i, r)] = 1; } } } public bool QueryBloomFilter(string word) { bool result = false; if (isBiggerThanBloomFilter(word)) { return false; } for (int r = 1; r <= maxMultipler; r++) { if (bloom[hash(word, r)] == 1) { result = true; } else { result = false; } } return result; } } }
Make sure to swap your needCalcMaxArray variable to true. If we calculate, we discover that the array is 3769 bytes in size. Lets see if we can take this down a notch by swapping bytes for bits.
A Bit for a Byte
Part of the reason, I used bytes was the fact that I was trying to keep to an array and there is no bit type to array off of in my experience within C#. What we lack in bits, we make up for in BitArrays. So make sure you have using System.Collections at the top and lets dig into the BitArray.
Here is what we did to improve the Bloom Filter
using System.Collections.Generic; using System.Text; using System.Collections; namespace CodeKata5 { public class Filter { BitArray bloom; int maxMultipler; List<string> words; bool needCalcMaxArray = true; int maxFilterSize = 255; public Filter(List<string> Words, int MaxMultipler) { words = Words; maxMultipler = MaxMultipler; } public static int hash(string input, int multipler) { int value = 0; byte[] temp = Encoding.ASCII.GetBytes(input); foreach (var i in temp) { value += i; } return value * multipler; } public int getArraySize() { return bloom.Length; } private int getMaxArraySize() { if (needCalcMaxArray) { return calcMaxArraySize(); } return maxFilterSize; } private int calcMaxArraySize() { int maxSize = 0; int temp; foreach(string i in words) { temp = hash(i, maxMultipler); if(temp > maxSize) { maxSize = temp; } } return maxSize + 1; } private bool isBiggerThanBloomFilter(string word) { int maxSize = hash(word, maxMultipler); if (maxSize > bloom.Length) { return true; } return false; } public void buildBloomFilter() { bloom = new BitArray(getMaxArraySize()); foreach (string i in words) { for(int r = 1; r <= maxMultipler; r++) { bloom[hash(i, r)] = true; } } } public bool QueryBloomFilter(string word) { bool result = false; if (isBiggerThanBloomFilter(word)) { return false; } for (int r = 1; r <= maxMultipler; r++) { if (bloom[hash(word, r)]) { result = true; } else { result = false; } } return result; } } }
So we have swapped out the byte array for a BitArray. The organization is a bit different because it is an array of bits that thinks in true or false. The joys of 0 and 1’s in your world. We swapped out the byte for bits and swapped setting things to 1 for setting things to true. While we still get 3769, I believe this is bits now rather than byses which should shrink it down by a factor of 8.
There is no doubt more things that can be cleaned up but I wanted to hit the couple of big ones that I found. If you see something that you don’t like then drop it down in the comments below. I am always curious to see what new things a new set of eyes can bring to improve my code and improve this Bloom Filter.