So we have an improved code base, we need to address the Bloom Filters and improve their hashes. When I say improve the hashes, I mean to one that allow us to cap the size of the filter. In the classical sense of implementing this filter, this is something you want to do. This also means you need to be able to distribute the results over a static range. Hopefully evenly. Without a strong enough statistical background, I cannot prove these hashes to work optimally but I think they are relatively fast. From my limited understanding using a security hash will provide better distribution but at the cost of speed. So pick your poison, me, I will stick with some modified versions of what I am going with some help on evening out the spread to help compact it. Lets dig into the Bloom Filter Hashes.
Original Hash
public static int hash(string input, int multiplier) { int value = 0; byte[] temp = Encoding.ASCII.GetBytes(input); foreach (var i in temp) { value += i; } return value * multiplier; }
First, we correct my misspelling of multiplier. With that out of the way, here is the original algorithm. We get all the ASCII characters for the input then add them together. Finally we multiple them by the stated multiplier and return that value. This is not optimized for size or distribution so we want to improve that. So why don’t we improve its spread and allow us to have a smaller array.
Hash 1
public static int hash1(string input) { int value = 0; byte[] temp = Encoding.ASCII.GetBytes(input); foreach (var i in temp) { value += i; } Random r = new Random(value); return r.Next(1, 255); }
Here is iteration number one. We killed off the multiplier and replaced it with a random number. Funny thing about a computers random number generator. If you give it the same seed value, you get the same random value back out. As a result, we can distribute this out over 255 values. Hopefully, this will give us a more even distribution than before since in theory it will be random.
Wouldn’t it be nice to simplify the logic even more. There is a hash function in .NET but it gives really big values and values that are sometimes negative. We can still work with it.
Hash 2
public static int hash2(string input) { int hash = Math.Abs(input.GetHashCode()); Random random = new Random(hash); return random.Next(1, 255); }
Doesn’t that look a bunch prettier. We can overcome the sometimes negative value of the GetHashCode function by taking its absolute value. Then run the function. In hash we have a really big positive number. We need to get it smaller and distributed evenly so we take the trick from the first modification and run it though a random number generator. Now this is a small, clean function. It still has a problem.
We cannot run it multiple times and get different values so you would need multiple algorithms. Lets fix this.
Hash 3
public static int hash3(string input, int multiplier) { int hash = Math.Abs(input.GetHashCode()); Random random = new Random(hash * multiplier); return random.Next(1, 255); }
Multiple it, by the multiplier and you get an relatively distributed hashing algorithm. Of all the Bloom Filter Hashes, this one is the most elegant and the simplest. Lets implement it in the program.
All the Bloom Filter Hashes
using System.Collections.Generic; using System.Text; using System.Collections; using System; namespace CodeKata5 { public class Filter { BitArray bloom; int maxMultiplier; List<string> words; public Filter(List<string> Words, int MaxMultiplier) { words = Words; maxMultiplier = MaxMultiplier; } public static int hash(string input, int multiplier) { int value = 0; byte[] temp = Encoding.ASCII.GetBytes(input); foreach (var i in temp) { value += i; } return value * multiplier; } public static int hash1(string input) { int value = 0; byte[] temp = Encoding.ASCII.GetBytes(input); foreach (var i in temp) { value += i; } Random r = new Random(value); return r.Next(1, 255); } public static int hash2(string input) { int hash = Math.Abs(input.GetHashCode()); Random random = new Random(hash); return random.Next(1, 255); } public static int hash3(string input, int multiplier) { int hash = Math.Abs(input.GetHashCode()); Random random = new Random(hash * multiplier); return random.Next(1, 255); } public int getArraySize() { return bloom.Length; } private int getMaxArraySize() { int maxSize = 0; int temp; foreach(string i in words) { temp = hash3(i, maxMultiplier); if(temp > maxSize) { maxSize = temp; } } return maxSize + 1; } private bool isBiggerThanBloomFilter(string word) { int maxSize = hash3(word, maxMultiplier); if (maxSize > bloom.Length) { return true; } return false; } public void buildBloomFilter() { bloom = new BitArray(getMaxArraySize()); foreach (string i in words) { for(int r = 1; r <= maxMultiplier; r++) { bloom[hash3(i, r)] = true; } } } public bool QueryBloomFilter(string word) { bool result = false; if (isBiggerThanBloomFilter(word)) { return false; } for (int r = 1; r <= maxMultiplier; r++) { if (bloom[hash3(word, r)]) { result = true; } else { result = false; } } return result; } } }
Lets do a couple of things to clean this up a bit more.
First lets add a maxFilterSize in the global variables to get rid of all those 255 hard coded magic numbers.
With the new Bloom Filter hashes, we no longer have a guarantee that we will see the max filter size in the fourth iteration. Before we took a sum and multiplied it, so the fourth run was always the biggest. We have to tell it. So lets add the boolean to tell it which size to work with in the grid. For the original has we want true and for the rest we want false. So in the getMaxArray function, we will move its contents to calcMaxArraySize() then we will do the calculation with the boolean to track which method to use. Lets look at what we have created.
using System.Collections.Generic; using System.Text; using System.Collections; using System; namespace CodeKata5 { public class Filter { BitArray bloom; int maxMultiplier; List<string> words; bool needCalcMaxArray = false; static int maxFilterSize = 255; public Filter(List<string> Words, int MaxMultiplier) { words = Words; maxMultiplier = MaxMultiplier; } public static int hash(string input, int multiplier) { int value = 0; byte[] temp = Encoding.ASCII.GetBytes(input); foreach (var i in temp) { value += i; } return value * multiplier; } public static int hash1(string input) { int value = 0; byte[] temp = Encoding.ASCII.GetBytes(input); foreach (var i in temp) { value += i; } Random r = new Random(value); return r.Next(1, maxFilterSize); } public static int hash2(string input, int multiplier) { int hash = Math.Abs(input.GetHashCode()); Random random = new Random(hash); return random.Next(1, maxFilterSize); } public static int hash3(string input, int multiplier) { int hash = Math.Abs(input.GetHashCode()); Random random = new Random(hash * multiplier); return random.Next(1, maxFilterSize); } public int getArraySize() { return bloom.Length; } private int getMaxArraySize() { if (needCalcMaxArray) { return calcMaxArraySize(); } return maxFilterSize; } private int calcMaxArraySize() { int maxSize = 0; int temp; foreach(string i in words) { temp = hash3(i, maxMultiplier); if(temp > maxSize) { maxSize = temp; } } return maxSize + 1; } private bool isBiggerThanBloomFilter(string word) { int maxSize = hash3(word, maxMultiplier); if (maxSize > bloom.Length) { return true; } return false; } public void buildBloomFilter() { bloom = new BitArray(getMaxArraySize()); foreach (string i in words) { for(int r = 1; r <= maxMultiplier; r++) { bloom[hash3(i, r)] = true; } } } public bool QueryBloomFilter(string word) { bool result = false; if (isBiggerThanBloomFilter(word)) { return false; } for (int r = 1; r <= maxMultiplier; r++) { if (bloom[hash3(word, r)]) { result = true; } else { result = false; } } return result; } } }
Now, we are ready for the Code Kata 5 challenge. We have a basic skeleton to work the problem from and have done a couple of different approaches to solving the issue. Just to show you the code as it is currently working. Notice how much smaller I filter is now then a few days ago. We have it down from over 3 KB to 255B. That is what I call shrinkage. Best of all, it still works.