This is the archived version of this course from the Spring 2013 semester. You might find something more recent by visitning my teaching page.
Write a program that uses a Bloom filter to check whether passwords are any good. Your program should display a prompt for the user to enter a password, and then if the password is either the same as (1) a common English word, or (2) a previously entered password, then you reject the password. If the password is accepted, you add it to the Bloom filter and move on.
Example:
1st password entered: edification
REJECTED (common English word)
2nd password entered: Eddie vacation
ACCEPTED
3rd password entered: CapitALIZATION
REJECTED (common English word)
4th password entered: roche
ACCEPTED
5th password entered: tobago
REJECTED (false positive in Bloom filter)
6th password entered: roche
REJECTED (previously entered)
7th password entered: roche vacation
ACCEPTED
Of course, in the actual program, the reason a password is rejected would not be listed! But the false positives should be rare - in particular, your Bloom filter should have a probability of false positives less than 10%.
You will need to choose the following for your program to work properly:
A function that maps strings to integers. More detailed instructions on how to do this in C++, Java, and Python are below. In any case, your hash function should produce 32-bit integers.
A list of common English words. Use the dictionary that is found in any standard Linux distribution at /usr/share/dict/words
. These words are listed one per line, all lower-case.
A universal family of hash functions. You should use the Carter & Wegman hash functions, which choose a fixed large prime \(p\), and then choose random \(a\) and \(c\) so that
\[h(x) = ((ax + c) \bmod p) \bmod s\]
Since \(p\) is "fixed" here, you can choose \(p\) to be any large prime of your choosing. "Large prime" here means that \(p\) should be at least \(2^{31}\).
Your program must randomly choose values for \(a\) and \(c\) every time it is run. This means that the false positives should be different every time the program is run.
A properly-seeded PRNG to choose values \(a\) and \(c\) for each of your \(k\) hash functions. You should be pros at this by now.
A Bloom filter table size \(s\) and number of hash functions \(k\) that will guarantee at most \(1/10\) probability of getting a false positive. You should try to make \(s\) and \(k\) as small as possible to achieve this error probability.
You can run wc /usr/share/dict/words
to discover that there are 98569 English words in this dictionary that will go into your Bloom filter. You can assume that there are less than 1000 new passwords that will be entered, and so you can use \(n=100000\) to do the calculations required to determine \(s\) and \(k\).
Explain your choices for \(p\), \(s\), and \(k\) very clearly in comments in your code. Be sure to give references to any sources that you used.
hash
that will do this for you. For example, hash('corned beef')
returns 1456869758 on the lab machines.hashCode()
method in the String
class does the same thing, more or less. You just do like "corned beef".hashCode()
to get an int
from that string.C++: There is no built in hashing in the current C++ standard. You may use this handy-dandy function instead:
int hash(string s) {
int mulbase = 97; // arbitrarily-chosen prime number
int curpow = 1; // this equals 97^i at step i.
int result = 0;
for (int i=0; i<s.length(); ++i) {
curpow *= mulbase;
result += s[i]*curpow;
}
return result;
}
Submit your program according to the instructions on the submit page.