This is the archived website of SY 301 from the Fall 2016 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Project 2: Rainbow Table

Due before 23:59 on Tuesday, November 1
Submit using the submit program as 301 proj 2.

1 Intro: Shadow files

Keeping passwords secret is a big, and very difficult, part of keeping many systems secure. It would obviously be a very poor idea to store a user's password in plaintext on the system. But then how can the system know that someone's entered password is correct?

The way this normally works is that in place of storing the password, the system instead stores a hash of the password in a shadow file. This way, when somebody enters their password, the system merely needs to hash that string, and compare the hash to the stored hash. For an example of this, on a linux machine or virtual machine that you have sudo access on, have a look at /etc/shadow.

The advantage of this approach is that even if that shadow file is compromised, attackers can't reverse-compute the password given the hash, so the password remains safe... or does it?

People don't use random passwords. In fact, there is a great deal of repetition of passwords across users, and across sites. So, suppose you had a shadow file. Couldn't you pre-compute the hashes of common passwords, and see if any of the hashes match the ones in your file? Of course you could.

There are ways to combat this (such as salting your passwords), but unfortunately many websites still don't do that, and a number of recent, high-profile breaches have resulted in users' passwords being revealed from exactly this kind of attack. Since these brute-force tables remain an often-effective technique for discovering passwords, and in order to understand that we want to know exactly how fast such an attack can be!

2 Data Files and Starter Code

The following paragraphs refer to the files inside a tarball called proj2.tar.gz, which you can extract by running tar xzvf proj2.tar.gz. You can also access some of the files with the links below. As a warning, the contents contain real-world data, with the associated profanity, racism, etc.

You can get all these files at once by downloading the file proj2.tar.gz and running tar xzvf proj2.tar.gz

RockYou is a Silicon Valley company that deals with web advertising, social media games, and other techno-buzzwordy-things. What they do isn't so relevant to us; what is relevant is that they are historically just atrocious at security. In 2009, their account passwords (which they stored in plaintext) were stolen and made public; this set of passwords is now one of the go-to datasets to see the relative frequency of user-defined passwords. rockyouwithcount.txt contains those passwords, along with the number of times that password was used. mediumrockyouwithcount.txt and shortrockyouwithcount.txt are subsets of that list, useful for when you don't want to deal with 14 million entries. These files are sorted by frequency.

I have also gone ahead and precomputed the MD5 hashes for those passwords. You can see those hashes in hashedRockyou.txt (again, there medium and short versions, corresponding to the *rockyouwithcount files). These are (almost) sorted by hash value, though there are a few exceptions.

Finally, you can generate a shadow file using shadowgen.py, like so:

$ python3 shadowgen.py rockyouwithcount.txt 5000 shadow.txt answers.txt

Once this is done running, shadow.txt will contain 5000 username-hash pairs, where the usernames are random English words, and the passwords are randomly selected from the list rockyouwithcount.txt, in the same frequency as indicated by their counts. In other words, the hash for the password 123456 will show up much more frequently than the hash for pookie168. answers.txt will contain the username-password pairs in plain text.

3 The Assignment

Your goal on this is to identify the passwords associated with the hashes in a shadow file. So, what you'll need is a Map, where the keys are hashes, and the values are plaintext passwords. The faster your Map works, the faster your program will run. I've started your code for you in brute.py. All your code must go in the indicated location, so that it can be timed without taking into account the hard drive access or the time it takes to read the hashes and print the answers.

It's fine to use any of the built-in Python modules as long as you've built your own data structures and written that part of the code yourself from scratch. And as usual, keep in mind the course honor policy on collaboration for projects.

Step 0: Get the program working using python's built-in map, the dictionary. This will help you understand just what it is you're doing with this project.

Step 1: up to 72 points: Implement your map, and get your program working, using your Treap from Lab 8, or a BST with randomly shuffled input order.

Step 2: up to 95 points: Use anything else we've learned which is faster (runs reasonably quickly on the medium-sized password list). The only exception is Treaps or using a BST with shuffled input (too easy). This data structure must be entirely implemented yourself! Your program should be able to handle the medium-sized list.

Step 3: up to 100 points: Add something we haven't learned. Maybe it's a data structure you found online (you may learn about data structures from anywhere you want, but it must be cited, and the implementation must be your own). Maybe it's some way of leveraging the password counts. Maybe it's parallelism. Try your faster program on the full list and large shadow files (30,000 or so users).

BONUS for everyone: If anyone in the class gets their program to run faster than mine, everyone will get +2 points on this project.

4 Workflow

To be explicit, for a program that works correctly, your workflow will look like this:

Generate a shadow file to be cracked, and the plaintext passwords associated with that shadow file, using the shadowgen.py program that I'm giving you.
Given that shadow file, crack it with brute.py. The output should be username/password pairs, one per line, separated by a tab character, just like the plaintext "answers" file created by shadowgen.py.

Include a README.txt which details which step you've achieved, and what you've implemented to make it work. What are the runtimes of add and get in your map? It's OK to include any other *.py files in your submission, but the code in brute.py should represent the best solution you have.