This is the archived website of SY 301 from the Fall 2016 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Lab 11: Graph hops

  • Due before 23:59 Wednesday, November 16
  • Submit using the submit program as 301 lab 11.

1 Metadata Collection

It has been well covered that the NSA and other intelligence agencies collect a huge amount of communications metadata, which contains the records of who has contacted whom, but are legally constrained in how they access and use this data. One of the constraints is that they may only access the metadata of individuals who are three "hops" from an individual for whom the NSA has "reasonable articulable suspicion."

If person A is suspicious, and he or she has contacted person B, then B is one hop from A. If B then contacts C, then C is two hops from A.

The metadata, of course, creates a directed graph detailing who has contacted whom, and our graph search algorithms make a basis for a magnificent way of answering lots of interesting questions about who may and may not be investigated.

2 Our Data

Download the file lab11.tar.gz and extract it using the command tar xzvf lab11.tar.gz. You will see two files, one with real data and one with synthetic data.

The real data is from email conversations in the EU. It has been converted to a dot file for you (email.dot)

The other file (small.dot) is made-up synthetic data for easier testing and development. Here is a visualization of that graph:

To understand this graph, consider individuals 0 and 3. Inside the dot file, you will see a line 0 -> 3;. This means that person 0 has contacted person 3. Person 3 is therefore one hop from person 0. Because the arrow does not go from person 3 to person 0, person 0 is NOT one hop away from person 3.

3 The Assignment

Create a file graph.py with a class called Graph (as well as any other classes you might want). Your Graph class should have:

  • a constructor that takes the filename of a .dot file as an argument and creates the corresponding graph. Note that unlike last week, these dot files store directed graphs!
  • a method hopper(self, name, hops) for which name is an integer vertex id (for the graph above, these could be the integers 0-10, NOT the strings '0'-'10'), and hops is an integer indicating the number of allowed hops. This method should return a list of the names of all vertices which are hops distance from name or closer. For example, calling g.hopper(0,1) on the above graph should return the list [0,1,2,3], in any order (notice these are integers, not strings). Similarly, calling g.hopper(3,2) should return [3,4,5,6], again in any order.
  • A method canSurveil(self, suspect, target, hops), which returns True if target is within hops hops from suspect, and False otherwise. All three inputs should be integers.

It will be convenient to use small.dot when developing your methods initially, so that you can easily follow along what your program should be doing.

But be sure to test your program on the real dataset as well — one common mistake will create an exception if you run g.hopper(3,2) on the real dataset). You may find Python's implementation of a Queue, deque, to be useful.

4 More interesting questions

The rest of this lab is optional for your enrichment and enjoyment. Be sure to work on this only after getting everything else perfect. You should still submit this work, but make sure it doesn't break any of the parts you already had working correctly!

Add the following methods to Graph:

  • distance(self, suspect, target), which returns the smallest number of hops to reach the given target from the given suspect.
  • percentage(self, suspect, hops), which returns a float indicating the percentage of individuals in the data set which can be reached by taking hops hops from suspect.
  • likelyPercentage(self, hops), which returns a float indicating the expected percentage of individuals which can be reached in hops hops from a random suspect. In other words, this is the average of the above method percentage, given an input of all valid ids as suspect.