Lab 11: Graph hops
- Due before 23:59 Wednesday, November 16
- Submit using the submit program
as
301 lab 11
.
1 Metadata Collection
It has been well covered that the NSA and other intelligence agencies collect a huge amount of communications metadata, which contains the records of who has contacted whom, but are legally constrained in how they access and use this data. One of the constraints is that they may only access the metadata of individuals who are three "hops" from an individual for whom the NSA has "reasonable articulable suspicion."
If person A is suspicious, and he or she has contacted person B, then B is one hop from A. If B then contacts C, then C is two hops from A.
The metadata, of course, creates a directed graph detailing who has contacted whom, and our graph search algorithms make a basis for a magnificent way of answering lots of interesting questions about who may and may not be investigated.
2 Our Data
Download the file lab11.tar.gz and extract it using
the command tar xzvf lab11.tar.gz
. You will see two files, one with real data
and one with synthetic data.
The real data is from email conversations in the EU. It has been converted to a dot file for you (email.dot)
The other file (small.dot) is made-up synthetic data for easier testing and development. Here is a visualization of that graph:
To understand this graph, consider individuals 0 and 3. Inside the dot
file, you will see a line 0 -> 3;
. This means that person 0 has
contacted person 3. Person 3 is therefore one hop from person 0. Because the
arrow does not go from person 3 to person 0, person 0 is NOT one hop away from
person 3.
3 The Assignment
Create a file graph.py
with a class called Graph
(as well as any other classes you might want).
Your Graph
class should have:
- a constructor that takes the filename of a .dot file as an argument and creates the corresponding graph. Note that unlike last week, these dot files store directed graphs!
- a method
hopper(self, name, hops)
for whichname
is an integer vertex id (for the graph above, these could be the integers 0-10, NOT the strings '0'-'10'), andhops
is an integer indicating the number of allowed hops. This method should return a list of the names of all vertices which arehops
distance fromname
or closer. For example, callingg.hopper(0,1)
on the above graph should return the list[0,1,2,3]
, in any order (notice these are integers, not strings). Similarly, callingg.hopper(3,2)
should return[3,4,5,6]
, again in any order. - A method
canSurveil(self, suspect, target, hops)
, which returnsTrue
iftarget
is withinhops
hops fromsuspect
, and False otherwise. All three inputs should be integers.
It will be convenient to use small.dot
when developing your methods initially,
so that you can easily follow along what your program should be doing.
But be sure to test your program on the real dataset as well — one common
mistake will create an exception if you run g.hopper(3,2)
on the real
dataset). You may find Python's implementation of a Queue, deque,
to be useful.
4 More interesting questions
The rest of this lab is optional for your enrichment and enjoyment. Be sure to work on this only after getting everything else perfect. You should still submit this work, but make sure it doesn't break any of the parts you already had working correctly!
Add the following methods to Graph
:
distance(self, suspect, target)
, which returns the smallest number of hops to reach the given target from the given suspect.percentage(self, suspect, hops)
, which returns a float indicating the percentage of individuals in the data set which can be reached by takinghops
hops fromsuspect
.likelyPercentage(self, hops)
, which returns a float indicating the expected percentage of individuals which can be reached inhops
hops from a random suspect. In other words, this is the average of the above methodpercentage
, given an input of all valid ids as suspect.