Lab 2: Enron emails
1 Overview
In today’s lab we will examine a vast treasure trove of emails released from the Enron corporation in 2002.
These emails were released into the public domain by the Federal Energy Regulatory Commission (FERC) as part of their investigation into what has been called the largest corporate scandal in U.S. history.
In short, Enron was an extremely successful energy trading company, with more than 20,000 employees and worth around $70 billion dollars at the turn of the century. But it turns out much of their supposed worth was inflated or flat out fraudulent through shady accounting practices and various financial shenanigans. When the scandal broke in late 2001, Enron’s stock price quickly lost more than 99% of its value and the company declared bankruptcy the next month.
While this historical context is important, our focus will mainly be on the insights we can glean from this large dataset of real human communication.
1.1 Deadlines- Milestone: 2359 on Monday, 5 February
- Complete lab: 2359 on Monday, 12 February
1.2 Learning goals
- Examine and gain insights from a large, raw, real-world dataset
- Use important data science command-line tools such as
find
and grep
on hundreds of thousands of text files
- Learn how to process raw data using a combination of command-line
and Python tools
- Use Python libraries to create simple visualizations
- Develop your own questions that can be investigated with data
2 Preliminaries
2.1 Notebook file to fill in
- Examine and gain insights from a large, raw, real-world dataset
- Use important data science command-line tools such as
find
andgrep
on hundreds of thousands of text files - Learn how to process raw data using a combination of command-line and Python tools
- Use Python libraries to create simple visualizations
- Develop your own questions that can be investigated with data
2 Preliminaries
2.1 Notebook file to fill in
Here is the Jupyter notebook file that you will need to fill in and submit for today’s lab: Here is the notebook file for this lab.
If you want, you can use the terminal to download the blank file directly to your lab folder with this command:
wget "https://roche.work/212/lab/enron/lab02.ipynb"
2.2 Bash code cells in Jupyter
Today’s lab is all about using the command line (bash). So, to show how you computed each answer, we need to see what bash commands you ran!
To do that, you just make a new code cell and add the following “cell magic” command at the top:
%%bash
Then Jupyter will know to treat the contents of that cell just like a bash script, and you can run it right there in the notebook and see the output - neat!
It is worth noting that each bash cell block is separate from the others, and things like assigning variables or changing directory sadly don’t carry through. (However there is a separate way to change directories; more on that next.)
My advice is to actually do your “exploratory” work on the terminal
itself, when you are trying to figure out what commands to run or what
the data looks like, etc. Then when you have the command or commands
figured out to answer a question, copy them into a code cell with the
%%bash
magic to show your work and later turn it in.
2.3 Accessing the datasets
The full Enron dataset is very large, about 500,000 files with a total size of 3GB. You should not download this to your lab machine because that would take too long and use too much storage for every student to have their own copy of the full dataset.
Instead, you can use a copy which has been downloaded and extracted
already, available from the lab machines or from ssh.cs.usna.edu
in
the read-only folder
/home/mids/SD212/enron
But exploring this data will be very difficult to get started on due to its huge size. To help, we have two smaller datasets available. Start with the smallest dataset to make your life easier.
1% of original size, about 5,000 emails:
/home/mids/SD212/enron.01
10% of original size, about 50,000 emails:
/home/mids/SD212/enron.1
At the top of the notebook file you downloaded, there is a special cell to change the working directory at the top; it reads
%cd /home/mids/SD212/enron.01
All of your bash code cells will start from this directory. That’s
actually very convenient — when you’re ready to “move up” to 10% or
100% dataset, it should just mean changing that %cd
command and then
re-running the rest of the cells!
2.4 Email structure
You should explore the subfolders and files in the enron (or enron.1
or enron.01
) directory using the command-line tools ls
, cd
, and
less
.
Each regular file in a subdirectory represents a single email that an employee sent or received. Emails are stored as plain-text files with a few special formatting rules, the most important of which are the email headers that occur at the top of the file.
For example, check out the file shapiro-r/federal_gov_t_affairs/35.
.
It shows an email thread between some Enron employees and an energy
industry insider about (of all things) exchanging a fax number and
downplaying the rumors of Enron’s vast fraud in October 2001.
The most most recent headers are always at the top of the email:
Message-ID: <23609681.1075862231979.JavaMail.evans@thyme>
Date: Wed, 31 Oct 2001 07:02:27 -0800 (PST)
From: pr <.palmer@enron.com>
To: richard.shapiro@enron.com, j..kean@enron.com, linda.robertson@enron.com
Subject: FW: Here from scratch ... what's your fax #... we can try the old
fashioned way!
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Palmer, Mark A. (PR) </O=ENRON/OU=NA/CN=RECIPIENTS/CN=MPALMER>
X-To: Shapiro, Richard </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Rshapiro>, Kean, Steven J. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Skean>, Robertson, Linda </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Lrobert3>
Most important for this lab are the first Date
header line (which shows
when the email was sent or received), the first Subject
header line,
and the first From
, To
, and possibly CC
header lines.
There will be many more header lines (especially those starting with X
like X-To
etc.) that you can ignore.
After the initial group of header lines is the actual body of the email, i.e., whatever message is being sent.
Most emails (including this one) may contain many forwards or replys,
which are other older emails copied below the current one. These will
usually be separated with a line that has a bunch of dashes like
-----Original Message-----
.
There might be more header lines like To:
or Date:
that show up
lower in the forwarded or reply-to emails; this is why mostly in your
analysis you want to focus on the first header line of interest.
2.5 The find command
(This section may be useful when you are getting stuck. Consider it a
supplemental reference on how to effectively use find
and grep
for
some problems in this lab. You might want to skip it for now and come
back later when you need it.)
For may problems, especially when using the larger datasets, you will
need to make effective use of the find
command, often combined with
other command-line tools.
For example, if I want to search every .txt
file in a folder fol
for
references to the first state, I could do this with a one-liner as:
grep 'Delaware' fol/*.txt
This is OK, but at some point will stop working if there are too many files, because the operating system puts a limit on how long the command line can be.
We could do the same thing with a loop like this:
for file in fol/*.txt
do
grep -H 'Delaware' "$file"
done
Here I used the -H
option to grep (one of many useful options you can
read about in the grep man page!) to force grep to display the name of
files with matches.
This loop now works for any number of files, but it still has two
shortcomings. First, it’s kind of slow, because the grep
command needs
to be executed separately once per file, and second, it only searches
the txt files directly under the directory fol
, but not the files in
any sub-directories.
Instead, we can do this search much more efficiently with find
as
follows:
find fol -name '*.txt' -exec 'grep' '-H' 'Delaware' '{}' '+'
This is a little bit harder to read, but the result will run much more quickly, and it will also examine all the subfolders automatically. The syntax is basically like this:
find [folders_to_search] [conditions] -exec [command_line_to_execute] '{}' '+'
In the example above, it is searching the directory fol
with the
condition of “any file whose name ends in .txt”, and for each such file,
it is executing the same grep
command you can see in the for loop above.
You might be asking: what are the '{}'
and '+'
at the end of the
command line? The first one '{}'
is a placeholder for the name of
each file. It’s the same role as the variable reference "$file"
in
the for loop previously, except that with find we don’t get to name the
variable.
And the '+'
is really just telling find “that’s where the -exec option
ends”. You can also use a ';'
instead (and sometimes you have to), but the
'+'
way is much faster for command like grep because it combines many
files (but not too many!) into a single call to grep, instead of calling
grep once per every file.
3 Initial questions
The first two questions are kind of general and apply to the entire lab. They are both optional (except if you used any resources that need to be documented as per the course honor policy).
What sources of help (if any) did you utilize to complete this lab? Please be specific.
What did you think of the lab overall? We hope that it was challenging but instructive, and maybe even fun. Again, if you can be specific that is helpful!
4 Single file (30pts)
These preliminary problems ask you to find and actually read through a few specific emails. You can use the 1% dataset to answer these problems.
Useful bash tools for these questions: ls
, cd
, less
, grep
What is the name of Scott Neal’s fraternity brother who plays the accordion?
(Remember, for these problems, you need to do two things. First, make a new code cell below each question, starting with
%%bash
, and put the commands there which you used to answer the question. Second, you edit the markdown cell with the question itself, and add your answer in text.)Enron employee John Arnold sent an email discussing oil prices saying (among other things), “Who cares if we nuke afghanistan?”.
On what date was this email sent? Type your date as
YYYY/MM/DD
.On January 30 2002, Mark Germann from Sacramento sent an email to a single Enron employee urging him to donate his ill-gotten gains.
What was the LAST NAME of the employee to whom Mr. Germann sent his letter?
After receiving it, what did that employee do with Mr. Germann’s email?
4.1 Submit what you have so far
Submit your work so far:
submit -c=sd212 -p=lab02 lab02.ipynb
or
club -csd212 -plab02 lab02.ipynb
or use the web interface
5 Make it count (30 pts)
These questions ask you to compute counts of the number of emails with
certain properties. Remember from the dataset description
above that there are 3 sizes of datasets:
enron
is the full one; enron.1
has only 10% of the emails, and
enron.01
has just 1% of them.
Start with the 1% dataset and work your way up.
Your answers will change for larger datasets, and the submit
auto-testing scripts account for this. Remember, changing your notebook
to work in the larger directory should just require changing the %cd
command at the top of the notebook. Be sure to replace the actual
answers in the markdown cells when you get it working for larger sizes!
Useful bash tools for these questions: find
, grep
, wc
What is the total number of emails in the dataset you are looking at?
Hint: don’t count the folders. Look at the documentation for the
-type
flag tofind
.How many emails were sent by Enron employees in the dataset during the year 1999?
(For this question, only consider emails in subfolders named ‘sent’ or ‘sent_items’)
Hint: remember to look at the first time the
Date
header appears in the email. For that purpose, the-m
option to grep might be useful.Hint 2: One oddball employee has his
sent
folder in a sub-subfolder. Make sure you don’t miss it!How many emails mention golf in the subject line?
Hint: You will probably want to
grep
twice: First to extract the firstSubject:
header line, and secondly to do a case-insensitive search for “golf” in each of those lines.Check out the
-i
and-c
flags to grep.How many emails contain profanity?
There is some flexibility in your definition of “profanity” here, but try to do your best to capture the kind of words that would be “bleeped” on network TV or radio, without double-counting. This research paper contains a classic list of such words.
Hint: make a regular expression for profanity for grep. Use the
\|
“alternation” operator to allow multiple possibilities.
5.1 Submit what you have so far
Submit your work so far:
submit -c=sd212 -p=lab02 lab02.ipynb
or
club -csd212 -plab02 lab02.ipynb
or use the web interface
5.2 Milestone
For this lab, the milestone means everything up to this point using the 10% dataset, which includes the following auto-tests:
nb_part4
nb_part5a
Remember: this milestone is not necessarily the half-way point. Keep going!
6 Choose your own adventure (40 pts)
For the last part of the lab, you will come up with your own question and try to answer it with the data. This will involve three steps:
- First you have to come up with a good question to ask!
- Next you will use bash to create a csv file which gathers the data needed to answer your question from the actual emails
- Finally you will make a graph using Python based on reading that csv file. Of course, the goal of your graph is to clearly present the evidence to answer your original question.
What question are you trying to answer or investigate using this data?
Create a bash cell (below this question) in your notebook that goes through the email data and creates a file
mydata.csv
which has the info needed to analyze your question.Briefly explain the contents of your
mydata.csv
file. What does each row and column represent?Which dataset did you use to create your
mydata.csv
file?- 1% dataset (smallest)
- 10% dataset
- 100% dataset (largest)
(Write just the letter of your answer below.)
Create a Python code cell that uses the data science libraries we have seen such as Pandas and Plotly to read in your
mydata.csv
file and create a graph from it. Save your graph as an imagemygraph.png
(with 1000x600 dimensions) to turn in.Having completed your analysis, what conclusions (if any) can you draw about the original question? If you had time, what improvements would you make or what would be the next thing you would investigate?
More details and tips for each part follow below.
6.1 Your question
We are being intentionally open-ended about what question you can ask. Note that you will not be graded on how compelling the results are, but on how interesting the question is that you are asking and how accurate and clear are the methods you use for analysis.
Use these guidelines when thinking of what to investigate:
- Make sure it’s your unique idea! It shouldn’t be the same exact thing someone else is doing, or similar to the example below.
- It should involve many/most emails in the dataset, at least for searching. For examples, if your study includes only the emails sent by Jeff Skilling on Feburary 3, 2001, then it’s probably not interesting.
- It may involve some outside additional dataset, but that is not necessary or even encouraged.
If you are coming up blank with ideas, here are a few directions you might go in: unusual word usage; messaging patterns (who sends emails to who else); connections with real-world events of 1999-2001 (Bush v Gore election, 9/11 attacks, release of the movie Office Space); chain letters; scams and spam; differences between Enron-internal vs external emails; who works on weekends or holidays; racist or sexist language; mentions of certain schools or organizations.
6.2 Data gathering using bash
This should be similar to what you have been doing in the rest of the lab, with a few key differences:
There is a
%cd
command at the end to take you back to your lab directory and out of the Enron one. This is important because you want to save and read yourmydata.csv
file in a place where your lab files live, not in the shared folder with the email datasets.You probably want to use the
echo
command to create the header line for your csv file, and then some other command(s) in bash to actually write the content lines below that.But this is a problem using the
>mydata.csv
redirection in bash, since that will overwrite the file every time.Instead, after writing the header line with
>mydata.csv
, you can write the rest of the csv file with>>mydata.csv
. The>>
tells bash to append to the file rather than overwrite it. Like this:echo "my,header,row" >mydata.csv some_command_with_lots_of_output >>mydata.csv
Be careful of delimiters if you deal with dates! The date lines in the emails contain commas and colons, which you might have to choose something else like a semicolon for the csv file you create.
This is real data, and it does have some errors in the dates listed in emails. This may or may not affect you depending on what kind of analysis you are doing. But if something comes up when you try to read the csv file in Python, you may have to come back to this step and modify your bash commands to fix or eliminate the faulty data.
6.3 Analysis and graphing using Python
Coding-wise, this part should be similar to what we did on the previous lab and in much of SD211. Of course, you may need to read some documentation yourself to figure out how to make the graph look the way you want. Everyone might create different kinds of graphs since you have different questions!
Keep in mind that the csv file you are processing in Python was created by you in the previous step! So there is kind of a “negotiation” between what kinds of processing or cleaning you do in bash when creating the csv, versus what kinds of processing you do in Python when reading it in. THe choice is yours, but just remember that if you are getting annoyed with one part (bash or Python), an option you have is to try and deal with the issue in the other part if that makes it easier.
6.4 Sample bash/csv/python/graph
The end of the notebook file contains a simple sample analysis of how often the word “food” is mentioned at different times of day. Check it out for some guidance on the kind of thing we are looking for.
But don’t try to make your solution look like this. You will have a different (and hopefully better) question you are asking and answering! So for example, your graph might be a completely different type — it doesn’t need to be a bar chart! Look at the plotly express gallery for inspiration.
6.5 To submit your worksubmit -c=sd212 -p=lab02 mydata.csv mygraph.png lab02.ipynb
submit -c=sd212 -p=lab02 mydata.csv mygraph.png lab02.ipynb
or
club -csd212 -plab02 mydata.csv mygraph.png lab02.ipynb
or use the web interface