Lab 2: Enron emails
1 Overview
In today’s lab we will examine a vast treasure trove of emails released from the Enron corporation in 2002.
These emails were released into the public domain by the Federal Energy Regulatory Commission (FERC) as part of their investigation into what has been called the largest corporate scandal in U.S. history.
In short, Enron was an extremely successful energy trading company, with more than 20,000 employees and worth around $70 billion dollars at the turn of the century. But it turns out much of their supposed worth was inflated or flat out fraudulent through shady accounting practices and various financial shenanigans. When the scandal broke in late 2001, Enron’s stock price quickly lost more than 99% of its value and the company declared bankruptcy the next month.
While this historical context is important, our focus will mainly be on the insights we can glean from this large dataset of real human communication.
1.1 Deadlines- Milestone: 2359 on Tuesday, 7 February
- Complete lab: 2359 on Sunday, 12 February
1.2 Learning goals
- Examine and gain insights from a large, raw, real-world dataset
- Use important data science command-line tools such as
find
and grep
on hundreds of thousands of text files
- Learn how to process raw data using a combination of command-line
and Python tools
- Use Python libraries to create simple visualizations
- Develop your own questions that can be investigated with data
2 Preliminaries
2.1 Markdown file to fill in
- Examine and gain insights from a large, raw, real-world dataset
- Use important data science command-line tools such as
find
andgrep
on hundreds of thousands of text files - Learn how to process raw data using a combination of command-line and Python tools
- Use Python libraries to create simple visualizations
- Develop your own questions that can be investigated with data
2 Preliminaries
2.1 Markdown file to fill in
Here is the file with questions to fill in and submit for today’s lab: lab02.md
You can run this wget command to download the blank md file directly from the command line:
wget "https://roche.work/courses/s23sd212/lab/md/lab02.md"
The first two questions are kind of general and apply to the entire lab. They are both optional (except if you used any resources that need to be documented as per the course honor policy).
What sources of help (if any) did you utilize to complete this lab? Please be specific.
What did you think of the lab overall? We hope that it was challenging but instructive, and maybe even fun. Again, if you can be specific that is helpful!
2.2 Accessing the datasets
The full Enron dataset is very large, about 500,000 files with a total size of 3GB. You should not download this to your lab machine because that would take too long and use too much storage for every student to have their own copy of the full dataset.
Instead, you can use a copy which has been downloaded and extracted
already, available from the lab machines or from midn.cs.usna.edu
in
the read-only folder
/home/mids/SD212/enron
But exploring this data will be very difficult to get started on due to its huge size. To help, we have two smaller datasets available. Start with the smallest dataset to make your life easier.
1% of original size, about 5,000 emails:
/home/mids/SD212/enron.01
10% of original size, about 50,000 emails:
/home/mids/SD212/enron.1
2.3 Email structure
You should explore the subfolders and files in the enron (or enron.1
or enron.01
) directory using the command-line tools ls
, cd
, and
less
.
Each regular file in a subdirectory represents a single email that an employee sent or received. Emails are stored as plain-text files with a few special formatting rules, the most important of which are the email headers that occur at the top of the file.
For example, check out the file shapiro-r/federal_gov_t_affairs/35.
.
It shows an email thread between some Enron employees and an energy
industry insider about (of all things) exchanging a fax number and
downplaying the rumors of Enron’s vast fraud in October 2001.
The most most recent headers are always at the top of the email:
Message-ID: <23609681.1075862231979.JavaMail.evans@thyme>
Date: Wed, 31 Oct 2001 07:02:27 -0800 (PST)
From: pr <.palmer@enron.com>
To: richard.shapiro@enron.com, j..kean@enron.com, linda.robertson@enron.com
Subject: FW: Here from scratch ... what's your fax #... we can try the old
fashioned way!
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Palmer, Mark A. (PR) </O=ENRON/OU=NA/CN=RECIPIENTS/CN=MPALMER>
X-To: Shapiro, Richard </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Rshapiro>, Kean, Steven J. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Skean>, Robertson, Linda </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Lrobert3>
Most important for this lab are the first Date
header line (which shows
when the email was sent or received), the first Subject
header line,
and the first From
, To
, and possibly CC
header lines.
There will be many more header lines (especially those starting with X
like X-To
etc.) that you can ignore.
After the initial group of header lines is the actual body of the email, i.e., whatever message is being sent.
Most emails (including this one) may contain many forwards or replys,
which are other older emails copied below the current one. These will
usually be separated with a line that has a bunch of dashes like
-----Original Message-----
.
There might be more header lines like To:
or Date:
that show up
lower in the forwarded or reply-to emails; this is why mostly in your
analysis you want to focus on the first header line of interest.
2.4 The find command
(This section may be useful when you are getting stuck. Consider it a
supplemental reference on how to effectively use find
and grep
for
some problems in this lab. You might want to skip it for now and come
back later when you need it.)
For may problems, especially when using the larger datasets, you will
need to make effective use of the find
command, often combined with
other command-line tools.
For example, if I want to search every .txt
file in a folder fol
for
references to the first state, I could do this with a one-liner as:
grep 'Delaware' fol/*.txt
This is OK, but at some point will stop working if there are too many files, because the operating system puts a limit on how long the command line can be.
We could do the same thing with a loop like this:
for file in fol/*.txt
do
grep -H 'Delaware' "$file"
done
Here I used the -H
option to grep (one of many useful options you can
read about in the grep man page!) to force grep to display the name of
files with matches.
This loop now works for any number of files, but it still has two
shortcomings. First, it’s kind of slow, because the grep
command needs
to be executed separately once per file, and second, it only searches
the txt files directly under the directory fol
, but not the files in
any sub-directories.
Instead, we can do this search much more efficiently with find
as
follows:
find fol -name '*.txt' -exec 'grep' '-H' 'Delaware' '{}' '+'
This is a little bit harder to read, but the result will run much more quickly, and it will also examine all the subfolders automatically. The syntax is basically like this:
find [folders_to_search] [conditions] -exec [command_line_to_execute] '{}' '+'
In the example above, it is searching the directory fol
with the
condition of “any file whose name ends in .txt”, and for each such file,
it is executing the same grep
command you can see in the for loop above.
You might be asking: what are the '{}'
and '+'
at the end of the
command line? The first one '{}'
is a placeholder for the name of
each file. It’s the same role as the variable reference "$file"
in
the for loop previously, except that with find we don’t get to name the
variable.
And the '+'
is really just telling find “that’s where the -exec option
ends”. You can also use a ';'
instead (and sometimes you have to), but the
'+'
way is much faster for command like grep because it combines many
files (but not too many!) into a single call to grep, instead of calling
grep once per every file.
3 Single file (20pts)
These preliminary problems ask you to find and actually read through a few specific emails. You can use the 1% dataset to answer these problems.
Useful bash tools for these questions: ls
, cd
, less
, grep
What is the name of Scott Neal’s fraternity brother who plays the accordion?
Enron employee John Arnold sent an email discussing oil prices saying (among other things), “Who cares if we nuke afghanistan?”.
On what date was this email sent? Type your date as
YYYY/MM/DD
.On January 30 2002, Mark Germann from Sacramento sent an email to a single Enron employee urging him to donate his ill-gotten gains.
What was the LAST NAME of the employee to whom Mr. Germann sent his letter?
After receiving it, what did that employee do with Mr. Germann’s email?
3.1 Submit what you have so far
Submit your work so far:
submit -c=sd212 -p=lab02 lab02.md
or
club -csd212 -plab02 lab02.md
or use the web interface
4 Make it count (25 pts)
These questions ask you to compute counts of the number of emails with
certain properties. Remember from the dataset description
above that there are 3 sizes of datasets:
enron
is the full one; enron.1
has only 10% of the emails, and
enron.01
has just 1% of them.
Start with the 1% dataset. You can come back later and replace these answers with the 10% or 100% dataset for more possible points.
Save the bash lines you use in a file called counts.sh
. Be sure to
add helpful comments to explain what line(s) you used to answer each
question. (Think of this as “showing your work” - the precise format of
how the bash lines here are organized and how the output is presented
are up to you.)
Useful bash tools for these questions: find
, grep
, wc
What is the total number of emails in the dataset you are looking at?
Hint: don’t count the folders. Look at the documentation for the
-type
flag tofind
.How many emails were sent by Enron employees in the dataset during the year 1999?
(For this question, only consider emails in subfolders named ‘sent’ or ‘sent_items’)
Hint: remember to look at the first time the
Date
header appears in the email. For that purpose, the-m
option to grep might be useful.Hint 2: One oddball employee has his
sent
folder in a sub-subfolder. Make sure you don’t miss it!How many emails mention golf in the subject line?
Hint: You will probably want to
grep
twice: First to extract the firstSubject:
header line, and secondly to do a case-insensitive search for “golf” in each of those lines.Check out the
-i
and-c
flags to grep.How many emails contain profanity?
There is some flexibility in your definition of “profanity” here, but try to do your best to capture the kind of words that would be “bleeped” on network TV or radio, without double-counting. This research paper contains a classic list of such words.
Hint: make a regular expression for profanity for grep. Use the
\|
“alternation” operator to allow multiple possibilities.
4.1 Submit what you have so far
Submit your work so far:
submit -c=sd212 -p=lab02 counts.sh lab02.md
or
club -csd212 -plab02 counts.sh lab02.md
or use the web interface
4.2 Milestone
For this lab, the milestone means everything up to this point, which includes the following auto-tests:
md_part3
md_part4a
md_part4b
sh_counts
Remember: this milestone is not necessarily the half-way point. Keep going!
5 Profane and profane accessories (35 pts)
Now let’s do some data analysis! We want to build off the last question and analyze how the use of profanity varies by hour of the day.
To do this, you will work in three steps:
First, write a bash script
makecsv.sh
that goes through the emails in the dataset and extracts two pieces of information from each email: the date when the email was sent, and an indication of whether that email contains profanity.Your bash script should create one or more csv files with one row per email and columns for the date and for the profantiy.
Next, write a python program
graph.py
that reads in the csv file(s) your bash script created, and produces a histogram showing the number of emails that mention profanity, grouped by which hour of the day they were sent (0 up to 23).Finally, run it! Run your bash script to create the csv file or files, then run your python program to make the graph. Save your graph as
profanity.png
to submit.
As before, we strongly recommend you start with the small dataset. Work through the entire problem on the smallest dataset, and then try on the larger datasets (to earn more points) once your code is working perfectly.
There are multiple ways to tackle this, and it will be challenging. You are ready for the challenge! Here are a few tips and suggestions that may be helpful.
All you really need is a 1 or 0 telling you whether the email has profanity or not, but it is probably easier to count the number of lines with profanity using the
-c
option to grep. Also remember the-H
option forces grep to print the filename.When you create a .csv file, you have to make the header row yourself. The usual way to do this is to redirect output into the csv filename you want. We can use the
>>
redirect to append to the file after writing the first line, like:echo "my,header,row" >file.csv some_command_with_lots_of_output >>file.csv
Be careful of the delimiters that you will use! The default way to separate columns in a csv file is a comma. This can work, but be careful because the dates in the emails might also contain commas already. Some command-line tools use (such as
grep -c
) use a colon, but there are also colons in the date lines of the emails!There are multiple ways to deal with this, which will be sort of a negotiation between your bash script (to create the csv file(s)) and your python program (to read and process them). The thing to remember is that you can make the delimiter be whatever character you like, as long as it doesn’t occur within any of the columns, and as long as you tell
pandas.read_csv()
to use the same thing.As you have learned, parsing datetime objects can be kind of a pain in python, especially when time zones get involved (as they are for these emails). For the purposes of this question, we actually don’t care about the timezone at all and want to ignore it. My suggestion is to save yourself some headaches and strip off the timezone information in your bash script before saving it to the csv file. That way, pandas won’t get tripped up reading them in.
This is real data, and it does have some errors in the dates listed in emails. When you try to call
pd.to_date_time()
look carefully at the error message to try and figure out what’s going on. Rather than just ignore the errors, you should be able to correct them in your bash script!If you choose to create two separate csv files (one for profanity and one for the dates), make sure that they both contain one line for every file, in the same order. Otherwise, merging the two dataframes in pandas will be much more complicated.
To create the histogram, use plotly express and the
px.histogram()
function we have seen before. The dataframe you pass in should already be filtered by row to only include emails that contain profanity. Make a column for the hour each email was sent, and set that column name as thex
option forpx.histogram()
.(Note, there is no need for bar grouping or colors; just a simple historgram with one bar for each hour of the day is good.)
Your final graph might look something like this, except of course that the data here is entirely made up (and much “cleaner” and more dramatic than the true story):
Take a few minutes to answer these questions in the markdown file after completing this part:
Explain in just a few sentences your approach to solving this problem, and how your code works.
Be sure to mention which dataset (1%, 10%, or full) you managed to get it working for.
Looking at your final graph, which time(s) of day were employees most likely to use profanity? What could explain this?
Can you think of any shortcomings or biases in this analysis? If so, explain briefly what you might do to correct it.
5.1 Submit what you have so far
Submit your work so far:
submit -c=sd212 -p=lab02 counts.sh makecsv.sh graph.py profanity.png lab02.md
or
club -csd212 -plab02 counts.sh makecsv.sh graph.py profanity.png lab02.md
or use the web interface
6 Choose your own adventure (20 pts)
6.1 Your question
Think of another interesting question or connection which you can interrogate using this dataset, and investigate it!
We are being intentionally open-ended about what you do here. Note that you will not be graded on how compelling the results are, but on how interesting the question is that you are asking and how accurate and clear are the methods you use for analysis.
Use these guidelines when thinking of what to investigate:
- It probably shouldn’t have anything to do with profanity. Think of a different idea!
- It should involve many/most emails in the dataset, at least for searching. For examples, if your study includes only the emails sent by Jeff Skilling on Feburary 3, 2001, then it’s probably not interesting.
- It may involve some outside additional dataset, but that is not necessary or even encouraged.
If you are coming up blank with ideas, here are a few directions you might go in: unusual word usage; messaging patterns (who sends emails to who else); connections with real-world events of 1999-2001 (Bush v Gore election, 9/11 attacks, release of the movie Office Space); chain letters; scams and spam; differences between Enron-internal vs external emails; who works on weekends or holidays; racist or sexist language; mentions of certain schools or organizations.
- What question are you trying to answer or investigate using this data?
6.2 Your analysis
To perform your analysis, you should follow the same rough outline from the previous part:
- Write a bash script
mybash.sh
that gathers the information about what you are investigating from the emails in the dataset, creating one or more csv files. - Write a Python program
mypy.py
that reads these csv file(s) and creates a nice visualization of what you are trying to investigate - Save your graph to
mygraph.png
.
- Having completed your analysis, what conclusions (if any) can you draw about the original question? If you had time, what improvements would you make or what would be the next thing you would investigate?
6.3 Submit your worksubmit -c=sd212 -p=lab02 counts.sh makecsv.sh graph.py profanity.png mybash.sh mypy.py mygraph.png lab02.md
submit -c=sd212 -p=lab02 counts.sh makecsv.sh graph.py profanity.png mybash.sh mypy.py mygraph.png lab02.md
or
club -csd212 -plab02 counts.sh makecsv.sh graph.py profanity.png mybash.sh mypy.py mygraph.png lab02.md
or use the web interface