Lab 3: Credit card agreements
1 Overview
Today’s lab is going to look at a whole lot of credit card agreements. Yes, that is the long booklet of important-looking information that no one reads. So let’s get the computer to read them for us!
The data we will use comes from the. Consumer Financial Protection Bureau, which apparently collects the card agreements for all credit cards issued by U.S. banks each quarter. Because these are pdf files instead of plain text, we will use Python’s pypdf library to read them.
1.1 Deadlines- Milestone: 2359 on Monday, 19 February
- Complete lab: 2359 on Friday, 23 February
1.2 Learning goals
- Gain experience in error handling from examining real-world data files
which may be mis-formatted
- Use a popular Python library to scrape text from PDF files
- Use regular expressions to search for key phrases
- Use Python libraries to create simple visualizations
- Develop your own questions that can be investigated with data
2 Preliminaries
2.1 Markdown file to fill in
- Gain experience in error handling from examining real-world data files which may be mis-formatted
- Use a popular Python library to scrape text from PDF files
- Use regular expressions to search for key phrases
- Use Python libraries to create simple visualizations
- Develop your own questions that can be investigated with data
2 Preliminaries
2.1 Markdown file to fill in
Unlike the last two labs, we will not be using Jupyter notebooks for
this one. Instead, you will write your code in plain old .py
files,
and put your answers to the questions in a .md
file.
Here is the file with questions to fill in and submit for today’s lab: lab03.md
You can run this wget command to download the blank md file directly from the command line:
wget "https://roche.work/courses/s24sd212/lab/md/lab03.md"
The first two questions are kind of general and apply to the entire lab. They are both optional (except if you used any resources that need to be documented as per the course honor policy).
What sources of help (if any) did you utilize to complete this lab? Please be specific.
What did you think of the lab overall? Again, if you can be specific that is helpful!
2.2 Accessing the datasets
The full dataset available from the CFPB site is a little less than 1GB, which is once again a bit large for you each to download and store separately.
So again, we have downloaded the raw data for you in the following
folder on ssh.cs.usna.edu
and mounted on all lab machines:
/home/mids/SD212/cc
Although this lab’s dataset has only a few thousand files, each of them is a PDF which takes longer to process, so it will be convenient to have some smaller datasets to get your code working with initially:
1% of original size, about 36 PDFs:
/home/mids/SD212/cc.01
10% of original size, about 360 PDFs:
/home/mids/SD212/cc.1
3 Python background
3.1 Writing files in Python
We have usually been just reading files in Python programs, but Python can also be used to create new text files pretty easily.
The syntax is like this:
filename = 'somefile.txt'
with open(filename, 'w') as fout:
print('First line of my file', file=fout)
n = 2
print('line', n, 'of my file', file=fout)
Notice three crucial differences from opening a file for reading:
- We usually put the opening in a
with
block. That ensures that if your program crashes, the data written so far will actually be saved to the file. Very useful! - You pass the
'w'
argument to theopen()
function so that the file is opened in “writing” mode. - To add a line of output to the file, use the
print()
function as normal, but with an extra named parameterfile=
for the name of your opened file handle.
3.2 Navigating directories in Python with pathlib
We have spent a lot of time writing bash commands to crawl through directories, but no so much with Python.
To go through directories and sub-directories in Python, find out whether files exist, etc., we can use the pathlib module.
Normally we represent file and directory names in Python using a normal
string, like myfolder/myfile.txt
. With pathlib, we can instead use
the Path
class to represent file and directory names, and then we get
some convenient methods:
Path(str)
: Turn a string into a Path objectstr(p)
: Turn a Path object back into a regular stringp.iterdir()
: Loop over the files (and subdirectories) of a given directory pathp
p.is_dir()
: Returns true or false depending on whetherp
is a directoryp.is_file()
: Returns true or false depending on whetherp
is a normal filep.open()
: Opens the file named by the Path objectp
. (It’s the same as doingopen(str(p))
For a complete example, here is a Python program which uses pathlib
to
go through a directory books
, open every .txt
file in that
directory, and print out how many lines that file has. It’s the
equivalent of the bash one-liner
wc -l books/*.txt
from pathlib import Path
booksdir = Path('books')
for filepath in booksdir.iterdir():
# filepath represents a single file or subfolder inside books/
if str(filepath).endswith('.txt'):
# count lines in the file
handle = filepath.open()
count = 0
for line in handle:
count += 1
# print out the filename and number of lines
print(filepath, count)
3.3 Scraping PDF files with pypdf
PDF files are binary files (not plain-text), so we can’t read through them and use string-processing tools like we are used to.
Instead, we can use the pypdf library to extract the text from a PDF file, and then use our normal Python skills to work with that text.
First, you need to install this library in mamba
:
mamba activate sd212
mamba install pypdf pycryptodome
Now let’s see an example of using pypdf
to extract the text from
MIDREGS. First we can download the MIDREGS pdf and save it as
midregs.pdf
, from the command line using wget
:
wget -O midregs.pdf 'https://www.usna.edu/Commandant/Directives/Instructions/5000-5999/CMDTMIDNINST_5400.6Y_-_MIDSHIPMEN_REGULATIONS_MANUAL.pdf'
Then the following Python program will extract the text from each page, and tell us which pages contain the word “fun”:
from pypdf import PdfReader
# start by opening the file and creating a PdfReader object
rdr = PdfReader('midregs.pdf')
# go through each page and look for fun
pagenum = 1
for page in rdr.pages:
# get a regular Python string for all the text on this page of the pdf
text = page.extract_text()
if 'fun' in text:
print(pagenum)
pagenum += 1
4 Not-so hidden fees (30 pts)
To get started, write a python program fees.py
that goes through the
PDFs in the 1% dataset cc.01
and counts how many time the word fee
or fees
appears, in total.
Here’s one way to tackle this:
Start with a single PDF file, maybe
“
cc.01/FIRST NATIONAL BANK/FNCLR-D-0922(AFTI) - FNCLR-D-0922 (AFTI).pdf
”Copy the
pypdf
example above and modify it to work with this file. Run it! In this file, the string “fun
” appears on pages 2 and 5.Modify your program so that it uses a regex and counts instances of the string “fee” instead of “fun”. (Look back in your notes for how to use the
re
library and thefindall()
method.)This file has 22 occurrences of the string “fee”.
(Note: the page numbers don’t actually matter anymore at this point, so you should be able to simplify your code!)
Tweak your regex so that it only matches the entire word “fee” or “fees”, ignoring case.
There should be 33 occurrences in this example file.
(Hint: look at the re module documentation to see how you can tell Python to ignore case in a call to
findall()
.)Now loop over all the pdf files in all subdirectories of
cc.01
, using thepathlib
module as in the example above. Get yourfees.py
program to print out the total count at the end.Note: some of the pdf files are improperly formatted and will give an error message when you try to open them with the
PdfReader
. Use proper error handling so that your program just ignores such files and moves on to the next one.Another note: Extracting text from PDF files is kind of slow! Even with the 1% database
cc.01
, it might take around 1 minute to sccessfully find the total count.
Now fill in this question in the markdown file:
- What is the total number of times the word “fee” or “fees” appears
in the 1% dataset
cc.01
?
4.1 Submit what you have so far
Save your files and submit your work so far:
submit -c=sd212 -p=lab03 lab03.md fees.py
or
club -csd212 -plab03 lab03.md fees.py
or use the web interface
4.2 Milestone
For this lab, the milestone means everything up to this point, which includes the following auto-tests:
part4-md
part4-feespy
This milestone is not the half-way point. Keep going!
5 State of the (credit) union (30 pts)
Copy your fees.py
program to a new program states.py
for this part.
We want to analyze the locations of the banks that are issuing all of these credit cards. Most of the PDFs contain one or more mailing addresses, presumably corresponding to where the bank is located.
Your states.py
program should:
Loop over all the pdf files in one of the datasets (start with the 1% dataset in
cc.01
and work your way up after it’s working perfectly)Use
pypdf
to extract the text of each page of each pdf file into a string.Use a regular expression to look through the text of each page for something that looks like a state abbreviation (two capital letters), followed by a single space, followed by a zip code (5 digits).
Be sure not to include anything else; for example
PO BOX 89909
should not count as a stateOX
, andNMLSR ID 399801
should not count for Idaho since that’s a 6-digit code.Print all the state abbreviations that you find, one per line, with repeats, to a new text file
cardstates.txt
as you go.
After running your states.py
program, you should be able to easily use
a single bash command-line to answer this question using the
cardstates.txt
file your program generated:
- For the largest dataset you have working so far, how many times does
the state
OR
appear?
Aside: When you run this program on the largest dataset, you will see some extra messages printed to the command line such as “incorrect startxref pointer” and “Multiple definitions in dictionary”. These are reporting issues with the PDF files themselves which pypdf is able to ignore and move past, so you can ignore them too!
But of course this is Python so we can ultimately control everything! There are three lines of code you can add at the top of your program to prevent these messages altogether: read the documentation here.
5.1 Submit what you have so far
Save your files and submit your work so far:
submit -c=sd212 -p=lab03 lab03.md fees.py states.py cardstates.txt
or
club -csd212 -plab03 lab03.md fees.py states.py cardstates.txt
or use the web interface
6 Population control (30 pts)
We want to compare the prevalence of each state in the credit card agreements with the population of that state. For that, we first need another piece of data, the population of each state.
Bottom line: The goal of this part is to write a program tally.py
that produces (and prints) a Pandas DataFrame with at least three
columns: the state abbreviation, state population, and count of how
many times that state was mentioned in the credit card agreement PDFs.
Below I have some steps and suggestions of how to make this DataFrame, but you are encouraged to stop now and try ti figure it out your own way!
6.1 Find population data
Find a free dataset from the web that has data on the population of each state. Searching for something like “us state populations csv” should lead you to what you need.
Important: You want the state populations to be listed by
abbreviation, to match up with the cardstates.txt
file you already have.
Depending on where you get your data, you might need to download a
separate CSV with just the abbreviations in order to get this right.
Feel free to use your command-line skills or anything else if needed to
fix up the data.
Eventually you should create a file called pops.csv
that is a normal
CSV file with at least two columns: state abbreviations (like DE
or
MD
) and population numbers.
Fill in this question when you’re finished.
- What website did you get your state population data from? Was the csv file perfect as-is or did you have to do some “massaging” to get what we need?
6.2 Tally up
Create a new python program tally.py
. This program should read the big
list of state names in your cardstates.txt
from the previous part, and
turn this into a Pandas DataFrame that has two columns, one for the
state names, and one for the count of how many times that state appeared
in the CC agreements.
Here are some Pandas functions and options that will be useful to get this done:
-
You know this function well, but we haven’t used it like this before.
The
cardstates.txt
file can be considered like a single-column csv file, since each line should just be a single state abbreviation with no commas.You will want to specify an option like
names=['state']
orheader=None
to tell Pandas that there is no header line in your file. -
Call this function on a Pandas series to combine entries with the same name and get how many times each label appears in that series.
What gets returned is a new, smaller series, which is indexed by the original series values, and where series now contains integers counting up how many times each thing occurs.
-
This one is new. What it does is take a single-column Pandas Series and turn it into a two-column DataFrame, where the first column is the index from the series.
Using
reset_index
after a call tovalue_counts
can be especially useful!
Your goal here is to get a Pandas DataFrame with a column for the state
name and a column for how many times
that state appeared in cardstates.txt
.
6.3 Combine
Now it’s time to add the logic to your tally.py
program so that it
combines the state population data with the counts from the credit card
agreements.
As usual, there are many possible ways to do this! Here’s one way:
At this point you should have a DataFrame with state abbreviations and populations, and a completely separate DataFrame that just has the state names and counts.
We want to combine these and “match up” the rows that correspond to the same state. The first thing to do is to get the columns for the state abbreviations to have the same column name in both DataFrames.
You can use the rename method of a DataFrame to fix up the column names.
Next we want to use the Pandas merge operation
Because this is a really sophisticated function (which we will learn more about later in the semester), I’m going to show you exactly how you want to use it, via a small illustrative example:
Make sure you do a
.fillna(0)
to change the NaNs for states that didn’t have any credit cards, into zero counts as they should be.
6.4 Question
Use your DataFrame to do a very simple analysis and answer this question:
- Which state has the highest ratio of credit card agreements per population? Write just the 2-letter abbreviation of the state.
6.5 Submit what you have so far
Save your files and submit your work so far:
submit -c=sd212 -p=lab03 lab03.md fees.py states.py cardstates.txt tally.py
or
club -csd212 -plab03 lab03.md fees.py states.py cardstates.txt tally.py
or use the web interface
7 Graph it (10 pts)
Modify your tally.py
program so that it displays a scatterplot with
population on the x-axis and the credit card agreement count on the
y-axis. So each state should appear as a single dot, and (for example) a state on the
bottom-right would have a high population and low amount of credit card
companies.
Making the scatterplot using Plotly Express should be very simple if
you have your DataFrame worked out from the last part.
The second example on
this page
shows how to make a scatterplot from a dataframe and is a great starting
point.
A more complete description of the px.scatter()
function is
here.
Add a trendline to your scatterplot (look at the documentation pages above to find the right option). The trendline will show sort of the “average” relationship between state population and credit card agreements.
Once you are happy with how your graph looks, save it to a file
called scatter.png
. Then answer two final questions:
What are the most significant outliers in the dataset? In your graph, these would be the states that are farthest from the trendline. What states are way above or below the “typical” population-scaled average?
Choosing one outlier state you identified from the previous problem, try to do some quick research to make a plausible explanation of why that state has so many or so few credit card companies relative to its population.
(For example, some states have different tax laws, lending regulations, or court systems that may make them more or less attractive to the credit card companies.)
7.1 Submit your work
Save your files and submit your work:
submit -c=sd212 -p=lab03 lab03.md fees.py states.py cardstates.txt tally.py scatter.png
or
club -csd212 -plab03 lab03.md fees.py states.py cardstates.txt tally.py scatter.png
or use the web interface