SD 212 Spring 2023 / Labs


This is the archived website of SD 212 from the Spring 2023 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Lab 3: Credit card agreements

1 Overview

Today’s lab is going to look at a whole lot of credit card agreements. Yes, that is the long booklet of important-looking information that no one reads. So let’s get the computer to read them for us!

The data we will use comes from the. Consumer Financial Protection Bureau, which apparently collects the card agreements for all credit cards issued by U.S. banks each quarter. Because these are pdf files instead of plain text, we will use Python’s pypdf library to read them.

1.1 Deadlines

  • Milestone: 2359 on Tuesday, 21 February
  • Complete lab: 2359 on Sunday, 26 February

1.2 Learning goals

  • Gain experience in error handling from examining real-world data files which may be mis-formatted
  • Use a popular Python library to scrape text from PDF files
  • Use regular expressions to search for key phrases
  • Use Python libraries to create simple visualizations
  • Develop your own questions that can be investigated with data

2 Preliminaries

2.1 Markdown file to fill in

Here is the file with questions to fill in and submit for today’s lab: lab03.md

You can run this wget command to download the blank md file directly from the command line:

wget "https://roche.work/courses/s23sd212/lab/md/lab03.md"

The first two questions are kind of general and apply to the entire lab. They are both optional (except if you used any resources that need to be documented as per the course honor policy).

  1. What sources of help (if any) did you utilize to complete this lab? Please be specific.

  2. What did you think of the lab overall? Again, if you can be specific that is helpful!

2.2 Accessing the datasets

The full dataset available from the CFPB site is a little less than 1GB, which is once again a bit large for you each to download and store separately.

So again, we have downloaded the raw data for you in the following folder on midn.cs.usna.edu and mounted on all lab machines:

/home/mids/SD212/cc

Although this lab’s dataset has only a few thousand files, each of them is a PDF which takes a few seconds to process, so it will be convenient to have some smaller datasets to get your code working with initially:

  • 1% of original size, about 36 PDFs: /home/mids/SD212/cc.01

  • 10% of original size, about 360 PDFs: /home/mids/SD212/cc.1

3 Python background

3.1 Writing files in Python

We have usually been just reading files in Python programs, but Python can also be used to create new text files pretty easily.

The syntax is like this:

filename = 'somefile.txt'

with open(filename, 'w') as fout:
    print('First line of my file', file=fout)
    n = 2
    print('line', n, 'of my file', file=fout)

Notice three crucial differences from opening a file for reading:

  • We usually put the opening in a with block. That ensures that if your program crashes, the data written so far will actually be saved to the file. Very useful!
  • You pass the 'w' argument to the open() function so that the file is opened in “writing” mode.
  • To add a line of output to the file, use the print() function as normal, but with an extra named parameter file= for the name of your opened file handle.

We have spent a lot of time writing bash commands to crawl through directories, but no so much with Python.

To go through directories and sub-directories in Python, find out whether files exist, etc., we can use the pathlib module.

Normally we represent file and directory names in Python using a normal string, like myfolder/myfile.txt. With pathlib, we can instead use the Path class to represent file and directory names, and then we get some convenient methods:

  • Path(str): Turn a string into a Path object
  • str(p): Turn a Path object back into a regular string
  • p.iterdir(): Loop over the files (and subdirectories) of a given directory path p
  • p.is_dir(): Returns true or false depending on whether p is a directory
  • p.is_file(): Returns true or false depending on whether p is a normal file
  • p.open(): Opens the file named by the Path object p. (It’s the same as doing open(str(p))

For a complete example, here is a Python program which uses pathlib to go through a directory books, open every .txt file in that directory, and print out how many lines that file has. It’s the equivalent of the bash one-liner

wc -l books/*.txt
from pathlib import Path

booksdir = Path('books')
for filepath in booksdir.iterdir():
    # filepath represents a single file or subfolder inside books/
    if str(filepath).endswith('.txt'):
        # count lines in the file
        handle = filepath.open()
        count = 0
        for line in handle:
            count += 1
        # print out the filename and number of lines
        print(filepath, count)

3.3 Scraping PDF files with pypdf

PDF files are binary files (not plain-text), so we can’t read through them and use string-processing tools like we are used to.

Instead, we can use the pypdf library to extract the text from a PDF file, and then use our normal Python skills to work with that text.

First, you need to install this library in mamba:

mamba activate sd212
mamba install pypdf pycryptodome

Now let’s see an example of using pypdf to extract the text from MIDREGS. First we can download the MIDREGS pdf and save it as midregs.pdf, from the command line using wget:

wget -O midregs.pdf 'https://www.usna.edu/Commandant/Directives/Instructions/5000-5999/CMDTMIDNINST_5400.6Y_-_MIDSHIPMEN_REGULATIONS_MANUAL.pdf'

Then the following Python program will extract the text from each page, and tell us which pages contain the word “fun”:

from pypdf import PdfReader

# start by opening the file and creating a PdfReader object
rdr = PdfReader('midregs.pdf')

# go through each page and look for fun
pagenum = 1
for page in rdr.pages:
    # get a regular Python string for all the text on this page of the pdf
    text = page.extract_text()
    if 'fun' in text:
        print(pagenum)
    pagenum += 1

4 Not-so hidden fees (30 pts)

To get started, write a python program fees.py that goes through the PDFs in the 1% dataset cc.01 and counts how many time the word fee or fees appears, in total.

Here’s one way to tackle this:

  • Start with a single PDF file, maybe

    cc.01/FIRST NATIONAL BANK/FNCLR-D-0922(AFTI) - FNCLR-D-0922 (AFTI).pdf

  • Copy the pypdf example above and modify it to work with this file. Run it! In this file, the string “fun” appears on pages 2 and 5.

  • Modify your program so that it uses a regex and counts instances of the string “fee” instead of “fun”. (Look back in your notes for how to use the re library and the findall() method.)

    This file has 22 occurrences of the string “fee”.

    (Note: the page numbers don’t actually matter anymore at this point, so you should be able to simplify your code!)

  • Tweak your regex so that it only matches the entire word “fee” or “fees”, ignoring case.

    There should be 33 occurrences in this example file.

    (Hint: look at the re module documentation to see how you can tell Python to ignore case in a call to findall().)

  • Now loop over all the pdf files in all subdirectories of cc.01, using the pathlib module as in the example above. Get your fees.py program to print out the total count at the end.

    Note: some of the pdf files are improperly formatted and will give an error message when you try to open them with the PdfReader. Use proper error handling so that your program just ignores such files and moves on to the next one.

    Another note: Extracting text from PDF files is kind of slow! Even with the 1% database cc.01, it might take around 1 minute to sccessfully find the total count.

Now fill in this question in the markdown file:

  1. What is the total number of times the word “fee” or “fees” appears in the 1% dataset cc.01?

4.1 Submit what you have so far

Save your files and submit your work so far:

submit -c=sd212 -p=lab03 lab03.md fees.py

or

club -csd212 -plab03 lab03.md fees.py

or use the web interface

4.2 Milestone

For this lab, the milestone means everything up to this point, which includes the following auto-tests:

part4-md
part4-feespy

This milestone is not the half-way point. Keep going!

5 State of the (credit) union (30 pts)

Copy your fees.py program to a new program states.py for this part.

We want to analyze the locations of the banks that are issuing all of these credit cards. Most of the PDFs contain one or more mailing addresses, presumably corresponding to where the bank is located.

Your states.py program should:

  • Loop over all the pdf files in one of the datasets (start with the 1% dataset in cc.01 and work your way up after it’s working perfectly)

  • Use pypdf to extract the text of each page of each pdf file into a string.

  • Use a regular expression to look through the text of each page for something that looks like a state abbreviation (two capital letters), followed by a single space, followed by a zip code (5 digits).

    Be sure not to include anything else; for example PO BOX 89909 should not count as a state OX, and NMLSR ID 399801 should not count for Idaho since that’s a 6-digit code.

  • Print all the state abbreviations that you find, one per line, with repeats, to a new text file cardstates.txt as you go.

After running your states.py program, you should be able to easily use a single bash command-line to answer this question using the cardstates.txt file your program generated:

  1. For the largest dataset you have working so far, how many times does the state OR appear?

Aside: When you run this program on the largest dataset, you will see some extra messages printed to the command line such as “incorrect startxref pointer” and “Multiple definitions in dictionary”. These are reporting issues with the PDF files themselves which pypdf is able to ignore and move past, so you can ignore them too!

But of course this is Python so we can ultimately control everything! There are three lines of code you can add at the top of your program to prevent these messages altogether: read the documentation here.

5.1 Submit what you have so far

Save your files and submit your work so far:

submit -c=sd212 -p=lab03 lab03.md fees.py states.py cardstates.txt

or

club -csd212 -plab03 lab03.md fees.py states.py cardstates.txt

or use the web interface

6 Population control (30 pts)

We want to compare the prevalence of each state in the credit card agreements with the population of that state. For that, we first need another piece of data, the population of each state.

Bottom line: The goal of this part is to write a program tally.py that produces (and prints) a Pandas DataFrame with at least three columns: the state abbreviation, state population, and count of how many times that state was mentioned in the credit card agreement PDFs.

Below I have some steps and suggestions of how to make this DataFrame, but you are encouraged to stop now and try ti figure it out your own way!

6.1 Find population data

Find a free dataset from the web that has data on the population of each state. Searching for something like “us state populations csv” should lead you to what you need.

Important: You want the state populations to be listed by abbreviation, to match up with the cardstates.txt file you already have. Depending on where you get your data, you might need to download a separate CSV with just the abbreviations in order to get this right. Feel free to use your command-line skills or anything else if needed to fix up the data.

Eventually you should create a file called pops.csv that is a normal CSV file with at least two columns: state abbreviations (like DE or MD) and population numbers.

Fill in this question when you’re finished.

  1. What website did you get your state population data from? Was the csv file perfect as-is or did you have to do some “massaging” to get what we need?

6.2 Tally up

Create a new python program tally.py. This program should read the big list of state names in your cardstates.txt from the previous part, and turn this into a single pandas series of total counts, where the series is indexed by the state names.

Here are some Pandas functions and options that will be useful to get this done:

  • read_csv:

    You know this function well, but we haven’t used it like this before.

    The cardstates.txt file can be considered like a single-column csv file, since each line should just be a single state abbreviation with no commas.

    You will want to specify an option like names=['state'] or header=None to tell Pandas that there is no header line in your file.

  • Selecting a single pandas column using square brackets

    Remember that a DataFrame has multiple columns. Each column of a DataFrame is a Series. So when we use square brackets on a DataFrame and specify a column name, we get back a Series.

  • value_counts

    Call this function on a Pandas series to combine entries with the same name and get how many times each label appears in that series.

    What gets returned is a new, smaller series, which is indexed by the original series values, and where series now contains integers counting up how many times each thing occurs.

Your goal here is to get a Pandas DataFrame or Series where the index is the name of each state and the value in the column is how many times that state appeared in cardstates.txt.

6.3 Combine

Now it’s time to add the logic to your tally.py program so that it combines the state population data with the counts from the credit card agreements.

As usual, there are many possible ways to do this! Here’s one way:

  • At this point you should have a DataFrame with state abbreviations and populations, and a completely separate Series that just has the state counts, indexed by the state abbreviations.

  • We want to combine these and “match up” the rows that correspond to the same state. The first task is to make sure both things are indexed in the same way.

    In this case, you want to use the set_index function so that your DataFrame is also indexed by the state abbreviations.

    So now you should have a DataFrame with the populations, and a Series with the counts (from credit card agreements), both indexed by state abbreviations.

  • Once the indexing is in place, just assign the Series as a new column in your DataFrame. You can use the same syntax as looking up a single column, like

    my_dataframe['new_column_name'] = my_series

6.4 Question

Use your DataFrame to do a very simple analysis and answer this question:

  1. Which state has the highest ratio of credit card agreements per population? Write just the 2-letter abbreviation of the state.

6.5 Submit what you have so far

Save your files and submit your work so far:

submit -c=sd212 -p=lab03 lab03.md fees.py states.py cardstates.txt tally.py

or

club -csd212 -plab03 lab03.md fees.py states.py cardstates.txt tally.py

or use the web interface

7 Graph it (10 pts)

Modify your tally.py program so that it displays a scatterplot with population on the x-axis and the credit card agreement count on the y-axis. So each state should appear as a single dot, and (for example) a state on the bottom-right would have a high population and low amount of credit card companies.

Making the scatterplot using Plotly Express should be very simple if you have your DataFrame worked out from the last part. The second example on this page shows how to make a scatterplot from a dataframe and is a great starting point. A more complete description of the px.scatter() function is here.

Add a trendline to your scatterplot (look at the documentation pages above to find the right option). The trendline will show sort of the “average” relationship between state population and credit card agreements.

Once you are happy with how your graph looks, save it to a file called scatter.png. Then answer two final questions:

  1. What are the most significant outliers in the dataset? In your graph, these would be the states that are farthest from the trendline. What states are way above or below the “typical” population-scaled average?

  2. Choosing one outlier state you identified from the previous problem, try to do some quick research to make a plausible explanation of why that state has so many or so few credit card companies relative to its population.

    (For example, some states have different tax laws, lending regulations, or court systems that may make them more or less attractive to the credit card companies.)

7.1 Submit your work

Save your files and submit your work:

submit -c=sd212 -p=lab03 lab03.md fees.py states.py cardstates.txt tally.py scatter.png

or

club -csd212 -plab03 lab03.md fees.py states.py cardstates.txt tally.py scatter.png

or use the web interface