SD 212 Spring 2024 / Labs


Lab 7: Football ML

1 Overview

This sideline coach is the GOAT! (From 1942)

The goat needs your help! Imagine that Navy’s storied football team wants to know when they can expect to see big plays coming up, and is even worried (gasp!) that their choice of run/pass is becoming too predictable. Naturally, they are reaching out to data scientists to find out.

For today’s lab, you will use an API to download play-by-play statistics from college football games, and then apply machine learning techniques based on our current unit to train models and make predictions on yards gained and run vs pass plays.

The main goal is to practice using an API to download, clean, and process a large dataset, then to wrangle and form that data into suitable inputs for sklearn algorithms. If you get get a little bit of accuracy in your predictions, that will be really impressive!

1.1 Deadlines

  • Milestone: 2359 on Monday, 29 April
  • Complete lab: 2359 on Thursday, 2 May

1.2 Ground rules

Refresh your memory on the course policy on collaboration and citing sources

For this lab in particular, we do not want you searching for help on how to do CFBD or football things in general, because we want you to get the practice with thinking through how to use an API and how to improve a machine learning task on your own.

For example, looking up general programming questions like “how to add a column in Pandas” or “best sklearn algorithm for binary classification” would be fair game if you cite any sources you end up using.

But it would not be allowed to search something like, “regression using cfbd data in python”, which would be too specific to this problem we are asking you to work on.

1.3 Learning goals

  • Experience using an key-based web API to download a large dataset in JSON format
  • Practice working with raw data which may contain omissions and inconsistencies
  • Apply domain-specific knowledge to identify potential insights in a dataset
  • Practice using standard machine learning libraries in Python on real data

2 Preliminaries

2.1 API Key

We will be using the CFBD API for this lab.

The first thing you need to do is go to this page and enter your email address in order to get a personal “API key” emailed to you. This is what you will use to access the data source.

BE CAREFUL: Your API key is just for you. Unlike many public APIs, there are no hard limits on how much data you can download with your key. But if you use it way too much (like, making lots of requests over a short time), your key might get throttled or blacklisted. So, be thoughtful and certain about when and how often you are making API requests with your key.

2.2 Notebook file to fill in

Here is the notebook file for this lab.

If you want, you can use the terminal to download the blank file directly to your lab folder with this command:

wget "https://roche.work/212/lab/fb/lab07.ipynb"

Start by copying in your own API key to the code cell where api_key is defined near the top of the starter file.

2.3 Initial questions

  1. List any sources of help you used other than your instructor and links directly on the course website.

  2. This is our first time trying this lab. What did you think of it?

3 Get you some data (20 pts)

Create a new code cell in your jupyter notebook which does the following:

  • Uses the requests library to access the /plays endpoint of the CFBD API to download play-by-play data for every college football game in test_week for the two years specified by train_year and test_year.

  • Converts these two JSON downloads to Pandas DataFrames, and then cleans them up appropriately to feed data into sklearn

Specifics on how to do this are below, followed by a few questions for you to answer about your DataFrames once you have them.

3.1 Making API calls to CFBD

To make an API call to the CFBD API, you need to use requests to make a .get() request, where it specifies (at least):

  • A headers dictionary which specifies your personal API key for authorization
  • A params dictionary specifying whatever specific options you are passing to that endpoint for the request.

For example (using a different API endpoint than the one you actually need to use), here is some Python code to download information about all of the Navy players in 2023:

response = requests.get(
    'https://api.collegefootballdata.com/roster',
    headers = {'Authorization': f'Bearer {api_key}'},
    params = {
        'team': 'Navy',
        'year': 2023,
    },
    verify = False
)
data = response.json()

For this lab, you need to access the /plays endpoint instead, and probably specify different params.

3.2 Convert to DataFrames

What you should get back from each requests.get() call is a (JSON-encoded) list of dictionaries. Take a look and explore!

We would like to convert these to Pandas dataframes. The easiest way is to use the pd.DataFrame.from_records() function, which exists exactly for situations like this.

You should have two DataFrames, one for your training data and one for the testing dataset.

3.3 Remove un-useful columns

Remember that for sklearn everything we feed in should be numbers. So we need to eliminate columns in the DataFrames that have other kinds of data, or that have numbers which don’t have any numeric meaning.

Even after removing those non-numeric columns, there is at least one column that has missing entries. Decide whether you want to just remove them, or use fillna() to set them to some default value.

The ppa column is a specific stat that the CFBD people compute to make predictions on each play. We are trying to make our own predictions here! And anyway it’s missing for many plays. So remove that ppa column as well.

Using Pandas, make copies of your two dataframes that have those un-useful columns removed. (We want a copy because you should save the originals; later we might be able to use some of those non-numeric columns, and we don’t want to have to re-download them.)

3.4 Questions about the data

Once your code is working to create the two cleaned-up dataframes for training and testing, answer these specific questions:

  1. How many rows (observations, which in this case are plays) are in your training dataset?

  2. How many columns (features) are in your datasets after your cleaning?

  3. Looking at the original dataframes with all of the textural columns, What is the second-most common play_type that occurs?

3.5 Submit what you have so far

SAVE YOUR NOTEBOOK FILE and submit your work so far:

submit -c=sd212 -p=lab07 lab07.ipynb

or

club -csd212 -plab07 lab07.ipynb

or use the web interface

4 Regression to predict yards gained (25 pts)

Next you will use regression algorithms in sklearn to predict the numerical value in the yards_gained column.

To do this, we will follow some steps which should be familiar to you already:

4.1 Choose a regression algorithm and create the model

For now, just choose a simple (and fast) regression algorithm in sklearn such as LinearRegression.

In later parts of the lab, you will have chances to try out different algorithms to try and make more accurate predictions.

4.2 Fit the model to the training data

Take your training dataframe and split it into two pieces: a “known” features matrix which has everything except the yards_gained column, and then the “known labels” series which is just that single column from the training set.

Feed this matrix and vector into your model’s fit() function to fit the model to that training data.

4.3 Predict labels for the testing data

Make a new dataframe which removes the yards_gained column from the testing data. Make double-sure that this has exactly the same columns as were in the known features matrix that you fit the model with above.

Now feed that testing matrix into the .predict() function of your trained model, and save the resulting series of predicted values as a new variable. These are the predicted yards_gained values for each row in your testing dataset.

4.4 Score the accuracy of the prediction

Now for the big question: how good were your predictions?

For this, we will calculate the root mean squared error (RMSE) of the predicted labels compared with the actual yards_gained column from the testing dataset.

You should be familiar with root mean squared error from linear algebra! It’s the square root of the sum of the differences between the actual values and the predicted values, each squared.

In other words, it’s the L2-norm of the difference between the right answers and the predicted answers.

Roughly speaking it gives us an idea of how far off your predictions were. So if you get a value like 50, that means that your typical prediction for the yards_gained was off by 50 yards. (Hopefully not!)

Fortunately, sklearn provides a function which computes the mean squared error for you.

4.5 Answer questions

  1. What took longer in Python - fitting the model to the training data, or predicting the labels on the testing data? Why do you think that is?

  2. What was the RMSE of your predictions?

  3. How good do you think that is? Would your predictions be useful information for a football coach?

4.6 Submit what you have so far

SAVE YOUR NOTEBOOK FILE and submit your work so far:

submit -c=sd212 -p=lab07 lab07.ipynb

or

club -csd212 -plab07 lab07.ipynb

or use the web interface

4.7 Milestone

For this lab, the milestone means everything up to this point.

Keep it up!

5 Classification to predict rushing plays (25 pts)

Now let’s turn to a second machine learning task: trying to predict whether each play will be some sort of rushing play.

5.1 Add a binary column for is_rush

First you will need to use Pandas to create a column is_rush, both in your training and testing dataframes, based on the play type from the original data. This should be a boolean column of True/False or 1/0 to indicate simply whether each play was a rushing play.

Note: There are two types of plays that correspond to a rush, which are a normal rushing play and a rushing touchdown. Make sure you include both of them.

(Also, be careful about where in your notebook these new columns are created and removed! You definitely do not want the is_rush data to feed into the regression model for the previous part; that would kind of be “cheating” since we don’t know if a play is a rush before it starts.)

5.2 Create model, fit, and predict

These steps will be the same as before, except this time you should choose an [sklearn algorithm][https://scikit-learn.org/stable/supervised_learning.html] for classification, not for regression.

(More specifically, this is a binary classification problem, since the label you are trying to predict has to be 0 or 1.)

As before, just choose a simple and fast algorithm for now; you will have the chance to try different things and improve on it later.

5.3 Score your classifier

We used RMSE to score the regression task from before, but that doesn’t make sense for binary classification where the predictions are either right or wrong (and nothing in between).

Instead, we will use a confusion matrix to check the accuracy of your results, which is a standard tool that shows you how often your prediction was correct and incorrect, compared to the actual value being true or false. Basically, you would ideally want 100% in the top-left and bottom-right corners (true negatives and true positives, respectively).

Use the following lines in a Jupyter notebook cell to display the confusion matrix for your rushing play predictions, replacing ACTUAL_VALUES and PREDICTED_VALUES as needed:

from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_predictions(ACTUAL_VALUES, PREDICTED_VALUES, normalize='true')

5.4 Answer questions

  1. What is your classifier’s accuracy on true negatives? (Top-left corner in the normalized confusion matrix, should be a number between 0 and 1)

  2. What is your classifier’s accuracy on true positives? (Bottom-right corner)

  3. Interpret the results you are seeing. Is your classifier useful? Where is it better or worse? What makes this an easy or challenging task for machine learning?

5.5 Submit what you have so far

SAVE YOUR NOTEBOOK FILE and submit your work so far:

submit -c=sd212 -p=lab07 lab07.ipynb

or

club -csd212 -plab07 lab07.ipynb

or use the web interface

6 Get more training data (15 pts)

Most machine learning algorithms get greater accuracy when you can feed them a larger volume of (high-quality) data.

Augment your training data by incorporating all weeks from train_year instead of just a single week. (Note, in this API there are 15 weeks numbered 1 through 15.)

You will need to go back to the beginning of the lab. Now instead of just doing a single requests.get() to download the training data JSON, you will need to do it in a loop for every week of that year. Then you will need to combine these into one huge training dataframe.

Then go back and re-run the regression and classification tasks from before using this larger training dataset. Go through and update your answers to the previous questions according to your new dataset.

6.1 Submit what you have so far

SAVE YOUR NOTEBOOK FILE and submit your work so far:

submit -c=sd212 -p=lab07 lab07.ipynb

or

club -csd212 -plab07 lab07.ipynb

or use the web interface

7 Make it better! (15 pts)

The last part of the lab is open-ended, but crucially important!

We want you to try different things to improve the accuracy of your regression and classification tasks above.

Here are some sorts of things you can try:

  • Use a different regression or classification algorithm from sklearn. (Hint: read the great documentation they have on the various methods for supervised learning!)

    For some algorithms, there are also some optional parameters you can specify when creating the model, that will affect how well it does.

  • Use some preprocessing such as StandardScalar in a pipeline so that numerically-larger numbers in the training data don’t get undue emphasis.

  • Add in some more features to each row that could be relevant to rushing yards and/or the chance of seeing a rush play. For example, could you make some 1/0 or other numerical features based on the string-valued columns you deleted earlier? Or can you incorporate the same team’s performance on recent previous plays?

Once agin, you should re-run your regression and classification and update your answers to those questions as needed.

7.1 Questions

Just two more questions:

  1. What change(s) did you make to improve the regression task? How much (if at all) did it improve the RMSE? Why did you think it would help?

  2. Same question as #12, for the classification task.

7.2 Submit your work

SAVE YOUR NOTEBOOK FILE and submit your work so far:

submit -c=sd212 -p=lab07 lab07.ipynb

or

club -csd212 -plab07 lab07.ipynb

or use the web interface