Lab 7: Football ML
1 Overview
The goat needs your help! Imagine that Navy’s storied football team wants to know when they can expect to see big plays coming up, and is even worried (gasp!) that their choice of run/pass is becoming too predictable. Naturally, they are reaching out to data scientists to find out.
For today’s lab, you will use an API to download play-by-play statistics from college football games, and then apply machine learning techniques based on our current unit to train models and make predictions on yards gained and run vs pass plays.
The main goal is to practice using an API to download, clean, and
process a large dataset, then to wrangle and form that data into
suitable inputs for sklearn
algorithms. If you get get a little bit of
accuracy in your predictions, that will be really impressive!
1.1 Deadlines- Milestone: 2359 on Monday, 29 April
- Complete lab: 2359 on Thursday, 2 May
1.2 Ground rules
Refresh your memory on the course policy on collaboration and citing sources
For this lab in particular, we do not want you searching for help on how to do CFBD or football things in general, because we want you to get the practice with thinking through how to use an API and how to improve a machine learning task on your own.
For example, looking up general programming questions like “how to add a column in Pandas” or “best sklearn algorithm for binary classification” would be fair game if you cite any sources you end up using.
But it would not be allowed to search something like, “regression using cfbd data in python”, which would be too specific to this problem we are asking you to work on.
1.3 Learning goals
- Experience using an key-based web API to download a large dataset in
JSON format
- Practice working with raw data which may contain omissions and
inconsistencies
- Apply domain-specific knowledge to identify potential insights in a
dataset
- Practice using standard machine learning libraries in Python on real
data
2 Preliminaries
2.1 API Key
We will be using the CFBD API for this lab.
The first thing you need to do is go to this page and enter your email address in order to get a personal “API key” emailed to you. This is what you will use to access the data source.
BE CAREFUL: Your API key is just for you. Unlike many public APIs, there are no hard limits on how much data you can download with your key. But if you use it way too much (like, making lots of requests over a short time), your key might get throttled or blacklisted. So, be thoughtful and certain about when and how often you are making API requests with your key.
2.2 Notebook file to fill inHere is the notebook file for this lab.
If you want, you can use the terminal to download the blank file directly to your lab folder with this command:
wget "https://roche.work/212/lab/fb/lab07.ipynb"
Start by copying in your own API key to the code cell where api_key
is defined near the top of the starter file.
2.3 Initial questions
List any sources of help you used other than your instructor and
links directly on the course website.
This is our first time trying this lab. What did you think of it?
3 Get you some data (20 pts)
List any sources of help you used other than your instructor and links directly on the course website.
This is our first time trying this lab. What did you think of it?
Create a new code cell in your jupyter notebook which does the following:
Uses the requests library to access the
/plays
endpoint of the CFBD API to download play-by-play data for every college football game intest_week
for the two years specified bytrain_year
andtest_year
.Converts these two JSON downloads to Pandas DataFrames, and then cleans them up appropriately to feed data into
sklearn
Specifics on how to do this are below, followed by a few questions for you to answer about your DataFrames once you have them.
3.1 Making API calls to CFBD
To make an API call to the CFBD API, you need to use
requests to make a .get()
request, where it specifies (at least):
- A
headers
dictionary which specifies your personal API key for authorization - A
params
dictionary specifying whatever specific options you are passing to that endpoint for the request.
For example (using a different API endpoint than the one you actually need to use), here is some Python code to download information about all of the Navy players in 2023:
response = requests.get(
'https://api.collegefootballdata.com/roster',
headers = {'Authorization': f'Bearer {api_key}'},
params = {
'team': 'Navy',
'year': 2023,
},
verify = False
)
data = response.json()
For this lab, you need to access the /plays endpoint
instead, and probably specify different params
.
3.2 Convert to DataFrames
What you should get back from each requests.get()
call is a
(JSON-encoded) list of dictionaries. Take a look and explore!
We would like to convert these to Pandas dataframes. The easiest way is
to use the
pd.DataFrame.from_records()
function,
which exists exactly for situations like this.
You should have two DataFrames, one for your training data and one for the testing dataset.
3.3 Remove un-useful columns
Remember that for sklearn
everything we feed in should be numbers. So
we need to eliminate columns in the DataFrames that have other kinds of
data, or that have numbers which don’t have any numeric meaning.
Even after removing those non-numeric columns, there is at least one
column that has missing entries. Decide whether you want to just remove
them, or use fillna()
to set them to some default value.
The ppa
column is a specific stat that the CFBD people compute to make
predictions on each play. We are trying to make our own predictions
here! And anyway it’s missing for many plays. So remove that ppa
column as well.
Using Pandas, make copies of your two dataframes that have those un-useful columns removed. (We want a copy because you should save the originals; later we might be able to use some of those non-numeric columns, and we don’t want to have to re-download them.)
3.4 Questions about the data
Once your code is working to create the two cleaned-up dataframes for training and testing, answer these specific questions:
How many rows (observations, which in this case are plays) are in your training dataset?
How many columns (features) are in your datasets after your cleaning?
Looking at the original dataframes with all of the textural columns, What is the second-most common
play_type
that occurs?
3.5 Submit what you have so far
SAVE YOUR NOTEBOOK FILE and submit your work so far:
submit -c=sd212 -p=lab07 lab07.ipynb
or
club -csd212 -plab07 lab07.ipynb
or use the web interface
4 Regression to predict yards gained (25 pts)
Next you will use regression algorithms in sklearn
to predict the numerical value in the yards_gained
column.
To do this, we will follow some steps which should be familiar to you already:
4.1 Choose a regression algorithm and create the model
For now, just choose a simple (and fast) regression algorithm in sklearn such as LinearRegression.
In later parts of the lab, you will have chances to try out different algorithms to try and make more accurate predictions.
4.2 Fit the model to the training data
Take your training dataframe and split it into two pieces:
a “known” features matrix which has everything except the
yards_gained
column, and then the “known labels” series which is just
that single column from the training set.
Feed this matrix and vector into your model’s fit()
function to fit
the model to that training data.
4.3 Predict labels for the testing data
Make a new dataframe which removes the yards_gained
column from the
testing data. Make double-sure that this has exactly the same columns as
were in the known features matrix that you fit the model with above.
Now feed that testing matrix into the .predict()
function of your
trained model, and save the resulting series of predicted values as a
new variable. These are the predicted yards_gained
values for each row
in your testing dataset.
4.4 Score the accuracy of the prediction
Now for the big question: how good were your predictions?
For this, we will calculate the root mean squared error (RMSE) of the
predicted labels compared with the actual yards_gained
column from
the testing dataset.
You should be familiar with root mean squared error from linear algebra! It’s the square root of the sum of the differences between the actual values and the predicted values, each squared.
In other words, it’s the L2-norm of the difference between the right answers and the predicted answers.
Roughly speaking it gives us an idea of how far off your predictions were. So if you get a value like 50, that means that your typical prediction for the
yards_gained
was off by 50 yards. (Hopefully not!)
Fortunately, sklearn
provides a
function which computes the mean squared error for you.
4.5 Answer questions
What took longer in Python - fitting the model to the training data,
or predicting the labels on the testing data? Why do you think that
is?
What was the RMSE of your predictions?
How good do you think that is? Would your predictions be useful
information for a football coach?
4.6 Submit what you have so far
What took longer in Python - fitting the model to the training data, or predicting the labels on the testing data? Why do you think that is?
What was the RMSE of your predictions?
How good do you think that is? Would your predictions be useful information for a football coach?
SAVE YOUR NOTEBOOK FILE and submit your work so far:
submit -c=sd212 -p=lab07 lab07.ipynb
or
club -csd212 -plab07 lab07.ipynb
or use the web interface
4.7 Milestone
For this lab, the milestone means everything up to this point.
Keep it up!
5 Classification to predict rushing plays (25 pts)
Now let’s turn to a second machine learning task: trying to predict whether each play will be some sort of rushing play.
5.1 Add a binary column for is_rush
First you will need to use Pandas to
create a column is_rush
, both in your training
and testing dataframes, based on the play type from the original data.
This should be a boolean column of True/False or 1/0 to indicate simply
whether each play was a rushing play.
Note: There are two types of plays that correspond to a rush, which are a normal rushing play and a rushing touchdown. Make sure you include both of them.
(Also, be careful about where in your notebook these new columns are
created and removed! You definitely do not want the is_rush
data to
feed into the regression model for the previous part; that would kind of
be “cheating” since we don’t know if a play is a rush before it starts.)
5.2 Create model, fit, and predict
These steps will be the same as before, except this time you should choose an [sklearn algorithm][https://scikit-learn.org/stable/supervised_learning.html] for classification, not for regression.
(More specifically, this is a binary classification problem, since the label you are trying to predict has to be 0 or 1.)
As before, just choose a simple and fast algorithm for now; you will have the chance to try different things and improve on it later.
5.3 Score your classifier
We used RMSE to score the regression task from before, but that doesn’t make sense for binary classification where the predictions are either right or wrong (and nothing in between).
Instead, we will use a confusion matrix to check the accuracy of your results, which is a standard tool that shows you how often your prediction was correct and incorrect, compared to the actual value being true or false. Basically, you would ideally want 100% in the top-left and bottom-right corners (true negatives and true positives, respectively).
Use the following lines in a Jupyter notebook cell to display the
confusion matrix for your rushing play predictions, replacing
ACTUAL_VALUES
and PREDICTED_VALUES
as needed:
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_predictions(ACTUAL_VALUES, PREDICTED_VALUES, normalize='true')
5.4 Answer questions
What is your classifier’s accuracy on true negatives? (Top-left
corner in the normalized confusion matrix, should be a number
between 0 and 1)
What is your classifier’s accuracy on true positives?
(Bottom-right corner)
Interpret the results you are seeing. Is your classifier useful?
Where is it better or worse? What makes this an easy or challenging
task for machine learning?
5.5 Submit what you have so far
What is your classifier’s accuracy on true negatives? (Top-left corner in the normalized confusion matrix, should be a number between 0 and 1)
What is your classifier’s accuracy on true positives? (Bottom-right corner)
Interpret the results you are seeing. Is your classifier useful? Where is it better or worse? What makes this an easy or challenging task for machine learning?
SAVE YOUR NOTEBOOK FILE and submit your work so far:
submit -c=sd212 -p=lab07 lab07.ipynb
or
club -csd212 -plab07 lab07.ipynb
or use the web interface
6 Get more training data (15 pts)
Most machine learning algorithms get greater accuracy when you can feed them a larger volume of (high-quality) data.
Augment your training data by incorporating all weeks from
train_year
instead of just a single week. (Note, in this API there are
15 weeks numbered 1 through 15.)
You will need to go back to the beginning of the lab. Now instead of
just doing a single requests.get()
to download the training data JSON,
you will need to do it in a loop for every week of that year. Then you
will need to combine these into one huge training dataframe.
Then go back and re-run the regression and classification tasks from before using this larger training dataset. Go through and update your answers to the previous questions according to your new dataset.
6.1 Submit what you have so far
SAVE YOUR NOTEBOOK FILE and submit your work so far:
submit -c=sd212 -p=lab07 lab07.ipynb
or
club -csd212 -plab07 lab07.ipynb
or use the web interface
7 Make it better! (15 pts)
The last part of the lab is open-ended, but crucially important!
We want you to try different things to improve the accuracy of your regression and classification tasks above.
Here are some sorts of things you can try:
Use a different regression or classification algorithm from
sklearn
. (Hint: read the great documentation they have on the various methods for supervised learning!)For some algorithms, there are also some optional parameters you can specify when creating the model, that will affect how well it does.
Use some preprocessing such as
StandardScalar
in a pipeline so that numerically-larger numbers in the training data don’t get undue emphasis.Add in some more features to each row that could be relevant to rushing yards and/or the chance of seeing a rush play. For example, could you make some 1/0 or other numerical features based on the string-valued columns you deleted earlier? Or can you incorporate the same team’s performance on recent previous plays?
Once agin, you should re-run your regression and classification and update your answers to those questions as needed.
7.1 Questions
Just two more questions:
What change(s) did you make to improve the regression task? How much (if at all) did it improve the RMSE? Why did you think it would help?
Same question as #12, for the classification task.
7.2 Submit your work
SAVE YOUR NOTEBOOK FILE and submit your work so far:
submit -c=sd212 -p=lab07 lab07.ipynb
or
club -csd212 -plab07 lab07.ipynb
or use the web interface