SD 212 Spring 2023 / Notes


This is the archived website of SD 212 from the Spring 2023 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Unit 13: Machine learning with sklearn

1 Overview

In this unit, we will get a birds-eye overview of what machine learning is and how Python’s scikit-learn library can be used to make predictions and fit models to data.

You should not expect to become well-versed in machine learning or sklearn in one week! In fact, as a Data Science major you will have two courses next year on Statistical Learning (SM317) and Machine Learning (SD312) that will be entirely spent on these topics.

Rather, the goal for our brief look at this now is to introduce some of the language and terminology of machine learning, and to see how our Pandas data wrangling skills can be used to set up the kinds of problems that powerful machine learning tools can solve.

At the end of this unit, you should:

  • Know the difference between supervised and unsupervised learning.
  • Understand what regression, classification, and clustering are.
  • Know how the terms model, fit, features, labels are used in machine learning.
  • Be able to manipulate Pandas DataFrames to make them suitable for input to sklearn algorithms
  • Know how to run some models for regression, classification, and clustering in relatively simple cases (without a deep understanding of how to choose models or model parameters)
  • Understand the role of testing and training data in fitting and evaluating machine learning models

2 Resources

  • Python for Data Analysis, Chapter 13: Modeling Libraries in Python

    This chapter at the end of the P4DA book (by the original author of Pandas) looks at a few different statistical modeling and machine learning libraries available.

    Check out the first section on interfacing between Pandas and model code, and the short section introducing sklearn.

    (Skip the parts about patsy and statsmodels.)

  • Python Data Science Handbook

    This is a book that we haven’t looked at before, but has a lot of useful stuff that we have learned over the last two semesters about coding with Python, using Pandas, and making visualizations.

    For this unit, there are two chapters to focus on:

    • Chapter 37: What is Machine Learning?

      A very nice introduction to the terminology of machine learning such as supervised vs unsupervized, regression/classification/clustering, model, fit, features, labels, testing, and training. Lots of visuals help give some intuition for what these things mean.

    • Chapter 38: Introducing Scikit-Learn

      Goes into how the sklearn library can actually be used, with a few concrete examples.

  • An introduction to learning with scikit-learn

    A short tutorial on the vocabularity of machine learning as used in the sklearn library.

  • Scikit-learn user guide

    This is the ultimate reference for how to use the sklearn library.

    Unlike with some other Python libraries, the authors have made tremendous efforts to make the documentation readable and accessible, with lots of examples and explanations alongside the code itself.

3 ML terminology

With Machine Learning, our data will always be in the form of a numerical matrix. The rows of the matrix are called “observations”, and they typically correspond to individual entries (maybe people, or foods, or animals, or days, or crimes, etc.) in the dataset.

The columns of the data matrix are called features, and they correspond to one attribute or measurable aspect of each observation. For example, if each observation is a person, then some features might be weight, birth year, annual income, number of times they have flown on an airplane, etc.

One special column is called the label, and this is the attribute or aspect that the ML algorithm is trying to guess or predict.

There are two main categories of ML algorithms that structure how we feed in data and what it is predicting/guessing.

  • Supervised learning

    In this case, we have some known labels for a lot of observations, and we want to predict that same label for some other observations. For both groups of observations, whether the label is known or not, we have to have measurements for all of the same features.

    For example, if the observations are movies, then we might know the IMDB rating, release date, cast size, number of filming days, etc., for a bunch of movies. Those are the features. Then for most movies, we also know how much money the movie made — that would be the label. We want to predict how much money the other movies will make based on all of those known features.

    There are two sub-categories of supervised learning:

    • Regression: When the labels represent numerical values.

      The “how much money will this movie make” problem just described is an example of a regression problem, because that label (how much money) is a numerical value that we want to predict.

      The simplest (and often useful) kind of regression algorithms are called linear regression, and then it can be expressed as a simple matrix-vector product where the ML algorithm is trying to “fit” a vector so that the (known observations) matrix times this vector is “close” to the known labels vector.

    • Classification: When the labels represent categorical value.

      For example, with movies, we might label known movies according to their genre, like 1 for action, 2 for rom-coms, 3 for documentaries, etc. Then given the same observation for some new movies, we want to predict what genre they will have.

      This is different than regression because it wouldn’t make sense, for example, to get a decimal number like 2.7.

  • Unsupervised learning

    In this case, we are asking the ML algorithm to make up a new label based on a single matrix of observations and known features. In other words, the ML algorithm is being asked to summarize or make a new “feature” based on the information we do know for a bunch of observations.

    There are multiple sub-categories of unsupervised learning, but we will focus on just one:

    • Clustering

      We ask the ML algorithm to group the known observations into groups or “clusters”. This is the same as saying that we want the new label that the ML algorithm makes up to be small integers like 0, 1, 2. So each cluster or group will be the set of observations with the same label.

      For example, given a bunch of economic features (known, measurable values) about all the countries in the world, we can feed this into a clustering algorithm to group them into countries that are somehow “similar” according to those features.

You will learn much more about these types of machine learning (and other types like dimensionality reduction and reinforcement learning) in your classes next year! For now, we are not trying to become experts, but just to get our hands dirty trying out some of these techniques and to get a handle on the terminology.

4 Clustering example

Imagine we have a CSV file with information on Midshipmen, with lines like this:

name,aom,mom,oom,maj_qpr,qpr,run,pushups,plank,varsity
Prof Roche,100,1200,1000,4,3.5,11.5,35,2.1,0
Bill,1,1,1,4,4,8,100,4,1
Chad,1200,200,800,1.5,2.1,8,110,3.5,1

So each row (Prof Roche, Bill, Chad) are observations, and each column (aom, qpr, etc.) are features. Notice that the features are all numerical, and even the True/False feature varsity representing “are you on a varsity sport” has been converted to 1/0.

In reality we would need many more observations in order to do any useful machine learning on this data! In class we made up many more rows of fake information.

Here is a program that reads in a CSV file with fake MIDS data like above and tries to group them into clusters using sklearn’s SpectralClustering algorithm:

import pandas as pd
from sklearn.cluster import SpectralClustering

df = pd.read_csv('fakemids.csv')
data = df.iloc[:, 2:] # exclude name and company
print("data:")
print(data)
print()

model = SpectralClustering(n_clusters=3)
model.fit(data)
df['learned_labels'] = model.labels_
print(df)

Notice a few things:

  • The name and company (first two columns) are removed, because those are not numerical features that can be used for machine learning.

    (Yes the company is a number, but it is still categorical data!)

  • For this algorithm, we have to specify how many groups or “clusters” we want, which we chose as 3. So this is pretty useless unless we have more than 3 observations in the dataset!

  • In this code, we just add the cluster numbers as a new column and print out the resulting dataframe. If you wanted to just see the names of everyone in a certain cluster, like cluster 2, you could do something like:

    print('cluster 2:', df[df['learned_labels'] == 2]['name'].to_list())

5 Regression example

For supervised learning, remember that we need really three things:

  • A matrix of (known observations) x (known features)
  • A vector of known labels for each of the known observations
  • A matrix of (unknown observations) x (known features). This is important! Remember that the “unknown” here means we don’t know the labels, but we do have to know the features!

The goal of regression or classification will be to predict the labels of the unknown observations (which will be another vector).

As an example, suppose we have a file called food.csv of information about foods

The first few lines look like this:

Food,Measure,Grams,Protein,Fat,Sat.Fat,Fiber,Carbs,Calories,Category
Cows' milk,1 qt.,976,32,40,36.0,0.0,48.0,660.0,Dairy products
Milk skim,1 qt.,984,36,0,0.0,0.0,52.0,360.0,Dairy products
Buttermilk,1 cup,246,9,5,4.0,0.0,13.0,127.0,Dairy products
...

Here the “Calories” will be our label. We want to predict how many calories some other foods will have, based on the other features like the amount of protein, fat, etc.

Here is the file mystery.csv with the “mystery” foods where we don’t know their calorie counts:

Food,Measure,Grams,Protein,Fat,Sat.Fat,Fiber,Carbs
Kiwi,1 piece,100,1,0.44,0,3,14
Scrapple,1 slice,25,2.1,3.7,1.3,0.1,3.7

Now we can use sklearn’s LinearRegression algorithm to predict the number of calories in each “mystery” food:

import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# read in csvs
foods = pd.read_csv('foods.csv')
mystery = pd.read_csv('mystery.csv')

# cut down
fdata = foods.drop(['Food','Measure','Calories','Category'], axis=1)
fcals = foods['Calories']
mdata = mystery.drop(['Food','Measure'], axis=1)

# fit model
model = make_pipeline(StandardScaler(), LinearRegression())
model.fit(fdata, fcals)

# make predictions
mystery['predicted_calories'] = model.predict(mdata)

# print results
print(mystery[['Food','predicted_calories']])

Notice a few things here:

  • There are many other (usually better) algorithms for regression besides LinearRegression. But no matter what algorithm is used, the same steps of create model, fit, and predict will be used.

  • Here we used a pipeline with sklearn’s StandardScaler to do the regression. This makes sure each feature is scaled down to the same range and prevents some kinds of outlier problems that can occur.

    (Scaling is actually totally unnecessary here, but it can’t hurt and is important for many other regression problems that you might run.)

  • The predicted labels (calorie counts) represent numerical values, so we used regression. If we were trying to predict a category instead, we would want to use a classification algorihtm.