SD 212 Spring 2024 / Notes


Unit 11: Machine learning with sklearn

1 Overview

In this unit, we will get a birds-eye overview of what machine learning is and how Python’s scikit-learn library can be used to make predictions and fit models to data.

Before we get into the machine learning part, we will first spend a little time to think about how to organize what kinds of data (or features within data) might arise in the life of a data scientist. This organization can help us when approaching a new dataset and thinking about what kinds of analysis or visualizations are most meaningful and appropriate.

You should not expect to become well-versed in machine learning or sklearn in one week! In fact, as a Data Science major you will have two courses next year on Statistical Learning (SM317) and Machine Learning (SD312) that will be entirely spent on these topics.

Rather, the goal for our brief look at this now is to introduce some of the language and terminology of machine learning, and to see how our Pandas data wrangling skills can be used to set up the kinds of problems that powerful machine learning tools can solve.

At the end of this unit, you should:

  • Recognize when a piece of information is quantitative (i.e., numerical) or qualitative (i.e., categorical).
  • Distinguish between continuous and discrete numerical data, and between ordinal and nominal categorical data.
  • Understand the limits of categorical data in terms of simple visualizations and machine learning.
  • Know the difference between supervised and unsupervised learning.
  • Understand what regression, classification, and clustering are.
  • Know how the terms model, fit, features, labels are used in machine learning.
  • Be able to manipulate Pandas DataFrames to make them suitable for input to sklearn algorithms
  • Know how to run some models for regression, classification, and clustering in relatively simple cases (without a deep understanding of how to choose models or model parameters)
  • Understand the role of testing and training data in fitting and evaluating machine learning models

2 Resources

  • Python for Data Analysis, Chapter 13: Modeling Libraries in Python

    This chapter at the end of the P4DA book (by the original author of Pandas) looks at a few different statistical modeling and machine learning libraries available.

    Check out the first section on interfacing between Pandas and model code, and the short section introducing sklearn.

    (Skip the parts about patsy and statsmodels.)

  • Python Data Science Handbook

    This is a book that we haven’t looked at before, but has a lot of useful stuff that we have learned over the last two semesters about coding with Python, using Pandas, and making visualizations.

    For this unit, there are two chapters to focus on:

    • Chapter 37: What is Machine Learning?

      A very nice introduction to the terminology of machine learning such as supervised vs unsupervized, regression/classification/clustering, model, fit, features, labels, testing, and training. Lots of visuals help give some intuition for what these things mean.

    • Chapter 38: Introducing Scikit-Learn

      Goes into how the sklearn library can actually be used, with a few concrete examples.

  • An introduction to learning with scikit-learn

    A short tutorial on the vocabularity of machine learning as used in the sklearn library.

  • Scikit-learn user guide

    This is the ultimate reference for how to use the sklearn library.

    Unlike with some other Python libraries, the authors have made tremendous efforts to make the documentation readable and accessible, with lots of examples and explanations alongside the code itself.

3 Statistical data types

3.1 Python type vs statistical type

When studying Python programming, we frequently think about the types of variables and values: str, int, float, bool, list, dict, and so on. This very much controls what kind of operations we are able to perform on a given value. For example, the meaning of

x[3]

is very different if x is a string, list, dict, or pandas dataframe, and it will be an error if x is a number.

While Python types govern how information is represented and what we can do with it in a Python program, we can also think about the type of information that is actually represented in the real world by that variable. These are sometimes called “statistical data types” or “feature types”.

For example, the Python string "April" could represent someone’s name, or it could represent the fourth month of the Gregorian calendar. How we treat that string in our data processing and analysis should be very different depending on this context.

As another example: the number 17 could mean a lot of things: it could be a company number in Bancroft Hall (which is useful for grouping but has no inherent numerical meaning), or it could refer to the year 2017 (or 1917?), or it could be an air temperature measurement. These distinctions are not captured by the Python type, but we need to be aware of them when performing data analysis and visualization.

3.2 Categorical, Ordinal, continuous, discrete

There are many different ways to define a hierarchy of statistical data types, but most commonly there are two groups with two sub-groups each:

  • Numerical data (a.k.a. quantitative)

    This kind of data is for things that can be measured in some way, and where the number itself actually has meaning. Doing math with these kind of values makes sense, for example computing an average value.

    There are two sub-types:

    • Continuous data: When any real number makes sense as a fine-grained measurement. These often correspond to physical measurements like distance, weight, etc. Ratios and percentages also typically fit here.

    • Discrete-valued data: When the only values that make sense are integers. For example, a count of how many people attended a concert would be a discrete-valued numerical value. It’s certainly measuring something where an average or total could make sense, but you can’t have 3.25 people.

    A special case here is time series data (i.e., datetimes). These are always numerical and typically continous. If the data is only specifying the day or year, it may be discrete rather than continuous.

    (Interestingly, the internal representation of a datetime is typically something like the number of seconds elapsed since January 1 1970, which makes it clear that this is a numerical quantity like any other.)

  • Categorical data (a.k.a. qualitative)

    This is the type of data which is not numerical, or where the number doesn’t have any meaning as a number. Typical examples would be things like names, or colors, or cities.

    Note that data can be categorical even if it’s represented by a number; for example a phone number, zip code, or company number. Again, it can be helpful to think about whether math makes sense: if MIDN X has alpha 268136 and MIDN Y has alpha 259454, does the average of these numbers mean anything at all? Of course not, because alpha numbers are categorial and not numeric.

    Again, there are two sub-types:

    • Ordinal data: When the values have some natural ranking or ordering. A good example would be class rank.

    • Nominal data: The values are just values, with no inherent ranking or relationship. Names of people, places, or things certainly fit here, as do many other groupings like countries, species, colors, etc.

    A special case of categorical data is a unique identifier, which is some feature in a dataset which by definition will be different for every entry in the dataset. Midshipman alphas would be a classic example, or account numbers in a bank, or usernames on a website, or city names within a single state. Notice that unique identifiers are sometimes numbers and sometimes strings, but are (almost) never numeric because they aren’t measuring anything.

3.3 Recognizing statistical types

We can get some help of identifying statistical types by looking at the data. Are all the entries numbers? Are some of them decimals, or only integers? If the data is strings, are they all the same, or do they divide into a small number of groups?

Some of these questions can be aided by the tools you know. For example, determining how many distinct values exist and how many times they are repeated can be accomplished on the command line with a pipeline ending with sort | uniq -c, and you can do the same on a Pandas series by calling .value_counts().

But ultimately we need to understand the meaning behind what the raw data is. Consider a few examples:

  • Cat species: lion, tiger, leopard, …

    This is clearly nominal, categorical data. There is no numerical value here, nor can we put these in any meaningful order.

    (Note, there still may be ways to compare some aspects of these categories, like whether there are more lions or tigers in the world, or which one can run faster. But that is not the same as the animal name itself!)

  • Survey responses: Do you eat cheese “always”, “sometimes”, or “never”?

    This is ordinal, categorical data. It’s easy to say which category is more or less than another, but the difference between “always” and “sometimes” is not well defined.

    Would you be able to say an “always” respondant eats twice as much cheese as a “sometimes” person? Not necessarily!

  • Finishing places in a race like 1st, 2nd, 3rd, etc.

    This is another example of ordinal data.

    It might seem that this data is numeric since there are numbers, but the numbers don’t tell us anything concrete except the ordering. The difference between 1st and 2nd place is not necessarily the same as the difference between 2nd and 3rd place. The person who finished 50th is probably not twice as slow as the person who finsihed 25th.

  • How many times you have been swimming in your life?

    This is discrete numerical data.

    It’s numeric because it makes sense to do math, like to say you have been swimming 3x as many times as me, means that number is 3x larger.

    But the data is also discrete because fractional values aren’t possible. You can’t go swimming 3.7 times; that would never make sense.

    (By contrast, the number of minutes you have spent in a pool would be continuous.)

  • How many miles per hour were you driving?

    This is continuous numerical data. Notice that this at first might seem similar to the previous data: the question is “how many” and we usually measure speed (of cars) in whole numbers. But you could imagine going, say, 25.6 mph. Even if the spedometer or whatever measures speed only displays integers, and even if the traffic laws are always multiples of 5, that doens’t mean those are the only possible speeds that could exist.

As with any way of categorizing things, there will be edge cases or things that seem to partially fit into more than one category, and that’s okay! This system of categorization is meant to be a useful starting point to help us think about what’s possible, not a precise question that demands a precise answer in all cases.

For example, consider:

  • Your QPR for a single class. (4.0 for A, 3.7 for A-, 3.3 for B+, etc.)

    This one is tricky! It may seem to be continuous/numerical because we have decimal points, but it is definitely not continuous since intermediate values such as 3.5 are just not possible (for a single class).

    In fact, you could argue that this is not even numerical. Does a C (2.0) student know twice as much as a D (1.0) student? Certainly we know the C student did better in the class, but can we say how much better based on the QPR? I’m not sure!

    On the other hand, we definitely do take averages of individual class QPRs, so it would seem that they have some numerical meaning, or at least are supposed to.

    Basically, QPR for a single class is at least ordinal, but is kind of in a gray area between numerical/discrete and categorical. Or it might be best to say it is an attempt at making (categorical) grades into numerical quantities.

3.4 Why does it matter?

Understanding statistical data type is useful in at least two ways to a data scientist: knowing what kind of analysis makes sense with your data, and knowing how best to visualize your data.

You will study both of those questions in much more detail in later courses, so for now we focus mostly on the big picture. The main pitfall is using categorical data as if it is numerical, both in analysis and visualizations. I’ll give two examples here; be on the lookout for more!

Bad Analysis Example

Imagine we are interested in the academic performance of varsity athletes. We find that the average class rank of a tennis player is 321.6 and the average class rank of a wrestler is 357.5. Can we say the tennis players are doing better academically on average than the wrestlers?

Not necessarily! It’s possible that, say, class ranks 100 through 400 all have almost the exact same QPR of 3.5, but then this drops off dramatically past 400.

Then the tennis players might be split between half having a high class rank around 120 with 3.5 QPR, and the other half having a low rank around 520 with around a 2.5. The overall average QPR of the tennis team would be 3.0.

But the wrestling team could all be between 350-360 class rank, all with a 3.5 QPR.

In other words, the better average class rank does not necessarily mean a higher average QPR. This is about treating ordinal data (rank) as if it is numeric.

As a rule, you should never be doing math with categorical data, like taking averages, summing them up, comparing differences, etc. If you are doing that, it’s likely that you are mis-treating categorical data as numeric.

Bad Visualization Example

This one actually cones to use from one of your DSITW submissions:

Source: Nathan Piccini, Data Science Dojo

Notice the striking erratic nature of the lines in this graph. What is happening?

We seem to have a confusion of nominal and numerical data. The x-axis is team names, which are obviously not numerical, not even ordinal. (OK, we could rank the teams by which place they finished in the season or something, but here they are just in alphabetical order.)

But the choice of drawing lines between the data points makes an implied continuous connection between the team names based on alphabetical order, which is meaningless and misleading.

The trend lines (dotted lines) are even worse — the author is trying to draw a distinction between the two categories (2016 vs 2018), but trend lines are about a relationship or correlation between the X and Y axis - in this case, it is showing a slight upward trend as the team names go closer to the end of the alphabet, which is totally meaningless.

Some kind of bar graph would have been a much better choice here. Line graphs like this should only be used when the axes are both numerical (some would argue, only if they are both continuous), with consistent scaling.

4 ML terminology

With Machine Learning, our data will always be in the form of a numerical matrix. The rows of the matrix are called “observations”, and they typically correspond to individual entries (maybe people, or foods, or animals, or days, or crimes, etc.) in the dataset.

The columns of the data matrix are called features, and they correspond to one attribute or measurable aspect of each observation. For example, if each observation is a person, then some features might be weight, birth year, annual income, number of times they have flown on an airplane, etc.

One special column is called the label, and this is the attribute or aspect that the ML algorithm is trying to guess or predict.

There are two main categories of ML algorithms that structure how we feed in data and what it is predicting/guessing.

  • Supervised learning

    In this case, we have some known labels for a lot of observations, and we want to predict that same label for some other observations. For both groups of observations, whether the label is known or not, we have to have measurements for all of the same features.

    For example, if the observations are movies, then we might know the IMDB rating, release date, cast size, number of filming days, etc., for a bunch of movies. Those are the features. Then for most movies, we also know how much money the movie made — that would be the label. We want to predict how much money the other movies will make based on all of those known features.

    There are two sub-categories of supervised learning:

    • Regression: When the labels represent numerical values.

      The “how much money will this movie make” problem just described is an example of a regression problem, because that label (how much money) is a numerical value that we want to predict.

      The simplest (and often useful) kind of regression algorithms are called linear regression, and then it can be expressed as a simple matrix-vector product where the ML algorithm is trying to “fit” a vector so that the (known observations) matrix times this vector is “close” to the known labels vector.

    • Classification: When the labels represent categorical value.

      For example, with movies, we might label known movies according to their genre, like 1 for action, 2 for rom-coms, 3 for documentaries, etc. Then given the same observation for some new movies, we want to predict what genre they will have.

      This is different than regression because it wouldn’t make sense, for example, to get a decimal number like 2.7.

  • Unsupervised learning

    In this case, we are asking the ML algorithm to make up a new label based on a single matrix of observations and known features. In other words, the ML algorithm is being asked to summarize or make a new “feature” based on the information we do know for a bunch of observations.

    There are multiple sub-categories of unsupervised learning, but we will focus on just one:

    • Clustering

      We ask the ML algorithm to group the known observations into groups or “clusters”. This is the same as saying that we want the new label that the ML algorithm makes up to be small integers like 0, 1, 2. So each cluster or group will be the set of observations with the same label.

      For example, given a bunch of economic features (known, measurable values) about all the countries in the world, we can feed this into a clustering algorithm to group them into countries that are somehow “similar” according to those features.

You will learn much more about these types of machine learning (and other types like dimensionality reduction and reinforcement learning) in your classes next year! For now, we are not trying to become experts, but just to get our hands dirty trying out some of these techniques and to get a handle on the terminology.

5 Regression example

For supervised learning, remember that we need really three things:

  • A matrix of (known observations) x (known features)
  • A vector of known labels for each of the known observations
  • A matrix of (unknown observations) x (known features). This is important! Remember that the “unknown” here means we don’t know the labels, but we do have to know the features!

The goal of regression or classification will be to predict the labels of the unknown observations (which will be another vector).

As an example, suppose we have a file called food.csv of information about foods

The first few lines look like this:

Food,Measure,Grams,Protein,Fat,Sat.Fat,Fiber,Carbs,Calories,Category
Cows' milk,1 qt.,976,32,40,36.0,0.0,48.0,660.0,Dairy products
Milk skim,1 qt.,984,36,0,0.0,0.0,52.0,360.0,Dairy products
Buttermilk,1 cup,246,9,5,4.0,0.0,13.0,127.0,Dairy products
...

Here the “Calories” will be our label. We want to predict how many calories some other foods will have, based on the other features like the amount of protein, fat, etc.

Here is the file mystery.csv with the “mystery” foods where we don’t know their calorie counts:

Food,Measure,Grams,Protein,Fat,Sat.Fat,Fiber,Carbs
Kiwi,1 piece,100,1,0.44,0,3,14
Scrapple,1 slice,25,2.1,3.7,1.3,0.1,3.7

Now we can use sklearn’s LinearRegression algorithm to predict the number of calories in each “mystery” food:

import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# read in csvs
foods = pd.read_csv('foods.csv')
mystery = pd.read_csv('mystery.csv')

# cut down
fdata = foods.drop(['Food','Measure','Calories','Category'], axis=1)
fcals = foods['Calories']
mdata = mystery.drop(['Food','Measure'], axis=1)

# fit model
model = make_pipeline(StandardScaler(), LinearRegression())
model.fit(fdata, fcals)

# make predictions
mystery['predicted_calories'] = model.predict(mdata)

# print results
print(mystery[['Food','predicted_calories']])

Notice a few things here:

  • There are many other (usually better) algorithms for regression besides LinearRegression. But no matter what algorithm is used, the same steps of create model, fit, and predict will be used.

  • Here we used a pipeline with sklearn’s StandardScaler to do the regression. This makes sure each feature is scaled down to the same range and prevents some kinds of outlier problems that can occur.

    (Scaling is actually totally unnecessary here, but it can’t hurt and is important for many other regression problems that you might run.)

  • The predicted labels (calorie counts) represent numerical values, so we used regression. If we were trying to predict a category instead, we would want to use a classification algorihtm.

6 Clustering example

Imagine we have a CSV file with information on Midshipmen, with lines like this:

name,company,aom,mom,oom,maj_qpr,qpr,run,pushups,plank,varsity
Prof Roche,0,100,1200,1000,4,3.5,11.5,35,2.1,0
Bill,10,1,1,1,4,4,8,100,4,1
Chad,25,1200,200,800,1.5,2.1,8,110,3.5,1

So each row (Prof Roche, Bill, Chad) are observations, and each column (aom, qpr, etc.) are features. Notice that the features are all numbers, and even the True/False feature varsity representing “are you on a varsity sport” has been converted to 1/0.

(Remember, being written as numbers does not mean these are numerical! We have to represent everything as numbers in order to feed it into sklearn.)

In reality we would need many more observations in order to do any useful machine learning on this data! In class we made up many more rows of fake information.

Here is a program that reads in a CSV file with fake MIDS data like above and tries to group them into clusters using sklearn’s SpectralClustering algorithm:

import pandas as pd
from sklearn.cluster import SpectralClustering

df = pd.read_csv('fakemids.csv')
data = df.drop(columns=['name', 'company'])
print("data:")
print(data)
print()

model = SpectralClustering(n_clusters=3)
model.fit(data)
df['learned_labels'] = model.labels_
print(df)

Notice a few things:

  • The name and company (first two columns) are removed, because those are not numerical features that can be used for machine learning.

    (Yes the company is a number, but it is still categorical data!)

  • For this algorithm, we have to specify how many groups or “clusters” we want, which we chose as 3. So this is pretty useless unless we have more than 3 observations in the dataset!

  • In this code, we just add the cluster numbers as a new column and print out the resulting dataframe. If you wanted to just see the names of everyone in a certain cluster, like cluster 2, you could do something like:

    print('cluster 2:', df[df['learned_labels'] == 2]['name'].to_list())