SD 212 Spring 2024 / Homeworks


hw35: College groupings

  • Due before the beginning of class on Friday, April 26

Background

This homework is a continuation of the previous homework using data from the College scorecard.

The (cleaned) dataset is the same as before: schools.csv

Starter code

The following code uses K-Means clustering in sklearn to group all the colleges into 4 groups, and then prints out the names of the colleges in each group.

Download and save this code as groups.py:

import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

schools = pd.read_csv('schools.csv')
schooldata = schools.drop(columns = ['INSTNM']).fillna(0)

scaled = StandardScaler().fit(schooldata).transform(schooldata)
model = KMeans(n_clusters=4, random_state=12345)
model.fit(scaled)
schools['group'] = model.labels_

for g in range(4):
    ingroup = schools[schools['group'] == g]['INSTNM'].sort_values()
    print(f"Group {g} ({len(ingroup)} schools):")
    print(ingroup.to_string(index=False))
    print("=" * 50)

Run this code and make sure it works. There will be a lot of output because there are more than 1000 schools in the list! You will change that next…

Modify the code

Make the following changes to the starter code:

  1. Use your Pandas prowess to only consider “special” schools, which we will define as satisfying both of these conditions:

    • Average SAT score (SAT_AVG) is at least 1300
    • At least 2000 undergraduate students (UGDS)

    There should be exactly 106 such schools.

    (You need to remove some rows from the original DataFrame.)

  2. Make the clustering ignore location (latitute/longitude and locale).

    (You need to drop some additional columns from the DataFrame.)

  3. Instead of KMeans clustering with 4 clusters, use BisectingKMeans clustering with 5 clusters.

    (Make sure you print out the schools in all 5 clusters at the end!)

Run and analyze results

Run your code - it should work without errors on the original schools.csv file in your SD212 environment.

Look at the five clusters of colleges your code produces. Then answer these two questions in a markdown file

Download the file hw35.md to fill in and submit for this homework
  1. Pick ONE of the five clusters of colleges your code produced, and copy down the names of the colleges in that group.

  2. For that ONE group, based on what you know about those colleges, give a possible (brief) explanation of what makes those schools “similar” to each other.

Submit command

To submit files for this homework, run one of these commands:

submit -c=sd212 -p=hw35 hw35.md groups.py
club -csd212 -phw35 hw35.md groups.py
Download the file hw35.md to fill in and submit for this homework