hw35: College groupings
- Due before the beginning of class on Friday, April 26
Background
This homework is a continuation of the previous homework using data from the College scorecard.
The (cleaned) dataset is the same as before: schools.csv
Starter code
The following code uses K-Means clustering in sklearn to group all the colleges into 4 groups, and then prints out the names of the colleges in each group.
Download and save this code as groups.py
:
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
schools = pd.read_csv('schools.csv')
schooldata = schools.drop(columns = ['INSTNM']).fillna(0)
scaled = StandardScaler().fit(schooldata).transform(schooldata)
model = KMeans(n_clusters=4, random_state=12345)
model.fit(scaled)
schools['group'] = model.labels_
for g in range(4):
ingroup = schools[schools['group'] == g]['INSTNM'].sort_values()
print(f"Group {g} ({len(ingroup)} schools):")
print(ingroup.to_string(index=False))
print("=" * 50)
Run this code and make sure it works. There will be a lot of output because there are more than 1000 schools in the list! You will change that next…
Modify the code
Make the following changes to the starter code:
Use your Pandas prowess to only consider “special” schools, which we will define as satisfying both of these conditions:
- Average SAT score (
SAT_AVG
) is at least 1300 - At least 2000 undergraduate students (
UGDS
)
There should be exactly 106 such schools.
(You need to remove some rows from the original DataFrame.)
- Average SAT score (
Make the clustering ignore location (latitute/longitude and locale).
(You need to drop some additional columns from the DataFrame.)
Instead of KMeans clustering with 4 clusters, use
BisectingKMeans
clustering with 5 clusters.(Make sure you print out the schools in all 5 clusters at the end!)
Run and analyze results
Run your code - it should work without errors on the original schools.csv file in your SD212 environment.
Look at the five clusters of colleges your code produces. Then answer these two questions in a markdown file
Download the file hw35.md to fill in and submit for this homeworkPick ONE of the five clusters of colleges your code produced, and copy down the names of the colleges in that group.
For that ONE group, based on what you know about those colleges, give a possible (brief) explanation of what makes those schools “similar” to each other.
Submit command
To submit files for this homework, run one of these commands:
submit -c=sd212 -p=hw35 hw35.md groups.py
club -csd212 -phw35 hw35.md groups.py
Download the file hw35.md to fill in and submit for this homework