SD 212 Spring 2024 / Notes


Unit 1: Welcome back

1 Reference Reading

2 Course overview

Welcome to SD212! You are now a wise and experienced data scientist, having thrived in SD211, and ready to move on to bigger and better things.

Let’s take a moment to remember what we are calling the “data science lifecycle”:

  1. Data Acquisition
  2. Data Storage
  3. Data Processing and Cleaning
  4. Data Analysis
  5. Data Visualization, Interface, and Communication

You have touched on all of these already in SD211. Now in this class, we will extend your skills even further by considering a wider view into data acquisition, storage and cleaning; and also taking a deeper dive into data processing and analysis.

Specifically, in this class, we will learn more about:

  • How to find and download datasets from a variety of sources
  • How to “fix” the raw data that may have missing or ill-formatted or erroneous parts
  • How to do all this on the command line as well as in Python
  • Using regular expressions to quickly search for specific patterns in text files
  • A bit about the main components of a computer and how they affect the performance of programs we write
  • How libraries work and how large projects are organized and maintained
  • …and more!

In short, the goal is to give you more independence and skills as programmers in order to handle all kinds of data science tasks, from start to finish.

3 Grading and logistics

You should have a look at the course policy.

Expect 2-3 homeworks per week and a new lab every 1 or 2 weeks. Because we are not only focused on programming, many of the homeworks will have you answering questions about the reading or concepts we learn about in class.

One special event we will take part in this semester is the Info Challenge hosted by UMD. Mark the two Saturdays February 24 and March 2 on your calendar. The first one is the introduction to the info challenge where your team will meet with a mentor and learn about what you will be working on for the next week. The following Saturday we will travel to UMD and present your work to the judges. This is a great chance for you to practice your data science skills on real datasets and work on problems that matter. Class comp days will be given to compensate your time, and we will work out the necessary excusals and MOs. If you think you can’t attend, please talk with your instructor about it.

4 Data Science in the Wild

At the beginning of every week, we will take just a few minutes to discuss something interesting you saw in a newspaper, magazine, or on social media, that uses data science to try and make an interesting point. This will also count as a homework assignment.

This week your instructor will take the first turn and discuss an article they saw; we need you all to sign up for the rest of the semester. Look out for an email from your instructor with a sign-up link.

5 Python background

You learned a lot in SD211! Here is a quick reminder of some of the topics covered there:

  • Using VS code to write and debug programs
  • Variables
  • Lists
  • Dictionaries
  • If and if/else
  • While and for loops
  • Reading files
  • Functions
  • Classes
  • Importing from other python programs you write
  • Importing and installing libraries using conda
  • Helpful data science libraries: csv, pandas, plotly, etc.

We will spend some time in the first week reviewing and remembering about some of this, but remember it is your responsibility to keep up, review your past work and notes as needed, and (as always) proactively seek help from MGSP and/or your instructor when needed.

6 Code you should understand

Each unit will end with some code examples that you should understand fully at the end of this unit.

6.1 States with large populations

Here is a csv with all 50 state populations and areas.

The file starts like this:

name,abbreviation,capital,population,area
Alabama,AL,Montgomery,4903185,52420
Alaska,AK,Juneau,731545,665384
Arizona,AZ,Phoenix,7278717,113990
Arkansas,AR,Little Rock,3017804,53179

First, let’s write a program that reads in a number and prints the names of all states that have at least that many residents.

Here’s how to do that using csv.DictReader:

from csv import DictReader

minpop = int(input('How many residents? '))

fd = open('states.csv')
rdr = DictReader(fd)

for row in rdr:
    if int(row['population']) >= minpop:
        print(row['name'])

And here’s how we could accomplish the same thing using pandas:

import pandas as pd

minpop = int(input('How many residents? '))

states = pd.read_csv('states.csv')

bigstates = states[states['population'] >= minpop]
for name in bigstates['name']:
    print(name)

6.2 MLK Day

Martin Luther King Jr. was born on Tuesday, January 15, 1929 in Atlanta, Georgia.

The holiday held in his honor is always on the third Monday in January each year.

Write a program that, for each year starting with 2020 and ending with 2050, prints out the date of MLK day, like this:

January 20, 2020
January 18, 2021
January 17, 2022
...
January 17, 2050

(Remember, there are 365 days in a year, except for leap years when there are 366 days. Within this range, the leap years are those years that are divisible by 4.)

You should design at least one function to help you write this program.

Here’s one way to do it without using any libraries:

year = 2020
mlk = 20 # MLK Day in 2020 is Jan 20

def days_in_year(year):
    if year % 4 == 0:
        # leap year
        return 366
    else:
        return 365

while year <= 2050:
    print(f"January {mlk}, {year}")
    # how many extra days beyond a full 52 weeks are in the year
    extra_days = days_in_year(year) % 7
    # Monday shifts back by that many days
    mlk -= extra_days
    # If needed, jump to next week to stick with 3rd Monday
    if mlk <= 14:
        mlk += 7
    year += 1

And here’s how we could do it using the datetime library:

from datetime import datetime

def mlk_day(year):
    # The earliest the 3rd Monday could be is the 15th
    # Note, datetime.weekday() returns 0 for Monday, 1 for Tuesday, etc.
    day = 15
    dt = datetime(day=day, month=1, year=year)
    while dt.weekday() != 0:
        day += 1
        dt = datetime(day=day, month=1, year=year)
    return day

for year in range(2020, 2050+1):
    mlk = mlk_day(year)
    print(f"January {mlk}, {year}")

Just for fun, there is a way to do it in just 2 lines using another built-in Python module (that we haven’t talked about) called calendar. The purpose of showing this isn’t to say you should have know about this, but just to point out that there are often many different ways to solve the same problem in Python:

import calendar

for year in range(2020, 2050+1):
    print(calendar.Calendar(1).monthdatescalendar(year, 1)[2][-1].strftime('%B %d, %Y'))

6.3 Superheros

The file heroes.csv (from this source) contains information about superheroes. Here are the first few lines:

Name;Identity;Birth place;Publisher;Height;Weight;Gender;First appearance;Eye color;Hair color;Strength;Intelligence
A-Bomb;Richard Milhouse Jones;Scarsdale, Arizona;Marvel Comics;203.21000000000001;441.94999999999999;M;2008;Yellow;No Hair;100;moderate
Abraxas;Abraxas;Within Eternity ;Marvel Comics;;;M;;Blue;Black;100;high
Abomination;Emil Blonsky;Zagreb, Yugoslavia;Marvel Comics;203.03999999999999;441.98000000000002;M;;Green;No Hair;80;good
Adam Monroe;;;NBC - Heroes;;;M;;Blue;Blond;10;good

Write a Python program with a class Hero which holds some information about a given superhero. Your class should include an __init__ constructor (of course) and two other functions:

  • display(self): Print out the Superhero’s name and their true identity, like

    Superman (Clark Kent)
  • stronger(self, other_hero): Returns True or False depending on whether this superhero has a higher “strength” than the other one.

Write is a Python program with the Hero class described above, as well as code to read from the heroes.csv file into a dictionary mapping names to Hero objects called heroes.

After your code runs, we can test it with lines like:

heroes['Superman'].display()
# prints out: Superman (Clark Kent)
heroes['Wonder Woman'].stronger(heroes['Doctor Strange'])
# returns True

Here’s one way to solve it:

from csv import DictReader

class Hero:
    """A superhero with various properties like name and strength."""

    def __init__(self, name, identity, strength):
        self.name = name
        self.identity = identity
        self.strength = int(strength)

    def display(self):
        print(f"{self.name} ({self.identity})")

    def stronger(self, other_hero):
        return self.strength > other_hero.strength

heroes = dict()
for row in DictReader(open("heroes.csv"), delimiter=';'):
    if row['Strength'] == '':
        # skip heroes with no reported strength
        continue
    heroes[row['Name']] = Hero(row['Name'], row['Identity'], row['Strength'])