SD 212 Spring 2023 / Labs


This is the archived website of SD 212 from the Spring 2023 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Lab 1: Crimes and football in Baltimore

1 Overview

Ray Rice (photo: Nick Wass/AP)

This lab will examine a large dataset of “crimes against persons” made available by the Baltimore Police department’s Open Data initiative. The first part of the lab will ask you to download this dataset, do some basic exploration and discovery, and then make a few graphs relating crime rates to their type, time, and location. Then we will connect this with a second piece of data (which you must acquire) on NFL games, and see if there is any connection between Baltimore Ravens games and crime rates.

1.1 Deadlines

  • Milestone: 2359 on Tuesday, 24 January
  • Complete lab: 2359 on Sunday, 29 January

1.2 Collaboration

We want you to work together and help each other when you get stuck. But we also want you to struggle yourselves with each concept and work to figure it out on your own.

The formal rules are in the course policy here. We suggest you also consider the following guidelines when giving and receiving help.

The overall guidance is that the first time you see an attempted solution to a technical problem or something you are supposed to figure out for the lab, it should be your own attempt! You should never be getting direct answers or copying code from anyone else at any point.

Good ideas / allowed help:

  • Discussing ideas, strategies, tips, and sources of help without looking at any code
  • Helping with VS code, installing WSL, using SSH to the lab machines, using mamba, and other set-up stuff.
  • Helping a classmate debug their code or get un-stuck on some part after you have completed that part yourself. This may involve looking at the part of their code where they are stuck.
  • Admitting if you can’t help and your classmate should ask someone else, like an MGSP leader or instructor. For example: if you are too busy to understand their issue, or if you can’t see any way to fix their code because you did yours in a different way.
  • Asking your instructor for clarification if you aren’t sure what is or isn’t allowed

Bad ideas / not allowed:

  • Looking at someone else’s code for a part of the assignment which you haven’t done yourself yet
  • Sending your code to a classmate
  • Copying code (even just a few lines) from a classmate and submitting as your own
  • Asking for help from other people or online tools not affiliated with this course
  • Copying code from a website or other source and submitting it as your own without specifically citing the website and the part of code you used

1.3 Learning goals

We want you to be aware why we are asking you to do this work, and how it fits in with the rest of the class. Here are the ways that your knowledge and understanding of data science and programming will grow as a result of doing this lab:

  • Gain independence and confidence using data science libraries from SD211 to read and process CSV files and produce graphs
  • Slice datasets horizontally (focusing on a few columns) and vertically (focusing on a few rows)
  • Practice dealing with raw, unprocessed and uncleaned data
  • Think about data acquisition and find a simple dataset from the web
  • Correlate two datasets based on a single common key
  • Interpret correlation results in terms of significance and possible explanations

2 Preliminaries

2.1 Installs

If you haven’t already, or if you haven’t done this on the lab machines, you need to install mamba and create your sd212 folder.

Open a terminal on the lab machine and run these commands. Answer “yes” when asked if you want to initialize mamba.

mkdir -p ~/sd212
wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
bash Mambaforge-Linux-x86_64.sh

Then close the terminal and re-open a new one, and run one more command to install packages:

mamba create -n sd212 numpy pandas ipykernel matplotlib plotly seaborn scikit-learn opencv bs4 lxml nltk easygui wordcloud openpyxl

2.2 Folder setup and dataset download

First, create a lab01 folder insider your sd212 folder.

Then, in VS Code, go to “Open folder” and open a fresh VS Code window with that folder.

The data we will use is in this csv file. You can get it with right-click/“Save link as”, but let’s download it directly in the terminal instead.

Now open a terminal in VS code and run these two commands to first download and then unzip the crimes dataset. Make sure you are in the correct folder — in the terminal you should see ~/sd212/lab01$ at the end of your prompt.

wget "http://roche.work/courses/s23sd212/lab/crime/crimes.csv.xz"
unxz crimes.csv.xz

Now you should have a nice big CSV file to work with!

2.3 Markdown questions

Much like some of our homeworks, this lab will ask you to fill in a markdown file with your answers to certain specific questions.

Here is the file for today's lab.

If you want, you can use the terminal to download the blank file directly to your lab folder with this command:

wget "http://roche.work/courses/s23sd212/lab/md/lab01.md"

Remember to type your answers in the blank lines, below the lines that start with # [Q01] to indicate where each question starts, and below the lines that start with > blah blah, which just re-state the questions themselves.

3 Inspecting and cleaning the data

This part of the lab asks you to answer some questions about the crimes data crimes.csv.

Create a new Python program explore.py. Start by using pandas to read in the csv file into a dataframe object.

For each question, you should do two things: write the code to answer the question and print out the answer in your explore.py program, and write the actual answer in the markdown file you downloaded.

(The reason we want you to do both is so that we don’t have to specify precisely the formatting of the answers as they print out from your Python program. But if I run your python program, it should display the correct answers, in order.)

3.1 Initial questions

First use your pandas skills to answer these questions.

  1. How many total rows are in this dataset (not counting the header row)?

  2. Which district (not counting the odd SD5 district) had the fewest number of car thefts?

    (Do not include car jackings, just regular thefts.)

  3. There is only one neighborhood which had over 100 homicides. What is this neighborhood called?

  4. Which district is that neighborhood in?

  5. There is a column for gender. Do you think this is the gender of the victim or the assailant? Explain how you figured this out. (Hint: Use the data! Some types of crimes disproportionately impact one gender or another.)

Some pandas hints and reminders:

  • Looking at the column names:

    my_dataframe.columns
  • Selecting a single column from a dataframe gives you a series:

    my_dataframe['Column Heading']
  • Selecting multiple columns is also possible, but you have to use a list of column headings:

    my_dataframe[['Column1', 'Column2']]
  • Comparing each entry in a series to a single value, to produce a another series of True/False values:

    my_series == 'some value to compare with each row'
  • Comparing each entry in a series to a list of possible values also works and creates a series of True/False values, by using the .isin() function:

    my_series.isin(['Values', 'To', 'Test', 'Against'])
  • To combine multiple True/False series logically, you have to use Python’s “bitwise” operators instead of the normal ones. So instead of or, you write |; instead of and you write &, and instead of not you write ~.

    So for example, this test:

    fruits.isin(['apple', 'banana', 'orange'])

    can be written equivalently as:

    (fruits == 'apple') | (fruits == 'banana') | (fruits == 'orange')
  • Giving a series of True/False as the index selects only the rows where the condition was true:

    my_dataframe[my_series_of_true_false]
  • Finding all the distinct values in some series, sorted by how many times they occur (very useful!!):

    my_series.value_counts()
  • Adding a series as a new column in an existing dataframe:

    my_dataframe['New Column Name'] = my_series

3.2 Getting the dates right

Each crime in the dataset has a date and time given when that crime occurred. However, unlike numbers, date/time columns are not automatically recognized by pandas. But that will make answering questions about crimes that occurred within a certain date range very difficult!

There are a few ways to handle this with pandas, but we will use the pd.to_datetime function, which takes a series of strings and attempts to convert them to a series of datetime objects.

Try running this in Python, replacing my_dataframe with whatever you called your dataframe of crime data:

pd.to_datetime(my_dataframe['CrimeDateTime'])

It gives a long, gross error message! If you entered this command correctly, the error is coming from a very unusual date which is out of the range of dates which pandas knows how to represent.

Look carefully at the error message to see the string which caused the datetime conversion to fail. Then answer this question in your markdown file:

  1. What is the year (only) of the first date in crimes.csv that pandas could not convert to a datetime object?

Obviously this year is an error; the city’s namesake was not even born for a few hundred more years. Real datasets contain errors, and as data scientists we have to work around them!

In this case, the solution is to add the errors='coerce' option to the pd.to_datetime function call, which will convert those un-representable dates to the special null value NaT (“not a time”).

Now you should be able to answer some more questions. (Remember, write the code to answer each question in explore.py and the actual answers in the markdown file.)

  1. How many crimes were committed in 2015?

  2. How many crimes were committed between March 14 and December 14, 2020 (including both those dates)?

  3. How many robberies occurred on Fridays which were the 13th day of the month?

  4. Which hour of the day had the most “common assaults” occur?

Some hints about working with datetimes in pandas:

  • When you have a series of datetimes, you can use the .dt accessor to get at the various datetime pieces. So for example, if dates is a pandas series of datetimes, then dates.dt.minute would be a series of integers from 0 to 59, corresponding to the minute values of each date in the original series.
  • Days of the week are numbered from 0 up to 6. But what does 0 stand for? Go to the documentation to find the answer.
  • To compare to a given date, you can use date objects from the datetime package.

3.3 Submit what you have so far

Submit your work so far:

submit -c=sd212 -p=lab01 explore.py lab01.md

or

club -csd212 -plab01 explore.py lab01.md

or use the web interface

3.4 Milestone

Because labs go over two weeks, we will have a “milestone” in the middle of the lab, which is due earlier. The purpose is to keep you on track and make sure you seek help early if needed.

It doesn’t mean to stop here though! Don’t assume the milestone is a half-way point necessarily. Instead, think of the milestone as the least amount you should have done by the intermediate deadline.

Grading-wise, failure to complete the milestone on time will result in up to 20% deduction from your grade on this lab.

For this lab, the milestone means everything up to this point, which includes the following auto-tests:

md_3.1
md_3.2
py_explore

4 Graphing crime data

For this part of the lab, you will make some visualizations of the crime data information.

We will use the plotly express library to create histograms. Read the documentation here!.

Make a new file visualize.py which will contain your code for this part. You may want to copy some of the data input/cleaning you did from explore.py to get started. You will also want to import plotly express. Eventually, this python program should contain the code to generate your two graphs for this part. You will also generate static png images of those graphs and submit them separately.

4.1 Crime types

For the seven years 2016–2022, we want to see the breakdown and trends of the most common types of crimes.

First, use your pandas skills to determine the overall three most common types of crime in the dataset.

Then use plotly to create a histogram of the crimes in this year range, broken down into those 3 most common types, where each type is a separate color. This is sometimes called a “grouped bar chart” and can be specified with the barmode='group' option in plotly.

(Hints: you should be able to pass an entire trimmed-down dataframe to plotly, specifying column names as the named arguments x and color.)

Your plot should look like this (except that the numbers in this plot are totally made up, and the types of crimes are not the same as what you will have):

Feel free to explore more plotly options to make your graph look pretty! Save your plot as an image file years.png to submit. Then answer a couple questions about what you see:

  1. Which year in the plotted range had the highest number of thefts? (The fancy term for theft is “larceny”.)

  2. Considering the years most affected by the COVID pandemic, what kind(s) of crime decreased during those years, and what kind(s) did not? Give a brief explanation of why that might be the case.

Submit now:

submit -c=sd212 -p=lab01 explore.py visualize.py years.png lab01.md

or

club -csd212 -plab01 explore.py visualize.py years.png lab01.md

or use the web interface

5 Ravens

Now we are finally ready to bring football into the picture. The question we want to answer is how (if at all) crime rates in Baltimore are affected by Ravens games.

To answer this, we have a few steps to undertake.

5.1 Find NFL games data

The first stage of the data science pipeline is data acquisition. So, it’s time to go acquire some data!

What we need to know are, for all Baltimore Ravens games between 2012 and 2022:

  • Game date
  • Was the game at home (in Baltimore) or away?
  • Who was the opponent
  • What was the final score

Fortunately, there are many free datasets online which have all this information. In fact, they probably have even more information (more teams, more details on each game, longer year range, …). That’s fine because you are now very good at trimming down large data sets using pandas!

(Hint: You really don’t want to be copy/pasting from a website or using beautiful soup to scrape the pages if you don’t have to. Sometimes it can help to add the word “csv” to your Google search to find data that’s already organized in a convenient way. I promise this data exists somewhere free, but you might have to “hunt” around a bit on the web to find it depending on your Google skills.)

Once you have your data downloaded, create a python program nfl.py which contains the code you use to answer these questions:

  1. Give the website where you got your NFL data from.

  2. How many home games did the Ravens play in the calendar year 2013?

  3. How many total points did the Ravens score at home on Mondays between January 1, 2012 and December 31, 2022?

5.2 Games and crimes

Now it’s time to put it together! We want to find a correlation between the home games that the Ravens played and the incidence of crimes in Baltimore. More specifically, we want to compare crime rates on days when there wasn’t a Ravens home game, to days where there was one.

One big challenge here is that we can’t just count the total number of crimes, since there are many more days without football than with football. We could do this by averaging, but another method is sampling. The idea is to compare the days with a Ravens home game, to a random sample of the same number of days without a home game.

Your task is to create a grouped histogram (like before), comparing non-game days to game days between 2012 and 2022, and where each group represents a type of crime (robbery, car theft, etc.). Initially, there will be two comparisons (colors in the graph):

  • Days between 2012 and 2022 with Ravens home games
  • The same number of random days in the same range without Ravens home games

We can focus on the seven most common types of crime for this comparison. So there should be seven groups of bars, and in each group there are two colors.

Create a file ravens.py that has your Python code to create this graph. At the top of the file, do

import random
random.seed(YOUR_ALPHA)

This way, the random sampling of days will be consistent each time you or I re-run the code, so the graph doesn’t keep changing with every run. (Be sure to only call random.seed once, at the top of your code.)

Answer this question:

  1. Based on your data, what kinds of crimes (if any) significantly increased or decreased during Ravens home games compared to all other days of the year?

Save your graph as ravens.png and submit the code and the graph along with all the other parts of the lab.

5.3 Better comparison

When doing data science and trying to make correlations or connections, an important (but very difficult!) challenge is to ensure that we are really measuring the connection between the things we want to, and not some other unrelated issues.

In the case of Ravens games, we want to ask: Are there any other aspects of game days that may affect crime besides the actual game?

In this case, there are at least two such aspects:

  • NFL games are usually on Sundays
  • NFL games are usually in the months between September and January

For the last part of your lab, add a third group to each group of comparisons, which accounts for these aspects.

That is, instead of randomly sampling any day of the year without a Ravens home game, sample only from Sundays between September and January. (You will also want to remove any Ravens home games that were not held on Sunday or not between September and January.) Your grouped histogram should be similar to before, but with a third bar in each group for “random fall Sundays”.

  1. Do the crime rates during Ravens games look more or less significant compared to the “random fall Sundays” group or the “random day of the year” group? What is your final conclusion on the effect of Ravens games (if any) on crime in the city of Baltimore?

Update your ravens.py code and ravens.png image accordingly.

5.4 Submit

submit -c=sd212 -p=lab01 explore.py visualize.py ravens.py years.png ravens.png lab01.md

or

club -csd212 -plab01 explore.py visualize.py ravens.py years.png ravens.png lab01.md

or use the web interface