SD 212 Spring 2024 / Labs


Lab 1: Crimes in Baltimore

1 Overview

Omar Little (played by Michael K. Williams)

This lab will examine a large dataset of “crimes against persons” made available by the Baltimore Police department’s Open Data initiative. We will ask you to download this dataset, do some basic exploration and discovery, and then make a few graphs relating crime rates to their type, time, and location.

1.1 Deadlines

  • Milestone: 2359 on Monday, 22 January
  • Complete lab: 2359 on Friday, 26 January

1.2 Collaboration

We want you to work together and help each other when you get stuck. But we also want you to struggle yourselves with each concept and work to figure it out on your own.

The formal rules are in the course policy here. We suggest you also consider the following guidelines when giving and receiving help.

The overall guidance is that the first time you see an attempted solution to a technical problem or something you are supposed to figure out for the lab, it should be your own attempt! You should never be getting direct answers or copying code from anyone else at any point.

Good ideas / allowed help:

  • Discussing ideas, strategies, tips, and sources of help without looking at any code
  • Helping with VS code, installing WSL, using SSH to the lab machines, using mamba, and other set-up stuff.
  • Helping a classmate debug their code or get un-stuck on some part after you have completed that part yourself. This may involve looking at the part of their code where they are stuck.
  • Admitting if you can’t help and your classmate should ask someone else, like an MGSP leader or instructor. For example: if you are too busy to understand their issue, or if you can’t see any way to fix their code because you did yours in a different way.
  • Asking your instructor for clarification if you aren’t sure what is or isn’t allowed

Bad ideas / not allowed:

  • Looking at someone else’s code for a part of the assignment which you haven’t done yourself yet
  • Sending your code to a classmate
  • Copying code (even just a few lines) from a classmate and submitting as your own
  • Asking for help from other people or online tools not affiliated with this course
  • Copying code from a website or other source and submitting it as your own without specifically citing the website and the part of code you used

1.3 Learning goals

We want you to be aware why we are asking you to do this work, and how it fits in with the rest of the class. Here are the ways that your knowledge and understanding of data science and programming will grow as a result of doing this lab:

  • Gain independence and confidence using data science libraries from SD211 to read and process CSV files and produce graphs
  • Slice datasets horizontally (focusing on a few columns) and vertically (focusing on a few rows)
  • Practice dealing with raw, unprocessed and uncleaned data
  • Interpret results in terms of significance and possible explanations

2 Preliminaries

2.1 Installs

If you haven’t already, or if you haven’t done this for ssh or the lab machines, you need to install mamba and create your sd212 folder.

One additional package is needed for this lab to work. Open a terminal and run these commands to install in your sd212 environment:

mamba activate sd212
mamba install -y nbformat

2.2 Folder setup and dataset download

First, create a lab01 folder insider your sd212 folder.

Then, in VS Code, go to “Open folder” and open a fresh VS Code window with that folder.

The data we will use is in this csv file. You can get it with right-click/“Save link as”, but let’s download it directly in the terminal instead.

Now open a terminal in VS code and run these two commands to first download and then unzip the crimes dataset. Make sure you are in the correct folder — in the terminal you should see ~/sd212/lab01$ at the end of your prompt.

wget "http://roche.work/courses/s24sd212/lab/crime/crimes.csv.xz"
unxz crimes.csv.xz

Now you should have a nice big CSV file to work with!

2.3 Notebook

Much like some of our homeworks, this lab will ask you to fill in a file with your answers to certain specific questions.

But the difference is, for this lab, you will actually fill in a Jupyter notebook with your answers to questions as well as the code that you used to get those answers.

Here is the notebook file for this lab.

If you want, you can use the terminal to download the blank file directly to your lab folder with this command:

wget "http://roche.work/courses/s24sd212/lab/crime/lab01.ipynb"

Once you have that notebook downloaded and copied to your lab directory, open it in VS Code. Follow the instructions there on where to put your answers to questions as well as your code. Remember that you just double-click on any markdown cell to start editing it.

Important: Don’t forget to save your notebook file frequently, and certainly whenever you go to submit it. VS Code does not automatically save the file when you run your code.

3 Exploring the data

This part of the lab asks you to answer some questions about the crimes data crimes.csv.

For each question, you should do two things: write the code to answer in a new code cell that you create in the notebook below the question header, and then fill in the actual answer in the markdown cell that contains the question header itself.

First use your pandas skills to answer these questions.

  1. How many total rows are in this dataset (not counting the header row)?

  2. How many car thefts occurred in the dataset?

    (Do not include car jackings, just regular thefts.)

  3. There is only one neighborhood which had over 100 homicides. What is this neighborhood called?

  4. Which district is that neighborhood in?

  5. There is a column for gender. Do you think this is the gender of the victim or the assailant? Explain how you figured this out.

    (Hint: Use the data! Some types of crimes disproportionately impact one gender or another.)

Some pandas hints and reminders:

  • Looking at the column names:

    my_dataframe.columns
  • Getting the number of rows in a dataframe:

    len(my_dataframe)
  • Selecting a single column from a dataframe gives you a series:

    my_dataframe['Column Heading']
  • Selecting multiple columns is also possible, but you have to use a list of column headings. And the return value will then be a dataframe (not a series), since it has multiple columns:

    my_dataframe[['Column1', 'Column2']]
  • Finding all the distinct values in some series, sorted by how many times they occur (very useful!!):

    my_series.value_counts()
  • Comparing each entry in a series to a single value, to produce a another series of True/False values:

    my_series == some_value_to_compare_with_each_row
  • Comparing each entry in a series to a list of possible values also works and creates a series of True/False values, by using the .isin() function:

    my_series.isin(['Values', 'To', 'Test', 'Against'])
  • To combine multiple True/False series logically, you have to use Python’s “bitwise” operators instead of the normal ones. So instead of or, you write |; instead of and you write &, and instead of not you write ~.

    So for example, this test:

    fruits.isin(['apple', 'banana', 'orange'])

    can be written equivalently as:

    (fruits == 'apple') | (fruits == 'banana') | (fruits == 'orange')
  • Giving a series of True/False as the index selects only the rows where the condition was true:

    my_dataframe[my_series_of_true_false]
  • Adding a series as a new column in an existing dataframe:

    my_dataframe['New Column Name'] = my_series

3.1 Submit what you have so far

Save your notebook and submit your work so far:

submit -c=sd212 -p=lab01 lab01.ipynb

or

club -csd212 -plab01 lab01.ipynb

or use the web interface

3.2 Milestone

Because labs go over two weeks, we will have a “milestone” in the middle of the lab, which is due earlier. The purpose is to keep you on track and make sure you seek help early if needed.

It doesn’t mean to stop here though! Don’t assume the milestone is a half-way point. Instead, think of the milestone as the very least that you should have done by the intermediate deadline.

Grading-wise, failure to complete the milestone on time will result in up to 20% deduction from your grade on this lab.

For this lab, the milestone means everything up to this point, which includes the following auto-tests:

nb_3

4 Getting the dates right

Each crime in the dataset has a date and time given when that crime occurred. However, unlike numbers, date/time columns are not automatically recognized by pandas. But that will make answering questions about crimes that occurred within a certain date range very difficult!

There are a few ways to handle this with pandas; for today we will use the pd.to_datetime function, which takes a series of strings and attempts to convert them to a series of datetime objects.

Try running this in a new code cell, replacing my_dataframe with whatever you called your dataframe of crime data:

pd.to_datetime(my_dataframe['CrimeDateTime'])

It gives a long, gross error message! If you entered this command correctly, the error is coming from a very unusual date which is out of the range of dates which pandas knows how to represent.

Look carefully at the error message to see the string which caused the datetime conversion to fail. Then answer this question in your markdown file:

  1. What is the year (only) of the first date in crimes.csv that pandas could not convert to a datetime object?

Obviously this year is an error; the city’s namesake was not even born for a few hundred more years. Real datasets contain errors, and as data scientists we have to work around them!

In this case, the solution is to add the errors='coerce' option to the pd.to_datetime function call, which will convert those un-representable dates to the special null value NaT (“not a time”).

One more tweak which will probably be helpful is to call .dt.tz_localize(None) on the series of datetimes. What that does is wipe out the time zone information in the column, which makes it easier to compare to other dates. (For this data set, we can be pretty confident that everything occurs in the same time zone, so eliminating that makes the job much simpler!)

Finally, I recommend to add these datetimes as a new column into your dataframe. So all in all you will want one or more lines of code that look something like:

your_dataframe['new column name'] = pd.to_datetime(your_dataframe['CrimeDateTime'], errors='coerce').dt.tz_localize(None)

Now you should be able to answer some more more questions.

  1. How many crimes were committed in 2015?

  2. How many crimes were committed between March 14 and December 14, 2020 (including both those dates)?

    (Hint: You can create normal datetime objects from the datetime python library to compare against.)

  3. How many robberies occurred on Fridays which were the 13th day of the month?

    Note: there are three kinds of robberies in the data set - be sure to include all three kinds in your count!

  4. Which hour of the day had the least number of “common assaults” occur?

  5. What were the four most common types of crimes in the seven years from 2016 to 2022 (inclusive)?

Some hints about working with datetimes in pandas:

  • Remember that a datetime object in Python or pandas generally is a specific date and time (hence the name). If you create a datetime object but only specify year/month/day, then the time defaults to 00:00, i.e. midnight at the beginning of that day. Keep this in mind when comparing datetime objects to each other!
  • When you have a series of datetimes, you can use the .dt accessor to get at the various datetime pieces. So for example, if dates is a pandas series of datetimes, then dates.dt.minute would be a series of integers between 0 and 59, corresponding to the minute values of each date in the original series.
  • Days of the week are numbered from 0 up to 6. But what does 0 stand for? Go to the documentation to find the answer.

4.1 Submit what you have so far

Save your notebook and submit your work so far:

submit -c=sd212 -p=lab01 lab01.ipynb

or

club -csd212 -plab01 lab01.ipynb

or use the web interface

5 Graphing crime data

For this part of the lab, you will make some visualizations of the crime data information.

We will use the plotly express library to create histograms. Read the documentation here!.

5.1 Crime types

For the seven years 2016–2022, we want to see the breakdown and trends of the most common types of crimes.

Then use plotly to create a histogram of the crimes in this year range, broken down into those 4 most common types, where each type is a separate color. This is sometimes called a “grouped bar chart” and can be specified with the barmode='group' option to the histogram function in plotly express.

(Hints: you should be able to pass an entire trimmed-down dataframe to plotly, specifying column names as the named arguments x and color.)

Your plot should look like this (except that you will have 4 bars in each group, and the numbers in this plot are totally made up, and the types of crimes are not the same as what you will have):

Feel free to explore more plotly options to make your graph look pretty!

  1. Create your histogram so it displays in the notebook.

    When plotting, use options width=1000, height=600 to give a consistent size for your instructor to grade.

    Click on the little camera icon on the graph to save your graph as a file called years.png

  2. Which year in the plotted range had the highest number of larcenies?

    (Hover your mouse over the bars in your graph to help answer the question accurately.)

  3. Considering the years most affected by the COVID pandemic, what kind(s) of crime decreased during those years, and what kind(s) did not decrease as much?

    Give a brief explanation of why that might be the case.

Save your notebook and submit now:

submit -c=sd212 -p=lab01 years.png lab01.ipynb

or

club -csd212 -plab01 years.png lab01.ipynb

or use the web interface

5.2 Days of the week (OPTIONAL)

(This part of the lab is optional enrichment if you finish everything else and want to go a bit further.)

We would like to understand how days of the week play into crime statistics as well.

Make another histogram of the crime data, over the ten-year range from 2013 to 2022. But now put the day of the week (Monday, Tuesday, etc.) along the x-axis, and use the color to show the different districts rather than the type of crime.

Caution: More data cleaning might be needed! Some of the crimes don’t have any district entered the raw dataset. In pandas those entries likely show up as NaN — “not a number”. Sometimes you will see such entries notated as “N/A”. In pandas, there are a few functions on series which can be useful for eliminating these values: isnull(), isna(), notna(), and dropna(). Look up the documentation to learn how each one works and figure out what you want to use.

  1. Create your histogram (with width 1000 and height 600 like before).

    Save it as a file days.png

  2. Which neighborhood sees the most crimes during weekdays (M-F)?

  3. In most districts the crime level stays the same or goes down on the weekends (Friday–Sunday). But one neighborhood sees a slight but noticeable increase in crime on weekends. Which neighborhood, and why?

    (Hint: Try to find a map of the city and look at what is located in this district.)

5.3 Submit!

Submit your hard work:

submit -c=sd212 -p=lab01 years.png days.png lab01.ipynb

or

club -csd212 -plab01 years.png days.png lab01.ipynb

or use the web interface