SD 212 Spring 2024 / Notes


Unit 6: Data cleaning

1 Overview

This unit will touch on important data science skills which are sometimes overlooked and variously go by names like “data cleaning”, “data scrubbing”, “data wrangling”.

This is an intermediate step after some dataset is obtained, and before we are ready to start processing and analyzing the data. In the data cleaning/wrangling phase, we are often concerned about formats and restructuring data so it fits nicely in something like a Pandas dataframe.

The good news is, you’ve already been doing data cleaning and wrangling, a bit in SD211 and even more in the first few weeks of SD212. You have already experienced data that has formatting errors, missing entries, mistakes, and multiple formats, and had to deal with that situation particularly in labs.

The goal of this unit is to discuss the different ways data can be dirty, mis-formatted, or inconsistently represented, and some of the tools that we use to wrestle and twist it into shape.

2 Resources

  • Python for Data Analysis

    • Chapter 7: Data Cleaning and Preparation

      A good overview of what can go wrong with data we receive and how to deal with it in Python with Pandas. The initial discussion on missing and outlier data is great.

    • Chapter 8: Data Wrangling

      A more in-depth study of how to reshape and merge Pandas datasets. The discussion in this chapter is a little more in-depth than what we will get into in this class; skip the hierarchical indexing for now and focus mostly on section 8.2 on how to combine and merge DataFrames.

  • Data Science at the Command Line

    • Chapter 5: Scrubbing Data

      Much of this is a review of a number of command-line tools which we have already been using such as sed and grep, but specifically in the context of data cleaning.

      The author here also mentions a few time some command-line tools which we won’t focus on in SD212 such as awk, but the chapter isn’t really focused on the tools themselves and you should gloss over the awk examples as well. The point is to focus on the tools which we have learned and the data “scrubbing”/cleaning task in itself.