Unit 6: Data cleaning
1 Overview
This unit will touch on important data science skills which are sometimes overlooked and variously go by names like “data cleaning”, “data scrubbing”, “data wrangling”.
This is an intermediate step after some dataset is obtained, and before we are ready to start processing and analyzing the data. In the data cleaning/wrangling phase, we are often concerned about formats and restructuring data so it fits nicely in something like a Pandas dataframe.
The good news is, you’ve already been doing data cleaning and wrangling, a bit in SD211 and even more in the first few weeks of SD212. You have already experienced data that has formatting errors, missing entries, mistakes, and multiple formats, and had to deal with that situation particularly in labs.
The goal of this unit is to discuss the different ways data can be dirty, mis-formatted, or inconsistently represented, and some of the tools that we use to wrestle and twist it into shape.
2 Resources
-
Chapter 7: Data Cleaning and Preparation
A good overview of what can go wrong with data we receive and how to deal with it in Python with Pandas. The initial discussion on missing and outlier data is great.
-
A more in-depth study of how to reshape and merge Pandas datasets. The discussion in this chapter is a little more in-depth than what we will get into in this class; skip the hierarchical indexing for now and focus mostly on section 8.2 on how to combine and merge DataFrames.
Data Science at the Command Line
-
Much of this is a review of a number of command-line tools which we have already been using such as
sedandgrep, but specifically in the context of data cleaning.The author here also mentions a few time some command-line tools which we won’t focus on in SD212 such as
awk, but the chapter isn’t really focused on the tools themselves and you should gloss over theawkexamples as well. The point is to focus on the tools which we have learned and the data “scrubbing”/cleaning task in itself.
-
% SD212 Unit X: Data Cleaning and Wrangling
3 Overview
Welcome to the real world, where data is never as clean as the textbook examples. In fact, a common joke is that data scientists spend 80% of their time cleaning data, and the other 20% complaining about cleaning data.
Up until now, we’ve given you nicely formatted CSVs and perfectly structured text files. Now, it’s time to take off the training wheels. In this unit, we will look at what can go wrong with raw data, why your files sometimes refuse to open, and how to stitch disparate datasets together using Pandas.
4 The Messy Reality of Data
What exactly do we mean by data “cleaning” or “wrangling”? Usually, it involves dealing with one of these main culprits:
Encoding Issues: The file won’t even open, or it looks like it’s written in alien hieroglyphics. Are we looking at a plain-text or binary format?
Mis-formatted Data: The data breaks the rules of its own format. Think of extra trailing spaces after names, or numbers written with commas (e.g.,
"1,234,567") that Pandas misinterprets as strings instead of integers.Missing Data: Blanks in the dataset. Maybe a row is missing a column, or a poll didn’t collect data for a certain state on a certain week. Ethically, it is never OK to just “guess” or fill in missing data with made-up numbers—that’s falsifying data! Usually, we have to figure out how to safely ignore missing values without crashing our analysis.
Erroneous Data: Data that is obviously mis-typed but follows the formatting rules. For example, entering a date as being in the 13th century, or mixing up 2-digit and 4-digit years.
5 Character Encodings
All digital data is ultimately just a clump of 1s and 0s (bytes). But we don’t read bytes; we rely on an interpretation of those bytes.
For plain-text files (like CSV, TSV, JSON, or TXT), the computer uses a “character set” or “encoding” to map those bytes to human-readable letters.
- UTF-8 is the modern gold standard. Python (and Pandas) expects your files to be UTF-8 by default.
- Older files or files generated by certain Microsoft software might use encodings like ISO-8859-1 (Latin-1) or Extended ASCII.
If you try to pd.read_csv() a file containing Latin-1 characters but
Python tries to read it as UTF-8, you will get a nasty
UnicodeDecodeError. When different encodings get crammed into one
file, it’s a massive headache.
Pro-Tip: Line endings can also cause chaos! Unix systems end lines
with \n, while older DOS/Windows systems use \r\n. If your text
processing gets weirdly mangled at the end of lines, a hidden \r is
usually the culprit.
Merging Disparate Datasets (pd.merge)
We frequently need to combine information from completely different sources. Imagine you have one dataset with NFL game stats, and another with city crime rates, and you want to see if crime spikes when the local team loses.
Pandas provides a powerful function called pd.merge to combine two
DataFrames row-wise based on shared columns.
(Note, mastering this topic conceptually will help you next year too, when you learn about database JOIN operations.)
5.1 Controlling the Merge
Merging is incredibly useful but requires patience. By default, Pandas looks for columns with the same exact name in both dataframes and pairs up rows that have the exact same values.
You control the merge using two main parameters:
on: Which column(s) should act as the “glue” holding the rows together?how: What do we do about missing entries?how='inner': The strict bouncer. Only keeps rows that have matching values in both dataframes.how='left': VIP treatment for the first (left) dataframe. Keeps all rows from the left dataframe, and pulls in data from the right dataframe where it matches.how='outer': Everyone is invited. Keeps all rows from both dataframes, filling in the blanks withNaN(Not a Number).
5.2 The Pizza Shop Example
Let’s look at a toy example. You run a pizza shop and have three different CSV files.
menu.csv
pizza_id,pizza_name,category
10,Pepperoni,Classic
20,Veggie Supreme,Veggie
30,BBQ Chicken,Specialty
nutrition.csv (Notice the column is product_name, not pizza_name!)
product_name,calories
Pepperoni,300
Veggie Supreme,250
Breadsticks,150
BBQ Chicken,320
inventory.csv
pizza_id,slices_available
10,5
30,2
40,7
Our goal is to make a single dataframe showing what’s on sale today, how many slices we have, and the calorie counts (if known). Because the column names and available items don’t perfectly align, we have to wrangle them into compatibility.
import pandas as pd
# Load the data
men = pd.read_csv('menu.csv')
nut = pd.read_csv('nutrition.csv')
inv = pd.read_csv('inventory.csv')
# Step 1: Merge Menu and Nutrition.
# We must rename 'product_name' to 'pizza_name' so Pandas knows how to match them!
# We use how='left' to keep all our menu pizzas, even if nutrition info is missing.
m1 = pd.merge(men, nut.rename(columns={'product_name': 'pizza_name'}),
on='pizza_name', how='left')
# Step 2: Merge the result with Inventory.
# We use how='inner' because we ONLY want pizzas that we actually have slices for today.
m2 = pd.merge(m1, inv, on='pizza_id', how='inner')
# Clean up the final view
final = m2[['pizza_name', 'slices_available', 'calories']]
print(final)
Output:
pizza_name slices_available calories
0 Pepperoni 5 300.0
1 BBQ Chicken 2 320.0
(Notice how Veggie Supreme was dropped because it wasn’t in the inventory, and Breadsticks were dropped because they weren’t on the pizza menu!)