hw18: Cleaning and wrangling in Pandas
- Due before the beginning of class on Monday, March 4
For this HW you will read about cleaning and wrangling data in Pandas and then fill in and submit a markdown file to check your understanding.
Reading
The readings come from the “Python for Data Analysis Book”, which we haven’t looked into yet.
This book is written by the person who wrote the pandas library, so he is a pretty good person to explain how to use pandas best!
One downside of this book, however, is that it can get “into the weeds” with detail really quickly. I find that the beginning of chapters are useful to read, and then I stop reading at some point when it gets too complicated, perhaps coming back later when I have a specific issue.
So, we’re asking you to read the beginning of just two sections.
Chapter 7: Read the beginning of the chapter and section 7.1, stopping at 7.2 “Data Transformation”.
Chapter 8: Skip section 8.1 (hierarchical indexing) and go to Section 8.2 (Combining and merging datasets). Just read the beginning of section 8.2, stopping when it starts talking about “many-to-many merges”.
Questions
Did you read the required part from Chapter 7?
(Answer yes or no)
By default, what does Pandas do with missing data when you ask for a “descriptive statistic” such as the mean or median?
- An error is thrown if any values are missing
- Missing values are automatically treated as zeros
- Missing values are copied from the previous row in the dataframe
- Missing values are excluded from the computation
(Enter just the letter of the correct choice.)
Imagine you have a DataFrame
df
which contains some missing entries. Which of these Python commands would replace all missing entries with42
?df.set_missing(42)
df.ffill(42)
df.fillna(42)
df.isnull() = 42
Did you read the required part from Chapter 8?
(Answer yes or no)
Imagine you have two DataFrames about food.
cost
is:food price 0 burger 5 1 tots 3 2 hot sauce 1
And
nutrition
is:food calories 0 tots 250 1 fries 300 2 burger 550 3 nachos 650
Which of the following commands will give us a DataFrame which looks like this, with the price and calories of only the foods where we have both pieces of information:
food price calories 0 burger 5 550 1 tots 3 250
cost + nutrition
pd.merge(cost, nutrition, on='food')
cost.join(nutrition, on='food')
cost.combine(nutrition, how='outer')
Submit command
To submit files for this homework, run one of these commands:
submit -c=sd212 -p=hw18 hw18.md
club -csd212 -phw18 hw18.md
Download the file hw18.md to fill in and submit for this homework