SD 212 Spring 2024 / Homeworks


hw18: Cleaning and wrangling in Pandas

  • Due before the beginning of class on Monday, March 4

For this HW you will read about cleaning and wrangling data in Pandas and then fill in and submit a markdown file to check your understanding.

Reading

The readings come from the “Python for Data Analysis Book”, which we haven’t looked into yet.

This book is written by the person who wrote the pandas library, so he is a pretty good person to explain how to use pandas best!

One downside of this book, however, is that it can get “into the weeds” with detail really quickly. I find that the beginning of chapters are useful to read, and then I stop reading at some point when it gets too complicated, perhaps coming back later when I have a specific issue.

So, we’re asking you to read the beginning of just two sections.

  • Chapter 7: Read the beginning of the chapter and section 7.1, stopping at 7.2 “Data Transformation”.

  • Chapter 8: Skip section 8.1 (hierarchical indexing) and go to Section 8.2 (Combining and merging datasets). Just read the beginning of section 8.2, stopping when it starts talking about “many-to-many merges”.

Questions

  1. Did you read the required part from Chapter 7?

    (Answer yes or no)

  2. By default, what does Pandas do with missing data when you ask for a “descriptive statistic” such as the mean or median?

    1. An error is thrown if any values are missing
    2. Missing values are automatically treated as zeros
    3. Missing values are copied from the previous row in the dataframe
    4. Missing values are excluded from the computation

    (Enter just the letter of the correct choice.)

  3. Imagine you have a DataFrame df which contains some missing entries. Which of these Python commands would replace all missing entries with 42?

    1. df.set_missing(42)
    2. df.ffill(42)
    3. df.fillna(42)
    4. df.isnull() = 42
  4. Did you read the required part from Chapter 8?

    (Answer yes or no)

  5. Imagine you have two DataFrames about food.

    cost is:

            food  price
    0     burger      5
    1       tots      3
    2  hot sauce      1

    And nutrition is:

         food  calories
    0    tots       250
    1   fries       300
    2  burger       550
    3  nachos       650

    Which of the following commands will give us a DataFrame which looks like this, with the price and calories of only the foods where we have both pieces of information:

         food  price  calories
    0  burger      5       550
    1    tots      3       250
    1. cost + nutrition
    2. pd.merge(cost, nutrition, on='food')
    3. cost.join(nutrition, on='food')
    4. cost.combine(nutrition, how='outer')

Submit command

To submit files for this homework, run one of these commands:

submit -c=sd212 -p=hw18 hw18.md 
club -csd212 -phw18 hw18.md
Download the file hw18.md to fill in and submit for this homework