SD 212 Spring 2023 / Labs


This is the archived website of SD 212 from the Spring 2023 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Lab 6: Names names names

1 Overview

Today’s lab is all about names — specifically, baby names! The U.S. government has kept pretty good statistics about what first names people have been giving their babies for more than 100 years, so it’s a fun data source to get into.

Besides continuing to exercise and develop your general comfort and confidence handling real data sources with Python and pandas, you will also get good practice with parallel programming in today’s lab, like in the recent unit from class.

As usual, you will fill in a markdown file answering specific questions as you go, as well as turn in your working code. This lab has multiple parts, so be sure to pace yourself and reach out for help if you get stuck.

1.1 Deadlines

  • Milestone: 2359 on Tuesday, 18 April
  • Complete lab: 2359 on Sunday, 23 April

1.2 Learning goals

  • Gain independence and practice with data download and wrangling using Python, pandas, and numpy
  • Write multithreaded code to handle IO-bound workloads
  • Write multiprocessing code to handle CPU-bound workloads

2 Data acquisition

2.1 Markdown file to fill in

Here is the file with questions to fill in and submit for today’s lab: lab06.md

You can run this wget command to download the blank md file directly from the command line:

wget "https://roche.work/courses/s23sd212/lab/md/lab06.md"

2.2 Baby names

The U.S. Social Security Administration runs a project that collects baby name statistics going back to 1910 on a state by state basis.

Go to this page and download the zip file marked for “state-specific data”.

That zip file has a bunch of txt files that will clutter up your directory. Instead do the following so that your txt files are in their own data subdirectory:

  • Create a new directory for this lab, like sd212/lab06
  • Make a subdirectory there called data
  • Extract the zip file in the sd212/lab06/data directory

Use your command line skills to look around at the data files and understand the general layout. Look at the documentation included within the zip file or on the website so you understand what the numbers mean.

Pick a (first) name, birth year, gender, and state that will count as “your name”, “your state”, “your birth gender”, etc., for the purposes of this lab.

(It can be your actual name, but doesn’t need to be. It just needs to be in the dataset! Do something like

grep 'Daniel,' data/de.txt

and make sure you get at least a dozen or so entries for your chosen name and state.)

Answer a few questions:

  1. What will be “your” name, birth year, gender, and state, for the purposes of this lab?

  2. Find yourself in the dataset. How many people with your name (and any gender) were born in your state in your birth year?

  3. Which year had the most births with “your” name in your state?

  4. Which state had the most number of births with your name in your birth year?

2.3 U.S. regions by state

Go to this page to find a nice table which shows all the U.S. states organized into six regions.

Download the table and save it as a plain-text file. Then use Python and/or bash to convert it to a nice CSV file that is comma-separated with a header row, etc. This should not be difficult! Save your file as regions.csv.

  1. Which region is “your” state in, and how many states in total are part of that region?

2.4 Submit

Get the ball rolling with your initial submission:

submit -c=sd212 -p=lab06 lab06.md regions.csv

or

club -csd212 -plab06 lab06.md regions.csv

or use the web interface

3 Get the data into pandas

Your first substantial goal is to gather all the data from the data directory and regions.txt into a single Pandas dataframe that looks something like this:

        state gender  year      name  count  State Name Region
0          AK      F  1910      Mary     14      Alaska   West
1          AK      F  1910     Annie     12      Alaska   West
2          AK      F  1910      Anna     10      Alaska   West
3          AK      F  1910  Margaret      8      Alaska   West
4          AK      F  1910     Helen      7      Alaska   West
...       ...    ...   ...       ...    ...         ...    ...
6254234    CA      M  2021      Zyan      5  California   West
6254235    CA      M  2021     Zyion      5  California   West
6254236    CA      M  2021     Zyire      5  California   West
6254237    CA      M  2021      Zylo      5  California   West
6254238    CA      M  2021     Zyrus      5  California   West

[6254239 rows x 7 columns]

Create a program ingest.py that reads the .txt files from the data/ subdirectory, as well as the regions.csv file, and creates a single DataFrame with columns like shown above (perhaps in a different order).

Put your code in a function get_names() that takes no arguments and just returns the dataframe. Design it well so that if I create a separate file and do something like

from ingest import get_names

names = get_names()
print(names[names['year'] == 1985].sort_values(by=['count']).iloc[-20:])

Then it should work and print the 20 rows showing the most popular baby names in 1985.

Your get_names() function must use multiprocessing to read each file in parallel before combining them into a single dataframe.

I’m not going to tell you exactly how to do this! This kind of “data ingest” can feel tedious but it’s something you should be well prepared to handle by now. Here are some hints to get going in the right direction:

  • You’ll need to iterate over all the txt files in the data directory. Look back at the credit cards lab where we saw how to use the pathlib module to do something similar.

  • For each file, you will want to create a new Process to read in just that file into a new dataframe, and send that dataframe back to the parent process. Look back at your notes from class on how to do that.

  • Note that these files don’t have a header line so you’ll need to specify the headers yourself when you call read_csv.

  • Collect all the individual dataframes into a list, and then use pd.concat to combine them into one big dataframe

  • Separately, read in the regions.csv file you created earlier and use pd.merge to get the final big result as shown above.

    (Your rows will probably be in a different order every time you run it (do you know why?), but check the number of rows and the column headers to see that you have it working correctly.)

3.1 Submit

Save your files and submit everything:

submit -c=sd212 -p=lab06 lab06.md regions.csv ingest.py

or

club -csd212 -plab06 lab06.md regions.csv ingest.py

or use the web interface

3.2 Milestone

For this lab, the milestone means everything up to this point

Remember: this milestone is not necessarily the half-way point. Keep going!

4 Who am I?

Now let’s do some data analysis! The goal of this part is to create a program who.py that uses demographic data to make a “guess” of a person’s gender, region, and age based on their first name only.

Here are some example runs:

roche@ubuntu$ python3 who.py
Name: Karen
Karen is most likely Female from the Midwest between 56 and 71 years old
roche@ubuntu$ python3 who.py
Name: Lebron
Lebron is most likely Male from the Southeast between 11 and 18 years old

There is a lot that goes into making this guess! Let’s break it down. I strongly you suggest you work carefully through the gender and region first before thinking about the age (which is more difficult).

4.1 Read the name

This part seems simple — you just need to make an input() call to get the name that the user types in.

But the twist is that you need to do this in a multithreaded way. Specifically, your program should have one thread reading in the name from the terminal, while a second thread simultaneously is calling your get_names() function from the previous part to read in all the data files.

Why does this make sense? We know that I/O is slow, and the very slowest kind of I/O is one where the computer has to wait for a human to type something in! By using two threads (one of which will be launching a bunch of processes within your get_names() function!), the application will feel more “snappy” because your code is doing a lot of prep work in the background while waiting for the human to type something in.

For an initial version, just read in the name using input() in one thread, while calling get_names() in the other thread, and do a single print statement at the end that uses them both, like

print(name, df) # where name is the string you read in and df is a DataFrame

If you did it correctly, then if you take your sweet time to type in the name, the dataframe should print out immediately when you hit enter. Try it!

4.2 Gender

Now let’s try and guess the birth gender based on the name.

To start out with, you have a massive Pandas dataframe with each state/year/gender/name combination listed separately.

You will want to:

  • Select only the rows of the dataframe that match the given name
  • Group those rows according to the gender column
  • Add up the counts for each gender
  • Sort so that you can extract the gender with the larger count.

These are all the same kinds of things we have seen before with Pahdas at various times.

The trickiest step is probably the grouping and adding up by group. Rather than me tell you how to do that, I just searched the web for how to add one column according to another column and found this short and sweet StackOverflow answer. Go read it! StackOverflow is a great resource and definitely OK to use as long as you add a short comment with # to give the citation of where you got your information.

After this, try modifying your who.py program so that it asks for a name and then gives just a gender prediction.

  1. Which of the following names have (slightly!) more female births?

    1. Kodi
    2. Tristyn
    3. Kimani
    4. Kris
    5. Arin
    6. Briar
    7. Daylin

    (Enter just the letters of your choices.)

4.3 Region

Figuring out the most likely region will be very similar to discovering the most likely gender, but be careful that you first filter down to the most likely gender before finding the most likely region.

For example, the name “Avery” is overall most popular in the Southeast. But this name is more commonly applied to Females overall, and among female babies, “Avery” occurred more frequently in the Midwest. So for that example, you would want your program to predict Female from the Midwest.

Get your who.py program working to give gender and region predictions.

  1. What region is “Oleg” most likely from?

4.4 Age range

Now you are ready for the toughest part, calculating the age range.

The goal here is to find the smallest age range that covers at least 51% of the births for the given name and calculated gender and region.

A few details on what we are looking for here:

  • Assume no one ever dies. (So for example, we’ll suppose rather optimistically that all 80 Helens born in Wyoming in 1924 are still alive.)

  • Don’t worry about birthdays; estimate age as simply (2023 - birthyear). Those 80 Wyoming Helens from 1924 are all 99 years old.

  • By “smallest age range”, we mean the smallest span of years to cover at least 51% of the total.

  • If there are multiple possibilities with the same smallest span, return the one for the youngest people.

    In the “Karen” example above, the shortest span is 15 years. Both (ages 63 to 78) and (ages 56 to 71) pass the 51% mark, but your program should print ages 56 to 71 since that’s the youngest/most recent.

    (The reason to prefer younger ages here is because we aren’t really properly accounting for deaths.)

The approach I recommend to figure this out in code is something like this:

  • Use pandas to isolate the series that you want: the birth counts for the given name, gender, and region, organized by year.

  • First count the total number of births for the given name, gender, and region, as a single number.

  • Multiply that by 51 percent to get your target number

  • Now make a loop to try out each starting year between the earliest year in the data set and the current year.

  • Within this outer loop, make an inner loop for the ending year, from that starting year until the current year

  • Within the inner loop, actually add up all the births in that (starting,ending) year range.

  • If the total matches or exceeds your target number, break the inner loop and print out the (starting,ending) year range as a potential answer.

  • Once that works, use a few variables to just keep track of the shortest (starting,ending) year range rather than printing it out.

  • Finally, convert the best (starting,ending) year pair to ages and pat yourself on the back. You got it!

4.5 Check yourself

Be sure to test your who.py program on plenty of names. Some less popular names have missing entries for various states and years, maybe even for entire regions — and your code should handle that!

Double-check that the formatting of what your program prints is also accurate.

And of course you should answer some questions:

  1. What is the most likely gender, region, and age range for “your” name (for the purposes of this lab)?

  2. If we released this tool on a website for public use, what potential ethical issues can you imagine might arise? Are there any groups or individuals who could be harmed by the use of this tool? Do you have any ideas on how to try and reduce that potential harm?

4.6 Submit

Save your files and submit everything so far:

submit -c=sd212 -p=lab06 lab06.md regions.csv ingest.py who.py

or

club -csd212 -plab06 lab06.md regions.csv ingest.py who.py

or use the web interface

5 Visualize

You probably knew this was coming! So far your who.py program is really cool in its attempt to predict gender, region, and age, but it is only giving a small glimpse of the overall data.

For example, “Riley” and “Maria” are both most likely to be female names, but the female/male ratio for “Riley” is close to even where as for “Maria” it’s closer to 100:1.

Similarly, “Elijah” and “James” are both most likely to be born in the Southeast, but James is more evenly distributed around the country, whereas the prevalence of Elijah in the southeast is almost double what it is in any other region.

Your task is to write a program whoviz.py which reads in a name (like before) and creates a visualization that gives a richer picture than just the most likely categories.

It is open-ended exactly what this should look like, so be creative and come up with something cool! It’s okay if your visualization only covers some part of the information about that name (gender, location, and age range), but it would be really impressive if you could incorporate multiple aspects together!

Either way, your visualization must give a clear, easy to understand picture of the full distribution of that name along one or more categories.

Some ideas of what you could try:

  • Shading a map of the 50 U.S. states according to name distribution
  • A bar graph showing the relative frequency of the name among different age groups or generations (Silent gen, Boomers, Gen X, Millennials, Gen Z, etc)
  • Maybe use shading to show the male/female ratio for a name

Submit your code as whoviz.py which should read in a name from the user, then generate and pop up a visualization.

Run your own code for “your” name and save that image as me.png to submit as well.

  1. Describe briefly what your visualization is showing.

  2. Tell me another name I should try where your visualization is interesting or nice-looking in some way. (And explain your choice briefly.)

5.1 Submit

Save your files and submit everything so far:

submit -c=sd212 -p=lab06 lab06.md regions.csv ingest.py who.py whoviz.py me.png

or

club -csd212 -plab06 lab06.md regions.csv ingest.py who.py whoviz.py me.png

or use the web interface