Lab 6: Names names names
1 Overview
Today’s lab is all about names — specifically, baby names! The U.S. government has kept pretty good statistics about what first names people have been giving their babies for more than 100 years, so it’s a fun data source to get into.
Besides continuing to exercise and develop your general comfort and confidence handling real data sources with Python and pandas, you will also get good practice with parallel programming in today’s lab, like in the recent unit from class.
As usual, you will fill in a markdown file answering specific questions as you go, as well as turn in your working code. This lab has multiple parts, so be sure to pace yourself and reach out for help if you get stuck.
1.1 Deadlines- Milestone: 2359 on Tuesday, 18 April
- Complete lab: 2359 on Sunday, 23 April
1.2 Learning goals
- Gain independence and practice with data download and wrangling
using Python, pandas, and numpy
- Write multithreaded code to handle IO-bound workloads
- Write multiprocessing code to handle CPU-bound workloads
2 Data acquisition
2.1 Markdown file to fill in
- Gain independence and practice with data download and wrangling using Python, pandas, and numpy
- Write multithreaded code to handle IO-bound workloads
- Write multiprocessing code to handle CPU-bound workloads
2 Data acquisition
2.1 Markdown file to fill in
Here is the file with questions to fill in and submit for today’s lab: lab06.md
You can run this wget command to download the blank md file directly from the command line:
wget "https://roche.work/courses/s23sd212/lab/md/lab06.md"
2.2 Baby names
The U.S. Social Security Administration runs a project that collects baby name statistics going back to 1910 on a state by state basis.
Go to this page and download the zip file marked for “state-specific data”.
That zip file has a bunch of txt files that will clutter up your
directory. Instead do the following so that your txt files are in their
own data
subdirectory:
- Create a new directory for this lab, like
sd212/lab06
- Make a subdirectory there called
data
- Extract the zip file in the
sd212/lab06/data
directory
Use your command line skills to look around at the data files and understand the general layout. Look at the documentation included within the zip file or on the website so you understand what the numbers mean.
Pick a (first) name, birth year, gender, and state that will count as “your name”, “your state”, “your birth gender”, etc., for the purposes of this lab.
(It can be your actual name, but doesn’t need to be. It just needs to be in the dataset! Do something like
grep 'Daniel,' data/de.txt
and make sure you get at least a dozen or so entries for your chosen name and state.)
Answer a few questions:
What will be “your” name, birth year, gender, and state, for the purposes of this lab?
Find yourself in the dataset. How many people with your name (and any gender) were born in your state in your birth year?
Which year had the most births with “your” name in your state?
Which state had the most number of births with your name in your birth year?
2.3 U.S. regions by state
Go to this page to find a nice table which shows all the U.S. states organized into six regions.
Download the table and save it as a plain-text file. Then use Python
and/or bash to convert it to a nice CSV file that is comma-separated
with a header row, etc. This should not be difficult! Save your file as
regions.csv
.
- Which region is “your” state in, and how many states in total are part of that region?
2.4 Submit
Get the ball rolling with your initial submission:
submit -c=sd212 -p=lab06 lab06.md regions.csv
or
club -csd212 -plab06 lab06.md regions.csv
or use the web interface
3 Get the data into pandas
Your first substantial goal is to gather all the data from the
data
directory and regions.txt
into a single Pandas dataframe that
looks something like this:
state gender year name count State Name Region
0 AK F 1910 Mary 14 Alaska West
1 AK F 1910 Annie 12 Alaska West
2 AK F 1910 Anna 10 Alaska West
3 AK F 1910 Margaret 8 Alaska West
4 AK F 1910 Helen 7 Alaska West
... ... ... ... ... ... ... ...
6254234 CA M 2021 Zyan 5 California West
6254235 CA M 2021 Zyion 5 California West
6254236 CA M 2021 Zyire 5 California West
6254237 CA M 2021 Zylo 5 California West
6254238 CA M 2021 Zyrus 5 California West
[6254239 rows x 7 columns]
Create a program ingest.py
that reads the .txt
files from the
data/
subdirectory, as well as the regions.csv
file, and creates a
single DataFrame with columns like shown above (perhaps in a different
order).
Put your code in a function get_names()
that takes no arguments and
just returns the dataframe. Design it well so that if I create a separate
file and do something like
from ingest import get_names
names = get_names()
print(names[names['year'] == 1985].sort_values(by=['count']).iloc[-20:])
Then it should work and print the 20 rows showing the most popular baby names in 1985.
Your get_names()
function must use multiprocessing
to read each file in parallel before combining them into a single
dataframe.
I’m not going to tell you exactly how to do this! This kind of “data ingest” can feel tedious but it’s something you should be well prepared to handle by now. Here are some hints to get going in the right direction:
You’ll need to iterate over all the txt files in the
data
directory. Look back at the credit cards lab where we saw how to use the pathlib module to do something similar.For each file, you will want to create a new
Process
to read in just that file into a new dataframe, and send that dataframe back to the parent process. Look back at your notes from class on how to do that.Note that these files don’t have a header line so you’ll need to specify the headers yourself when you call
read_csv
.Collect all the individual dataframes into a list, and then use
pd.concat
to combine them into one big dataframeSeparately, read in the
regions.csv
file you created earlier and usepd.merge
to get the final big result as shown above.(Your rows will probably be in a different order every time you run it (do you know why?), but check the number of rows and the column headers to see that you have it working correctly.)
3.1 Submit
Save your files and submit everything:
submit -c=sd212 -p=lab06 lab06.md regions.csv ingest.py
or
club -csd212 -plab06 lab06.md regions.csv ingest.py
or use the web interface
3.2 Milestone
For this lab, the milestone means everything up to this point
Remember: this milestone is not necessarily the half-way point. Keep going!
4 Who am I?
Now let’s do some data analysis! The goal of this part is to create
a program who.py
that uses demographic data to make a “guess” of a
person’s gender, region, and age based on their first name only.
Here are some example runs:
roche@ubuntu$
python3 who.py
Name:
Karen
Karen is most likely Female from the Midwest between 56 and 71 years old
roche@ubuntu$
python3 who.py
Name:
Lebron
Lebron is most likely Male from the Southeast between 11 and 18 years old
There is a lot that goes into making this guess! Let’s break it down. I strongly you suggest you work carefully through the gender and region first before thinking about the age (which is more difficult).
4.1 Read the name
This part seems simple — you just need to make an input()
call to
get the name that the user types in.
But the twist is that you need to do this in a multithreaded way.
Specifically, your program should have one thread reading in the name
from the terminal, while a second thread simultaneously is calling your
get_names()
function from the previous part to read in all the data
files.
Why does this make sense? We know that I/O is slow, and the very slowest
kind of I/O is one where the computer has to wait for a human to type
something in! By using two threads (one of which will be launching a
bunch of processes within your get_names()
function!), the application
will feel more “snappy” because your code is doing a lot of prep work in
the background while waiting for the human to type something in.
For an initial version, just read in the name using input()
in one
thread, while calling get_names()
in the other thread, and do a single
print statement at the end that uses them both, like
print(name, df) # where name is the string you read in and df is a DataFrame
If you did it correctly, then if you take your sweet time to type in the name, the dataframe should print out immediately when you hit enter. Try it!
4.2 Gender
Now let’s try and guess the birth gender based on the name.
To start out with, you have a massive Pandas dataframe with each state/year/gender/name combination listed separately.
You will want to:
- Select only the rows of the dataframe that match the given name
- Group those rows according to the gender column
- Add up the counts for each gender
- Sort so that you can extract the gender with the larger count.
These are all the same kinds of things we have seen before with Pahdas at various times.
The trickiest step is probably the grouping and adding up by group.
Rather than me tell you how to do that, I just searched the web for how
to add one column according to another column and found this short and
sweet StackOverflow answer.
Go read it! StackOverflow is a great resource and definitely OK to use
as long as you add a short comment with #
to give the citation of
where you got your information.
After this, try modifying your who.py
program so that it asks for a
name and then gives just a gender prediction.
Which of the following names have (slightly!) more female births?
- Kodi
- Tristyn
- Kimani
- Kris
- Arin
- Briar
- Daylin
(Enter just the letters of your choices.)
4.3 Region
Figuring out the most likely region will be very similar to discovering the most likely gender, but be careful that you first filter down to the most likely gender before finding the most likely region.
For example, the name “Avery” is overall most popular in the Southeast. But this name is more commonly applied to Females overall, and among female babies, “Avery” occurred more frequently in the Midwest. So for that example, you would want your program to predict Female from the Midwest.
Get your who.py
program working to give gender and region
predictions.
- What region is “Oleg” most likely from?
4.4 Age range
Now you are ready for the toughest part, calculating the age range.
The goal here is to find the smallest age range that covers at least 51% of the births for the given name and calculated gender and region.
A few details on what we are looking for here:
Assume no one ever dies. (So for example, we’ll suppose rather optimistically that all 80 Helens born in Wyoming in 1924 are still alive.)
Don’t worry about birthdays; estimate age as simply (2023 - birthyear). Those 80 Wyoming Helens from 1924 are all 99 years old.
By “smallest age range”, we mean the smallest span of years to cover at least 51% of the total.
If there are multiple possibilities with the same smallest span, return the one for the youngest people.
In the “Karen” example above, the shortest span is 15 years. Both (ages 63 to 78) and (ages 56 to 71) pass the 51% mark, but your program should print ages 56 to 71 since that’s the youngest/most recent.
(The reason to prefer younger ages here is because we aren’t really properly accounting for deaths.)
The approach I recommend to figure this out in code is something like this:
Use pandas to isolate the series that you want: the birth counts for the given name, gender, and region, organized by year.
First count the total number of births for the given name, gender, and region, as a single number.
Multiply that by 51 percent to get your target number
Now make a loop to try out each starting year between the earliest year in the data set and the current year.
Within this outer loop, make an inner loop for the ending year, from that starting year until the current year
Within the inner loop, actually add up all the births in that (starting,ending) year range.
If the total matches or exceeds your target number, break the inner loop and print out the (starting,ending) year range as a potential answer.
Once that works, use a few variables to just keep track of the shortest (starting,ending) year range rather than printing it out.
Finally, convert the best (starting,ending) year pair to ages and pat yourself on the back. You got it!
4.5 Check yourself
Be sure to test your who.py
program on plenty of names. Some less
popular names have missing entries for various states and years, maybe
even for entire regions — and your code should handle that!
Double-check that the formatting of what your program prints is also accurate.
And of course you should answer some questions:
What is the most likely gender, region, and age range for “your” name (for the purposes of this lab)?
If we released this tool on a website for public use, what potential ethical issues can you imagine might arise? Are there any groups or individuals who could be harmed by the use of this tool? Do you have any ideas on how to try and reduce that potential harm?
4.6 Submit
Save your files and submit everything so far:
submit -c=sd212 -p=lab06 lab06.md regions.csv ingest.py who.py
or
club -csd212 -plab06 lab06.md regions.csv ingest.py who.py
or use the web interface
5 Visualize
You probably knew this was coming! So far your who.py
program is
really cool in its attempt to predict gender, region, and age, but it is
only giving a small glimpse of the overall data.
For example, “Riley” and “Maria” are both most likely to be female names, but the female/male ratio for “Riley” is close to even where as for “Maria” it’s closer to 100:1.
Similarly, “Elijah” and “James” are both most likely to be born in the Southeast, but James is more evenly distributed around the country, whereas the prevalence of Elijah in the southeast is almost double what it is in any other region.
Your task is to write a program whoviz.py
which reads in a name
(like before) and creates a visualization that gives a richer
picture than just the most likely categories.
It is open-ended exactly what this should look like, so be creative and come up with something cool! It’s okay if your visualization only covers some part of the information about that name (gender, location, and age range), but it would be really impressive if you could incorporate multiple aspects together!
Either way, your visualization must give a clear, easy to understand picture of the full distribution of that name along one or more categories.
Some ideas of what you could try:
- Shading a map of the 50 U.S. states according to name distribution
- A bar graph showing the relative frequency of the name among different age groups or generations (Silent gen, Boomers, Gen X, Millennials, Gen Z, etc)
- Maybe use shading to show the male/female ratio for a name
Submit your code as whoviz.py
which should read in a name from the
user, then generate and pop up a visualization.
Run your own code for “your” name and save that image as me.png
to
submit as well.
Describe briefly what your visualization is showing.
Tell me another name I should try where your visualization is interesting or nice-looking in some way. (And explain your choice briefly.)
5.1 Submit
Save your files and submit everything so far:
submit -c=sd212 -p=lab06 lab06.md regions.csv ingest.py who.py whoviz.py me.png
or
club -csd212 -plab06 lab06.md regions.csv ingest.py who.py whoviz.py me.png
or use the web interface