SD 212 Spring 2023 / Notes


This is the archived website of SD 212 from the Spring 2023 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Unit 3: Statistical data types

1 Overview

In this short unit, we will spend a little time to think about how to organize what kinds of data (or features within data) might arise in the life of a data scientist. This organization can help us when approaching a new dataset and thinking about what kinds of analysis or visualizations are most meaningful and appropriate.

This concept is not really covered in a single place in our textbooks, so we are providing some notes here to supplement your own notes from class.

2 Python type vs statistical type

When studying Python programming, we frequently think about the types of variables and values: str, int, float, bool, list, dict, and so on. This very much controls what kind of operations we are able to perform on a given value. For example, the meaning of

x[3]

is very different if x is a string, list, dict, or pandas dataframe, and it will be an error if x is a number.

While Python types govern how information is represented and what we can do with it in a Python program, we can also think about the type of information that is actually represented in the real world by that variable. These are sometimes called “statistical data types” or “feature types”.

For example, the Python string "April" could represent someone’s name, or it could represent the fourth month of the Gregorian calendar. How we treat that string in our data processing and analysis should be very different depending on this context.

As another example: the number 17 could mean a lot of things: it could be a company number in Bancroft Hall (which is useful for grouping but has no inherent numerical meaning), or it could refer to the year 2017 (or 1917?), or it could be an air temperature measurement. These distinctions are not captured by the Python type, but we need to be aware of them when performing data analysis and visualization.

3 Categorical, Ordinal, continuous, discrete

There are many different ways to define a hierarchy of statistical data types, but most commonly there are two groups with two sub-groups each:

  • Numerical data (a.k.a. quantitative)

    This kind of data is for things that can be measured in some way, and where the number itself actually has meaning. Doing math with these kind of values makes sense, for example computing an average value.

    There are two sub-types:

    • Continuous data: When any real number makes sense as a fine-grained measurement. These often correspond to physical measurements like distance, weight, etc. Ratios and percentages also typically fit here.

    • Discrete-valued data: When the only values that make sense are integers. For example, a count of how many people attended a concert would be a discrete-valued numerical value. It’s certainly measuring something where an average or total could make sense, but you can’t have 3.25 people.

    A special case here is time series data (i.e., datetimes). These are always numerical and typically continous. If the data is only specifying the day or year, it may be discrete rather than continuous.

    (Interestingly, the internal representation of a datetime is typically something like the number of seconds elapsed since January 1 1970, which makes it clear that this is a numerical quantity like any other.)

  • Categorical data (a.k.a. qualitative)

    This is the type of data which is not numerical, or where the number doesn’t have any meaning as a number. Typical examples would be things like names, or colors, or cities.

    Note that data can be categorical even if it’s represented by a number; for example a phone number, zip code, or company number. Again, it can be helpful to think about whether math makes sense: if MIDN X has alpha 268136 and MIDN Y has alpha 259454, does the average of these numbers mean anything at all? Of course not, because alpha numbers are categorial and not numeric.

    Again, there are two sub-types:

    • Ordinal data: When the values have some natural ranking or ordering. A good example would be class rank.

    • Nominal data: The values are just values, with no inherent ranking or relationship. Names of people, places, or things certainly fit here, as do many other groupings like countries, species, colors, etc.

    A special case of categorical data is a unique identifier, which is some feature in a dataset which by definition will be different for every entry in the dataset. Midshipman alphas would be a classic example, or account numbers in a bank, or usernames on a website, or city names within a single state. Notice that unique identifiers are sometimes numbers and sometimes strings, but are (almost) never numeric because they aren’t measuring anything.

4 Recognizing statistical types

We can get some help of identifying statistical types by looking at the data. Are all the entries numbers? Are some of them decimals, or only integers? If the data is strings, are they all the same, or do they divide into a small number of groups?

Some of these questions can be aided by the tools you know. For example, determining how many distinct values exist and how many times they are repeated can be accomplished on the command line with a pipeline ending with sort | uniq -c, and you can do the same on a Pandas series by calling .value_counts().

But ultimately we need to understand the meaning behind what the raw data is. Consider a few examples:

  • Cat species: lion, tiger, leopard, …

    This is clearly nominal, categorical data. There is no numerical value here, nor can we put these in any meaningful order.

    (Note, there still may be ways to compare some aspects of these categories, like whether there are more lions or tigers in the world, or which one can run faster. But that is not the same as the animal name itself!)

  • Survey responses: Do you eat cheese “always”, “sometimes”, or “never”?

    This is ordinal, categorical data. It’s easy to say which category is more or less than another, but the difference between “always” and “sometimes” is not well defined.

    Would you be able to say an “always” respondant eats twice as much cheese as a “sometimes” person? Not necessarily!

  • Finishing places in a race like 1st, 2nd, 3rd, etc.

    This is another example of ordinal data.

    It might seem that this data is numeric since there are numbers, but the numbers don’t tell us anything concrete except the ordering. The difference between 1st and 2nd place is not necessarily the same as the difference between 2nd and 3rd place. The person who finished 50th is probably not twice as slow as the person who finsihed 25th.

  • How many times you have been swimming in your life?

    This is discrete numerical data.

    It’s numeric because it makes sense to do math, like to say you have been swimming 3x as many times as me, means that number is 3x larger.

    But the data is also discrete because fractional values aren’t possible. You can’t go swimming 3.7 times; that would never make sense.

    (By contrast, the number of minutes you have spent in a pool would be continuous.)

  • How many miles per hour were you driving?

    This is continuous numerical data. Notice that this at first might seem similar to the previous data: the question is “how many” and we usually measure speed (of cars) in whole numbers. But you could imagine going, say, 25.6 mph. Even if the spedometer or whatever measures speed only displays integers, and even if the traffic laws are always multiples of 5, that doens’t mean those are the only possible speeds that could exist.

As with any way of categorizing things, there will be edge cases or things that seem to partially fit into more than one category, and that’s okay! This system of categorization is meant to be a useful starting point to help us think about what’s possible, not a precise question that demands a precise answer in all cases.

For example, consider:

  • Your QPR for a single class. (4.0 for A, 3.7 for A-, 3.3 for B+, etc.)

    This one is tricky! It may seem to be continuous/numerical because we have decimal points, but it is definitely not continuous since intermediate values such as 3.5 are just not possible (for a single class).

    In fact, you could argue that this is not even numerical. Does a C (2.0) student know twice as much as a D (1.0) student? Certainly we know the C student did better in the class, but can we say how much better based on the QPR? I’m not sure!

    On the other hand, we definitely do take averages of individual class QPRs, so it would seem that they have some numerical meaning, or at least are supposed to.

    Basically, QPR for a single class is at least ordinal, but is kind of in a gray area between numerical/discrete and categorical. Or it might be best to say it is an attempt at making (categorical) grades into numerical quantities.

5 Why does it matter?

Understanding statistical data type is useful in at least two ways to a data scientist: knowing what kind of analysis makes sense with your data, and knowing how best to visualize your data.

You will study both of those questions in much more detail in later courses, so for now we focus mostly on the big picture. The main pitfall is using categorical data as if it is numerical, both in analysis and visualizations. I’ll give two examples here; be on the lookout for more!

Bad Analysis Example

Imagine we are interested in the academic performance of varsity athletes. We find that the average class rank of a tennis player is 321.6 and the average class rank of a wrestler is 357.5. Can we say the tennis players are doing better academically on average than the wrestlers?

Not necessarily! It’s possible that, say, class ranks 100 through 400 all have almost the exact same QPR of 3.5, but then this drops off dramatically past 400.

Then the tennis players might be split between half having a high class rank around 120 with 3.5 QPR, and the other half having a low rank around 520 with around a 2.5. The overall average QPR of the tennis team would be 3.0.

But the wrestling team could all be between 350-360 class rank, all with a 3.5 QPR.

In other words, the better average class rank does not necessarily mean a higher average QPR. This is about treating ordinal data (rank) as if it is numeric.

As a rule, you should never be doing math with categorical data, like taking averages, summing them up, comparing differences, etc. If you are doing that, it’s likely that you are mis-treating categorical data as numeric.

Bad Visualization Example

This one actually cones to use from one of your DSITW submissions:

Source: Nathan Piccini, Data Science Dojo

Notice the striking erratic nature of the lines in this graph. What is happening?

We seem to have a confusion of nominal and numerical data. The x-axis is team names, which are obviously not numerical, not even ordinal. (OK, we could rank the teams by which place they finished in the season or something, but here they are just in alphabetical order.)

But the choice of drawing lines between the data points makes an implied continuous connection between the team names based on alphabetical order, which is meaningless and misleading.

The trend lines (dotted lines) are even worse — the author is trying to draw a distinction between the two categories (2016 vs 2018), but trend lines are about a relationship or correlation between the X and Y axis - in this case, it is showing a slight upward trend as the team names go closer to the end of the alphabet, which is totally meaningless.

Some kind of bar graph would have been a much better choice here. Line graphs like this should only be used when the axes are both numerical (some would argue, only if they are both continuous), with consistent scaling.