SD 212 Spring 2024 / Notes


Unit 7: Hardware and OS

1 Overview

So far we have been focusing on programming and other skills needed to solve data science problems. In this unit, we will take a small step back from skill-building and instead build up our knowledge of how the computer works from the ground up.

You could take an entire course or even a whole degree in computer architecture or computer engineering, and we obviously are not going to do that in one week. Instead, we will focus on a high-level understanding of the major components of a computer and how they interact.

The reason we are covering this material is still the big goal of making better data scientists. Some of the topics we will cover later in this class, as well as in future classes, require an understanding of how computers work which you might not have yet. So, now is the time to learn about it! Hopefully the short break from programming is a bit of a relief.

2 Resources

  • These notes

  • Dive into Systems

    This book by Prof. Suzanne Matthews at that other elite military academy is a great and very accessible introduction to everything about how modern computers and operating systems work. I recommend any part of the book if you are interested, but jumping to the first few parts of these sections should be helpful to reinforce the material of this unit:

    • Chapter 5: What von Neumann knew

      The beginning of this chapter up until section 5.2 (Logic Gates) gives an abbreviated history of computers and a good overview of the main hardware components and how they interact.

    • Chapter 11: Storage and the Memory Hierarchy

      The beginning up until section 11.3 (Locality) discusses primary and secondary storage (what we are calling memory and disk in the notes below). There’s even a Grace Hopper shout-out.

    • Chapter 13: The Operating System

      The beginning of this chapter up until 13.2.3 (Creating and destroying processes) explains the role of the OS as a manager and how a machine with even a single processor can be running hundreds or thousands of processes at the same time.

3 Hardware and Software

A computer consists of hardware and software. The hardware is the physical equipment, i.e., the silicon, plastic, copper, and aluminum that someone probably had to build for you. The software is the actual code that governs how the computer works.

So far we have focused mostly on the software side, namely Python and bash (command line) programming. That is where you will spend basically all of your time working as a data scientist.

But just like a race car driver should have some basic understanding of how an engine works, a data scientist also benefits from some understanding of how their primary tools (computers) work. That’s what we are trying to do here.

3.1 Hardware components

For our purposes, we will say computer hardware can be broken into four kinds of components:

  • Processor(s) (a.k.a. CPU)

    The “brain” of the computer, which performs all the computations and logic and ultimately controls everything that the machine does. It’s also the smallest of these components — the CPU in your cell phone is less than 1 cm2 in size.

    In modern devices, CPUs are multi-core, which means that a single CPU chip actually contains multiple processors. For example, the PCs in the Hopper Hall labs have 6-core CPUs in 2023.

  • (Primary) Memory

    Comprising RAM, Cache, and CPU Registers, this is the temporary storage that a program can use while it is running.

    In terms of Python programs, memory stores the complete state of the program as it runs, including notably the values of all variables as well as the program location, basically which line of the code is currently executing.

  • Disk (a.k.a. secondary memory)

    This is the long-term storage of your computer, usually in the form of hard disk or even SD cards and thumb drives. When we talk about files and folders, they are on disk!

  • Input/Output Devices (I/O)

    Everything else fits in this category: keyboard, mouse, (touch)screen, speakers, microphone, ethernet (wifi or wired), bluetooth, etc.

3.2 Software components

The computer software is the actual code that runs controls all of this equipment. The “soft” part refers to the idea that we’re not talking about physical things here but rather information/ideas.

To give an analogy, your brain, heart, and lungs are hardware, whereas the concepts and things you learn are more like software. (Of course the analogy is not perfect because your brain and other organs actually do change and get stronger or weaker over time, unlike electronic components which usually either work at 100% or are totally broken.)

When thinking about software, there are broadly three categories of programs that run on your computer:

  • Operating System (OS)

    The operating system is the manager of all hardware and programs running on your computer. You can think of this as the main program that actually runs when you turn on your computer. It is responsible for directly interacting with the hardware and launching all other processes or programs.

  • System processes or daemons

    These programs (called “daemons” because they are there even if you can’t see them…) are running in the background at all times. Many of these are started by the OS when your computer boots up and they do things like help manage internet connections, filesystems, or other such background tasks. A typical modern computer has hundreds of system processes running at any given time.

  • User applications

    These are the normal programs that you use and — as a programmer — create. THey range from massive and complicated (think of you web browser or VS code) to very small and simple (like the command line utilities cat or tr).

    Notably, when a normal application wants to access some piece of hardware (access the internet, say, or get the mouse pointer location), it cannot do so directly. Instead, any hardware access is handled as a request to the Operating System.

4 Memory Hierarchy

We are data scientists here, so let’s start with the data! In terms of computer hardware, data storage happens in a wide variety of ways depending on its capacity (how much data there is) and how long it takes to access.

Imagine going on a hike. Your hands are the main “manipulators” that can do things, but each hand can probably only hold one thing at a time: a hiking pole, a water bottle, a phone, an energy bar, etc. For things you need to access frequently (phone perhaps, or a map), you might put those in your pockets for easy access. Notice, your pockets can hold more things than your hands, but they take a little longer to access.

Going further, you probably have even more stuff (food, flashlight) in your backpack. The pack holds more than your pockets, but again takes longer to get things in and out of. Even further away but larger, you can think about all the stuff back in the car or at the campsite. Or we could go even larger, like the REI store you would have to drive to that has everything.

Computer storage works very similarly, organized in levels that is commonly called the memory hierarchy. The main idea is that storage components are a tradeoff between size and speed. We know how to build very small storage systems that are extremely fast, or very large storage systems that are slower. In any modern computer, there are a range of different storage systems at different points along this tradeoff, just like a hiker has their hands, pockets, backpack, and campsite.

Here is a pretty good diagram of the memory hierarchy from Chapter 11 of the Dive into Systems textbook:

There is a lot of important detail in this image! Let’s go through each level in a little more detail:

  • Registers: the “hands” in our hiking analogy.

    Registers hold the values your processor can directly operate on, but the total size is very very small. A typical CPU has only a small number of registers (maybe a few dozen), and each will have a fixed size like 64 or 128 bits.

  • Cache: “pockets”

    The CPU has a temporary holding space called cache which temporarily stores values from main memory (RAM) as they are loaded into registers. The general idea is that when your processor needs some value from RAM, it is more likely to need that same value again in the near future, or to need some other nearby values adjacent to the one being requested. The fancy terms for these two kinds of relationships are temporal locality and spatial locality.

    Actually, modern CPUs have multiple levels of cache and even separate caches for different kinds of values (instructions and data). We don’t need to get into the details of that here. But you should know that cache is always on the CPU, which makes it faster than RAM, but also less flexible; you can’t change or upgrade the cache without buying a new processor.

    Modern CPU caches range in size from a few KB (thousands of bytes) to a few MB (millions of bytes).

  • Main memory (RAM): “backpack”

    RAM memory is outside of the CPU, but usually right next to it. In any desktop and even many laptops, RAM comes in “sticks” and can be augmented or even upgraded to get more temporary storage. Your cell phone or laptop probably has between 8GB (that’s a 8 billion bytes) and 32GB of RAM.

    Values stored in main memory or cache are both specified by a single numerical address. There is no distinction in terms of the address whether the value is in cache or in RAM. Instead, whenever your processor wants to access any value in temporary storage (memory), it first looks in cache, and then only gets the value from RAM if it’s not in cache, saving it there for the next time the same value might be needed.

  • Disk: “campsite”

    Now we get into what is called “secondary storage”, which is not directly addressed by the CPU. Your computer’s disk (sometimes called a hard disk) is manufactured with one of two technologies: spinning magnetic platters (“traditional” in the diagram above) or solid state device (SSD or “flash” memory in the diagram above). Nowadays, SSD drives are often faster and cheaper than traditional ones, so spinning platter disks are being phased out.

    You use disk storage all the time when you open or save files on your computer. Rather than numerical addresses, spaces on disk are organized into named directories and files.

    The main distinction is that disk is not temporary. Whenever you close a program (or the computer loses power), all the memory that program used in registers, cache, and RAM is gone. But the files are still there!

    Being further down in the memory hierarchy, disk storage is also larger and slower than RAM. A typical PC might have 1TB (that’s 1 trillion bytes), but accessing data from disk is also about 1000x slower than RAM.

  • Remote storage: “REI store”

    We can also think of going beyond a single device and thinking about remote or internet-based storage, such as cloud storage. The situation becomes pretty fragmented here because there is so much variety in what could be happening, but it should be no surprise that internet-based storage can be much larger than what would fit on a single disk or inside a single device, but fetching or sending data over the web is also an order of magnitude slower than using local storage.

    In the same way that disk is more permanent than memory since file contents “survive” even after an application quits or the machine restarts, cloud storage can be even more permanent, surviving not just the death of a single app on your PC or phone, but the loss of that device itself.

    We won’t really discuss cloud storage in any more detail here, but you will see a lot more of it in later data science major courses.

5 Processors

The processor or central processing unit (CPU) is the center of any computer’s operation. It is this device that actually process data, performs computations, makes decisions, and generally controls everything that happens on your machine.

5.1 Instructions

But the processor is not “magical”! Processors are only able to understand and execute specific instructions in machine language which is particular to that hardware device. While many slight variations exist depending on new features or slight updates, the two most popular are ARM (used on some laptops and most cell phones) and Intel x86 (used on most laptops, PCs, and servers).

A single program is basically just a list of machine instructions. Whichever architecture is used (ARM or x86 or something else), these instructions can be grouped into 3 categories:

  • Arithmetic: A single mathematical or logical operation. For example, adding two (not too large) numbers, or computing the boolean “and” of two true/false values, would be arithmetic instructions.

    Arithmetic operations are typically between a fixed number of inputs and outputs stored in registers. There are many variations depending on how many bytes large the values are, whether they are always positive or can be negative, whether they are integers or floating-point, and so on.

  • Load and store: Copying values between registers and memory addresses.

    A “load” means copying from memory to registers, and a “store” is copying from registers to memory. Remember that memory usually stores the values of variables in a running program. So to execute a seemingly-simple line of Python code like

    x = x + 3

    might require three instructions:

    1. Load the value of x from memory into a register
    2. Add 3 to that register
    3. Store the register back to the location of x in memory

    (Remember that recently-used or nearby values might be stored in cache, but they use the same addresses as main memory so from the instruction level there is no distinction.)

  • Control flow: Jumping to a different instruction in the list, or conditionally jumping or not depending on the value in some register.

    Normally, executing each instruction causes the processor to move down the list and execute the next instruction, and the next one, etc. Control flow (usually called branch instructions) allow execution to instead “jump” to another instruction in the list.

    Notice there are no if statements, loops, or function calls in machine code! All of that program logic has to be encoded with branching instructions.

    Function calls are an especially interesting case to think about for a moment. Given that a program is just a list of instructions, how do we do something like a function call

    y = sqrt(10)

    A few things need to happen:

    1. The argument values (10 in this case) are stored in specific registers so that the function knows where to find them.
    2. The position of the next instruction in the list is saved in another register so that the function knows where to jump to when the function is finished.
    3. The program does a jump to the first instruction of the function itself.
    4. The function does its thing, saving any return value in another specific register.
    5. The function executes a jump back to the location that was saved in a register on step 2.

    This is really complicated! We don’t expect you to remember or even maybe fully deeply understand the details here. The point is just to emphasize that all those “nice” things we see in programs like function calls and loops don’t really exist at the machine-code level; it’s just jumps and registers at the bottom.

5.2 Clock cycles

The CPU runs on a very high-speed but consistent timer called the clock. You can think of it as an electronic pulse that goes through the entire CPU about 3 billion times per second. (The precise speed is usually reported as the “clock rate” of the processor; 3 billion times per second is translated as 3GHz.)

Ideally, a single instruction is executed every clock cycle. So a program that runs through 1000 instructions will take at least 1000 clock cycles.

Why “at least”? Two important things can happen to cause a single instruction to use more than 1 clock cycle:

  • Cache miss

    Remember that a load/store instruction goes between a register and a memory address. But that memory address could be in CPU cache or in main memory. (And within the CPU there are multiple caches of different sizes and speeds.)

    Copying a single value from the fastest (“L1”) cache might just take a single clock cycle, but anything which is beyond that in slower cache levels or in RAM will take 10x or 100x more clock cycles.

    A “cache miss” specifically refers to when a program load/store instruction requests a memory address which is not in cache, which may cause the program to stall or idle while waiting to retrieve that value from RAM.

  • Branch misprediction (a.k.a. “pipeline burst”)

    This one is a little more complicated to understand but can be summarized as: sometimes control flow instructions can be slow.

    In any modern CPU, the processor achieves its high speed (like 3GHz) by essentially preparing each instruction ahead of time, while executing the previous instructions. This works fine with only arithmetic and load/store instructions, but when a conditional jump or “branch” instruction is coming up, the processor doesn’t know which future instructions to prepare for.

    What the processor does in such cases is essentially guess whether the conditional branch will be true or false. If the guess is right, then everything is great and no cycles are wasted. But if the guess is wrong, the CPU has to go back and do that other work that it was not planning for, and this costs maybe a few up to a few dozen extra cycles.

    (In reality, the “guess” is made based on previous statistics of how that conditional branch instruction happened so far in the program. Your processor is basically doing data science at the lowest level, billions of times per second!)

6 Compiled vs Interpreted

First let’s get some terminology straight:

  • A computer program is the written source code, as in your .py file (or a .c or .java file in some other languages).
  • A process is a program that is running on your computer.

Let’s think about how a program (source code) actually runs and becomes a process. We know from the previously section that, somehow, a line of source code like

mylist = [1, 2, 3, 4]

can’t directly be understood by your CPU. Instead, some other tool has to do the job of translating the language of your program (source code in Python, or Java, or some other programming language) into the language of your CPU (machine code instructions).

This process crucially depends on what kind of programming language you use (or more accurately, what kind of tool is used with that source code).

6.1 Interpreted languages

Python is an example of an interpreted language, which means that a special program called an interpreter.

Say you write a Python program and save it to a file myprogram.py. There are two ways to run it: you can click the “play button” icon in your IDE, or you can run it from the command line by typing:

python3 myprogram.py

Actually these are really the same way; if you look carefully at the terminal when you ask VS Code to run your program, you’ll see that VS code opens a bash terminal actually runs the same python command listed above (probably using the full path to the python3 executable according to your conda/mamba environment).

Remembering what we know about the command line, we can infer that:

  • python3 is the name of an executable program on your computer
  • myprogram.py is an input file that is read by the python3 program

We say that the python3 program is acting as an interpreter here; it is translating between the Python programming language and machine instructions “on the fly”, reading each instruction and executing it. It can be helpful to think of the analogy of a human doing a live translation of a speech.

There are many other languages that work similarly: Javascript, Scheme, PHP, and Lua to name a few. In all cases, there are some advantages and disadvantages to having a translated language. On the plus side, such languages tend to be very flexible and adaptive for the ease of the programmer; this is actually one of the main reasons why Python has become so popular for use in data science.

On the other hand, interpreted languages tend to lose out a bit on efficiency because each Python statement has to be read in from a file and interpreted to machine instructions every time we run that program. (Well, there are certain ways that Python tries to improve on this, but the main point still remains.)

Another aspect of interpreted languages like Python is that most errors are not noticed until run-time. For example, consider this program:

num = int(input('Enter a number please: '))
if num > 0:
    print("Your number is positive")
elif num < 0:
    print("Your number is negative")
elif nun == 0:
    print("Your number is zero")

Do you see the error? In the last elif case, we mistyped the variable name as nun instead of num. But even though we can spot this error as humans, the python3 interpreter won’t notice it until someone actually runs the program and enters 0.

The same is true in general for interpreted languages: they tend to be more flexible and easy for the programmer to get stuff done, but also less efficient and harder to spot bugs in.

“Better” than what? Well there is another choice…

6.2 Compiled languages

A natural question to ask after the previous discussion is, “so who wrote the program python3”? We know if can’t really be written in Python, since then we would have to use another interpreter to run it!

In this case, the python3 interpreted is actually written in another programming language called C; you can browse the source code here.

C (and many other languages like C++, Rust, and Go) is compiled rather than interpreted. This means that, if we have a C program saved in a file myprog.c, we can’t directly run it. Instead, running the program is a two-step process:

  1. Compile the C code to machine instructions with a C compiler, like

    gcc myprog.c -o myprog
  2. The compiler creates an executable file which is a translation into processor instructions. So then we can just run that file:

    ./myprog

Notice that the program only has to be compiled once, and then it can be run many times. Furthermore, you don’t need the compiler to actually run the program; the myprog executable that the compiler created is now completely self-contained.

As you can probably guess, the advantages and disadvantages of compiled languages are exactly the opposite of interpreted ones:

  • Compiled languages tend to be less flexible than interpreted ones.

    (For example, in Python we can assign any value we like to a variable name, but in C and most compiled languages, the type of a variable like integer or string can never be changed.)

  • Compiled languages tend to be more efficient at run-time, after compilation.

    (This is because the compiler can perform many optimizations when translating the source code to machine code. Compare the experience of listening to a live-translated speech originally in a different languages, vs reading an essay or book that has been carefully translated ahead of time by an expert in both languages.)

  • Compiled languages can help spot errors sooner, at the initial compilation stage, before the program is actually run.

    (This can actually be an advantage or a disadvantage, depending on your perspective. Compiler errors are the bane of many a C programmer!)

6.3 Third way: Using compiled libraries

An in-between option which may get some of the “best of both worlds” is to use high-performance compiled libraries inside an interpreted language.

For example, we have already been using Pandas quite a bit, which in turn is based on Numpy. If you look through the source code, you’ll discover that they are both written with a mix of Python and C source code, and the most important performance-critical functions are all written in C.

Although we haven’t directly used this feature, Python is designed to fairly easily interface with C language libraries, so that if you compile a few functions in C, it is possible to call those functions directly from Python code.

This means that, when a good library like Pandas is used well, we can get almost the same performance in Python as we would with a well-written compiled C program to do the same thing. We can have the flexibility and ease of programming of an interpreted language along with the good efficiency of a compiled one. Great!

In class, we looked at a specific example where the same function was written three ways:

  • In Python, using a loop
  • In Python, using a function call into the numpy library
  • In C

We saw that the optimized, compiled C version was fastest, something like 20x faster than the Python loop version. But the Python+numpy version is almost as fast as the C version.

Understanding this is crucial to an appreciation of why these libraries like Pandas have become so popular for data science applications.