SD 212 Spring 2024 / Notes


Unit 8: Concurrency

1 Overview

So far all of the programs and shell scripts we have written are inherently serial: do one thing, then another thing, then another thing, etc. until the job is done.

But this is a wasteful use of the powerful computing resources at our disposal! As we learned in the previous unit, the operating system on a typical laptop or desktop is typically managing a few hundred processes running at the same time — why should just one of them be doing useful “work” on some data science task?

In this unit, we will get a small glimpse into the world of concurrent programming, where we write code to execute more than one thing at the same time. Starting with bash, we will first learn about how to manage multiple processes can interact on the command line and the “familial” relationships between them. Then we will see how to do the same thing with Python’s multiprocessing library. Finally, we will learn about multi-threading and see that Python’s threading library, while superficially similar to multiprocessing, has a few key advantages and disadvantages.

As with many things we cover in your intro classes, this is not an exhaustive view of the subject of parallel programming! The art of designing efficient concurrent programs is something you could take multiple whole courses on. Our goal here is to introduce the main concepts of multiprocessing and multithreading, and to see some of the advantages (and dangers) they can provide.

2 Resources

  • The Linux Command Line Chapter 10: Processes (Required)

    This chapter of TLCL goes over how mutliple processes coexist and how to get information about them in Linux. You can gloss over the details in the section on Signals, but otherwise this is all good and useful info for us.

    Pay special attention to job control: how to start, stop (pause) and kill a process, and how to execute processes in the background.

  • Python in a Nutshell Chapter 15: Concurrency

    This is not one of our “usual” Python textbooks, but it’s still a good one for the level we are at.

    Here’s what to focus on within this (long) chapter):

    The other sections of this chapter are all good info, but more detail than we have time to fully cover in SD212.

3 Bash

3.1 Processes

A process is a running program. When you write code in Python or another programming language, you’ve created a program. Actually executing/running that program creates a process.

We use a lot of human/family terminology when talking about processes:

  • Processes have a lifetime. They are created when they start running, then they are alive while running, and then they might either terminate normally or be killed by another command.

  • Each process has a parent, which is the process that started it. For example, when you open a terminal in VS Code, the code process is the parent of the bash process in the terminal. If you run a command like grep in the terminal, then that bash process will be the parent of grep.

  • Similarly, if a running program starts another one, that new process is a child of the parent process.

  • This goes further to things like “orphans” and “widows”, but we won’t get into that for this course.

How can we refer to a process? We could use the command that the process is running, but that’s not enough, since (for example) there could be multiple instances of the same program running at once.

  • Every process has a unique process identifier or PID. It is a number, usually 6 or 7 digits on modern versions of Linux.
  • Every process also has a parent, which is the process that launched it. The parent process ID is called the PPID.
  • The only exception is the process with PID 1, which is the first process started by the OS kernel. On modern Linux, this process is called systemd. It does not have a parent, and its PPID is listed as 0. If you follow the parents-of-parents of any Linux process, you will eventually get back to PID 1.

3.2 Backgrounding

By default, when you run a command in bash like

roche@ubuntu$ wget 'https://www.theinspiration.com/wp-content/uploads/016-3-1074x1074.webp'

the default behavior of bash is to run that program as a new process in the foreground. That means that the program runs in the terminal for as long as it takes, and bash waits patiently for it to finish. When the process is finished, bash steps in again and gives you a new prompt.

Instead, we can add the ampersand & at the end of any command line or after a line in a bash script to run that process in the background. A background process runs just like a foreground process. The difference is with bash itself: instead of waiting for the process to finish, bash instead lets the process go on in the background and immediately gives you a new prompt.

This is our first foray into multiprocessing, where we can get the computer to do multiple things at the same time. When you execute a program in the background in bash (with the &), you can then immediately start another process in the background, and another, etc.

The special command wait will pause the current bash terminal or bash script until all background commands are finished.

3.3 Signals (killing)

Normally, each of the many processes running at a given time are blissfully unaware of the hundreds of other processes that are also competing for limited hardware resources. The operating system partitions memory and schedules the CPU and I/O devices so that each process gets what they need without interference.

But of course sometimes we need to communicate between processes! One way to do this which we have already seen is a bash pipeline, where the output stream of one processes is hooked into the input stream of another one.

In Linux, signals are used to “interrupt” a process, usually triggered by another process or by the OS. There are only a small number of possible signals, with specific meanings, and each with a numeric code. You can type man 7 signal to learn about all of them.

For us, there are two signals we should know about:

  • SIGINT (2): “Interrupt” a process. This is the signal that gets sent when you type Ctrl-C while a process is running.

  • SIGTERM (15): Nicely ask to terminate a process. This is the default signal sent with the kill command.

  • SIGKILL (9): Forcibly terminate (“kill”) a process

The first signal SIGINT (Ctrl-C) can be ignored by a program if it wants to. For example, if you type Ctrl-C from the bash terminal without any program currently running, bash just ignores it and gives you another prompt.

The second one SIGKILL cannot be ignored. If you send a SIGKILL it will end that process immediately, even if it’s in the middle of doing something super-important.

The kill command can be used to send a signal to a specific process. Here is an example of running a wget download command in the background, then using ps to see the PID of that wget command as it runs, and then kill to send a Ctrl-C signal to make wget quit early:

3.4 Useful commands

  • ps: Get info about currently running processes in the current terminal. Some variants/options:

    • ps -A: Get info about all processes, not just the ones in the current terminal
    • ps -f: Display a full listing that includes the PPID
    • ps 123456: Show info about the process with PID 123456
  • top: Show a live update of all processes running, sorted by default by CPU usage. (Similar to Windows Task Manager, but more nerdy.)

  • htop: Better, more colorful version of top

  • &: Not a command, but can be added to the end of any command or pipeline to make it run in the background

  • wait: Pause execution until all background processes terminate

  • sleep: Pause for a given number of seconds

  • kill: Send a signal (default is SIGTERM) to a process with the given PID

    • kill -2 123456: Ahem hello process 123456, I’d like to get your attention. Whoops, did I startle you and cause your death?
    • kill -15 123456: Process 123456, please die when you get a free moment.
    • kill -9 123456: I’m not asking anymore. You’re dead.

4 Python

In Python we can do something very similar to running a command in the background in bash, by executing a Python function in one of two ways: either in a separate Python process (multiprocessing) or in a new thread that is part of the same process (multithreading).

We will see how to do both, and what are the pros/cons of each approach.

4.1 Multiprocessing

Consider the following program, which finds and prints out the smallest prime numbers after 100,000, after 200,000, etc.:

def is_prime(p):
    """Returns whether or not p is a prime number."""
    # (Note, this function is intentially slow, for demo purposes!)
    d = 2
    while d < p:
        if p % d == 0:
            # p is divisible by d, so not prime
            return False
        d += 1
    # p has no divisors between 1 and p
    return True

def next_prime(n):
    """Computes and prints out the next largest prime after n."""
    p = n
    while not is_prime(p):
        p += 1
    print(p)

if __name__ == '__main__':
    for x in range(1000000, 10000000, 1000000):
        next_prime(x)

Don’t worry too much about the details of primality checking. But notice the general structure: we are calling a function next_prime repeatedly, with different input arguments, and that function will print out the results of each call.

This is an ideal candidate for multiprocessing! Here is what that same program looks like when each of the function calls to next_prime is run in its own process:

from primes import next_prime
from multiprocessing import Process

if __name__ == '__main__':
    children = []
    for x in range(1000000, 10000000, 1000000):
        child = Process(target=next_prime, args=[x])
        child.start()
        children.append(child)

    for child in children:
        child.join()

    print("All done")

Notice a few components, which will be common for just about every time you want to use multiprocessing in Python:

  • Need to import the Process class from the multiprocessing library

  • Always use a if __name__ == '__main__' to run multiprocessing stuff. (Otherwise, you risk each of the children spawning more and more children on some operating systems.)

  • Use the Process constructor with two named arguments:

    • target is the name of the function you want to call
    • args is a list or tuple of the arguments that will be passed to that function
  • Save your Process objects in a list (for later use)

  • Call the .start() method on each Process. (If you forget this step, then they won’t actually run and your program will do nothing!)

  • Later, in a separate loop, call the .join() method on each process. This causes the parent process to wait for each child to finish.

    (Make sure you understand why it needs to be a separate loop rather than calling .join() right after .start().)

You can run the program above on the command line with the time built-in to test it for yourself. Here it is on my 4-core laptop. First the original version without multi-processing:

roche@ubuntu$ $ time python3 primes.py
1000003
2000003
3000017
4000037
5000011
6000011
7000003
8000009
9000011

real    0m3.281s
user    0m3.269s
sys     0m0.012s

What we see here is first the actual output (the 9 prime numbers), in order as expected. Then the timing results say that this took about 3.3 seconds of “real” time, of which almost all of that was spent with the CPU working (“user” time measures normal CPU usage).

Now look carefully at what happens when we run the multi-processing version instead:

roche@ubuntu$ time python3 primes-mp.py
1000003
2000003
3000017
4000037
5000011
7000003
6000011
8000009
9000011
All done

real    0m1.151s
user    0m5.936s
sys     0m0.047s

There are some key differences here:

  • It got faster! The real time it took was cut almost in third, to about 1.2 seconds.

  • The output is not in order. In fact, if I run this again, the order will be slightly different each time. That is because each process is running simultaneously, so they can finish (and print their results) in different orders on different runs.

  • The CPU time went up. The “user” time measures the TOTAL CPU usage across all processes. Because there is some “overhead” cost for starting a new process, this can be higher than the single-process version. But because the CPU time is much larger than the “real” time, that’s a great indication that our program is taking good advantage of the 4-core CPU in my laptop.

4.2 Multithreading

There is an alternative to multiprocessing in Python called multi-threading. It follows a nearly-identical syntax to multiprocessing. Here is how we can write the same example as above using threads instead of processes:

from primes import next_prime
from threading import Thread

if __name__ == '__main__':
    children = []
    for x in range(1000000, 10000000, 1000000):
        child = Thread(target=next_prime, args=[x])
        child.start()
        children.append(child)

    for child in children:
        child.join()

    print("All done")

Notice there are only two differences compared with the multiprocessing version:

  • import Thread from the threading library
  • Use the Thread class instead of the Process class

That’s it! The Python multiprocessing and threading library authors intentionally designed the Process and Thread classes so that they can be used identically in code, which makes it easy to try each one.

But there are very important differences in how multithreading and multiprocessing work — don’t be fooled by the similar syntax! Here is a summary:

  • In multiprocessing, each Process runs separately with its own copies of all global variables. So if one process changes a global variable, it won’t be seen by the other processes.

    In practice, this means that processes either have to print their results to the terminal, or write to files, or use some other mechanism (we will talk about pipes later) to communicate results back to the parent process.

  • Multithreading instead uses a shared memory, which means that all global variables are shared between all threads. This can make it much easier to send back data to the parent thread, just by appending to a global list or something like that.

  • Separate processes are truly independent and can take advantage of multi-core CPUs.

  • Separate threads are more inter-dependent, and something called the Global Interpreter Lock (GIL) in Python effectively prevents a multi-threaded program from using more than one CPU core at a time.

    This is essentially a consequence of the shared memory model, but it obviously limits how much we can do with threading in Python.

    What this means in practice is that multi-threading is most useful for I/O-based operations such as reading from disk or the Internet.

To see the unfortunate downside of the GIL in action, here is what happens when we run the multithreaded version:

roche@ubuntu$ time python3 primes-t.py
1000003
2000003
3000017
5000011
4000037
7000003
6000011
8000009
9000011
All done

real    0m3.408s
user    0m3.402s
sys     0m0.032s

Once again, the output shows up out of order (and might change on each run of the program), which indicates multiple things are running at once.

But the time is the same (actually a little worse) than the original version that didn’t do any multiprocessing or multithreading! That is because this program is highly CPU-dependent. In particular, it is doing a whole bunch of arithmetic to try and find prime numbers. So because the Python GIL effectively prevents more than one CPU core from working at once, we won’t get much timing benefit from multithreading in this case.

4.3 Sharing values between threads

Let’s look at an example where multi-threading is actually useful! Remember that because of the GIL, multi-threading in Python will not be helpful for CPU-intensive programs. But for I/O-intensive ones, we can see some nice benefits. In these kind of programs, the “slowness” comes not from the CPU working hard, but just from waiting for data to come back from the web or from disk.

Here’s a program that uses multi-threading to retrieve 20 random cat facts from this API, then sorts them in alphabetical order and prints them out:

import requests
from threading import Thread
requests.packages.urllib3.disable_warnings()

facts = []

def cat_fact():
    global facts
    resp = requests.get('https://cat-fact.herokuapp.com/facts/random', verify=False)
    fact = resp.json()['text']
    facts.append(fact)

if __name__ == '__main__':
    children = []

    for _ in range(20):
        child = Thread(target=cat_fact, args=[])
        child.start()
        children.append(child)

    for child in children:
        child.join()

    facts.sort()
    for fact in facts:
        print(fact)

Notice a few important things about this program:

  • Each function call appends its fact to the list facts. We have to specify global facts at the beginning of the function so that it knos to update the global variable.

  • When creating each Thread, we leave the args field as an empty list because the function we are calling, cat_fact, doesn’t take any arguments.

  • Sorting the facts can only be done after all the responses are in. So inside the main part, we do facts.sort() only after the join() loop completes.

Because this function is mostly not CPU-bound (doing a lot of computation) but rather IO-bound (making web requests), multi-threading is very effective, resulting in about a 3x speedup compared to doing the requests one at a time.

Multi-threading is also the right choice here because of the shared memory used when each thread is attempting to append to a global list. If we ran exactly the same program but replaced Thread with Process, it would run without error but nothing would print at the end. That’s because in multiprocessing there is no shared memory, so each process would be appending to their own copy of the list, and the parent thread at the end would be sorting and printing out an empty list.

4.4 Sharing values between processes using SimpleQueue

Let’s summarize what we know so far about multiprocessing vs multithreading:

Multiprocessing Multithreading
Shared memory Each process has a copy of global variables Each thread has shared read/write access to the same global variables
GIL Not affected by global interpreter lock GIL prevents multiple threads from executing CPU instructions simultaneously
CPU-bound tasks Can effectively use multiple CPU cores Not effective for CPU-intensive tasks because of the GIL
IO-bound tasks Works well Works well

So basically, we know that for tasks which need to have some shared memory we have to use threads, and for tasks which are computing-based (CPU bound) we need to use multiprocessing.

But what about when we have a CPU-bound task which really needs some communication between the processes besides print statements? In bash, we can achieve this kind of inter-process communication (IPC) using pipes.

It turns out, we can do something equivalent in Python using special multiprocessing data structures that can be shared between multiple processes.

There are a few mechanisms for this in Python, but the only one we will look at is the SimpleQueue class in the multiprocessing library.

Think of a SimpleQueue as a kind of “Python pipe” that can be used to write and read Python values between processes. (Actually, if you look at the documentation you will see that SimpleQueue uses a python Pipe in its implementation, which is not a coincidence!)

Here is a summary of how to use a SimpleQueue in a multiprocessing Python program:

  • from multiprocessing import SimpleQueue
  • Only once in the parent process, create a new queue object with SimpleQueue()
  • Pass this queue as an argument to the function calls that each Process runs
  • Use the .put(x) method to add some value to the queue in one process
  • Use the .get() method to remove and return a value from the queue in a different process

Usually, the queue is only used for one-way communication: either the child processes are putting thing in and the parent is getting them out, or vice-versa.

Here is a complete example program which is based on finding prime numbers like before, but with the difference that all of the child processes add their individual counts to a shared SimpleQueue, and the parent process removes these counts and adds them up:

from primes import is_prime
from multiprocessing import Process, SimpleQueue

def count_primes(start, end, queue):
    """Returns the number of primes p with start <= p < end."""
    print(f"Counting primes beween {start} and {end}...")
    count = 0
    for p in range(start, end):
        if is_prime(p):
            count += 1
    print(f"There are {count} primes between {start} and {end}")
    queue.put(count)

if __name__ == '__main__':
    # This will count primes in the ranges (10000,11000) and
    # then (11000,12000), etc.  up to the range (90000,100000).
    start    =  10000
    interval =  10000
    nprocs = 9

    # This object is only created ONCE but shared with all processes.
    # It is a very special data structure from the multiprocessing
    # library that can be used to communicate between processes.
    queue = SimpleQueue()

    # current_end is updated with how far we have gone in
    # starting the processes for each range.
    current_end = start

    # Start up each process.
    # Notice that we will use the queue to know when all the processes
    # are finished, so we don't have to save the Process objects into
    # a list.
    for _ in range(nprocs):
        p = Process(target=count_primes, args=[current_end, current_end+interval, queue])
        p.start()
        current_end += interval

    # Now wait for each process to finish and add up the results.
    total = 0
    for _ in range(nprocs):
        partial = queue.get()
        total += partial
        print(f"Adding {partial}; running sum is {total}")

    # Now everything is done, so print the grand total.
    print(f"The TOTAL number of primes between {start} and {current_end} is {total}")

Try running it! You should notice that as each process reports its count and adds it to the queue, the parent process immediately adds that to a running total. Crucially, because we know each child process will only add a single value to the queue and exit, the parent process can just have a single loop at the end to get() the correct number of values from the queue corresponding to how many processes it launched at the beginning.

On my 6-core desktop, this cuts the running time down from 33.5 seconds in a single-process version, down to 8.1 seconds using multiprocessing. Great!

Remember: it would be easier to just use a shared global variable for the counter instead of the SimpleQueue, but shared global variables don’t work with multiprocessing. We could instead implement this with multithreading, which works and gets the right answer, but doesn’t run any faster than a simple non-parallel version because this is a CPU-intensive computation that is hampered by the GIL.

(Try it! If you change Process to Thread in the code above, it actually takes about twice as long as not using threads at all.)

In summary, we can often avoid inter-process communiction using these special data structures like SimpleQueue either by:

  • Printing to the screen or writing to files in each process rather than sending Python variables back to the parent process; or
  • Using multithreading and global variables instead of multiprocessing, if the computation is IO-bound

But when we have a CPU-intensive process which requires variable values to be sent back to the parent process (numbers in the prime counting example), then we can still do it by using multiprocessing and the SimpleQueue (or some other mechanisms we won’t cover in SD212).