This is the archived website of SI 486I from the Spring 2022 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Lab 3: Caching blocks for a read-only node

Part 0: Get your static servers running from Lab 2
Background
Part 1: Caching blocks during show_chat
- Your tasks
Part 2: Flask node to serve cached blocks
- Your task
Going further: Concurrency

Google Form for this lab

So far, the constantly fetching blocks — often the same blocks — over and over again. This week, we will add some caching for both fetching and serving blockchains, using the filesystem as a kind of basic database.

Part 0: Get your static servers running from Lab 2

Go back to lab 2 and make sure you complete Part 3, running a static “good” and “bad” blockchain server on ports 5000 and 5001.

Background

Reading and writing entire files in Python

You will need to save and retrieve blocks from files. Because these blocks need to have consistent hashes based on their exact string contents, it is very important to read and write the exact block contents as they are downloaded.

Fortunately, this is really easy in Python! Mostly, you just have to avoid using the line-based functions like print or readline.

Here is a small Python program that reads the entire contents of file1.txt into a string s, and then writes that entire string to file2.txt.

with open('file1.txt') as fin:
    s = fin.read()

with open('file2.txt', 'w') as fout:
    fout.write(s)

Telling git to ignore files and directories

Usually, we want our git repos to contain only the source code and other necessary files, but not other stuff like temporary caches, editor savefiles, compiled binaries, etc.

You can tell git to ignore certain files (never add them to the repo, just let them sit on the filesystem) by means of a special file called .gitignore. Notice the . in front of the name!

I recommend you create a .gitignore file in your git repo for this class, with the following contents:

.*.swp
*~
__pycache__/
cache/

The first two lines ignore the temporary files created by vim and emacs. The third line is for the temporary folder that python3 creates sometimes when you run your code, and the last is for the blockchain cache files you will create in this lab.

Once you create your .gitignore file and save it at the top folder of your git repository, be sure to do git add .gitignore and then commit and push. That is, the .gitignore file itself should be part of your repository!

Returning HTTP error codes in Flask

In a Flask web server application, you write a Python function to handle an HTTP request such as a GET request. Normally, we have been returning strings from such requests, like

return "my response"

This actually is shorthand for returning that string as the content of an HTTP 200 “OK” response; you could also do the exact same thing with

return "my response", 200

So 200 is the default response and doesn’t need to be specified, but there are many more HTTP response codes available that might make sense in different situations. For example, if someone tries to fetch a block that doesn’t exist, you might do

return "no block with that hash", 404

Part 1: Caching blocks during show_chat

So far, your show_chat.py program should:

Take a hostname and port as command-line arguments
Make a GET request to /head on that server
Iteratively download the entire blockchain using GET requests to /fetch, verifying each block as it’s fetched
If everything verifies, print out all the messages

Now you will modify this program so that it also caches the blocks it downloads in a folder called cache.

Importantly, you should only save blocks after you have verified them in the blockchain! Your saved blocks will be kept around indefinitely and will be assumed verified when you read them later. So be sure that a block is valid before you save it.

Your tasks

Modify show_chat.py as follows. (Tip: Add each piece one step at time, then test and debug before moving to the next requirement.)

Create a folder called cache if it doesn’t already exist. (You should have added this folder already to your gitignore; see background above.)
When fetching blocks, after the block is verified, save it to the file cache/<hash>.json. That is, the name of the file should come from the hash of that block.
Once the entire chain is verified, save information about it in a dictionary, with two entries: "head" should be the hash value of the head node, and "length" should be the total number of blocks in the chain. Save this dictionary in JSON format to a file cache/current.json.
Before trying to fetch a block, see if the corresponding file already exists in the cache/ folder, and if so, read it from the file instead of fetching from the server.
At the beginning of your program, read the hash in cache/current.json, if it exists. When fetching and verifying blocks, use the fact that the specified "head" block back to the genesis block are already verified, and save time by not re-verifying unnecessarily.
When printing the messages from a verified chain at the end of the program, only show the chat messages more recent than the block specified from the previous cache/current.json file.

(Note, to see all messages you should be able to just delete cache/head.txt and run your program again. You will still get the benefit of any cached blocks, but the chain should be re-verified and all messages printed out.)

Check your understanding: Why is it OK to use the same cache/ folder for blocks, even when checking the blockchain on completely different servers?

Part 2: Flask node to serve cached blocks

Now you will create a new Flask server, similar to static_good.py that you wrote last week, but returning data from information in the cache/ directory instead of hard-coded blocks in the code itself.

That is, your server should respond to two kinds of GET requests:

/head: On getting this request, read the file cache/current.json and return the head hash that is saved in there.

If the current.json file doesn’t exist, you should return HTTP code 503 to indicate a server error.
/fetch/<hash>: Try to read the file cache/<hash>.json, and if it exists, return its contents as a string.

If the file is not found, return an HTTP code 404 - this is a client error for requesting an invalid hash, not a server error.

Notice two things about how this works. First, your server will not work until you finish part 1 and actually run your show_chat.py to connect to some server and download its blocks. After that, your server will essentially be a mirror of the one you just downloaded blocks from.

The second thing to notice is that your server should now be dynamic, based on the most recent time show_chat.py was run. That is, running show_chat.py should update the blocks without the need to restart your server.

Your task

Write a new program server.py that is a Flask app, running on port 5002, responding to /head and /fetch GET requests based on files created in the cache directory from show_chat.py.

Going further: Concurrency

(This part is interesting, useful, and important, but not required.)

Right now, there are probably some race conditions between your show_chat.py program that updates the cache, and your server.py that reads from it. What would happen if you were simultaneously updating the cache and responding to a head or fetch request? Worse yet, what if you were running show_chat.py twice simultaneously?

While these concurrency bugs are very unlikely to happen now, they will become more likely in later labs, and anyway it’s good programming practice to avoid bugs that even are very unlikely to happen.

There are two ways to solve this:

File locks

Linux supports file locks, which means you rely on the OS to ensure that the same file can’t be written twice at the same time, or written as the same time as it’s being read.

To use this, you want to import the fcntl library in Python, and use fcntl.lockf() after opening a file and before reading or writing its contents. This function will “block” your program until it’s safe to read or write that file. For reading, you want to get a shared lock, and for writing, it should be an exclusive lock.

(You might remember some of this from your systems programming class…)

I strongly recommend looking at the Python documentation on fcntl for help on how this works!
Use an actual database

We are using files to sort of act like a database here. But actual database software (among other things) will already do a great job with handling concurrency.

(Note, if you haven’t taken a web or databases class, I recommend sticking with file locks unless you’re very eager to learn.)

So you want to reate a new database and use that to store previous blocks and current info instead of storing and reading files in the /cache directory. You will have to modify both your show_chat.py program and your server.py program accordingly.

I don’t really care what DB software you use, as long as it’s freely available. Note that you have sudo access on your VM so you can install anything you want with sudo apt install <package-name>.

Probably the easiest thing to use is sqlite, which is supported by the built-in sqlite3 package in Python.

SI 486I Spring 2022 / Labs