Lab 3: Caching blocks for a read-only node
So far, the constantly fetching blocks — often the same blocks — over and over again. This week, we will add some caching for both fetching and serving blockchains, using the filesystem as a kind of basic database.
Part 0: Get your static servers running from Lab 2
Go back to lab 2 and make sure you complete Part 3, running a static “good” and “bad” blockchain server on ports 5000 and 5001.
Background
Reading and writing entire files in Python
You will need to save and retrieve blocks from files. Because these blocks need to have consistent hashes based on their exact string contents, it is very important to read and write the exact block contents as they are downloaded.
Fortunately, this is really easy in Python! Mostly, you just have to avoid using the line-based functions like print
or readline
.
Here is a small Python program that reads the entire contents of file1.txt
into a string s
, and then writes that entire string to file2.txt
.
with open('file1.txt') as fin:
s = fin.read()
with open('file2.txt', 'w') as fout:
fout.write(s)
Telling git to ignore files and directories
Usually, we want our git repos to contain only the source code and other necessary files, but not other stuff like temporary caches, editor savefiles, compiled binaries, etc.
You can tell git to ignore certain files (never add them to the repo, just let them sit on the filesystem) by means of a special file called .gitignore
. Notice the .
in front of the name!
I recommend you create a .gitignore
file in your git repo for this class, with the following contents:
.*.swp
*~
__pycache__/
cache/
The first two lines ignore the temporary files created by vim and emacs. The third line is for the temporary folder that python3
creates sometimes when you run your code, and the last is for the blockchain cache files you will create in this lab.
Once you create your .gitignore
file and save it at the top folder of your git repository, be sure to do git add .gitignore
and then commit and push. That is, the .gitignore
file itself should be part of your repository!
Returning HTTP error codes in Flask
In a Flask web server application, you write a Python function to handle an HTTP request such as a GET request. Normally, we have been returning strings from such requests, like
return "my response"
This actually is shorthand for returning that string as the content of an HTTP 200 “OK” response; you could also do the exact same thing with
return "my response", 200
So 200 is the default response and doesn’t need to be specified, but there are many more HTTP response codes available that might make sense in different situations. For example, if someone tries to fetch a block that doesn’t exist, you might do
return "no block with that hash", 404
Part 1: Caching blocks during show_chat
So far, your show_chat.py
program should:
- Take a hostname and port as command-line arguments
- Make a GET request to
/head
on that server - Iteratively download the entire blockchain using GET requests to
/fetch
, verifying each block as it’s fetched - If everything verifies, print out all the messages
Now you will modify this program so that it also caches the blocks it downloads in a folder called cache
.
Importantly, you should only save blocks after you have verified them in the blockchain! Your saved blocks will be kept around indefinitely and will be assumed verified when you read them later. So be sure that a block is valid before you save it.
Your tasks
Modify show_chat.py
as follows. (Tip: Add each piece one step at time, then test and debug before moving to the next requirement.)
Create a folder called
cache
if it doesn’t already exist. (You should have added this folder already to your gitignore; see background above.)When fetching blocks, after the block is verified, save it to the file
cache/<hash>.json
. That is, the name of the file should come from the hash of that block.Once the entire chain is verified, save information about it in a dictionary, with two entries:
"head"
should be the hash value of the head node, and"length"
should be the total number of blocks in the chain. Save this dictionary in JSON format to a filecache/current.json
.Before trying to fetch a block, see if the corresponding file already exists in the
cache/
folder, and if so, read it from the file instead of fetching from the server.At the beginning of your program, read the hash in
cache/current.json
, if it exists. When fetching and verifying blocks, use the fact that the specified"head"
block back to the genesis block are already verified, and save time by not re-verifying unnecessarily.When printing the messages from a verified chain at the end of the program, only show the chat messages more recent than the block specified from the previous
cache/current.json
file.(Note, to see all messages you should be able to just delete
cache/head.txt
and run your program again. You will still get the benefit of any cached blocks, but the chain should be re-verified and all messages printed out.)
Check your understanding: Why is it OK to use the same
cache/
folder for blocks, even when checking the blockchain on completely different servers?
Part 2: Flask node to serve cached blocks
Now you will create a new Flask server, similar to static_good.py
that you wrote last week, but returning data from information in the cache/
directory instead of hard-coded blocks in the code itself.
That is, your server should respond to two kinds of GET requests:
/head
: On getting this request, read the filecache/current.json
and return the head hash that is saved in there.If the
current.json
file doesn’t exist, you should return HTTP code 503 to indicate a server error./fetch/<hash>
: Try to read the filecache/<hash>.json
, and if it exists, return its contents as a string.If the file is not found, return an HTTP code 404 - this is a client error for requesting an invalid hash, not a server error.
Notice two things about how this works. First, your server will not work until you finish part 1 and actually run your show_chat.py
to connect to some server and download its blocks. After that, your server will essentially be a mirror of the one you just downloaded blocks from.
The second thing to notice is that your server should now be dynamic, based on the most recent time show_chat.py
was run. That is, running show_chat.py
should update the blocks without the need to restart your server.
Your task
Write a new program server.py
that is a Flask app, running on port 5002, responding to /head
and /fetch
GET requests based on files created in the cache
directory from show_chat.py
.
Going further: Concurrency
(This part is interesting, useful, and important, but not required.)
Right now, there are probably some race conditions between your show_chat.py
program that updates the cache, and your server.py
that reads from it. What would happen if you were simultaneously updating the cache and responding to a head or fetch request? Worse yet, what if you were running show_chat.py
twice simultaneously?
While these concurrency bugs are very unlikely to happen now, they will become more likely in later labs, and anyway it’s good programming practice to avoid bugs that even are very unlikely to happen.
There are two ways to solve this:
File locks
Linux supports file locks, which means you rely on the OS to ensure that the same file can’t be written twice at the same time, or written as the same time as it’s being read.
To use this, you want to import the
fcntl
library in Python, and usefcntl.lockf()
after opening a file and before reading or writing its contents. This function will “block” your program until it’s safe to read or write that file. For reading, you want to get a shared lock, and for writing, it should be an exclusive lock.(You might remember some of this from your systems programming class…)
I strongly recommend looking at the Python documentation on
fcntl
for help on how this works!Use an actual database
We are using files to sort of act like a database here. But actual database software (among other things) will already do a great job with handling concurrency.
(Note, if you haven’t taken a web or databases class, I recommend sticking with file locks unless you’re very eager to learn.)
So you want to reate a new database and use that to store previous blocks and current info instead of storing and reading files in the
/cache
directory. You will have to modify both yourshow_chat.py
program and yourserver.py
program accordingly.I don’t really care what DB software you use, as long as it’s freely available. Note that you have
sudo
access on your VM so you can install anything you want withsudo apt install <package-name>
.Probably the easiest thing to use is sqlite, which is supported by the built-in
sqlite3
package in Python.