SI 413 Fall 2023 / Labs


This is the archived website of SI 413 from the Fall 2023 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Lab 4: Recursive-Descent Parsing

This lab is due at 2359 on Wednesday, 27 September. It should be submitted to the course SI413 as Lab04, with the files named properly according to the lab instructions. See the submit page for details. You are highly encouraged to submit whatever you have done by the end of this lab time. Remember that you must thoroughly test your code; the tests that are visible before the deadline are only there to make sure you are using the right names and input/output formats.

1 Starter Code

This lab starts a new group of labs. This means two things. First, you need to choose a new lab partner and make sure to notate that on the shared Google sheet.

Second, you need to create a new repo for your shared work running the following command. (One lab partner should run this command first to create the shared repo, and then the other lab partner should run it next to clone the repo.)

roche@ubuntu$ 413repo parsing

If that command doesn't work, check the initial setup instructions.

You should now have a folder like si413/parsing/lab04 with the starter code for this lab, the lab04.md file you will need to fill out, and a handy script to submit your code.

You can also browse the starter code from the Gitlab page (USNA intranet only).

2 A different kind of language

In this lab you will implement a recursive descent parser and interpreter for a special language called "pat", described below. You will first implement just the parser part, that parses valid programs and only tells us if a parse error occurs or if the input parsed successfully. Then you will add some actual functionality.

The pat language is a simple language for defining sequences. Here it is in action:

roche@ubuntu$ java PatInterpreter
Enter Pat language input below, followed by Ctrl-D.
pat> a b c;
a b c
pat> [one two three]_r;
three two one
pat> [damn the torpedoes] : X;
damn the torpedoes
pat> hmm X_r;
hmm torpedoes the damn
pat> X * [one two three four five];
damn one the two torpedoes three four five
pat> [lets go]:Var [Var * [a little]]:Var crazy Var_r;
lets go lets a go little crazy little go a lets
pat> goodbye

"Symbols" are alpha-numeric strings beginning with lower-case letters (such as 'a', 'b', or 'cat'). Pattern variables are alpha-numeric strings beginning with upper-case letters. Square brackets are used for grouping. A sequence followed by : NAME is assigned to a variable as a side effect. That variable is in scope from that moment on until the interpreter is exited (with Ctrl-d). The _r operator reverses a sequence. Its precedence (and variable assignment) are higher than concatenation, so a b [c d]_r gives a b d c, not d c b a. Finally, there's an operator * that interleaves two sequences, like

pat> [a b c] * [x y z];
a x b y c z
pat> [a b c d e] * [x y];
a x b y c d e

This operator has lowest precedence, so the [ ]'s above are unnecessary. If the interleaved sequences have different lengths, the unmatched extra characters in the longer one are just written out sequentially at the end.

Here are the tokens for the pat language:

SYM:     [a-z][a-zA-Z0-9]*
FOLD:    "*"
STOP:    ";"
COLON:   ":"
NAME:    [A-Z][a-zA-Z0-9]*
REV:     "_r"
LB:      "["
RB:      "]"

And here is the grammar for a pat program, which is just a sequence of ;-terminated pat expressions:

progseq STOP prog | EOF
seqseq FOLD catseq | catseq
catseqcatseq opseq | opseq
opseqopseq COLON NAME | opseq REV | atom
atomSYM | NAME | LB seq RB

NOTE: The grammar above is not LL(1), meaning it cannot be used for the recursive-descent parsers you are writing in this lab! You can easily see that it's not LL(1) because for example it has left recursion in the non-terminals seq, catseq, and opseq.

Instead, for this lab, use the following rewritten grammar which is LL(1) and suitable for top-down parsing. (Notice it has no left recursion or common prefixes!)

progseq STOP prog | EOF
seqcatseq seqtail
seqtailFOLD catseq seqtail | ε
catseqopseq cattail
cattailopseq cattail | ε
opseqatom optail
optailCOLON NAME optail | REV optail | ε
atomSYM | NAME | LB seq RB

3 Getting started

3.1 Installing Java 17

First things first: for the Java code in this course, we need Java 17 or later. To check what version you have, try running these commands; you should see similar output to this:

roche@ubuntu$ javac -version
javac 17.0.8.1

roche@ubuntu$ java -version
openjdk version "17.0.8.1" 2023-08-24
OpenJDK Runtime Environment (build 17.0.8.1+1-Ubuntu-0ubuntu122.04)
OpenJDK 64-Bit Server VM (build 17.0.8.1+1-Ubuntu-0ubuntu122.04, mixed mode, sharing)

On the lab machines, this version should already be installed. In WSL or your VM, you can upgrade to Java 17 by running

sudo apt update; sudo apt install openjdk-17-jdk

3.2 Starter code: what we provide

Now go into your git repo for this lab. The starter code we provide should have these 4 classes which are already completely written:

  • Lines.java: Helper class to read input lines with a nice prompt string like "pat>".
    You don't need to worry about how this works.
  • PatToken.java: A single token in the Pat language, which consists of a token type and the literal token "text". For example, if the input is one two_r : VAR;, the first token has type PatToken.Type.SYM and has text "one".
    Take a look at this class. you will need to use (but not modify) it.
  • PatLexer.java: A lexer (scanner) for Pat langauge tokens using regular expressions, similar to the RegexCalcLexer.java example from the Unit 4 notes
    Take a look at this class. For this lab, you need to be able to use the public methods such as peek() and match().
  • PatError.java: A simple exception class to represent scanning, parsing, or interpretation errors in the Pat langauge that your code identifies.
    Take a look at this class. you will need to use (but not modify) it.
  • Makefile: Makefiles aren't a big thing for Java, but here is one anyway. If you type make, it will use javac to compile your code. If you type make run, it will try to run your PatInterpreter program which you have to fill-in for this lab.

4 A recursive descent parser for pat: Part I

In this part of the lab, you will use the PatLexer program we provide, and fill in the PatParser.java program to create a recursive descent parser for the Pat language.

In the starter code, the main method for PatParser is already filled in for you (take a look!).

The method you need to fill in is prog(), which is supposed to parse the prog non-terminal in the grammar, which is a series of zero or more seqs each ending with a semicolon STOP token, terminated by EOF.

Your completed prog() method should print "Parse OK" for each complete sequence-followed-by-semicolon, and throw an instance of PatError if anything goes wrong.

To make prog() work, you will need helper methods for every non-terminal in the grammar. Many of these will be recursive or mutually-recursive --- that's why it's called a recursive-descent parser!

Now you are write a recursive descent top-down parser for this grammar. You should refer to the recursive descent calculator from Unit 4 to see how this works. The provided starter files above should help get you started.

Hints:

  • Look back at the code for the recursive descent calculator parser RdCalcParser.java from the Unit 4 notes.
  • Your code for this part will actually much simpler than that from the recursive-descent calculator example, because right now all you're really doing is making (recursive helper) function calls, calling match, and never returning or printing anything except "Parse OK".
  • You should only have to modify the PatParser.java file to get this part working.
  • Top-down parsers are based on the PREDICT sets for each production rule of a nonterminal. You can use the Scheme program predfol.scm from the Unit 4 notes (which should be in your starter code) to compute the PREDICT sets for any grammar you wish. Use that information!
  • To give you a more concrete idea of what you need to do and get the ball rolling, here is my complete solution for the seqtail helper method:
    void seqtail() throws PatError {
        switch (lexer.peek().getType()) {
            case FOLD:
                lexer.match(PatToken.Type.FOLD);
                catseq();
                seqtail();
                break;
            case STOP: case RB:
                break;
            default:
                error("seqtail");
        }
    }

Exercises

  1. Write a recursive descent parser for the pat language and implement it in the file PatParser.java. When a complete sequence followed by semicolon is parsed correctly, your parser just needs to print "Parse OK". If there is an error, you should throw an instance of the PatError class. For example:

    roche@ubuntu$ make run_PatParser
    Parser test program - will parse and recognize complete statements.
    Enter Pat language input below, followed by Ctrl-D.
    pat> a b c;
    Parse OK
    pat> [a b c]:X X_r;
    Parse OK
    pat> [ a b : c d ];
    PatError: PatLexer mismatch: expected NAME, got SYM 'c'
    Important: For testing error cases, we won't check the exact wording of your error messages, because there are different valid ways to write the parser which cause syntax errors in different methods. But we will check that your program actually throws PatError instead of just crashing. You need to exit gracefully in case of errors!

5 Stop, copy, and roll

When you finish Part I, copy all of your recursive rule methods from PatParser.java into the starter code for PatInterpreter.java. And if you get here during the lab time, flag down the instructor and show off your Part I!

6 A recursive descent paser for pat: Part II

Now it's time to build a functioning interpreter for the pat mini-language. I suggest you look at the recursive-descent parser and interpreter for the calculator language RdCalcParser.java from the Unit 4 notes. Make sure you understand how it works. See how we evaluate across those awkward "tail" rules?

In order to get this to work, I recommend you modify the signatures all of your helper functions as follows:

  • Every non-terminal function except prog will return a List<String>, corresponding to the sequence of symbols that it just evaluated.
  • All of the tail rules should also take a List<String> argument, corresponding to the sequence to the left of that tail rule that may need to be combined or operated on in some way.

Hints:

  • You should still only need to mess with PatInterpreter.java for this part.
  • Notice that the PatToken class includes a getText() function to return the literal source text that matched the token type. You didn't have to use this in Part I because all that matters for checking syntax is whether the right token types appear in the right places. But now you will have to (in some cases) use getText() to exctract the actual strings, like the symbol that a SYM token corresponds to.
  • Your code for everything can still be pretty short! We kindly provide a fold() helper method in the starter code, and Java has a lot of built-in functions that can be really useful for this lab, such as:
  • Instead of implementing the whole thing at once, pretend that the "tail" rules all just go to ε, and ignore variables. Just return empty List<String>'s for the cases you want to ignore. Get that working then add the non-ε cattail rule. Get that working then add the rest. Leave variables for last.
  • Be careful about modifying lists that are passed into your helper functions. You might need to make a copy by calling passing the existing list to an ArrayList constructor.
  • You'll need a symbol table for pat-language variables. I suggest adding a field to store the variable values with type Map<String, List<String>> symbolTable;
  • As with the first part, your interpreter must throw a PatError in any error that occurs with scanning, parsing, or interpreting the input. Crucially, this includes not only syntax errors now, but also if someone uses a variable before it is ever defined.
  • Again to be nice and help you get the ball rolling, here is a sample solution for one way to write the seqtail() method:
    List<String> seqtail(List<String> lhs) throws PatError {
        switch (lexer.peek().getType()) {
            case FOLD:
                lexer.match(PatToken.Type.FOLD);
                List<String> rhs = catseq();
                List<String> combined = fold(lhs, rhs);
                return seqtail(combined);
            case STOP: case RB:
                return lhs;
            default:
                error("seqtail");
                return null;
        }
    }

Exercises

  1. In the PatInterpreter.java file, extend your parser from Part I to actually interpret the language.

7 OPTIONAL: Efficiency

A hallmark feature of any good compiler or interpreter is how fast programs will run. The interpreter you created in Part II is good, but will probably run out of stack space if you run it on too big of a program.

For example, try this bash one-liner which just repeats the same Pat symbol a 10 thousand times in a single statement, and feeds that into your interpreter. You will probably get a StackOverflowException:

echo "$(printf 'a %.0s' {1..10000});" | java PatInterpreter >/dev/null

The issue is that (unlike Scheme!!) Java compilers do not optimize tail-recursive calls. So in this case, all of those recursive calls to catseq-cattail-cattail-cattail-cattail... all have to live in stack frames until the last one returns, and there is only so much memory allocated for that before Java gives up and crashes your program.

To make this more efficient, rewrite your interpreter to avoid making any calls to the "tail" functions, and instead do the same thing with loops in your other helper functions. So for instance, insted of catseq() calling opseq() and then cattail(), it will instead have a loop to keep calling opseq() repeatedly until it gets to the end.

If you do this, add a comment line like

// NOTAIL
to any function which you fix in this way. (And of course, test your code to make sure you didn't break it!)