Lab 5: Recursive-Descent LL Parsing

This is the archived website of SI 413 from the Fall 2012 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

This lab is due at 0900 next Tuesday, October 2. It should contain (at least) the two files pat1.cpp and pat2.cpp, as well as two subfolders called tests1 and tests2 with your tests. See the submit page for more info.

Overview

In this lab you will implement a recursive descent parser and interpreter for a special language called "pat", described below. You will first implement just the parser part, that parses valid programs and does nothing. Then you will add some actual functionality.

Here are the files you will need to get started:

You can get all these files at once by downloading the file lab05.tar.gz and running tar xzvf lab05.tar.gz

A different kind of language

The pat language is a simple language for defining sequences. Here are some examples:

$ ./pat2
> a b c;
a b c 
> [a b c]_r;
c b a 
> [a b b d]:X;
a b b d 
> X X_r;
a b b d d b b a 
> [a b c [d e]:X f g h] X_r;
a b c d e f g h e d

"Symbols" are alpha-numeric strings beginning with lower-case letters (such as 'a', 'b', or 'cat'). Pattern variables are alpha-numeric strings beginning with upper-case letters. Square brackets are used for grouping. A sequence followed by : NAME is assigned to a variable as a side effect. That variable is in scope from that moment on until the interpreter is exited (with Ctrl-d). The _r operator reverses a sequence. Its precedence (and variable assignment) are higher than concatenation, so a b [c d]_r gives a b d c, not d c b a. Finally, there's an operator * that interleaves two sequences, like

$ ./pat2
> [a b c] * [x y z];
a x b y c z
> [a b c d e] * [x y];
a x b y c d e

This operator has lowest precedence, so the [ ]'s above are unnecessary. If the interleaved sequences have different lengths, the unmatched extra characters in the longer one are just written out sequentially at the end.

Here are the tokens for the pat language:

SYM:     [a-z][a-zA-Z0-9]*
FOLD:    "*"
STOP:    ";"
COLON:   ":"
NAME:    [A-Z][a-zA-Z0-9]*
REV:     "_r"
LB:      "["
RB:      "]"

And here is the grammar for a pat program, which is just a sequence of ;-terminated pat expressions:

S → seq STOP
seq → seq FOLD catseq | catseq
catseq → catseq opseq | opseq
opseq → opseq COLON NAME | opseq REV | atom
atom → SYM | NAME | LB seq RB

Testing

Testing for this lab will be a little different, for two reasons. First, we need a way to test for errors, without having to worry about what the specific error message is. Second, you are writing two programs for this lab, and we need a way to test them separately.

Testing for errors is easy. If any test file ends with .err, then it's an error test. What that means is that this input will be fed to the program being tested, and if the result is any kind of error (i.e., exit value not equal to zero), then the test passes. Because we don't care about how the error message (if any) is printed, these tests shouldn't contain any semicolon split line or output part.

For example, the following file pub08.err is part of the sanity checks for part 1:

[ a b : c d ];

If the interpreter gives any kind of error on this input, then the test passes.

The second difference with testing for this lab is that there are two parts below, Part I and Part II, and each part needs to have separate tests. This is easy: your tests for part I go in the tests1 folder, and your tests for part II go in the tests2 folder.

A recursive descent parser for pat: Part I

In this part of the lab, I provide a flex scanner file and an abstract grammar for the language, and you create the recursive descent parser. For the moment, your parser should simply accept valid program strings and print out an error message and exit in the presence of a syntax error. This will be similar to the basic recursive-descent parser from Unit 4: Look at the file rdcalc.cpp and make sure you understand how it works.

The grammar above is probably the natural grammar for this language, at least if you want to specify the associativities and precedences of the language's operators. However, it is not appropriate for LL (top-down) parsing. Please stop for a moment, look at the grammar and make sure you understand why. Which rules are troublesome?

For purposes of this lab, I'm going to give you a rewriting of the grammar in a form that is amenable to top-down parsing (though I'd like you to be able to do it yourselves!):

S → seq STOP
seq → catseq seqtail
seqtail → FOLD catseq seqtail | ε
catseq → opseq cattail
cattail → opseq cattail | ε
opseq → atom optail
optail → COLON NAME optail | REV optail | ε
atom → SYM | NAME | LB seq RB

Now you are write a recursive descent top-down parser for this grammar. You should refer to the recursive descent calculator from Unit 4 to see how this works. The provided starter files above should help get you started.

Hints:

You should only have to modify the pat1.cpp file to get this part working. That is, the scanner should work as-is.
The enum in pat.hpp defines constant integer values for the token types, counting up from 1. On end-of-file, the scanner returns 0 instead of a valid token type.
Your code for this part will actually much simpler than that from the recursive-descent calculator, because right now all you're really doing is making (recursive helper) function calls, calling match, and never returning or printing anything.

Exercises

Write a recursive descent parser for the pat language and implement it in the file pat1.cpp. When your recursive descent parser works, you should be able to enter statement after statement with no feedback, until a syntax error creates a message and aborts the program, or you insert EOF manually with Ctrl-D. For example:
```
$ ./pat1
> a b c;
(Parse OK)
> [a b c]:X X_r;
(Parse OK)
> [ a b : c d ];
Token match error
```
You should write at least 4 tests for this part, that are not the same as the public tests. Since you are just testing whether the input is accepted by the parser or not, roughly half of your tests should be error tests. Important: the "(Parse OK)" printouts are not part of the output; they are just there for your convenience. All of your non-error tests should have the output space be completely blank. See the public tests for examples.

Stop, copy, and roll

When you finish Part I, copy your do-nothing recursive-descent parser into the pat2.cpp file. And if you get here during the lab time, flag down the instructor and show off your Part I!

A recursive descent paser for pat: Part II

Now it's time to build a functioning interpreter for the pat mini-language. I suggest you look at the recursive-descent parser and interpreter for the calculator language from Unit 4 (rdcalc.cpp) and make sure you understand how it works. See how we evaluate across those awkward "tail" rules?

Now, for your interpreter I suggest that the tokens have C++ string objects as semantic values, and that non-terminals have vector<string> objects for semantic values, i.e. that's what the grammar rule functions return. (Since every token will have the same type, you don't need to bother with any union types.)

Specifically, every non-terminal function will take no arguments and return a vector of strings, except that stmt won't have any return value, and all the tail rules will also take an argument that is a vector of strings.

Because I like you, I've implemented a few helful helper functions, to perform the fold (*) operation, concatenation and the reverse (_r) operations. You can copy these right into the top of your pat2.cpp file:

#include <map>
#include <vector>
 
// Prints out a vector of strings with spaces in between
// You can call this just like resout << some_vector << endl;
ostream& operator<< (ostream& out, const vector<string>& vec) {
  if (vec.empty()) return out;
  out << vec[0];
  for (unsigned long i=1; i<vec.size(); ++i)
    out << ' ' << vec[i];
  return out;
}
 
// Computes the "fold" or interleaving of two vectors of strings
vector<string> fold(const vector<string> &A, const vector<string> &B) {
  vector<string> res;
  unsigned long int i = 0;
  for (; i < A.size() && i < B.size(); ++i) {
    res.push_back(A[i]);
    res.push_back(B[i]);
  }
  for (; i < A.size(); ++i)
    res.push_back(A[i]);
  for (; i < B.size(); ++i)
    res.push_back(B[i]);
  return res;
}
 
// Concatenates two vectors of strings
vector<string> concat(const vector<string> &A, const vector<string> &B) {
  vector<string> res;
  unsigned long int i;
  for(i = 0; i < A.size(); ++i) res.push_back(A[i]);
  for(i = 0; i < B.size(); ++i) res.push_back(B[i]);
  return res;
}
 
// Reverses a vector of strings
vector<string> rev(const vector<string> &A) {
  vector<string> res;
  long int i;
  for(i = A.size() - 1; i >= 0; --i) res.push_back(A[i]);
  return res;
}

Hints:

You should still only need to mess with pat2.cpp for this part. You might notice that certain tokens already return semantic values in the pat.lpp file provided, and the match function I provided returns this string value to your recursive descent parser functions. (The basic parser from part I should just ignore these return types.)
Instead of implementing the whole thing at once, pretend that the "tail" rules all just go to ε, and ignore variables. Just return empty vector<string>'s for the cases you want to ignore. Get that working then add the non-ε cattail rule. Get that working then add the rest. Leave variables for last.
Pass vectors by reference ... make them const too, it'll just make life easier.
Don't define variables inside a switch/case statement. There are potential problems that are best simply avoided. Instead, define them before the switch, and assign values inside the switch/case.
You'll need a symbol table for pat-language variables. Once again we'll use maps, but this time we're mapping strings to vectors of strings, i.e.:
```
map<string, vector<string> > symTable;
```
Make sure you use the spaces in between > and >. Why? Because C++'s scanner treats >> as a single token, not as two >'s.

Exercises

In the pat2.cpp file, extend your parser from Part I to actually interpret the language. You should have at least 4 more tests for this part, in the folder tests2/.

Enrichment: Efficiency

A hallmark feature of any good compiler or interpreter is how fast programs will run. The interpreter you created in Part II is probably quite inefficient in the way it handles memory. In particular, passing around vector<string>'s by value and returning them from functions involves a lot of memory copying. Much of this is unnecessary.

Re-write your interpreter from Part II to use memory more efficiently. In particular, your recursive descent functions should all return void, and they should store their results in the first argument, which should be non-const and passed by reference. For instance, the signature for cattail might be:

void cattail(vector<string>& vec);

With this, you should be able to make your recursive descent functions tail recursive. gcc will actually optimize for tail recursion in your programs (just like DrScheme would), but only if you tell it to with an optimization flag like -O2. You can insert this into the Makefile.

(Nothing to submit for this part. You can feel free to include your amped-up interpreter in what you submit for Part II, but make sure it doesn't break what you already had working!)