Lab 4: Parse Lab I

You may work in pairs for this lab. If you choose to do so, only have one partner submit the required files electronically. Be sure to include both partners' names at the top of every submitted file.

In this lab you will implement a recursive descent parser and interpreter for a special language called pat. The lab is due at the beginning of your next lab period. You should submit a folder called lab04 containing all of the following files:

features.txt: Describes how far you got in a computer-readable format. This format is basically the same as in Lab 3, but here there are only two "features", Part I and Part II. For each "feature" (part), you just need to indicate how far you got. If you think you completed that part perfectly, you can leave the description blank.
pat.h: The header file describing the scanner-parser interface.
pat.lpp: Flex input file describing the scanner.
pat.cpp: The recursive descent parser from Part I. This should just read a program in the "pat" language and do nothing unless a scanner/parser error occurs.
pat2.cpp: The recursive descent parser and interpreter for the "pat" language from Part II. This should scan, parse, and execute the actions specified in the language, and print the result of each expression.
Makefile: Should have rules to compile the pat and pat2 programs. The rules in the included Makefile will probably suffice.

Starter code for all these files has been generously provided for you: download and extract lab04.tar.gz. (On any unix/linux machine, type tar xzvf lab04.tar.gz to create a subdirectory lab04 and extract the files into it.) You can also click on the links in the list above to get the files individually.

A different kind of language

The pat language is a simple language for defining sequences. Here are some examples:

> ./pat2
a b c;
a b c 
[a b c]_r;
c b a 
[a b b d]:X;
a b b d 
X X_r;
a b b d d b b a 
[a b c [d e]:X f g h] X_r;
a b c d e f g h e d

"Symbols" are alpha-numeric strings beginning with lower-case letters (such as 'a', 'b', or 'cat'). Pattern variables are alpha-numeric strings beginning with upper-case letters. Square brackets are used for grouping. A sequence followed by : NAME is assigned to a variable as a side effect. That variable is in scope from that moment on until the interpreter is exited (with ctrl-d). The _r operator reverses a sequence. Its precedence (and variable assignment) are higher than concatenation, so a b [c d]_r gives a b d c, not d c b a. Finally, there's an operator * that interleaves two sequences, like

[a b c] * [x y z];
a x b y c z
[a b c d e] * [x y];
a x b y c d e

This operator has lowest precedence, so the [ ]'s above are unnecessary. If the interleaved sequences have different lengths, the unmatched extra characters in the longer one are just written out sequentially at the end.

Here are the tokens for the pat language:

SYM:     [a-z][a-zA-Z0-9]*
FOLD:    "*"
STOP:    ";"
COLON:   ":"
NAME:    [A-Z][a-zA-Z0-9]*
REV:     "_r"
LB:      "["
RB:      "]"

And here is the grammar for a pat program, which is just a sequence of ;-terminated pat expressions:

S → seq STOP S | ε
seq → seq FOLD catseq | catseq
catseq → catseq opseq | opseq
opseq → opseq COLON NAME | opseq REV | atom
atom → SYM | NAME | LB seq RB

A recursive descent parser for pat: Part I

In this part of the lab, I provide a flex scanner file and an abstract grammar for the language, and you create the recursive descent parser. For the moment, your parser should simply accept valid program strings and print out an error message and exit in the presence of a syntax error. This will be similar to the basic recursive-descent parser from Class 9: Look at the file rdcalc.cpp and make sure you understand how it works.

The grammar above is probably the natural grammar for this language, at least if you want to specify the associativities and precedences of the language's operators. However, it is not appropriate for LL (top-down) parsing. Please stop for a moment, look at the grammar and make sure you understand why. Which rules are troublesome?

For purposes of this lab, I'm going to give you a rewriting of the grammar in a form that is amenable to top-down parsing (though I'd like you to be able to do it yourselves!):

S → seq STOP S | ε
seq → catseq seqtail
seqtail → FOLD catseq seqtail | ε
catseq → opseq cattail
cattail → opseq cattail | ε
opseq → atom optail
optail → COLON NAME optail | REV optail | ε
atom → SYM | NAME | LB seq RB

Now you are write a recursive descent top-down parser for this grammar. You should refer to Class 9 to see how this works. The provided files pat.h, pat.lpp, pat.cpp, and Makefile should help get you started.

Hints:

You should only have to modify the pat.cpp file to get this part working. That is, the scanner should work as-is.
The enum in pat.h defines constant integer values for the token types, counting up from 1. On end-of-file, the scanner returns 0 instead of a valid token type.
The grammar here defines the start symbol as a sequence of seq STOP pairs. Note that this is different from the calculator language, where the start symbol is only a single statement. What this means for you is that your main function will be a little simpler than in the calculator parser, but the function for the start symbol will be a little more complicated.

When your recursive descent parser works, you should be able to enter statement after statement with no feedback, until a syntax error cases a message and aborts the program. For example:

> ./pat
a b c;
[a b c]:X X_r;
[ a b : c d ];
Parse error!

Stop, copy, and roll

When you finish Part I, copy your do-nothing recursive-descent parser into the pat2.cpp file. And if you get here during the lab time, flag down the instructor and show off your Part I!

A recursive descent paser for pat: Part II

Now it's time to build a functioning interpreter for the pat mini-language. I suggest you look at the recursive-descent parser and interpreter for the calculator language from Class 9 (calc.h, calc.lpp, rdcalc2.cpp) and make sure you understand how it works. See how semantic values (the union type TokenSemantic for the calculator) are handled? See how we evaluate across those akward "tail" productions?

Now, for your interpreter I suggest that the tokens have C++ string objects as semantic values, and that non-terminals have vector<string> objects for semantic values, i.e. that's what the grammar rule functions return. Since every token will have the same type, you don't need to bother with any union types. And because I like you, I've implemented a few helful helper functions, to perform the fold (*) operation, concatenation and the reverse (_r) operations. These are already included in the starter file for this part pat2.cpp.

Hints:

In order to add semantic values to be returned with tokens, you will have to modify the files pat.h and pat.lpp. However, your modifications should not affect the original pat program from Part I. (The basic parser will just ignore these return types.)
The included helper functions will be useful - so use them! You will still have to write your own code to print out an array of strings, though.
Instead of implementing the whole thing at once, pretend that the "tail" rules all just go to ε, and ignore variables. Just return empty vector<string>'s for the cases you want to ignore. Get that working then add the non-ε cattail rule. Get that working then add the rest. Leave variables for last.
Pass vectors by reference ... make them const too, it'll just make life easier. Remember to #include string and vector.
Don't define variables inside a switch/case statement. There are potential problems that are best simply avoided. Instead, define them before the switch, and assign values inside the switch/case.
You'll need a symbol table for pat-language variables. Once again we'll use maps, but this time we're mapping strings to vectors of strings, i.e.:
```
map<string, vector<string> > symTable;
```
Make sure you use the spaces in between > and >. Why? Becauase C++'s scanner treats >> as a single token, not as two >'s.

Enrichment: Efficiency

A hallmark feature of any good compiler or interpreter is how fast programs will run. The interpreter you created in Part II is probably quite inefficient in the way it handles memory. In particular, passing around vector<string>'s by value and returning them from functions involves a lot of memory copying. Much of this is unnecessary.

Re-write your interpreter from Part II to use memory more efficiently. In particular, your recursive descent functions should all return void, and they should store their results in the first argument, which should be non-const and passed by reference. For instance, the signature for cattail might be:

void cattail(vector<string>& vec);

With this, you should be able to make your recursive descent functions tail recursive. gcc will actually optimize for tail recursion in your programs (just like DrScheme would), but only if you tell it to with an optimization flag like -O2. You can insert this into the Makefile.

(Nothing to submit for this part. You can feel free to include your amped-up interpreter in what you submit for Part II, but make sure it doesn't break what you already had working!)