Lab 5: Recursive-Descent LL Parsing
This lab is due at 2359 on Wednesday, 6 October. It should be submitted to the course SI413 as Lab05, with the files named properly according to the lab instructions. See the submit page for details. You are highly encouraged to submit whatever you have done by the end of this lab time. Remember that you must thoroughly test your code; the tests that are visible before the deadline are only there to make sure you are using the right names and input/output formats.
1 Starter Code
You should already have a shared git repository parsing
.
To get the starter code for today's lab, cd to your
si413/parsing
directory and run these commands:
git pull starter main
git pull
You should now have a folder like
si413/parsing/lab05
with the starter code for this lab and a handy script to submit your code.
You can also browse the starter code from the Gitlab page (USNA intranet only).
2 A different kind of language
In this lab you will implement a recursive descent parser and interpreter for a special language called "pat", described below. You will first implement just the parser part, that parses valid programs and only tells us if a parse error occurs or if the input parsed successfully. Then you will add some actual functionality.
The pat
language is a
simple language for defining sequences. Here are some
examples:
roche@ubuntu$
./pat2
>
a b c;
a b c
>
[a b c]_r;
c b a
>
[a b b d] : X;
a b b d
>
X X_r;
a b b d d b b a
>
[a b c [d e]:X f g h] X_r;
a b c d e f g h e d
"Symbols" are alpha-numeric
strings beginning with lower-case letters (such as 'a', 'b', or 'cat').
Pattern variables are alpha-numeric strings beginning with upper-case
letters. Square brackets are used for grouping. A sequence
followed by : NAME
is assigned to a variable
as a side effect. That variable is in scope from that moment
on until the interpreter is exited (with Ctrl-d). The
_r
operator reverses a sequence. Its precedence
(and variable assignment) are higher than concatenation, so
a b [c d]_r
gives a b d c
,
not d c b a
. Finally, there's an
operator *
that interleaves two sequences, like
roche@ubuntu$
./pat2
>
[a b c] * [x y z];
a x b y c z
>
[a b c d e] * [x y];
a x b y c d e
This operator has lowest precedence, so the [ ]'s above are unnecessary. If the interleaved sequences have different lengths, the unmatched extra characters in the longer one are just written out sequentially at the end.
Here are the tokens for the pat
language:
SYM: [a-z][a-zA-Z0-9]* FOLD: "*" STOP: ";" COLON: ":" NAME: [A-Z][a-zA-Z0-9]* REV: "_r" LB: "[" RB: "]"
And here is the grammar for a pat
program,
which is just a sequence of ;
-terminated pat
expressions:
S → seq STOP
seq → seq FOLD catseq | catseq
catseq → catseq opseq | opseq
opseq → opseq COLON NAME | opseq REV | atom
atom → SYM | NAME | LB seq RB
3 A recursive descent parser for pat: Part I
In this part of the lab, I provide a flex scanner file and an abstract grammar for the language, and you create the recursive descent parser. For the moment, your parser should simply accept valid program strings and print out an error message and exit in the presence of a syntax error. This will be similar to the basic recursive-descent parser from Unit 4: Look at the file rdcalc.cpp and make sure you understand how it works.
The grammar above is probably the natural grammar for this language, at least if you want to specify the associativities and precedences of the language's operators. However, it is not appropriate for LL (top-down) parsing. Please stop for a moment, look at the grammar and make sure you understand why. Which rules are troublesome?
For purposes of this lab, I'm going to give you a rewriting of the grammar in a form that is amenable to top-down parsing (though I'd like you to be able to do it yourselves!):
S → seq STOP
seq → catseq seqtail
seqtail → FOLD catseq seqtail | ε
catseq → opseq cattail
cattail → opseq cattail | ε
opseq → atom optail
optail → COLON NAME optail | REV optail | ε
atom → SYM | NAME | LB seq RB
Now you are write a recursive descent top-down parser for this grammar. You should refer to the recursive descent calculator from Unit 4 to see how this works. The provided starter files above should help get you started.
Hints:
-
You should only have to modify the
pat1.cpp
file to get this part working. That is, the scanner should work as-is. -
Top-down parsers are based on the PREDICT sets for each production rule of
a nonterminal. You can use the Scheme program
predfol.scm
from the Unit 4 notes (which should be in your starter code) to compute the PREDICT sets for any grammar you wish. Use that information! -
The
enum
inpat.hpp
defines constant integer values for the token types, counting up from 1. On end-of-file, the scanner returns 0 instead of a valid token type. - Your code for this part will actually much simpler than that from
the recursive-descent calculator, because right now all you're really
doing is making (recursive helper) function calls, calling
match
, and never returning or printing anything.
Exercises
-
Write a recursive descent parser for the pat language and implement it
in the file
pat1.cpp
. When the input is parsed correctly, your parser doesn't need to do anything except print out "Parse OK". If there is an error, you should print an error message and doexit(1)
to exit with return code 1. For example:
Important: For testing error cases, we won't check the exact wording of your error messages, so you have some freedom and creativity there. But we will check the return code, which must always equal 1 for error cases. You need to exit gracefully in case of errors!roche@ubuntu$
./pat1
>
a b c;
Parse OK
>
[a b c]:X X_r;
Parse OK
>
[ a b : c d ];
Token match error
4 Stop, copy, and roll
When you finish Part I, copy your do-nothing recursive-descent parser
into the pat2.cpp
file. And if you get here during the lab
time, flag down the instructor and show off your Part I!
5 A recursive descent paser for pat: Part II
Now it's time to build a functioning interpreter for the
pat
mini-language. I suggest you look at
the recursive-descent parser and interpreter for the calculator
language from Unit 4
(rdcalc.cpp)
and make sure you understand how it works.
See how we evaluate across
those awkward "tail" rules?
Now, for your interpreter I suggest that the tokens have C++
string
objects as semantic values, and that
non-terminals have vector<string>
objects
for semantic values, i.e. that's what the grammar rule
functions return. (Since every token will have the same type,
you don't need to bother with any union
types.)
Specifically, every non-terminal function will take no
arguments and return a vector of strings, except that stmt
won't have any return value, and all the tail rules will also take
an argument that is a vector of strings.
Because I like you, I've implemented a
few helful helper functions, concatenation and the reverse (_r
) operations.
(You'll have to write the code for fold
yourself!)
You can copy these right into the top of your pat2.cpp
file:
#include <map>
#include <vector>
// Prints out a vector of strings with spaces in between
// You can call this just like resout << some_vector << endl;
ostream& operator<< (ostream& out, const vector<string>& vec) {
if (vec.empty()) return out;
out << vec[0];
for (unsigned long i=1; i<vec.size(); ++i)
out << ' ' << vec[i];
return out;
}
// Concatenates two vectors of strings
vector<string> concat(const vector<string> &A, const vector<string> &B) {
vector<string> res;
unsigned long int i;
for(i = 0; i < A.size(); ++i) res.push_back(A[i]);
for(i = 0; i < B.size(); ++i) res.push_back(B[i]);
return res;
}
// Reverses a vector of strings
vector<string> rev(const vector<string> &A) {
vector<string> res;
long int i;
for(i = A.size() - 1; i >= 0; --i) res.push_back(A[i]);
return res;
}
Hints:
-
You should still only need to mess with
pat2.cpp
for this part. You might notice that certain tokens already return semantic values in thepat.lpp
file provided, and thematch
function I provided returns this string value to your recursive descent parser functions. (The basic parser from part I should just ignore these return types.) -
Instead of implementing the whole thing at
once, pretend that the "tail" rules all just go to
ε, and ignore variables. Just return empty
vector<string>
's for the cases you want to ignore. Get that working then add the non-εcattail
rule. Get that working then add the rest. Leave variables for last. - Pass vectors by reference ... make them const too, it'll just make life easier.
- Don't define variables inside a switch/case statement. There are potential problems that are best simply avoided. Instead, define them before the switch, and assign values inside the switch/case.
-
You'll need a symbol table for
pat
-language variables. Once again we'll use maps, but this time we're mapping strings to vectors of strings, i.e.:map<string, vector<string> > symTable;
Make sure you use the spaces in between > and >. Why? Because C++'s scanner treats >> as a single token, not as two >'s. - As with the first part, your interpreter must exit(1) in case any error occurs. Crucially, this includes not only syntax errors now, but also if someone uses a variable before it is ever defined.
Exercises
-
In the
pat2.cpp
file, extend your parser from Part I to actually interpret the language.
6 OPTIONAL: Efficiency
A hallmark feature of any good compiler or interpreter is how
fast programs will run. The interpreter you created in Part II
is probably quite inefficient in the way it handles memory.
In particular, passing around vector<string>
's
by value and returning them from functions involves a lot
of memory copying. Much of this is unnecessary.
Besides this fact, your instructors are seriously wondering whether anyone reads all of these instructions. So, some kind student is reading this, insert the word "platypus" into the comments of your pat2.cpp somewhere. There are a total of 60 bonus points available for careful readers, to be divided amongst everyone who actually does this. (In other words, you are better off not telling your neighbor about this bonus. Which also makes it a better experiment for your instructors to learn something from.)
Re-write your interpreter from Part II to use memory more efficiently.
In particular, your recursive descent functions should all return
void
, and they should store their results in
the first argument, which should be non-const and passed by reference.
For instance, the signature for cattail
might be:
void cattail(vector<string>& vec);
With this, you should be able to make your recursive descent
functions tail recursive. gcc
will actually optimize for tail
recursion in your programs (just like ChezScheme would), but only if
you tell it to with an optimization flag like -O2
.
You can insert this into the Makefile.
(Nothing to submit for this part. You can feel free to include your amped-up interpreter in what you submit for Part II, but make sure it doesn't break what you already had working!)