SI 413 Fall 2021 / Labs


This is the archived website of SI 413 from the Fall 2021 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Lab 6: LR Parsing & Assoc/Prec

This lab is due at 2359 on Wednesday, 13 October. It should be submitted to the course SI413 as Lab06, with the files named properly according to the lab instructions. See the submit page for details. You are highly encouraged to submit whatever you have done by the end of this lab time. Remember that you must thoroughly test your code; the tests that are visible before the deadline are only there to make sure you are using the right names and input/output formats.

1 Starter Code

You should already have a shared git repository parsing. To get the starter code for today's lab, cd to your si413/parsing directory and run these commands:

git pull starter main
git pull

You should now have a folder like si413/parsing/lab06 with the starter code for this lab and a handy script to submit your code.

You can also browse the starter code from the Gitlab page (USNA intranet only).

2 Overview

This lab is going to be pretty similar in scope to last week's lab: you're going to start by making a parser for the pat language, and then you're going to add functionality to make your parser actually interpret the language itself.

The big difference is that this week's work should be much easier because you don't have to write the parser yourself by hand. We will be using the bison tool instead to do it for us!

In this lab, we are going to look at how to specify the associativity and precedence in ways that parser generators like bison can understand. We'll see that there's a hard way and an easy way to do this. Don't worry - the "hard way" is done for you already. This "easy way" will also make it easier to generate ASTs in next week's lab.

3 Background: Precedence and Associativity the "easy way"

What's wrong with this grammar for arithmetic expressions?
expexp OPA exp
expexp OPM exp
expNUM
expLP exp RP
It's ambiguous, which means that there's more than one parse tree for the same string. The unambiguous grammar for this language that we've used in the past is more complex, but it's unambiguous, and the unique parse tree it yields respects our associativity and precedence requirements. However, not only is the grammar complex, but the parse trees are huge (see the above) with lots of subtrees that are just "reinterpretation" steps, e.g. in x := 5 ; we have
expnexpcexpaexptermsfactorfactorNUM
just to interpret 5 as the right-hand side of an assignment. Not only are the parse trees huge, but the parser takes a lot of steps simply to make all those reinterpretations.

What happens with an LR parser if we use the ambiguous grammar above? What happens, of course, is lots of shift/reduce and reduce/reduce conflicts. But why don't we keep the CFSM produced from this grammar, which is nice and small, and augment the machine with some rules it can use to resolve these conflicts; rules that stem from our associativity and precedence expectations. This works! We get a simpler grammar, a smaller CFSM, a faster parser (since it's not making all those extra moves), and a simpler parse tree. Everyone's happy!

The question is, can we generalize this? Can we augment parser generators like bison with a mechanism by which the input tells the tool how to diabiguate? The answer is yes (of course). The yacc/bison input file can include specifications of associativity and precedence for some or all tokens. Each rule gets a precedence which it inherits from the the right-most token in the rule (note: this looks only at the terminals. Non-terminals don't count here). Additionally, rules can be assigned a precedence level explicitly. Section 5.3.4 of the bison manual explains how this works in bison:

...the resolution of conflicts works by comparing the precedence of the rule being considered with that of the lookahead token. If the token's precedence is higher, the choice is to shift. If the rule's precedence is higher, the choice is to reduce. If they have equal precedence, the choice is made based on the associativity of that precedence level.

Associativity of tokens are assigned by "%left token-name", "%right token-name", and "%nonassoc token-name" statements in the bison file. Use these statements instead of "%token token-name" which we had before. These come before the grammar rules, and the relative precedence of these tokens is defined by the order in which the statements appear: first in the file has lowest precedence, last in the file has highest precedence. To assign a rule a precedence explicitly, you put "%prec token-name" after the rule's right-hand side. Sometimes you use "dummy" token names just to make a precedence level to assign a rule. This is also spelled out in Section 5.3.2 of the manual.

For basic arithmetic we'd say:

%left OPA
%left OPM
%right UPLUSMINUS

%%

exp: exp OPA exp
|    exp OPM exp
|    OPA exp        %prec UPLUSMINUS
|    NUM
|    LP exp RP

%%

Notice how we used the dummy token UPLUSMINUS to get unary minus's as in 3 + (-5*3 - 8) to be handled properly. The scanner never returns such a token; its sole purpose is to create the proper precedence level.

4 Part I: Parsing pat bottom-up

Take a look at the pat1.ypp file from today's starter code. You'll see that the grammar for pat is split into four levels (seq, catseq, opseq, atom), like in last week's lab. Here's what the grammar looks like by itself:

S: seq STOP
|

seq: seq FOLD catseq
|    catseq

catseq: catseq opseq
|       opseq

opseq: opseq COLON NAME
|      opseq REV
|      atom

atom: SYM
|     NAME
|     LB seq RB

If you download, make, and run the pat1 program, it will show you the full parse tree, according to this grammar, for any valid "pat" program you type in. Specifically, there are two lines in the main function that use standard Linux tools to create and display the parse tree as a pdf:

system("dot -Tpdf pat.dot > pat.pdf");
system("evince pat.pdf > /dev/null 2>&1 &");

Go ahead and compile pat1 and try it out. How do you like them parse trees?

Now I would describe this as the "hard way" of doing precedence and associativity. With a more complex language, we would have to make more and more nonterminals in the grammar for every kind of operator we have. This results in a big grammar, huge parse trees, and ultimately slow and buggy compilers.

Exercises

  1. Rewrite the grammar in pat1.ypp to encode the precedence and associativity the "easy way", as described above. Specifically, you should have only TWO nonterminals in your grammar, S and seq. Basically, everything becomes a seq!

Hints

  • Concatenation doesn't have any operator associated with it, so you'll have to use a dummy token and assign the concatenation rule its precedence with the %prec statement.
  • You'll probably end up with some shift/reduce errors. You need to resolve these by looking at the bison output file pat.output. You'll see the states of the CFSM, along with the LR items that label those states and the actions for various lookahead symbols. If there's more than one action listed for a lookahead symbol, there's a shift-reduce error, i.e. an ambiguity as to what action to take. You can resolve those ambiguities by assigning the appropriate precedence (or associativity) to to the tokens involved in the shift-reduce errors. For example, if state 14 has this conflict :
    state 14
    
        2 seq: seq . FOLD seq
        2    | seq FOLD seq .
        3    | seq . seq
        4    | seq . COLON NAME
        5    | seq . REV
    
        REV    shift, and go to state 9
        COLON  shift, and go to state 10
        ATOM   shift, and go to state 1
        NAME   shift, and go to state 2
        LB     shift, and go to state 3
    
        ATOM      [reduce using rule 2 (seq)]
        NAME      [reduce using rule 2 (seq)]
        LB        [reduce using rule 2 (seq)]
        $default  reduce using rule 2 (seq)
    
        seq  go to state 12
    
    This state is saying that if there is a ATOM or NAME or LB token next, it doesn't know what to do: either shift or reduce (using rule 2). I want to make sure that I do *not* reduce by "2 seq: seq FOLD seq •", since FOLD is supposed to have lowest precedence. Instead I want to shift an ATOM or NAME or LB so that I build up the seq on the right as well as the left before I finally reduce by the FOLD rule. If these tokens had some associativity assigned already, then that bison info that is quoted above (from the manual) tells me that associativity would resolve the conflict -- but not necessarily in the way I wanted. Instead, I should give ATOM, NAME and LB higher precedence than FOLD, and then bison will know to shift instead of reduce.

5 Part II: Interpreting pat bottom-up

The file pat2.ypp contains the skeleton of an interpreter for the pat language, built in bison. The main key difference is the line

#define YYSTYPE vector<string>*
which says that the type associated with all the tokens and nonterminals in this parser is a POINTER TO a vector of strings. (It has to be a pointer so that the memory for it can be reused.)

Besides this, pat2.lpp and pat2.ypp also have cut out all the stuff with the ParseTree class and calling system(). That was just to display the parse tree in part 1. Now we're good with the parsing part and want to actually interpret the language!

Your task in this part is going to be a lot like part 2 from last week's lab, except that it should be a lot easier in Bison! Now the grammar that is there is no good - I want you to start by copying your grammar from part 1 into the end of pat2.ypp where the comments specify to do it. That is, the grammar for Part II can only contain 2 non-terminals as well.

Be sure to make use of the helper functions that I've already provided for you. The trickiest part will be handling variable names, so save that for last. The second-trickiest part will be handling atoms. Remember that everything in this parser has to be a vector of strings, and this includes whatever gets returned by the scanner itself. This means that you will have to edit the pat2.lpp scanner file so that certain tokens return a vector of strings (containing only one string!) in the yylval variable. Ask for help if you need it.

Exercises

  1. Re-write the grammar in pat2.ypp to have only 2 non-terminals (like Part I), and then extend it into a fully-functioning interpreter for the pat language.