Lab 6: LR Parsing & Assoc/Prec

This lab contains both written and electronic parts, to be submitted separately. Each exercise is marked according to whether it should be part of the electronic or written submission. Note: we won't be doing peer testing for this lab, so you don't need to submit a features.txt file.

You can (and should) work with a partner for this lab, and if so, only one of you should submit each part. And of course include both names on everything you submit, electronically or on paper.

Your electronic submission should come from a folder called lab06 and include the following files. You can download all the starter code at once with the file lab06.tar.gz.

parsetree.hpp: C++ header file for the ParseTree class. (You shouldn't need to make any changes here.)
spl1.lpp: Flex file for the SPL language in Part I.
spl1.ypp: Bison file for the SPL language in Part I.
spl2.lpp: Flex file for the SPL language in Part II.
spl2.ypp: Bison file for the SPL language in Part II.
pat.lpp: Flex file for the pat language in Part III.
pat.ypp: Bison file for the pat language in Part III.
Makefile: Makefile with targets for spl1, spl2, and pat.

Introduction

In this lab, we are going to look at how to specify the associativity and precedence in ways that parser generators like bison can understand. We'll see that there's a hard way and an easy way to do this. Don't worry - you only have to do it the easy way. This "easy way" will also make it easier to generate ASTs in next week's lab.

Parts I and II of this lab will also serve as your initial introduction to SPL ("Simple Programming Language"). This language we will be the basis for many labs in the coming weeks.

Part I: SPL the hard way

SPL is an extension of the calculator language we have been using to include a few control structures as well as simple calculator instructions. Specifically, SPL supports:

Basic arithmetic (+, -, *, /)
Numeric comparisons (=,!=, >, >=, <, <=)
Boolean operations (and, or, !)
Input/output (read and write)
Basic control structures (if, if/else, and while)
User-defined functions. These must be unary (single-argument), and are defined with lambda statements.
Variables with dynamic scope.

This is an intimidating list of features! But don't worry, we'll go slowly. Today's lab is just about parsing this language and producing parse trees.

Here is a grammar for SPL, written in the bison style that you should be familiar with by now:

res: block | stmt

block: LC stmtlist RC

stmtlist: stmtlist stmt | stmt

stmt: NEW ID ASN exp STOP
|     ID ASN exp STOP
|     READ ID STOP
|     WRITE exp STOP
|     IF LP exp RP block
|     IF LP exp RP block ELSE block
|     WHILE LP exp RP block
|     exp STOP

exp: exp BOP nexp | nexp

nexp: NOT cexp | cexp

cexp: cexp COMP aexp | aexp

aexp: aexp OPA term | term

term: term OPM sfactor | sfactor

sfactor: OPA factor | factor

factor: ID 
|       NUM 
|       BOOL
|       LP exp RP
|       ID LP exp RP
|       LAMBDA ID block

You can see that this is exactly the grammar specified in spl1.ypp.

And here is a small SPL program; call it fact:

# Computes the factorial of 8.
{
  new fact := lambda n {
    new f := 1;
    while(n > 0)
    {
       f := f * n;
       n := n - 1;
    }
    write f;
  };
  fact(8);
}

Notice that this grammar encodes the precedence and associativity rules for all the operators such as or, !, and +. This is done in the usual way, which forces a certain amount of complexity in the grammar.

Download the starter code for today's lab, make the spl1 program, and run it. Type in the fact program above and take a look at the parse tree for that "small" program. You should be able to see the precedences in the parse tree as well.

Exercises:

(written) Write all the possible operators in an "exp", in order from highest to lowest precedence. Remember that operators with higher precedence get applied first in an evaluation.
(written) Add the parentheses implicit in the expression
```
!  3  +  4  *  5  <  6  -  x  and  y  <  0  or  false
```
due to operator precedences and associativity.
(electronic) Consider the program fragment 3 < 4 < 5. This is a syntactically valid "cexp", but the semantic meaning is probably not what the programmer expects. Modify the grammar in the spl1.ypp file so that this kind of expression is no longer syntactically valid. (Of course, something like 3 < 4; should still work!)
(electronic) Those of you who implemented Feature H from Lab 3 had to add a modulo operator to the calc language. Let's add one to SPL. Modify spl1.lpp to include a new token called MOD, which should just match the string "%". Then incorporate this token into the grammar in spl1.ypp. In C/C++/Java, % has the same precedence as multiplication and division, but for this lab I want its precedence to be higher than comparison, but lower than addition and subtraction. (Hint: you will have to add a new non-terminal to the grammar.)

Part II: Precedence and associativity the easy way

What's wrong with the grammmar

exp: exp OPA exp | exp OPM exp | NUM | LP exp RP

for arithmetic expressions? It's ambiguous, which means that there's more than one parse tree for the same string. The unambiguous grammar for this language that we've used in the past is more complex, but it's unambiguous, and the unique parse tree it yields respects our associativity and precedence requirements. However, not only is the grammar complex, but the parse trees are huge (see the above) with lots of subtrees that are just "reinterpretation" steps, e.g. in x := 5 ; we have

exp → nexp → cexp → aexp → term → sfactor → factor → NUM

just to interpret 5 as the right-hand side of an assignment. Not only are the parse trees huge, but the parser takes a lot of steps simply to make all those reinterpretations.

What happens with an LR parser if we use the ambiguous grammar above? What happens, of course, is lots of shift/reduce and reduce/reduce conflicts. But why don't we keep the CFSM produced from this grammar, which is nice and small, and augment the machine with some rules it can use to resolve these conflicts; rules that stem from our associativity and precedence expectations. This works! We get a simpler grammar, a smaller CFSM, a faster parser (since it's not making all those extra moves), and a simpler parse tree. Everyone's happy!

The question is, can we generalize this? Can we augment parser generators like bison with a mechanism by which the input tells the tool how to diabiguate? The answer is yes (of course). The yacc/bison input file can include specifications of associativity and precedence for some or all tokens. Each rule gets a precedence which it inherits from the the right-most token in the rule (note: this looks only at the terminals. Non-terminals don't count here). Additionally, rules can be assigned a precedence level explicitly. Section 5.3.4 of the bison manual explains how this works in bison:

...the resolution of conflicts works by comparing the precedence of the rule being considered with that of the lookahead token. If the token's precedence is higher, the choice is to shift. If the rule's precedence is higher, the choice is to reduce. If they have equal precedence, the choice is made based on the associativity of that precedence level.

Associativity of tokens are assigned by "%left token-name", "%right token-name", and "%nonassoc token-name" statements in the bison file. Use these statements instead of "%token token-name" which we had before. These come before the grammar rules, and the relative precedence of these tokens is defined by the order in which the statements appear: first in the file has lowest precedence, last in the file has highest precedence. To assign a rule a precedence explicitly, you put "%prec token-name" after the rule's right-hand side. Sometimes you use "dummy" token names just to make a precedence level to assign a rule. This is also spelled out in Section 5.3.2 of the manual.

For basic arithmetic we'd say:

%left OPA
%left OPM
%right UPLUSMINUS

%%

exp: exp OPA exp
|    exp OPM exp
|    OPA exp        %prec UPLUSMINUS
|    NUM
|    LP exp RP

%%

Notice how we used the dummy token UPLUSMINUS to get unary minus's as in 3 + (-5*3 - 8) to be handled properly. The scanner never returns such a token; its sole purpose is to create the proper precedence level.

The file spl2.ypp has a modified grammar that uses the associativity and precedence specifications rather than doing it "the hard way" with a bunch of new non-terminals as in Part I. Notice that there are now only 5 non-terminals in the grammar! make and then run the spl2 parser on the fact SPL program above. Compare the parse tree to what you got from spl1. Much better, huh?

Exercises:

(electronic) Repeat exercise #4 for the modified spl2 program. That is, add a new token to the scanner in spl2.lpp, and modify the grammar in spl2.ypp to include the % operator. Again, make its precedence between that of comparison and addition. Now you should be able to do this without adding any new non-terminals to the grammar!

Part III: Associativity and precedence in the pat language

Make sure you understand this: Adding precedence and associativity declarations to an ambiguous grammar allows LR parsers to produce the unique parse tree we want. But it only works for some grammars/languages. Fortunately, this includes a lot of what we want for programming languages.

The pat program for this lab (specified with pat.lpp and pat.ypp is an implementation of the pat language from Lab 4 that reads valid expressions and displays the parse tree. make the pat and play around with it. Re-familiarize yourself with the language and get a feel for what the parse trees look like.

In this part of the lab, your job is to modify pat.ypp so that S and seq are the only non-terminals! You'll need to figure out the proper associativities and precedences.

Hints

Concatenation doesn't have any operator associated with it, so you'll have to use a dummy token and assign the concatenation rule its precedence with the %prec statement.
You'll probably end up with some shift/reduce errors. You need to resolve these by looking at the bison output file pat.output. You'll see the states of the CFSM, along with the LR items that label those states and the actions for various lookahead symbols. If there's more than one action listed for a lookahead symbol, there's a shift-reduce error, i.e. an ambiguity as to what action to take. You can resolve those ambiguities by assigning the appropriate precedence (or associativity) to to the tokens involved in the shift-reduce errors. For example, if state 14 has this conflict :
```
state 14

    2 seq: seq . FOLD seq
    2    | seq FOLD seq .
    3    | seq . seq
    4    | seq . COLON NAME
    5    | seq . POP

    POP    shift, and go to state 9
    COLON  shift, and go to state 10
    ATOM   shift, and go to state 1
    NAME   shift, and go to state 2
    LB     shift, and go to state 3

    ATOM      [reduce using rule 2 (seq)]
    NAME      [reduce using rule 2 (seq)]
    LB        [reduce using rule 2 (seq)]
    $default  reduce using rule 2 (seq)

    seq  go to state 12
```
This state is saying that if there is a ATOM or NAME or LB token next, it doesn't know what to do: either shift or reduce (using rule 2). I want to make sure that I do *not* reduce by "2 seq: seq FOLD seq •", since FOLD is supposed to have lowest precedence. Instead I want to shift an ATOM or NAME or LB so that I build up the seq on the right as well as the left before I finally reduce by the FOLD rule. If these tokens had some associativity assigned already, then that bison info that is quoted above (from the manual) tells me that associativity would resolve the conflict -- but not necessarily in the way I wanted. Instead, I should give ATOM, NAME and LB higher precedence than FOLD, and then bison will know to shift instead of reduce.

Exercises:

(electronic) Modify the pat.ypp so that there are only two non-terminals, as described above.
(written) What are the advantages to re-writing the grammar in this way?