Summary

Parsing overview

Remember that parsing is the second half of syntax analysis when interpreting or compiling a program. (The first half is scanning, a.k.a. lexing or tokenization.)

The goal of a parser is to turn a token stream into a parse tree according to the production rules in a grammar.

Today we will look at how this happens. We are still doing the parsing “by hand”, but we want to think about how a computer program could do parsing automatically.

Running example

For these notes, we will use a simple “calculator language” with the following tokens:

PRINT:   print
SAVE:    save
X:       x
LP:      \(
RP:      \)
ADDOP:   [-+]
MULOP:   [*/]
INT:     [0-9]+
ignore:  #.*
ignore:  [ \t\n]

Here is a grammar for the language:

prog -> stmt prog
     -> ε

stmt -> PRINT LP expr RP
     -> SAVE LP expr RP

expr -> INT
     -> X
     -> ADDOP expr
     -> expr MULOP expr
     -> expr ADDOP expr

Top-down parsing

In top-down parsing, we start at the root of the entire parse tree, and build the tree down by expanding it according to the grammar production rules. At each step, the left-most nonterminal leaf node gets expanded, until eventually the only leaf nodes are tokens and the parse tree is complete.

Consider the following small program in the calculator langauge:

save(10)
print(2*x)

First this input will get transformed into a token sequence by the scanner, like we learned about last week. Here is the token stream:

SAVE  LP  INT  RP  PRINT  LP  INT  MULOP  X  RP
save  (   10   )   print  (    2     *    x  )

Now it’s time to parse! For a top-down parse, we start with the start symbol as the root of the parse tree:

prog

This is a nonterminal symbol, so we need to expand it according to the grammar. Immediately there is a decision to make: do we expand it according to the first production rule prog -> stmt prog or the second one prog -> ε?

The decision is straightforward: the input isn’t empty, so we will go with the first rule. The parse tree now has three nodes:

   prog
  /    \
stmt   prog

Now we continue with the left-most nonterminal leaf, which is stmt. There are two expansions for stmt in the grammar - how do we decide which one to take?

The answer here is to look ahead at the next token on the token stream. It’s a SAVE token, so that tells us we want to take the second production of stmt from the grammar, namely stmt -> SAVE LP expr RP. Let’s fill in the parse tree accordingly:

            __prog__
           /        \
     __stmt____      prog
    /  |   \   \
SAVE  LP  expr  RP

Notice that the first two tokens from the token stream are now matched, SAVE and LP. The next nonterminal to expand is expr, which has four possibilities. We can see here that we want expr -> INT, so let’s do that:

            __prog__
           /        \
     __stmt____      prog
    /  |   \   \
SAVE  LP  expr  RP
save  (    |
          INT

After two more token matches (of INT and RP), and two more nonterminal expansions (prog and then stmt), we have:

            ________prog________
           /                    \
     __stmt____                __prog__
    /  |   \   \              /        \
SAVE  LP  expr  RP      __stmt____      prog
save  (    |    )      /  |  |    \
          INT     PRINT  LP  expr  RP
          10

The next nonterminal to expand is expr, and now (by looking ahead) we can see the correct expansion will be expr -> expr MULOP expr. Let’s work that out with a few more matches and expansions:

            ____________prog____________
           /                            \
     __stmt____                       ___prog____
    /  |   \   \                     /           \
SAVE  LP  expr  RP      _________stmt________     prog
save  (    |    )      /   /       |         \
          INT     PRINT  LP      _expr__      RP
          10      print  (      /  |    \      )
                            expr  MULOP  expr
                             |     *      |
                            INT           X
                             2            x

Finally, we want to expand the last prog and notice that there are no more tokens left on the input stream, so this one gets the epsilon expansion and the parse tree is complete:

            ____________prog____________
           /                            \
     __stmt____                       ___prog____
    /  |   \   \                     /           \
SAVE  LP  expr  RP      _________stmt________     prog
save  (    |    )      /   /       |         \     |
          INT     PRINT  LP      _expr__      RP   ε
          10      print  (      /  |    \      )
                            expr  MULOP  expr
                             |     *      |
                            INT           X
                             2            x

Notice this is a proper and valid parse tree from the input token sequence: the tokens are in order at the leaf level, all the internal nodes are nonterminals whose children correspond to some grammar rule for that nonterminal, and the root node is the start symbol in the grammar.

How a top-down parser makes decisions

A top-down parser constantly has to make decisions about which grammar rule to expand according to the current nonterminal. This is called prediction because it’s like predicting what should come next in the parse. Real top-down parsers perform prediction by look-ahead: they scan in some upcoming tokens, and use those to decide what to do.

In the above example, predicting prog is easy: just see if there are any tokens left, and if so, do the first expansion to prog -> stmt prog. Predicting stmt is also easy, by looking to see if the upcoming token is PRINT or SAVE.

But predicting expr is not so straightforward. Just knowing the upcoming token is an INT, for example, is not enough to know which expansion to take. The parser might need even more look-ahead to decide.

(Better yet, in this grammar, we could re-write the grammar in order to make predicting an expr easier. Take this as a puzzle/challenge for you!)

Bottom-up parsing

As you might guess, bottom-up parsing works the opposite to top-down parsing. Rather than starting at the root of the parse tree (the start symbol), we start at the leaves (tokens) and build our way up.

Specifically, the bottom-up parser will maintain a list of partial parse trees. (Well, technically it’s just as a stack, but it’s easier to think of as a list since we visualize it left to right.) Initially this list of partial trees is empty.

At each step of the parse, one of two things happens:

Notice that a reduce is essentially going “backwards” in the grammar, starting with the right-hand side and moving left.

Let’s look at the same example from before. As a reminder, here is the grammar:

prog -> stmt prog
     -> ε

stmt -> PRINT LP expr RP
     -> SAVE LP expr RP

expr -> INT
     -> X
     -> ADDOP expr
     -> expr MULOP expr
     -> expr ADDOP expr

and here is the token stream we are trying to parse:

SAVE  LP  INT  RP  PRINT  LP  INT  MULOP  X  RP
save  (   10   )   print  (    2     *    x  )

For a bottom-up parse, we start with an empty list. There is actually one reduction we could take here, prog -> ε, but we can see that’s not the right choice yet. Instead, let’s shift on the first token:

SAVE
save

Now this doesn’t match the right-hand side of any grammar rule, so we shift again:

SAVE  LP
save  (

Now the list has two partial parse trees, each with just a single node. Still no reductions are possible, so we shift a third time:

SAVE  LP  INT
save  (   10

Now a reduction is possible! We can’t reduce the whole list yet, but remember we are allowed to reduce whenever the end of the list matches the right-hand side of a grammar rule. In this case, the relevant rule is expr -> INT. So we run this rule backwards and build up the tree with our first reduce:

          expr
           |
SAVE  LP  INT
save  (   10

Now the list has three trees, with roots SAVE, LP, and expr. No reductions are possible, so we shift again.

          expr
           |
SAVE  LP  INT  RP
save  (   10   )

At this point, we can do a big reduction, according to the rule stmt -> SAVE LP expr RP. This will reduce the current list to just a single tree with root stmt.

    ___stmt___
   /   /   |  \
   |   |  expr |
   |   |   |   |
SAVE  LP  INT  RP
save  (   10   )

No more reductions can happen here, so we continue the process with three shifts and a reduce, similar to what just happened:

    ___stmt___
   /   /   |  \
   |   |  expr |              expr
   |   |   |   |               |
SAVE  LP  INT  RP  PRINT  LP  INT
save  (   10   )   print  (    2

At this point the list has 4 partial trees. Recall from top-down parsing that look-ahead was needed to figure out what to do here: we had to predict whether it was going to just be a single integer or some kind of arithmetic operation. But bottom-up parsing doesn’t have to do that! At this stage, we would be ready if the print statement ended with an RP, but we can also keep going for an arithmetic operation; no look-ahead was needed.

The next steps for a bottom-up parse are two more shifts and a reduce according to expr -> X:

    ___stmt___
   /   /   |  \
   |   |  expr |              expr        expr
   |   |   |   |               |           |
SAVE  LP  INT  RP  PRINT  LP  INT  MULOP   X
save  (   10   )   print  (    2     *     x

And now, after the fact, we see that in fact it was an arithmetic expression, and we reduce the last three partial trees in the list:

    ___stmt___                     _expr_
   /   /   |  \                   /  |   \
   |   |  expr |              expr   |    expr
   |   |   |   |               |     |     |
SAVE  LP  INT  RP  PRINT  LP  INT  MULOP   X
save  (   10   )   print  (    2     *     x

Now we shift the last token and reduce the print statement:

                        ______ __stmt_________
                       /   /         |        \
    ___stmt___         |   |       _expr_      |
   /   /   |  \        |   |      /  |   \     |
   |   |  expr |       |   |  expr   |    expr |
   |   |   |   |       |   |   |     |     |   |
SAVE  LP  INT  RP  PRINT  LP  INT  MULOP   X   RP
save  (   10   )   print  (    2     *     x   )

What now? We still have two partial trees, so the parse isn’t finished. But we’ve reached the end of the token sequence so we can’t shift again.

But technically one reduction is possible, the epsilon reduction for prog. In a bottom-up parse, an epsilon reduction is always allowed, and just adds an extra node at the end with the relevant non-terminal:

                        ______ __stmt_________
                       /   /         |        \
    ___stmt___         |   |       _expr_      |
   /   /   |  \        |   |      /  |   \     |
   |   |  expr |       |   |  expr   |    expr |   prog
   |   |   |   |       |   |   |     |     |   |    |
SAVE  LP  INT  RP  PRINT  LP  INT  MULOP   X   RP   ε
save  (   10   )   print  (    2     *     x   )

This kicks off a series of two final reductions to finish off the parse:

            ___________prog______________
           /                             \
          |                           ____prog____
          |                          /            \
          |             ______ __stmt_________     |
          |            /   /         |        \    |
    ___stmt___         |   |       _expr_      |   |
   /   /   |  \        |   |      /  |   \     |   |
   |   |  expr |       |   |  expr   |    expr |   prog
   |   |   |   |       |   |   |     |     |   |    |
SAVE  LP  INT  RP  PRINT  LP  INT  MULOP   X   RP   ε
save  (   10   )   print  (    2     *     x   )

Even thought it’s drawn kind of funny, make sure you notice that this is the exact same tree as we got from the top-down parse.

How a bottom-up parser makes decisions

As we have seen, the crucial decision at each step of a bottom-up parse is shift or reduce (and, if multiple reductions are possible, which rule should we reduce by).

But generally speaking, this decision is much easier to make for bottom-up parsers than for top-down predictive parsers. In the above parse, we were able to follow the simple strategy of “do every reduction as soon as you can”.

This doesn’t always work. For example, if you consider the expression 2 + 3 * 4, in order to follow order of operations and perform the multiplication first, the parser needs to know not to reduce 2 + 3 to an expression (as it could do), but instead to shift on the * and 4 , and then perform the two expr reductions.

In many cases, a bottom-up parser can resolve these shift/reduce questions with zero or one token of look-ahead. Because the decision-making is a bit simpler than with top-down parsers, it is a bit easier and more efficient to generate a bottom-up parser automatically from a grammar specification. Many early compilers, especially ones designed for speed, took this approach.