Summary

Abstract Syntax Trees (ASTs): what they are, and why they are useful for writing interpreters and compilers
AST generation from parse tree: Semantic Analysis
Stages of compilation

The problems with parse trees

For these notes, let’s consider a simple imperative language in the style of basic C++, defined with the following tokens:

STOP:    ;
LP:      \(
RP:      \)
LC:      \{
RC:      \}
COMMA:   ,
EQ:      =
OP:      [-+*/<>]
IF:      if
ELSE:    else
TYPE:    int|string
INTLIT:  -?[0-9]+
STRLIT:  "([^"\\]|\\.)*"
ID:      [a-zA-Z_][a-zA-Z0-9_]*
ignore:  [ \t\n]
ignore:  //.*$
~

and the following grammar (written in ANTLR style):

prog : stmtlist EOF ;

stmtlist : stmt stmtlist
         |
         ;

stmt : expr STOP
     | TYPE ID STOP
     | TYPE ID EQ expr STOP
     | LC stmtlist RC
     | IF LP expr RP stmt
     | IF LP expr RP stmt ELSE stmt
     ;

expr : INTLIT
     | STRLIT
     | ID
     | ID EQ expr
     | expr OP expr
     | LP expr RP
     | expr LP args RP
     ;

args : nonempty
     |
     ;

nonempty : expr
         | expr COMMA nonempty
         ;

Now consider a relatively small program, like

printf("How old are you? ");
age = read_num();
if (age < 21) {
  printf("You are just %d. Clean living for another %d years\n", age, (21-age));
}

This will have a massive parse tree! Let’s consider the last ID “age” in this tree. See if you can come up with the path from the parse tree root (prog) down to that one node, along with all the “sibling” nodes on the path:

Click here to see the path

     prog
    /    \
stmtlist  EOF
 |     \
stmt  stmtlilst
       |       \
     stmt     stmtlist
             /       \
        __stmt_____   stmtlist
       / /  |   |  \
     IF LP expr RP stmt____
                  /  |     \
                LC stmtlist RC
                    |     \
                __stmt_   stmtlist
               / /  |  \
           expr LP args RP
                    |
                 nonempty
                /  |    \
            expr COMMA nonempty
                      /  |    \
                  expr COMMA nonempty
                               |
                              expr
                             /  | \
                           LP expr RP
                             /  | \
                           expr OP expr
                                    |
                                   ID

That’s quite a lot, and it’s just one path in the tree for this 4-line program!

This complication has a few negative effects. First and foremost, it slows down the compiler or interpreter and makes it use more memory. This is especially important for interpreters, where for example in a loop or a function call, the same part of the code might be executed multiple times. Each node in the parse tree probably corresponds to a single (nested) function call, which has some non-trivial execution overhead which will add up for larger programs.

Second, the parse tree reflects aspects of the grammar which might have been needed to properly analyze the source code structure, but are irrelevant for the program’s meaning. One example are the small tokens like parentheses and semicolons which are only there to help the parser group things together. Once parsing is done, there’s no reason to even think about these parts of the source code anymore.

For another example, take a look at the args and nonempty nonterminals in the grammar above. The grammar was written like that to account for an empty argument list foo(), or a single argument foo(x), or multiple foo(x,y,z). In the parse tree, these three versions will look very different, whereas logically they are all just function calls with a different size arguments list.

Abstract Syntax Trees (ASTs)

Abstract Syntax Trees (ASTs) provide a way of representing the semantic meaning of our program in a simple way that is independent of the actual syntax. This is a little confusing, so I’ll emphasize again that the AST has nothing to do with the syntax — it’s an abstraction away from the syntax of the language!

Unlike with a parse tree, which is uniquely defined according to the grammar (as long as the grammar is not ambiguous), there isn’t just one definition of exactly how the AST for a particular program should look. Think of the AST as a useful internal representation for the interpreter or compiler, which may be slightly different depending on the language or capabilities of the implementation.

In this class we will try to be fairly consistent in how we draw ASTs and what they mean. Every node in the AST is either a statement or an expression. We’ve already discussed in previous units how these are different from each other; basically an expression is any self-contained program fragment which has a type and evaluates to some value, whereas a statement is a self-contained program fragment which does something, but doesn’t return any value. A while loop or a function definition are good examples of statements, and a variable name or arithmetic operation are good examples of expressions.

AST example

Here’s an example of an AST for the same 4-line program from before (just putting the two statements inside curly braces so they form a single Block statement):

                        ____________________________BlockStmt_______________________________
                       /                               |                                    \
                ExprStmt                      AssignStmt("age")                          ____IfElseStmt____
                   |                                 |                                  /          |       \
              FunCallExpr                        FunCallExpr            BinaryOpExpr("<")       BlockStmt  BlockStmt
             /           \                           |                    /         \               |
IdExpr("printf")  StrLitExpr("How old...")  IdExpr("read_num")  IdExpr("age")  IntLitExpr(21)  FunCallExpr
                                                                            __________________/  |      | \___________
                                                                           /                     |       \            \
                                                            IdExpr("printf")  StrLitExpr("You are...")  IdExpr("age")  BinaryOpExpr("-")
                                                                                                                       /             \
                                                                                                                IntLitExpr(21)   IdExpr("age")

The first thing you should notice is that this is a lot simpler than the previous parse tree. This AST represents the actual meaning of our small program - what that program is actually supposed to do.

And there’s not really any wasted information here. Every part of the AST has a specific meaning, and it’s not redundant.

The nesting also matches the logical nesting, not just the grammar/syntax. For example, there are three top-level statements in the program, so that’s what you see in the top-level BlockStmt.

General properties of ASTs

A few more details to notice about this AST (and ASTs in general):

Anything like semicolons, parentheses, etc. are definitely wiped away. They were useful to get the logical structure, but now that we have a tree they are meaningless.
There is a subtle distinction between what ends up being the children of a node, and what is just a part of that node itself. For example, in an assignment statement, the variable name being assigned (such as "age) goes in the node itself, whereas the right-hand side of the assignment becomes a child node. The rule of thumb is that children should only be expressions or statements that could be further nested.

For example, the left-hand side of an assignment is always just a single ID, so we might as well just store that string in the AssignStmt node itself. But the right-hand side of an assignment could be any kind of nested expression, so it should be a child node.
Most AST nodes will have a fixed number of children nodes. For example, an assignment statement always has one child (the expression on the right-hand side), and of course a BinaryOpExpr always has two children.

But there are exceptions to this. In the AST above, we see that funcall has 1 or more children (since function calls in C++ can take an arbitrary number of arguments), and BlockStmt has 0 or more children (since that’s basically what a block is).
Sometimes the AST actually adds more than the parse tree, in the interest of keeping things simple and consistent. An example above is the empty BlockStmt as the last child of the IfElseStmt. In the code, there is no “else” clause there. We could capture this in the AST by having a separate IfStmt and IfElseStmt type of AST node, but it’s simpler and easier to just have one type of AST node for IfElseStmt, and then adapt the two different syntaxes to meet that one definition.

Or to put it another way, the AST here is revealing the semantics of what it means to have an if statement with no “else” clause — it’s equivalent to having an empty block for the “else”.

ASTs are not (necessarily) language-specific!

Remember, an AST is supposed to abstract away from the grammar of a particular language. That means that the AST above could just as easily have come from a C++ program, or a Java program, or even Python or something else.

But that only goes so far in practice, because different languages usually have different semantics and capabilities, besides just having different syntax. For example, I don’t think the AST above would ever be generated from a Scheme program, since in Scheme an if/else is a type of expression, not a statement.

To emphasize again, there are no hard rules about how to draw an AST. It will depend on the capabilities of the language itself to some extent. But we’ll try to be more or less consistent with what we draw and use in this class.

In practice (as you will see in your next lab), each type of AST node corresponds to a different (Java) class in your interpreter/compiler source code. The properties and children nodes that you see in the tree as we drew it, correspond to fields in those AST node classes.

Stepping back: Stages of compilation

Reviewing somewhat from a few weeks ago, we are not actually ready to say what every one of the traditional “stages” of compilation or interpretation are:

Scanning reads in raw source code and produces a token stream. This breaks the raw source code into discrete “chunks”, and discards comments and (usually) whitespace.

Scanning is the first half of syntax analysis. Scanners are usually auto-generated from a token spec using regular expressions.
Parsing takes in a token stream and organizes those tokens as leaves of a parse tree.

Parsing is the second half of syntax analysis. Parsers are also often auto-generated (but not always!), based on a grammar with tokens as the terminal symbols.
Semantic analysis takes the parse tree and turns it into an abstract syntax tree (AST).

This is where we move away from the concrete source code and try to get at the simplest representation of the program’s meaning.

There may also be some compile-time checks done here (which we haven’t focused on yet), like for undeclared variables or static typing.
Execution (in an interpreter) or Code generation (in a compiler).

In a compiler, there may be many more stages after this of intermediate code represenations and optimization before the final output machine code.