Summary

Scanner DFAs

Last class we talked about the first step of syntax analysis, which is scanning or tokenization. As a quick reminder/summary:

In your CS theory class, you learned some of the theory behind regular expressions and regular languages, and that any regex can be turned into a deterministic finite automaton (DFA). Besides their theoretical interest, DFAs can be efficiently implemented in software to recognize languages in linear time with no memory.

So the question is, how can we take a prioritized list of tokens and generate a single scanner for the entire DFA?

  1. Generate a DFA for each token type based on each regex.
  2. Use the algorithm for the union of regular languages to create a single DFA for all of the tokens.
  3. Label each accepting state in the combined DFA with the highest-priority token type which accepts on that state.
  4. When reading in from source, keep going until the first transition that would go from an accepting to a non-accepting state. (This is how maximal munch is implemented.)

For example, consider a scanner with just the if keyword, add/subtract operators, (signed) integers, and variable names (identifiers):

Here’s how the combined scanner for these four token types might look.

Scanner DFA

Take special note of two more things which are also part of the Scanner DFA, and really “should” be in the picture above, but are left out because it would get too cluttered:

More about comments and whitespace

We can think of whitespace and comments as having no real role in the formal syntax except to break up other tokens. For example, this line is obviously valid C code:

int x = 1234;

But this one will give you a syntax error:

int x = 12/* a comment breaks this int token into two */34;

Here, the comment is indeed recognized and ignored by the scanner, so it doesn’t correspond to an actual token. But because of how the scanner DFA works, it will interpret 12 as one integer literal, and 34 as a separate, disconnected integer literal.

Overarching principle: do as much as you can as soon as you can

There is an important principle of compiler (or interpreter) design on display here. Compared to all the other tools that come later in the compilation pipeline, regular expressions and DFA are by far the simplest and fastest (from the standpoint of the computer at least). So we want to take care of as much as possible at this initial stage.

Would it be possible to leave whitespace and comments until the next stage (parsing)? Yes, we could do that. But it would make the grammar and everything else much more complicated to have to account for all the places where comments and whitespace could potentially occur.

Instead, if we let the scanner take care of these things (and essentially “hide” them from the next steps), it will make those next steps easier and more efficient. So that’s what we do.

Or to put it more succinctly: if the scanner can do it, the scanner should do it.

Parsing

The first step of any interpreter or compiler is syntax analysis. And as we have been learning for the last two classes, the first step of syntax analysis is scanning, stripping out comments and whitespace, and breaking the raw source code into a token stream based on regexes.

The second half of syntax analysis is called parsing. This is about how the tokens are allowed to be combined with each other according to the syntax of the language. We specify these “ways of combining tokens” with formal grammars.

As an analogy, we can think about the written English language: The scanner identifies individual words and punctuation characters. The grammar specifies how those words and punctuation marks can be put together to form sentences and paragraphs.

In programming languages, parsing is also where the nesting of language structures comes into play.

Context-free grammars (CFGs)

Just like scanner/token rules are specified using regexes, parsing rules are specified using a context-free grammar. You learned about these in your CS theory class (honestly - here are the lecture notes), but let’s briefly review:

Let’s think about a simple calculator language with the INT, OP, and ID tokens from before, plus three more tokens:

`PRINT`: `print`
`LP`: `(`
`RP`: `)`

Here’s an example of a program in this calculator language.

print(10)
print(x + 4)
print(100 - (var2 + 20))

And here’s how we might write the grammar for this language.

program -> stmt program
program -> ε
stmt -> PRINT LP expr RP
expr -> ID
expr -> INT
expr -> expr OP expr
expr -> LP expr RP

This is a much more succinct and precise way of saying something like this in plain English:

A program is a sequence of zero or more statements.

A statement is the PRINT token, followed by a left parentheses, followed by any expression, and ending with a right parentheses.

An expression can be one of four things: a single identifier, an integer literal, an operation between two sub-expressions, or any sub-expression surrounded by left and right parentheses.

Complete example

Putting this all together, let’s look at how our sample program in this calculator language is transformed through both steps of syntax analysis.

Here is the original program:

print(10)
print(3 + 4)
print(100 - (10 + 20))

Which the scanner transforms to the following sequence of tokens. Each token has the token type like LP or INT, along with the actual string which matched (sometimes called the token “spelling”):

Token type   Spelling
----------   --------
PRINT        print
LP           (
INT          10
RP           )
PRINT        print
LP           (
ID           x
OP           +
INT          4
RP           )
PRINT        print
LP           (
INT          100
OP           -
LP           (
ID           var2
OP           +
INT          20
RP           )
RP           )

And now, based on the token types and the grammar shown above, these tokens are formed into a parse tree. We will talk much more about how the parse tree is created in the coming week, but now just keep in mind that:

Here is the complete tree showing how all of those tokens are put together (it’s big!):

Parse tree