Summary

Scanner DFA generation from regular expressions and token names
Defining how tokens are put together using context-free grammars (CFGs)

Scanner DFAs

Last class we talked about the first step of syntax analysis, which is scanning or tokenization. As a quick reminder/summary:

Each token type has a name and a defining regular expression
A scanner is specified by a priority-ordered list of token types and corresponding regexes
The scanner partitions the entire input source code into a single sequence of non-overlapping tokens.
When two tokens could match, prefer the longest token reading left to right (the maximal munch rule)
When two tokens with the same length could match, we count the type with the higher priority.

In your CS theory class, you learned some of the theory behind regular expressions and regular languages, and that any regex can be turned into a deterministic finite automaton (DFA). Besides their theoretical interest, DFAs can be efficiently implemented in software to recognize languages in linear time with no memory.

So the question is, how can we take a prioritized list of tokens and generate a single scanner for the entire DFA?

Generate a DFA for each token type based on each regex.
Use the algorithm for the union of regular languages to create a single DFA for all of the tokens.
Label each accepting state in the combined DFA with the highest-priority token type which accepts on that state.
When reading in from source, keep going until the first transition that would go from an accepting to a non-accepting state. (This is how maximal munch is implemented.)

For example, consider a scanner with just the if keyword, add/subtract operators, (signed) integers, and variable names (identifiers):

IF: if
OP: [-+]
INT: -?[0-9]+
ID: [a-z0-9]+

Here’s how the combined scanner for these four token types might look.

Scanner DFA

Take special note of two more things which are also part of the Scanner DFA, and really “should” be in the picture above, but are left out because it would get too cluttered:

Trap state: There is another non-accepting “trap” state with transitions on unspecified characters from every other state. (Remember, in a DFA, every state must have an outgoing transition for every possible character in the alphabet. For source code, we should consider the alphabet to be all ASCII characters.)
Whitespace: In most (but not all!) programming languages, whitespace characters (spaces, tabs, maybe also newlines) are ignored and skipped by the scanner. In the scanner, these correspond to a “special” token type that is always ignored, rather than creating a new token in the token stream.

If we wanted to allow whitespace with the four token types above, in the picture it would correspond to another state, labeled “ignore” or “skip”, with a transition from the start state on any space character.

Comments would be treated similarly in almost all programming languges. They are defined by a regex for the scanner, but they don’t get a distinct token type and are not part of the token stream.

More about comments and whitespace

We can think of whitespace and comments as having no real role in the formal syntax except to break up other tokens. For example, this line is obviously valid C code:

int x = 1234;

But this one will give you a syntax error:

int x = 12/* a comment breaks this int token into two */34;

Here, the comment is indeed recognized and ignored by the scanner, so it doesn’t correspond to an actual token. But because of how the scanner DFA works, it will interpret 12 as one integer literal, and 34 as a separate, disconnected integer literal.

Overarching principle: do as much as you can as soon as you can

There is an important principle of compiler (or interpreter) design on display here. Compared to all the other tools that come later in the compilation pipeline, regular expressions and DFA are by far the simplest and fastest (from the standpoint of the computer at least). So we want to take care of as much as possible at this initial stage.

Would it be possible to leave whitespace and comments until the next stage (parsing)? Yes, we could do that. But it would make the grammar and everything else much more complicated to have to account for all the places where comments and whitespace could potentially occur.

Instead, if we let the scanner take care of these things (and essentially “hide” them from the next steps), it will make those next steps easier and more efficient. So that’s what we do.

Or to put it more succinctly: if the scanner can do it, the scanner should do it.

Parsing

The first step of any interpreter or compiler is syntax analysis. And as we have been learning for the last two classes, the first step of syntax analysis is scanning, stripping out comments and whitespace, and breaking the raw source code into a token stream based on regexes.

The second half of syntax analysis is called parsing. This is about how the tokens are allowed to be combined with each other according to the syntax of the language. We specify these “ways of combining tokens” with formal grammars.

As an analogy, we can think about the written English language: The scanner identifies individual words and punctuation characters. The grammar specifies how those words and punctuation marks can be put together to form sentences and paragraphs.

In programming languages, parsing is also where the nesting of language structures comes into play.

Context-free grammars (CFGs)

Just like scanner/token rules are specified using regexes, parsing rules are specified using a context-free grammar. You learned about these in your CS theory class (honestly - here are the lecture notes), but let’s briefly review:

Each token type from scanning is a terminal symbol in the grammar. By convention, we will write these in all caps like INT.
The grammar also defines new nonterminal symbols, which we will write in lowercase like expr.
A production rule (or just “rule”) has a non-terminal symbol on the left-hand side, then an arrow, and then a sequence of symbols (tokens or non-terminals) on the right-hand side.
If the sequence of symbols on the right-hand side is empty, we write ε (the Greek letter epsilon)
Each non-terminal can appear more than once as a left-hand side for multiple production rules.
By convention, the nonterminal on the left-hand side of the first production rule is called the “start symbol”.

Let’s think about a simple calculator language with the INT, OP, and ID tokens from before, plus three more tokens:

`PRINT`: `print`
`LP`: `(`
`RP`: `)`

Here’s an example of a program in this calculator language.

print(10)
print(x + 4)
print(100 - (var2 + 20))

And here’s how we might write the grammar for this language.

program -> stmt program
program -> ε
stmt -> PRINT LP expr RP
expr -> ID
expr -> INT
expr -> expr OP expr
expr -> LP expr RP

This is a much more succinct and precise way of saying something like this in plain English:

A program is a sequence of zero or more statements.

A statement is the PRINT token, followed by a left parentheses, followed by any expression, and ending with a right parentheses.

An expression can be one of four things: a single identifier, an integer literal, an operation between two sub-expressions, or any sub-expression surrounded by left and right parentheses.

Complete example

Putting this all together, let’s look at how our sample program in this calculator language is transformed through both steps of syntax analysis.

Here is the original program:

print(10)
print(3 + 4)
print(100 - (10 + 20))

Which the scanner transforms to the following sequence of tokens. Each token has the token type like LP or INT, along with the actual string which matched (sometimes called the token “spelling”):

Token type   Spelling
----------   --------
PRINT        print
LP           (
INT          10
RP           )
PRINT        print
LP           (
ID           x
OP           +
INT          4
RP           )
PRINT        print
LP           (
INT          100
OP           -
LP           (
ID           var2
OP           +
INT          20
RP           )
RP           )

And now, based on the token types and the grammar shown above, these tokens are formed into a parse tree. We will talk much more about how the parse tree is created in the coming week, but now just keep in mind that:

The leaves of the tree are always tokens.
The root is the start symbol (first nonterminal in the grammar).
The internal nodes are nonterminals.
The children of an internal node correspond to the right-hand side of some production rule for that nonterminal.

Here is the complete tree showing how all of those tokens are put together (it’s big!):

Parse tree