This unit

In this unit, over the next three weeks, we will be designing and then implementing new languages that, in addition to basic string operations, can also handle boolean operations and store values in variables.

The semantics of these languages, like in the first unit, are still pretty limited. But the syntax is getting more complicated — most of the “work” in the first unit was really defining and then processing the syntax of the language.

So our focus in class for a few weeks will be formal syntax analysis of source code, also known as scanning and parsing. We will learn about how the things you saw in your CS Theory class, namely regexes, DFAs, and grammars, are actually practical and useful tools that can make our job of defining and implementing new programming languages much simpler.

Summary (for today)

Scanning is the part step of syntax analysis, which is the first step of any interpreter or compiler
Tokens are specified using regular expressions
The full scanner recognizes multiple tokens using a single DFA
When more than one token could potentially match, the choice is made based on maximal munch and precedence.

Regular expressions for describing parts of a program

We have already seen, in homework and in class how regular expressions can be used to specify and to process individual pieces of source code.

For example, take string literals in C++ like "a string" or "something \"complicated\"". We can try to define this using English, like:

A string literal is a double-quote character, followed by a sequence of regular characters or escape sequences, followed by another double-quote character. A regular character is anything other than a double-quote or a backslash. And an escape sequence is a backslash followed by any single character.

But that’s awfully wordy, and maybe prone to misinterpretation. Regular expressions allow us to be clear and precise both to other humans (maybe) and automated compiler tools (definitely). Here is a regex for C++ string literals. If you can understand this fully, you are well on your way to becoming a regex master!

"(\\.|[^"\\])*"

Breaking this down:

The first and last " just mean what they say, to match a double-quote character at the beginning and the end
The | in the middle is an “alternation” character, which acts like an “or” operation matching either the regex on the left or on the right.
The ( and ) define a group, and are there so that the | operator just works inside this group.
The left-side option \\. is a sequence of two characters. The first, specified by \\, matches a single backslash — we write it twice because backslash is also an escape character. And then the dot . matches any single character. So \\. means “a backslash followed by any single character”, i.e., an escape sequence in the string.
[] defines a character class for matching a single character from the group. When the first thing inside the [] is a carat ^, it means to take any character except the ones specified. So [^"\\] means any single character that isn’t a double-quote or a backslash.
The * matches zero or more instances of the previous thing, i.e., it indicates repetition. In this case, it’s repeating the whole thing inside the () that comes right before it.

In Java, we could define an instance of the java.util.Pattern class which searches exactly for C++ string literals, with this line of code:

Pattern literals = Pattern.compile("\"(\\\\.|[^\"\\\\])*\"");

Admittedly, this looks awful. We have to use backslashes again at the level of Java source code to escape each of the backslashes or double-quote characters in the actual regex we are defining. We are now working two levels deep in abstraction: the actual C++ string literal, then the regular expression describing such literals, and then the Java string literal describing the regex. There are escape sequences at every level here, and in fact all three languages (the language of C++, of regexes, and Java) all use the same escape character \, so we end up with the monstrosity you see before you.

The bottom line is, if we can master the art of (or use tools to) generating regexes, they can greatly simplify a program that needs to read structured text like source code. Remember that program you wrote to try and find and print out the string literals in a source code file? Here’s a single statement in Java that uses our regex above to basically do the whole job:

Pattern.compile("\"(\\\\.|[^\"\\\\])*\"")
    .matcher(Files.readString(Paths.get(args[0])))
    .results().map(MatchResult::group).forEach(System.out::println);

Defining and recognizing tokens

A complete source code isn’t just made of of string literals of course. Thinking about a typical C program, a few kinds of tokens come to mind:

Whitespace (spaces, tabs, newlines)
String literals like "this"
Char literals like 'c'
Integer literals like 123
Floating-point literals like 6.7
Identifiers (i.e. variable and function names) such as x or card17
Comments like /* this */
Parentheses and brackets: at least 6 literals for {, }, [, ], (, )
The various operators such as *, +, /, &, `&&, ==, >, … (and many more)
The all-important semicolon ;
A single period . for struct member access
Special keywords such as if, for, while, return

Note that these are defined independently of the context where they are used — that will come next, in parsing. The job of a scanner is just to:

Read the source code
Break the complete source code into a sequence of non-overlapping tokens, according to the various token types and each one’s defining regex
Throw away tokens that are known to be completely meaningless in the language — in this case, comments and whitespace
Pass the resulting token stream along to the parser to continue the work

But still, despite this seemingly straightforwrad job of splitting up source code into regex-matching pieces, there are still a number of potential conflicts we can see from the list above. For example:

Should abc be a single variable name or three variable names right next to each other?
Is 12.34 a single floating-point literal, or an int literal, followed by a single ., followed by another int literal?
Is for a variable name or a keyword?
Is ++ one operator or two?

None of these questions are answered by the regexes alone. Even though we can see what the answer should be each time since we understand C code, the scanner needs concrete rules to help it resolve these ambiguities. Fortunately, there are really just two of them:

Maximal munch: We always prefer to take the longest possible token starting at the current position.
Precedence: The list of tokens is ordered from hightest to lowest precedence. If more than one token type matches at the current position, and they have the same length, we take the one with higher precedence.

So for example, maximal munch explains why 12.34 should be a single floating-point literal, and precedence explains why for should be a keyword rather than an identifier.