This unit

In this unit, over the next three weeks, we will be designing and then implementing new languages that, in addition to basic string operations, can also handle boolean operations and store values in variables.

The semantics of these languages, like in the first unit, are still pretty limited. But the syntax is getting more complicated — most of the “work” in the first unit was really defining and then processing the syntax of the language.

So our focus in class for a few weeks will be formal syntax analysis of source code, also known as scanning and parsing. We will learn about how the things you saw in your CS Theory class, namely regexes, DFAs, and grammars, are actually practical and useful tools that can make our job of defining and implementing new programming languages much simpler.

Summary (for today)

Regular expressions for describing parts of a program

We have already seen, in homework and in class how regular expressions can be used to specify and to process individual pieces of source code.

For example, take string literals in C++ like "a string" or "something \"complicated\"". We can try to define this using English, like:

A string literal is a double-quote character, followed by a sequence of regular characters or escape sequences, followed by another double-quote character. A regular character is anything other than a double-quote or a backslash. And an escape sequence is a backslash followed by any single character.

But that’s awfully wordy, and maybe prone to misinterpretation. Regular expressions allow us to be clear and precise both to other humans (maybe) and automated compiler tools (definitely). Here is a regex for C++ string literals. If you can understand this fully, you are well on your way to becoming a regex master!

"(\\.|[^"\\])*"

Breaking this down:

In Java, we could define an instance of the java.util.Pattern class which searches exactly for C++ string literals, with this line of code:

Pattern literals = Pattern.compile("\"(\\\\.|[^\"\\\\])*\"");

Admittedly, this looks awful. We have to use backslashes again at the level of Java source code to escape each of the backslashes or double-quote characters in the actual regex we are defining. We are now working two levels deep in abstraction: the actual C++ string literal, then the regular expression describing such literals, and then the Java string literal describing the regex. There are escape sequences at every level here, and in fact all three languages (the language of C++, of regexes, and Java) all use the same escape character \, so we end up with the monstrosity you see before you.

The bottom line is, if we can master the art of (or use tools to) generating regexes, they can greatly simplify a program that needs to read structured text like source code. Remember that program you wrote to try and find and print out the string literals in a source code file? Here’s a single statement in Java that uses our regex above to basically do the whole job:

Pattern.compile("\"(\\\\.|[^\"\\\\])*\"")
    .matcher(Files.readString(Paths.get(args[0])))
    .results().map(MatchResult::group).forEach(System.out::println);

Defining and recognizing tokens

A complete source code isn’t just made of of string literals of course. Thinking about a typical C program, a few kinds of tokens come to mind:

Note that these are defined independently of the context where they are used — that will come next, in parsing. The job of a scanner is just to:

  1. Read the source code
  2. Break the complete source code into a sequence of non-overlapping tokens, according to the various token types and each one’s defining regex
  3. Throw away tokens that are known to be completely meaningless in the language — in this case, comments and whitespace
  4. Pass the resulting token stream along to the parser to continue the work

But still, despite this seemingly straightforwrad job of splitting up source code into regex-matching pieces, there are still a number of potential conflicts we can see from the list above. For example:

None of these questions are answered by the regexes alone. Even though we can see what the answer should be each time since we understand C code, the scanner needs concrete rules to help it resolve these ambiguities. Fortunately, there are really just two of them:

  1. Maximal munch: We always prefer to take the longest possible token starting at the current position.
  2. Precedence: The list of tokens is ordered from hightest to lowest precedence. If more than one token type matches at the current position, and they have the same length, we take the one with higher precedence.

So for example, maximal munch explains why 12.34 should be a single floating-point literal, and precedence explains why for should be a keyword rather than an identifier.