Lab 3: Tokenization and Parsing

You may work in pairs for this lab. If you choose to do so, only have one partner submit the required files electronically. Be sure to include both partners' names at the top of every submitted file.

This lab is due at the beginning of your next lab period. You should submit a folder called lab03 containing all of the following files:

features.txt: Describes which features you have implemented in a specified format (see below).
calc.ypp: The Bison file describing the parser, with all of your features implemented.
calc.lpp: Flex input file describing the scanner, with all of your features implemented.
Makefile: Should have a rule to compile everything into a program called mycalc.

Flex basics and flex/bison interactions

Flex is a tool that generates scanners. One of its main purposes is to provide yylex() functions for bison-generated parsers. Flex and bison are GNU versions of lex and yacc, which are traditional Unix development tools. They both have manuals online: flex manual, bison manual.

Download the following files into your lab03 directory:

These three files define a simple calculator program. In fact, it's the same simple program we had from Class 7. The big difference is that now, instead of writing our scanner by hand, we're using the standard tool flex to create it.

What do flex and bison do?

Look at the Makefile. We start off with the file calc.ypp that defines the grammar and action rules for our calculator. The call

bison -d calc.ypp

creates files calc.tab.cpp and calc.tab.hpp. The first contains the C++ code for the parser (which is a function called yyparse()), the second is a header file that contains, among other things, the definitions of the token names NUM, OPA, etc., and the variable yylval that we use to pass the bison code the semantic values of tokens. The flex file calc.lpp needs those definitions so that the yylex() function it defines for bison can pass back the information it needs to pass back.

The next thing we have is the flex file calc.lpp. The call

flex -o lex.yy.cpp calc.lpp

creates the file lex.yy.cpp that contains, among other things, the definition of the yylex() function. Of course we haven't talked about flex, so this file is new. The basic idea is that you write regular expressions for the different token types, and actions for each regular expression. The flex generated scanner code tries to match characters from the current input stream to these regular expressions, and when a match is found, it executes the associtated action code. The variable yytext contains the string (in the C sense, i.e. '\0' terminated char*.) of characters that were matched. When more than one match is possible it breaks ties by going with the longest match --- i.e. it follows the "maximal munch" rule. When there are several "longest" matches, it goes with the regular expression listed first in the flex file. Of course, you should realize that flex is simply creating the same program implementing a finite automaton that we talked about in class. The difference is that we users specify tokens by regular expressions. So flex has to convert the regular expressions into finite automata for each token type, which are then combined into a single finite automaton. This is then converted, automatically, into a program.

Exercises

Examine these files carefully and make sure you understand what flex is doing and how flex and bison interact and communicate. Run make and confirm that the mycalc program works as intended.

Adding Features to the Calculator

For the remainder of this lab, you will add features to the simple calculator program. These features will involve significant additions to the scanner (calc.lpp), and some to the parser (calc.ypp) as well.

The features you will implement are listed as A-J below. Everyone must implement features A-E. You must also implement at least two of the features F-J.

Peer Review and features.txt

Next week, each submission will be (anonymously) reviewed by your classmates. The goal of this peer review will be to judge whether all features have been implemented correctly and completely. When developing your lab this week, think about how you would rigorously test every feature. Part of your lab grade will be based on the quality of your review.

To facilitate this review, you must list every feature that you implement in the features.txt file. An example file is here: features.txt. Since I have to process these files very quickly at the beginning of next week's lab, you must follow the following formatting rules:

The first five lines are for you to put your name(s) etc. and will be stripped from the file before it is printed, for anonymity.
For each feature that you worked on and successfully implemented, include a block of text starting with %%, and ending with a completely blank line. The first line (after the %%) should give the name of that feature, and the following lines include any description or details of how that feature works in your program. For instance,
```
%%Feature A
Here are the details for how feature A works in my program.
This is a feature that was SUCCESSFULLY IMPLEMENTED.
This part can contain any amount of text, but no blank lines.
The blank line below ends this block.
```
For each feature that you worked on but did not complete, include a similar block starting with $$ instead.

You will be reviewed by your peers and by your instructor on all features that you worked on, whether successfully implemented or not. You will get the highest grade for features that you claim are successfully implemented, and which actually are. The next highest grades will be for features that are almost complete, and which you acknowledge (with $$) are not completed. Lower grades will be given to features which are found to be buggy, but which claim to be successfully implemented.

Features

Modify the scanner so that it accepts integers in either decimal or binary. Specifically, a string of only 0's and 1's that ends in the letter "b" (with no whitespace in between) should be interpreted as a binary integer. For example:
```
> ./mycalc
10 + 3;
13
10b + 3;
5
```
This should only require changing the flex scanner specification file. The cstdlib function strtol will be useful here (check it out with man strtol). To convert a binary integer stored in the C string called str, you can write strtol(str, NULL, 2).
Modify the scanner and parser so that terminating a statement with ";b" instead of ";" results in the output being printed in binary. To make your life easier ...
```
// print n as a signed binary number
void printsbin(int n)
{
  if (n < 0) { cout << "-"; n = -n; }
  char res[33] = { '\0' };
  int i;
  for(i = 31; n != 0; --i, n /= 2)
    res[i] = (n & 1 ? '1' : '0');
  cout << (i == 31 ? "0" : res + i + 1);
}
```
Now this is going to require modifying the bison file calc.ypp as well as the flex file. It should be a simple change if you understand the system, though: you need to define a new token (I might call it STOPB, or something like that), you need to modify your flex file to recognize the new token, and you need to create a new rule in the grammar to use the token.
```
> ./mycalc
10+3*5;
25
10+3*5;b
11001
10b+3*5;b
10001
10b+3*5;
17
```
Next we add comments! They're not too exciting in a simple calculator, but comments are always a good thing to have. C-sytle comments /* like this */ are not completely straightforward with flex. Why? Well, suppose you added the flex rule:
```
"/*".*"*/" { }
```
This doesn't work because .* doesn't capture \n's, and the whole point of C-style comments is to allow multi-line comments. So we make it like this:
```
"/*"(.|\n)*"*/" { }
```
This doesn't work either, and the problem is more fundamental. Remember that the rule is that the longest match possible is the one flex makes, so in
```
/* start */ x = 4; /* end */
```
flex thinks the whole string is a comment. In this case, we don't want maximal munch! (We have the same problem with string literals.) So we'll take the easy way out and use another common comment convention: A "#" character starts a comment and the comment extends to the next newline character. Here's an example session:
```
> ./mycalc
1+2+3+4+5+6+7+8+9+10; # sum 1..10 computed explicitly
55
10*(10+1)/2; # sum 1..10 computed by Gauss's closed form
55
```
Note that although the newline is part of the comment, the comment does separate tokens. So
```
1000# beginning of one million
000# end of one million
```
is interpreted as 1000 followed by 0, which gives an error. Not a lexical error, though, rather a parse error!

Food for thought

One of the real advantages of using a tool like flex as opposed to writing your scanner by hand is that adding new tokens usually has no effect whatsoever on what you've already done. You simply add a new line to your flex file. When you do this by hand, that's not the case: a new token may change the underlying finite automaton substantially.

Variables

Our calculator is getting a bit more interesting. One thing it doesn't allow, though, is variables. So let's add variables. We'll start with just a single variable called an accumulator. Observe that there are two ingredients to supporting a variable:

Assignment How do we set the value in the accumulator? We need an assignment statement. Let's add a line
```
res: SETACC exp STOP
```
to the grammar, along with a new token type SETACC. This token should correspond to some new operator that we choose; let's use a single exclamation point "!". This statement will set the value of the accumulator to the result of exp.
Now the actual value should be stored in an int somewhere in your code. Should it be in the (bison) parser code or the (flex) scanner code?
Reference Setting the accumulator is kind of useless if we can't actually access that value. We first need to decide how the accumulator will be referenced syntactically. For this, let's use the symbol @, and make a new token called ACC. Then we also need to add this into the grammar: ACC should be another expansion of the nonterminal factor.

Features:

Add assignment and reference to a single accumulator variable to your calculator, as described above. Then we should be able to do:
```
> ./mycalc
! 5;
@;
5
! 3 * 2;
@;
6
@ + 4;
10
```
You may decide to use symbols other than ! and @, but be sure to document this in your features.txt file if you do.

Named variables: Implementing a Symbol Table

Now let's allow many variables instead of just one. This will involve creating a symbol table to map variable names to values.

We won't require any declarations; having a name on the left-hand side of an assignment will be enough to create it. Syntax-wise, we have some decisions to make:

What should we use as an assignment operator? Some popular choices have been :=, =, to as in x to 1, and <- as in x <- 1. The last one is going to be problematic for us (why?). We'll use ":=" for these examples. If you choose to use something else, be sure to document it in features.txt.
What strings can be used as variable names? In Scheme almost anything is okay as a variable name. (Can you think of a string that isn't OK?) In Perl all variable names for numeric values like ours are prefixed with a $ symbol. In C/C++ and Java you have letters, digits and underscores, along with the restrictions that a number cannot begin a variable name. In defining something like this, the "character class" syntax for regular expressions is nice. For example, if you want C/C++ style variable names you'd use:
```
[a-zA-Z_][a-zA-Z0-9_]*
```
You can make whatever restrictions you like on variable names, as long as they include normal, alphabetic names such as var. Document the allowable variable names in your features.txt file.

Semantically, we have a few decisions as well:

Should assignment be an expression, i.e. something that can be nested within other expressions? Let's say no. We'll fit assignment into the grammar as: res: NAME ASN exp STOP, where NAME and ASN are new tokens.
What should happen when an uninitialized variable name is used, i.e. when a name gets used before it's been given a value? Let's print an error message and abort.

Finally, we have some technical implementation issues to worry about.

How will we implement a symbol table?
The C++ Standard Template Library (STL) is an extremely powerful language feature ... not always super easy to use, but very powerful. One of its most useful "containers" is the map. A "map" is essentially the abstract data type the red-black tree was invented to implement. It's a lot like an array that you don't need to index by consecutive integers, a structure called an associative array. Here's an example that uses a map mapping C++-style strings to int's:
```
#include <string>
#include <iostream>
#include <map>
using namespace std;

int main()
{
  map<string,int> T;
  string s;
  for(int i = 1; cin >> s && s != "done"; ++i)
    T[s] = i;
  cout << endl;
  
  cin >> s;
  if (T.find(s) == T.end())
    cout << "String " << s << " was not entered." 
         << endl;
  else
    cout << "String " << s << " was the "
         << T[s] << "th string entered." << endl;
}
```
A sample run of this program:
```
foo
bar
star
cat
done

cat
String cat was the 4th string entered.
```
Note that you must #include <map> to use maps. We can simply declare our symbol table as a map in the initial declarations section of the bison file.
The way this works is that T is initially empty, but as soon as you index it with something, like T[string("the")], the mere act of indexing creates an entry with key "the". One thing about this is that you can't really test to see whether there's already an entry for "the" this way. The way you make that test is: if (T.find(string("the")) == T.end()) ....
How will we pass names as semantic values? The natural way to pass names around as semantic values is to add a field of type string (i.e. C++-style string) in the %union statment, but that doesn't work. Bison is unable to handle any kind of object there that has a constructor. So ... we will pass around C-style strings instead. So you'll add something like
```
%union {
  int val;
  char sym;
  char *str;
};
```
to your bison file so that you can handle semantic values that are C-style strings. One pitfall: in your flex file, yytext is a C-style string giving the text that matched the token's pattern. However, you can't just assign that to yylval.str, because flex may use yytext's memory for later scanning. Instead you need to make a copy of yylval.str. The strdup function from C's string library (that'd be #include <cstring> in C++) does that for you. You can check it out with man strdup. This approach of passing C-style strings as semantic values is fine for this lab, but it has some long term issues, since we may "lose" strings before we have a time to free the memory they use.
Teaching bison about char* semantic values. In your action rules, you've already used things like $1, $2, and $$. Did you notice that seemed to magically do the right thing even though some of the values were integers, and some were characters? No magic really -- the type of each symbol (terminal or non-terminal) depends on what is defined with the %token keywords, just under the union declaration. So.... you'll need to do the right thing with a %token declaration to deal with the 'str' part of the union that you just created above.

Features:

Add a symbol table and variables as described above to your calculator program. Document in features.txt the assignment operator as well as the allowable variable names. If you used := for assignment, we should be able to do something like:
```
> ./mycalc
x := 7;
y := 93;
# Now we compute the sum x + ... + y
y*(y + 1)/2 - (x-1)*x/2;
4350
# Now in binary!
y*(y + 1)/2 - (x-1)*x/2;b
1000011111110
```

Last part: Your choice

You are required to implement at least two of the following "extra" features. Be sure to document what you do in features.txt.

Features:

Allow single-line strings as comments, which should be ignored. If you use double-quotes like "a comment", then we should be able to do things like:
```
> ./mycalc
5 + 3; "A comment"
8
5 + "Another comment" 3;
8
```
Make variable names case-insensitive. So for example we can do
```
> ./mycalc
var := 10;
VAR;
10
vAR := 11;
var;
11
```
Add in support for two new math operators: exponentiation and remainder (i.e., mod). Be sure to document what the operators are called in features.txt.
Allow numbers to be entered in scientific notation using the letter "e". Of course these will still be ints, so still no decimal points and the exponent can't be very big. It's up to you to specify the exact syntax, but this should not require any new grammar rules. For instance,
```
> ./mycalc
21e2;
2100
5e6;
5000000
```
Any other feature that you choose. (Run it by the instructor first.)