Lab 4: Intro to Flex and Bison
This lab is due at 2359 on Wednesday, 29 September. It should be submitted to the course SI413 as Lab04, with the files named properly according to the lab instructions. See the submit page for details. You are highly encouraged to submit whatever you have done by the end of this lab time. Remember that you must thoroughly test your code; the tests that are visible before the deadline are only there to make sure you are using the right names and input/output formats.
1 Starter Code
This lab starts a new group of labs. This means two things. First, you need to choose a new lab partner and make sure to notate that on the shared Google sheet.
Second, you need to create a new repo for your shared work running the following command. (One lab partner should run this command first to create the shared repo, and then the other lab partner should run it next to clone the repo.)
roche@ubuntu$
413repo parsing
If that command doesn't work, check the initial setup instructions.
You should now have a folder like
si413/parsing/lab04
with the starter code for this lab and a handy script to submit your code.
You can also browse the starter code from the Gitlab page (USNA intranet only).
2 Flex basics and flex/bison interactions
Flex is a tool that generates scanners. One of its main purposes is to provideyylex()
functions for
bison-generated
parsers. Flex and bison are GNU versions of lex and yacc,
which are traditional Unix development tools.
They both have manuals online:
flex manual,
bison manual.
The files calc.lpp and calc.ypp in the starter code define a simple calculator program. It follows a similar grammar to what we've seen in Unit 4, and is pretty self explanatory: addition, subtraction, multiplication, and division with numbers - using parenthesis too - is what's supported at the moment.
What do flex and bison do?
Look at the Makefile. We start off with the file calc.ypp that defines the grammar and action rules for our calculator. The call
bison -d calc.yppcreates files
calc.tab.cpp
and
calc.tab.hpp
. The first contains the C++ code for
the parser (which is a function called
yyparse()
), the second is a header file that
contains, among
other things, the definitions of the token names NUM, OPA,
etc., and the variable yylval
that we use to pass
the bison code the semantic values of tokens (for instance, the
numerical value). The flex file
calc.lpp
needs those definitions so that the
yylex()
function it defines for bison can pass
back the information it needs to pass back.
The next thing we have is the flex file calc.lpp
.
The call
flex -o lex.yy.cpp calc.lppcreates the file
lex.yy.cpp
that contains, among
other things, the definition of the yylex()
function. Of course we haven't talked about flex, so this file
is new. The basic idea is that you write regular expressions
for the different token types, and actions for each regular
expression. The flex generated scanner code tries to match
characters from the current input stream to these regular
expressions, and when a match is found, it executes the
associated action code. The variable yytext
contains the string (in the C sense, i.e. null-terminated
char
array)
of characters that were matched. When more than one
match is possible, it breaks ties by going with the longest
match - i.e., it follows the "maximal munch" rule.
When there are several "longest" matches, it goes with the regular expression listed first in the flex file. Of course, you should realize that flex is simply creating the same program implementing a finite automaton that we talked about in class. The difference is that we users specify tokens by regular expressions. So flex has to convert the regular expressions into finite automata for each token type, which are then combined into a single finite automaton. This is then converted, automatically, into a program.
Exercises
-
Examine these files carefully and make sure you understand what flex is
doing and how flex and bison interact and communicate. Run
make
and confirm that the calc program works as intended. (Nothing to turn in or test for this part.)
3 Adding Features to the Calculator
For the remainder of this lab, you will add features to the simple
calculator program. These features will involve significant additions to
the scanner (calc.lpp
), and some to the parser
(calc.ypp
) as well.
Exercises
-
Modify the scanner so that it accepts integers in either decimal or binary. Specifically, a string of only 0's and 1's that ends in the letter "b" (with no whitespace in between) should be interpreted as a binary integer. For example:
roche@ubuntu$
./calc
>
10 + 3;
13
>
10b + 3;
5
This should only require changing the flex scanner specification file. The
cstdlib
functionstrtol
will be useful here (check it out withman strtol
). To convert a binary integer stored in the C string calledstr
, you can writestrtol(str, NULL, 2)
. -
Modify the scanner and parser so that
terminating a statement with ";b"
instead of ";" results in the output being printed in
binary. To make your life easier, I've already included
the following function in the starter code:
Now this is going to require modifying the bison file// Prints n as a signed binary number void printbin(int n) { if (n < 0) { resout << '-'; n = -n; } int bit = 1; while (bit > 0 && bit*2 <= n) bit *= 2; while (bit > 0) { if (bit <= n) { n -= bit; resout << '1'; } else resout << '0'; bit /= 2; } return; }
calc.ypp
as well as the flex file. It should be a simple change if you understand the system, though: you need to define a new token (I might call it STOPB, or something like that), you need to modify your flex file to recognize the new token, and you need to create a new rule in the grammar to use the token.roche@ubuntu$
./calc
>
10+3*5;
25
>
10+3*5;b
11001
>
10b+3*5;b
10001
>
10b+3*5;
17
-
Next we add comments! They're not too exciting in a
simple calculator, but comments are always a good thing
to have. C-sytle comments
/* like this */
are not completely straightforward with flex. Why? Well, suppose you added the flex rule:"/*".*"*/" { }
This doesn't work because.*
doesn't capture\n
's, and the whole point of C-style comments is to allow multi-line comments. So we make it like this:"/*"(.|\n)*"*/" { }
This doesn't work either, and the problem is more fundamental. Remember that the rule is that the longest match possible is the one flex makes, so in/* start */ x = 4; /* end */
flex thinks the whole string is a comment. In this case, we don't want maximal munch! (We have the same problem with string literals.) So we'll take the easy way out and use another common comment convention: A "#" character starts a comment and the comment extends to the next newline character. Here's an example session:
Note that although the newline is part of the comment, the comment does separate tokens. Soroche@ubuntu$
./calc
>
1+2+3+4+5+6+7+8+9+10; # sum 1..10 computed explicitly
55
>
10*(10+1)/2; # sum 1..10 computed by Gauss's closed form
55
1000# beginning of one million 000# end of one million
is interpreted as 1000 followed by 0, which gives an error. Not a lexical error, though, rather a parse error!
Food for thought
One of the real advantages of using a tool like flex as opposed to writing your scanner by hand is that adding new tokens usually has no effect whatsoever on what you've already done. You simply add a new line to your flex file. When you do this by hand, that's not the case: a new token may change the underlying finite automaton substantially.
4 One variable
Our calculator is getting a bit more interesting. One thing it doesn't allow, though, is variables. So let's add variables. We'll start with just a single variable called an accumulator. Observe that there are two ingredients to supporting a variable:
- Assignment How do we set the value in the
accumulator? We need an assignment statement. Let's add a line
stmt: SETACC exp STOP
to the grammar, along with a new token typeSETACC
. This token should correspond to some new operator that we choose; let's use a single exclamation point "!
". Remember that we have to add the token's name to the bison file and add a corresponding line to the flex file.
The statement above will set the value of the accumulator to the result ofexp
.
Now the actual value should be stored in anint
somewhere in your code. Should it be in the (bison) parser code or the (flex) scanner code? - Reference Setting the accumulator is kind of
useless if we can't actually access that value. We first need to decide
how the accumulator will be referenced syntactically. For this,
let's use the symbol
@
, and make a new token calledACC
. Then we also need to add this into the grammar:ACC
should be another expansion of the nonterminalfactor
.
Exercises
-
Add assignment and reference to a single accumulator variable
to your calculator, as described above. Then we should be able to do:
roche@ubuntu$
./calc
>
! 5;
>
@;
5
>
! 3 * 2;
>
@;
6
>
@ + 4;
10
5 Many variables
Now let's allow many variables instead of just one. This will involve creating a symbol table to map variable names to values.
We won't require any declarations; having a name on the left-hand side of an assignment will be enough to create it. Syntax-wise, we have some decisions to make:
-
What should we use as an assignment operator? Some
popular choices have been
:=
,=
,to
as inx to 1
, and <- as inx <- 1
. The last one is going to be problematic for us (why?). We'll use ":=
" for this lab, with a token name ASN. -
What strings can be used as variable names? In Scheme
almost anything is okay as a variable name. (Can you think
of a string that isn't OK?) In Perl all variable names
for numeric values like ours are prefixed with a $ symbol.
In C/C++ and Java you have letters, digits and
underscores, along with the restrictions that a number
cannot begin a variable name. In defining something like
this, the "character class" syntax for regular expressions
is nice. For this lab, we'll say variable names can be any string
of at least 1 digits and letters. We can specify this to flex with
the regular expression
[a-zA-Z0-9]+
(We will have to make sure it gets added low enough in the flex file so this doesn't interfere with NUMs.)
Semantically, we have a few decisions as well:
- Should assignment be an expression, i.e. something that
can be nested within other expressions? Let's say no.
We'll fit assignment into the grammar as:
stmt: NAME ASN exp STOP
where NAME and ASN are new tokens. -
What should happen when an uninitialized variable name is
used, i.e. when a name gets used before it's been given a
value? Let's print an error message and abort.
Remember to use the
errout
stream if you want pretty colors!
Finally, we have some technical implementation issues to worry about.
- How will we implement a symbol table?
We will need a mapping from variable names to values for the symbol table. The variable names will be strings of course, and the values will be ints. Fortunately the
map
class from the C++ Standard Template Library (STL) provides this for us to use! It's kind of like an array, except we can use anything we want for the indices (in this case, it will be strings)! Check out this example:#include <string> #include <iostream> #include <map> using namespace std; int main() { map<string,int> T; string s; cout << "Enter strings, followed by \"done\"." << endl; for(int i = 1; cin >> s && s != "done"; ++i) { T[s] = i; // create map entry with key s and value i } cout << endl; cout << "What would you like to search for? "; cin >> s; if (T.count(s) == 0) { cout << "String " << s << " was not entered." << endl; } else { cout << "String " << s << " was the " << T[s] << "th string entered." << endl; } return 0; }
A sample run of this program:
roche@ubuntu$
./mapexample
Enter strings, followed by "done".
foo
bar
cat
star
done
What would you like to search for?
cat
String cat was the 3th string entered.
Note that you must
#include <map>
to use maps. We can simply declare our symbol table as a map near the other global variables in the initial declarations section of the bison file. (Look for the declaration forbool keepgoing
.)The way the example above works is that
T
is initially empty, but as soon as you index it with something, likeT[string("the")]
, the mere act of indexing creates an entry with key"the"
. One thing about this is that you can't really test to see whether there's already an entry for"the"
this way. The way you make that test is:if (T.count(string("the")) == 0) ...
. - How will we pass names as semantic values?
The natural way to pass names around as semantic values is
to add a field of type
string
(i.e. C++-style string) in the%union
statment, but that doesn't work. Bison is unable to handle any kind of object there that has a constructor. So ... we will pass around C-style strings instead. So you'll add something like%union { int val; char sym; char *str; };
to your bison file so that you can handle semantic values that are C-style strings. One pitfall: in your flex file,yytext
is a C-style string giving the text that matched the token's pattern. However, you can't just assign that toyylval.str
, because flex may use yytext's memory for later scanning. Instead you need to make a copy ofyylval.str
. Thestrdup
function from C's string library (that'd be#include <cstring>
in C++) does that for you. You can check it out withman strdup
. This approach of passing C-style strings as semantic values is fine for this lab, but it has some long term issues, since we may "lose" strings before we have a time to free the memory they use. - Teaching bison about char* semantic values. In your action rules, you've already used things like $1, $2, and $$. Did you notice that seemed to magically do the right thing even though some of the values were integers, and some were characters? No magic really -- the type of each symbol (terminal or non-terminal) depends on what is defined with the %token keywords, just under the union declaration. So.... you'll need to do the right thing with a %token declaration to deal with the 'str' part of the union that you just created above.
Exercises
-
Add a symbol table and variables as described above to your
calculator program. Then you should be able to do something like:
This is quite an undertaking! As a review of all that is written above, here are the steps you will need to take. If you stop and test along the way (when possible), it'll make your life much easier.roche@ubuntu$
./calc
>
x := 7;
>
y := 93;
>
# Now we compute the sum x + ... + y
>
y*(y + 1)/2 - (x-1)*x/2;
4350
>
# Now in binary!
>
y*(y + 1)/2 - (x-1)*x/2;b
1000011111110
- Define two new token types
NAME
andASN
in the bison file. - Make two new lines in the flex file for your new tokens. Don't worry about the semantic value yet; just return the token type.
- Make two new grammar rules in the bison file. The first is a
production of
stmt
as shown above. The second is a new production offactor
that's just a variable NAME. You won't be able to write the code for these grammar rules just yet, so for now just have them print out a helpful message. - Make your symbol table. This will be declared with a line like
map<string, int> ST;
in the top part of your bison file, near the declaration forkeepgoing
. - Now write the code for your grammar rules. For now, you won't be
able to get the actual variable name strings! So I would just
pretend like every variable is named "x" or something like that.
You will have to assign to
T[name]
whenever a variable is assigned, and do acount
whenever a variable is accessed (this would be in the rule forfactor
). - OK, time to pass the actual variable names from the scanner to
the parser. You will need to add a new line to the
%union
part of the bison file. Then open the scanner file and use thestrdup
function as discussed above to actually assign toyylval
. - Now just go back and modify the code for the grammar rule that does assignment to use the actual variable name (semantic value) instead of "x" or whatever you used.
- Test, test test!
- Define two new token types
6 Last part: A few more features
Exercises
-
Allow single-line strings as comments, which should be ignored.
If you use double-quotes like
"a comment"
, then we should be able to do things like:roche@ubuntu$
./calc
>
5 + 3; "A comment"
8
>
5 + "Another comment" 3;
8
-
Make variable names case-insensitive. So for example we can do
roche@ubuntu$
./calc
>
var := 10;
>
VAR;
10
>
vAR := 11;
>
var;
11
-
OPTIONAL:
Add in support for two new math operators: exponentiation usng
^
and remainder (i.e., mod) using%
. -
OPTIONAL:
Allow numbers to be entered in scientific notation using the letter
"e". Of course these will still be
int
s, so still no decimal points and the exponent can't be very big. This should not require any new grammar rules. For instance,roche@ubuntu$
./calc
>
21e2;
2100
>
5e6;
5000000
- OPTIONAL: Any other feature that you choose! (Make sure it doesn't cause something else to no longer work though.)