This is the archived website of SI 413 from the Fall 2012 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.
This lab is due at 0900 next Tuesday, September 25. It should contain
(at least) the two files
calc.ypp
and calc.lpp
,
as well as a subfolder
called tests
with your tests. (Details on the testing
are below.)
See the submit page for more info.
tar xzvf [filename]
from the linux
command line.)
Until further notice, all our labs will consist of writing
interpreters for (at first) simple programming languages. You are
still expected to write two good tests per exercise, submitted in
the folder tests/
.
Each test file will look like:
The input is up here. (Possibly on multiple lines.) ; The expected output is down here. See the line with the semicolon? That's what separates them.
This is mostly the same as our previous labs, except that instead
of both parts of the test case being Scheme code fragments, the top
part is input and the bottom part is output. Again, each test file name
should start with something like ex01
to indicate which exercise
is being tested.
You can use the iotest
command to run a single test
manually and see the results, just like schemetest
worked before.
yylex()
functions for
bison-generated
parsers. Flex and bison are GNU versions of lex and yacc,
which are traditional Unix development tools.
They both have manuals online:
flex manual,
bison manual.
The files calc.lpp
and calc.ypp
in the starter code define a simple calculator
program. It's almost the same simple program we had from
Unit 4 in the bisoncalc
The big difference is that now, instead of
writing our scanner by hand, we're using the standard tool flex
to create it. (In addition, NUM tokens don't contain any +/- signs.)
Look at the Makefile. We start off with the file calc.ypp that defines the grammar and action rules for our calculator. The call
bison -d calc.yppcreates files
calc.tab.cpp
and
calc.tab.hpp
. The first contains the C++ code for
the parser (which is a function called
yyparse()
), the second is a header file that
contains, among
other things, the definitions of the token names NUM, OPA,
etc., and the variable yylval
that we use to pass
the bison code the semantic values of tokens (for instance, the
numerical value). The flex file
calc.lpp
needs those definitions so that the
yylex()
function it defines for bison can pass
back the information it needs to pass back.
The next thing we have is the flex file calc.lpp
.
The call
flex -o lex.yy.cpp calc.lppcreates the file
lex.yy.cpp
that contains, among
other things, the definition of the yylex()
function. Of course we haven't talked about flex, so this file
is new. The basic idea is that you write regular expressions
for the different token types, and actions for each regular
expression. The flex generated scanner code tries to match
characters from the current input stream to these regular
expressions, and when a match is found, it executes the
associated action code. The variable yytext
contains the string (in the C sense, i.e. null-terminated
char
array)
of characters that were matched. When more than one
match is possible, it breaks ties by going with the longest
match - i.e., it follows the "maximal munch" rule.
When there are several "longest" matches, it goes with the regular expression listed first in the flex file. Of course, you should realize that flex is simply creating the same program implementing a finite automaton that we talked about in class. The difference is that we users specify tokens by regular expressions. So flex has to convert the regular expressions into finite automata for each token type, which are then combined into a single finite automaton. This is then converted, automatically, into a program.
make
and confirm that the calc program works
as intended.
(Nothing to turn in or test for this part.)
For the remainder of this lab, you will add features to the simple
calculator program. These features will involve significant additions to
the scanner (calc.lpp
), and some to the parser
(calc.ypp
) as well.
Modify the scanner so that it accepts integers in either decimal or binary. Specifically, a string of only 0's and 1's that ends in the letter "b" (with no whitespace in between) should be interpreted as a binary integer. For example:
$ ./calc > 10 + 3; 13 > 10b + 3; 5
This should only require changing the flex scanner specification
file. The cstdlib
function strtol
will be
useful here (check it out with man strtol
).
To convert a binary integer stored in the C string called
str
, you can write strtol(str, NULL, 2)
.
// Prints n as a signed binary number void printbin(int n) { if (n < 0) { resout << '-'; n = -n; } int bit = 1; while (bit > 0 && bit*2 <= n) bit *= 2; while (bit > 0) { if (bit <= n) { n -= bit; resout << '1'; } else resout << '0'; bit /= 2; } return; }Now this is going to require modifying the bison file
calc.ypp
as well as the flex file. It
should be a simple change if you understand the system,
though:
you need to define a new token (I might call it STOPB,
or something like that), you need to modify your flex
file to recognize the new token, and you need to create
a new rule in the grammar to use the token.
$ ./calc > 10+3*5; 25 > 10+3*5;b 11001 > 10b+3*5;b 10001 > 10b+3*5; 17
/* like this */
are not completely straightforward with flex. Why? Well,
suppose you added the flex rule:
"/*".*"*/" { }This doesn't work because
.*
doesn't capture
\n
's, and the whole point of C-style comments
is to allow multi-line comments. So we make it like this:
"/*"(.|\n)*"*/" { }This doesn't work either, and the problem is more fundamental. Remember that the rule is that the longest match possible is the one flex makes, so in
/* start */ x = 4; /* end */flex thinks the whole string is a comment. In this case, we don't want maximal munch! (We have the same problem with string literals.) So we'll take the easy way out and use another common comment convention: A "#" character starts a comment and the comment extends to the next newline character. Here's an example session:
$ ./calc > 1+2+3+4+5+6+7+8+9+10; # sum 1..10 computed explicitly 55 > 10*(10+1)/2; # sum 1..10 computed by Gauss's closed form 55Note that although the newline is part of the comment, the comment does separate tokens. So
1000# beginning of one million 000# end of one millionis interpreted as 1000 followed by 0, which gives an error. Not a lexical error, though, rather a parse error!
One of the real advantages of using a tool like flex as opposed to writing your scanner by hand is that adding new tokens usually has no effect whatsoever on what you've already done. You simply add a new line to your flex file. When you do this by hand, that's not the case: a new token may change the underlying finite automaton substantially.
Our calculator is getting a bit more interesting. One thing it doesn't allow, though, is variables. So let's add variables. We'll start with just a single variable called an accumulator. Observe that there are two ingredients to supporting a variable:
stmt: SETACC exp STOPto the grammar, along with a new token type
SETACC
. This token should correspond to
some new operator that we choose; let's use a single exclamation point
"!
". Remember that we have to add the token's name to
the bison file and add a corresponding line to the
flex file.exp
.int
somewhere
in your code. Should it be in the (bison) parser code or the
(flex) scanner code?@
, and make a new token called
ACC
. Then we also need to add this into the grammar:
ACC
should be another expansion of the nonterminal
factor
.$ ./calc > ! 5; > @; 5 > ! 3 * 2; > @; 6 > @ + 4; 10
Now let's allow many variables instead of just one. This will involve creating a symbol table to map variable names to values.
We won't require any declarations; having a name on the left-hand side of an assignment will be enough to create it. Syntax-wise, we have some decisions to make:
:=
, =
,
to
as in x to 1
,
and <- as in x <- 1
.
The last one is going to be problematic for us (why?).
We'll use ":=
" for this lab, with a token
name ASN.
[a-zA-Z0-9]+(We will have to make sure it gets added low enough in the flex file so this doesn't interfere with NUMs.)
Semantically, we have a few decisions as well:
stmt: NAME ASN exp STOPwhere NAME and ASN are new tokens.
errout
stream if you want
pretty colors!
Finally, we have some technical implementation issues to worry about.
We will need a mapping from variable names to values
for the symbol table. The variable names will be strings of course,
and the values will be ints. Fortunately the map
class from the C++ Standard Template Library (STL) provides
this for us to use! It's kind of like an array, except we can
use anything we want for the indices (in this case, it will be
strings)! Check out this example:
#include <string> #include <iostream> #include <map> using namespace std; int main() { map<string,int> T; string s; for(int i = 1; cin >> s && s != "done"; ++i) T[s] = i; cout << endl; cin >> s; if (T.find(s) == T.end()) cout << "String " << s << " was not entered." << endl; else cout << "String " << s << " was the " << T[s] << "th string entered." << endl; }
A sample run of this program:
foo bar star cat done cat String cat was the 4th string entered.
Note that you must #include <map>
to
use maps. We can simply declare our symbol table as a map
near the other global variables
in the initial declarations section of the bison file.
(Look for the declaration for bool keepgoing
.)
The way the example above works is that T
is initially empty, but as
soon as you index it with something, like
T[string("the")]
,
the mere act of indexing creates
an entry with key "the"
. One thing about
this is that you can't really test to see whether
there's already an entry for "the"
this
way.
The way you make that test is:
if (T.find(string("the")) == T.end()) ...
.
string
(i.e. C++-style
string) in the %union
statment, but that
doesn't work. Bison is unable to handle any kind of
object there that has a constructor. So ... we will pass
around C-style strings instead. So you'll add something
like
%union { int val; char sym; char *str; };to your bison file so that you can handle semantic values that are C-style strings. One pitfall: in your flex file,
yytext
is a C-style string giving the text
that matched the token's pattern. However, you can't just
assign that to yylval.str
, because flex may
use yytext's memory for later scanning. Instead you need
to make a copy of yylval.str
. The
strdup
function from C's string library
(that'd be #include <cstring>
in C++)
does that for you. You can check it out with man
strdup
. This approach of passing C-style strings
as semantic values is fine for this lab, but it has some
long term issues, since we may "lose" strings before we
have a time to free the memory they use.
$ ./calc > x := 7; > y := 93; > # Now we compute the sum x + ... + y > y*(y + 1)/2 - (x-1)*x/2; 4350 > # Now in binary! > y*(y + 1)/2 - (x-1)*x/2;b 1000011111110This is quite an undertaking! As a review of all that is written above, here are the steps you will need to take. If you stop and test along the way (when possible), it'll make your life much easier.
NAME
and ASN
in the bison file.stmt
as shown above. The second is a new
production of factor
that's just a variable NAME.
You won't be able to write the code for these grammar rules just yet,
so for now just have them print out a helpful message.
map<string, int> ST;in the top part of your bison file, near the declaration for
keepgoing
.
T[name]
whenever a variable
is assigned, and do a find
whenever a variable is
accessed (this would be in the rule for factor
).
%union
part of the bison file. Then open the scanner
file and use the strdup
function as discussed above
to actually assign to yylval
.
"a comment"
, then we
should be able to do things like:
> ./calc 5 + 3; "A comment" 8 5 + "Another comment" 3; 8
> ./calc var := 10; VAR; 10 vAR := 11; var; 11
int
s, so still no decimal
points and the exponent can't be very big. It's up to you to specify the
exact syntax, but this should not require any new grammar rules.
For instance,
> ./calc 21e2; 2100 5e6; 5000000