In this section we will tackle the creation of our lexical analyzer that understands and tokenizes
the ptuc
language. For this task, we will use a tool called flex
(as previously mentioned); which
is different than jflex
and other programs with similar name. The only thing that's remotely compatible
with flex
is lex
, which flex
is based on; but since flex
is a superset of the features provided
by lex
I cannot guarantee that the material and techniques used here will be directly applicable
to lex
.
A typical flex
file is comprised out of three sections which can the following:
- User code section (optional)
- Definition section (decl. of name def. and starting conditions)
- Rule section (actual input matching)
- User code section (optional)
Rule section is enclosed within %%
brackets and the top user code section must be enclosed in
{%
, %}
as shown below or using %top{ ... }
.
{%
user code
%}
def. section
%%
rule section
%%
user code
Although optional, the top user code section is quite commonly used as one would normally find it the perfect place for including files, declaring function prototypes for later use and so on -- handy, to say the least. For example this section in our starting stub looks like following:
%{
#include <assert.h>
#include <stdio.h>
#include <string.h>
#include "ptucc_parser.tab.h"
#include "cgen.h"
int line_num = 1;
/*
A primitive macro facility:
just one macro is allowed to be defined!
*/
#define MAXMACRO 32
char* mactable[MAXMACRO][2];
int mactable_size = 0;
/* Return 1 on success, 0 on failure (macro table full) */
int set_macro(char* name, char* def);
/* Return def for macro, or NULL if no such macro is defined. */
char* get_macro(char* name);
%}
All these variables can be used in both definition and rule sections as we will later see; if they were
placed only in the bottom user code section this would not be possible. Keep in mind that Finally all
code inside the user code sections is copied as-is to the processed file. Additionally comments in
the rule and definition sections must be padded with at least one white-space (
), otherwise flex
treats them as rules; this is illustrated in the following example:
/* this is a valid comment */
ID [a-zA-Z_][0-9a-zA-Z_]*
/* this is not a valid comment */
The heart of flex
resides with regular expressions as its job is to basically match the input to a
specific rule-set we instruct it to. Basic knowledge of them is assumed; so if you don't know a quick
introduction to them would be the Examples section of Wikipedia. An invaluable tool for evaluating
regexp in general is regex101 that provides both analysis and testing of regular expressions -- it's
really awesome and handy as f**k.
First of all, our starting stub has already some sample code that performs some useful tasks already, which are shown below.
ID [a-zA-Z_][0-9a-zA-Z_]*
DIGIT [0-9]
NUMBER {DIGIT}{DIGIT}*
DECIMAL ("."{NUMBER})|({NUMBER}("."{NUMBER}?)?)
REAL {DECIMAL}([eE][+-]?{NUMBER})?
STRING '[^'\n]*'
%x comment
%x macro
In the above segment we can see that the named definitions have the following syntax:
def_name definition
and starting conditions have the following:
%x cond_name
%s cond_name
Named definitions are used to simplify the process of writing rules; so for example we can do the following:
DIGIT [0-9]
NUMBER {DIGIT}{DIGIT}*
The first definition is comprised out of exactly one (1
) digit in the range [0, 9]
so we intuitively
labeled that named def. DIGIT
. Moreover a NUMBER
is comprised out of at least one (1
) or more
digits of that range so a regular expression that matches that would be the one shown above:
{DIGIT}{DIGIT}*
, which does exactly what we previously said. It has to be noted that the previous expression
is equivalent to [0-9][0-9]*
but usually it's much more readable to compose rules out of descriptive named
definitions especially when dealing with more complex rules.
If you don't understand any of the rules, go to regex101 and play with them, you'll gain quite more by fiddling with them yourself.
There are cases where we want to invoke a custom rule after we match a particular character or sequence;
that's not something that can be done using named definitions so flex
supports two starting conditions which
do exactly that. Their types are inclusive and exclusive which are indicated by %s
and %x
respectively.
For our intents and purposes we will only use the exclusive start condition type. Conditions are called using
a special function BEGIN
, so we would invoke one like this: BEGIN(comment)
. Finally an
exclusive (%x
) condition means that as long as another BEGIN
action is not invoked no other
rules that have starting condition will be active.
In order to populate the flex
rules we have to carefully read our language specification here, I'll
only here compare the changes with the final file and only mention any noteworthy changes.
flex
is tasked to match the input against our rules; tokens are what we use to let bison
know that we
a matched a flex
rule against our current input. Tokens nothing more than integer
values inside
an enumeration located in parser_fname.tab.h
which is generated for us automatically by invoking bison
with the -d
argument and our parser file.
A token to be valid in both bison
and flex
must be first declared as a %token
inside bison
like so
%token IDENT
and then bison
can automatically generate the token stub by executing: bison -d my_parser.y
as
previously mentioned. The generated file would have the name of my_parser.tab.h
which would be then
included in the flex
(.l
) file. It has to be noted that not all rules are required to return a
token
but most of them should.
Each language has a set of reserved words that perform a specific function and cannot be used for any other purpose (e.g. keywords, operators etc). The way we handle them is pretty simple with a slight variation should our language be case sensitive or insensitive (we would require a slightly more expressive reg. exp. to represent our rule).
Usually we match a keyword
and return that specific token; this is all done directly in the rules section
as it is shown for some operators below:
"+" {return KW_OP_PLUS;}
"-" {return KW_OP_MINUS;}
"*" {return KW_OP_MUL;}
"/" {return KW_OP_DIV;}
This directly matches the string characters shown in the left with the rule definition shown inside the {
, }
;
thus when for example we see a +
sign we return the KW_OP_PLUS
token. In a similar fashion keywords
would
be matched as follows:
begin {return KW_BEGIN;}
end {return KW_END;}
This again matches the literal string shown in the left (begin
and end
) to their respective tokens
inside {
, }
. Finally named definitions can be used as matching rules themselves as such:
{REAL} {return REAL;}
The above matches a REAL
named definition and returns if it was indeed matches against our input the REAL
token.
As is indicated in ptuc
spec. we treat strings that have single and double quotes equally and also we allow
a limited set of special escape characters (\n
, \t
, \r
, \'
, \"
) and no multi-line strings. In our
starting stub we had the following named definition for string classification:
STRING '[^'\n]*'
This is extremely inadequate, as what that only does is handle single quoted strings and is naively ended
in either a newline character (\n
) or in a single quote ('
). This does not handle the escape characters
nor something like this: writeString('PI="nice"');
in which we should throw and error as "
is an escape
character and cannot be written like that; instead this would be written as: writeString('PI=\"nice\"');
.
This is accomplished by the following two named definitions:
STRING '(\\[ntr\'\"\\]|[^\'\\"])*'
STR_LIT \"(\\[ntr\'\"\\]|[^\'\\"])*\"
This allows us to capture both types of strings in two classes, one for each. The philosophy behind this is
pretty simple as the first part \\[ntr\'\"\\]
is responsible for capturing the allowed
escape characters while the other [^\'\\"]
is responsible for capturing the remaining allowed characters.
We also have to classes as the strings that are in single quotes need to be converted into double quoted strings
for C
conversion, hence we need to have a distinct token in order know when to perform that conversion.
Now as we saw in the specification we have baked into ptuc
a rudimentary macro support facility which is pretty
straightforward. Already in our starting stub we have a working macro substitution framework -- but it's really
slow and it has a big limitation as its size is fixed at compile time. Let's take a look at how macros are
detected and we will deal with that issue later on.
Again we use a starting condition in order to fully process a macro which is placed in the definition section and looks like this:
/* macro expansion starting condition */
%x macro
Now in the rules section we have a regular-expression match that triggers our macro starting condition; that particular rule follows:
@defmacro[ \r\t]+ BEGIN(macro);
This rule literally matches "@defmacro" string and expects to discard at least one or more characters
inside the class: [ \r\t]+
. Then we have the matched rule that follows to invoke the exclusive <macro>
starting condition in order process the macro definition. The only rule that begins with this starting
condition must be followed by an <identifier>
as well in order to be successfully match (otherwise an error
is thrown). The rule follows in full.
<macro>{ID} {
/* Store macro name */
char* name = strdup(yytext);
char* def = NULL;
size_t deflen = 0;
char c;
FILE* deff = open_memstream(&def, &deflen);
while((c=input())!='\n') {fputc(c, deff);}
fclose(deff);
if(!set_macro(name, def))
{yyerror("lexer error: failed to define macro '%s'\n", name);}
BEGIN(INITIAL);
};
There are a couple of things that happen here; the first thing is that we use open_memstream
to open a
file in-memory so that we can read to the value of the macro; this is done inside the white
loop as we
read the characters until we reach the end of the line. The name of the macro is the current value
of yytext
(as it is matched from the ID
named def.).
We then use set_macro
function to insert the Key
, Value
pair inside our macro table. Unfortunately this
method that's used in our starting stub is not only slow (as we pay always the cost of traversing a list,
O(n)
) it's also limited in size at compile time. The final version uses a hashtable (and the one from
here) which is both faster (O(1)
lookup cost) and more flexible as a data structure. In the end both
set_macro
and get_macro
are wrappers to the respective ht_get
and ht_set
hashtable
functions that
attempt to set and get a (Key
, Value
) pair.
Identifiers thankfully are pretty straight forward in their implementation and the rule is the same as in our
starting stub. But let's check it out in order to discern how we perform a macro substitution if needed.
The full matching rule for identifiers
follows.
{ID} {
char* def = get_macro(yytext);
if(def==NULL) {
yylval.crepr = strdup(yytext);
return IDENT;
}
for(int i=strlen(def); i>0; i--)
{unput(def[i-1]);}
}
There are a couple of things happening here but the logic behind it is this -- since we perform macro
substitution inside our lexer (our parser has no idea of this) we have to check at every identifier
if that is a macro and needs to have its value replaced. Let's illustrate this with an example,
say we have the following:
@defmacro COOL_DIGIT 10
(* and later on *)
x := COOL_DIGIT;
now x := COOL_DIGIT
is an assignment statement, hence COOL_DIGIT
will be matched as an identifier as
who's to say that it is not another variable
or macro
? Thus, when we perform an identifier
input matching
we check if the value of yytext
is the key
of a (Key
, Value
) pair in our macro
hashtable. That's
accomplished by this
char* def = get_macro(yytext);
In case def
turns out to be NULL
, then this is definitely not a macro so we return that we matched an
identifier
; in the case that def
is not NULL
then we have to substitute its value (by pushing it on
our input buffer) in order to be read by flex
afterwards. That's what that for
-loop is for, please also
notice that we substitute the value in reverse order (why?).
Handling numbers is a bit more complex than the named definitions provided in our starting stub; that mainly
due to the fact that as per our specification we do not allow the integer part to have starting zeros; thus
with our initial definition this x := 0001
would be a perfectly legal NUMBER
. To alleviate this we create
another class called SDIGIT
that includes the range [1, 9]
that will be used always to represent the
first digit of a potential integer part. The final result is shown below.
SDIGIT [1-9]
DIGIT [0-9]
SNUMBER (0|{SDIGIT}{DIGIT}*)
NUMBER {DIGIT}{DIGIT}*
DECIMAL ("."{NUMBER})|({SNUMBER}("."{NUMBER}?)?)
REAL {DECIMAL}([eE][+-]?{SNUMBER})?
Please note that we kept both SNUMBER
and NUMBER
as we have to allow starting zeros in the fraction part of
a floating point number (e.g. .0011
, 1.002e+12
).
Comments are important on reading a source file but not when compiling it -- hence we just discard them. There are a couple of things that we have to take care of such as incrementing the line counter on multi-line comments and handling some edge cases.
On our starting stub we have an almost complete implementation on how to discard multi-line comments, but nothing is done for single-line ones. Let's look at how multi-line comments are handled first.
First of, in the definition section we have the exclusive starting class
/* multi-line comment class */
%x comment
And later on deep inside the rule section we have the actual rules that discard the comments.
"(*" BEGIN(comment);
<comment>[^*\n]* /* eat anything that's not a '*' */
<comment>"*"+[^*)\n]* /* eat up '*'s not followed by ')'s */
<comment>\n ++line_num;
<comment>"*"+")" BEGIN(INITIAL);
As you might imagine the first rule is triggered when the input (current value of yytext
) is (*
then
we might go to any of the four (4
) available classes depending on the input. Inspecting the comments
it's pretty straight-forward to understand what these do but please also do remember that unless we exit
the exclusive starting condition (<comment>
) no other rule than these four (4
) will be tried. Thus if
we don't encounter the fourth rule we might as well eat up the whole file (leading to an error in the end).
Using a similar philosophy single-line comments could be implemented like this.
In the definition section we will have to define a new (exclusive) starting condition like this:
/* single-line comment class */
%x sl_comment
Then using fewer cases, as we are bounded by the single line constraint; thus our hard end limit is
the newline (\n
) character. The complete implementation is shown below.
/* handle single-line comments */
"//" BEGIN(sl_comment);
<sl_comment>[^\n] /* eat up anything up to newline */
<sl_comment>\n {
++line_num; // increment line number
BEGIN(INITIAL); // go back
}
This is a rough outline on how it looks in the final version (with a couple more additions).
Again as the specification outlines we baked into ptuc
apart from basic macro capabilities some rudimentary
module support. Our modules here only support declarations (of functions, variables etc) but not direct
statements. Modules are handled within flex
and are passed to bison as part of the general source, that
means that bison does not know (at least directly) whether we switched our input buffer in order to parse
a module file.
Modules can have their own includes as well but up to a depth (default 10
); that's done to avoid
infinite-circular includes where we would include the same file over and over again. There are a couple
of interesting techniques illustrated here which are the following:
- Multiple input (
flex
) buffers - Accurate line tracking (in includes and main source)
- Error tracking with file insight
Let's explain how each one is implemented in our version.
When you want to switch the input to read another file than the one currently inside yyin
FILE
pointer you
should be probably use flex
's multiple input buffer capabilities. That functionality allows us to switch
our source file for reading at will as is outlined in the manual here. If you want something really basic
you can probably get away by letting flex
manage your buffers for you, but if you want to do anything more
sophisticated you'll be probably out of luck with this method.
In our case since I wanted to support (somewhat) accurate error reporting, line tracking across files and avoid any leaks in case we don't find the file to open or an error inside it. The key function prototypes that we have to learn here are the following:
YY_BUFFER_STATE yy_create_buffer(FILE *f, int size); // creates a yylex buffer.
void yy_switch_to_buffer(YY_BUFFER_STATE new_buffer); // switches to the pointed buffer.
void yy_delete_buffer(YY_BUFFER_STATE buffer); // deletes an allocated buffer.
The outline is pretty simple, once we match the include module
rule we open that file and read it
in-place. Again this happens without bison
being directly aware of that, although functions that
report errors are. To avoid the infinite-circular include problems (which happens when a module
includes
itself; hence it would do that over and over) we have imposed a soft include depth limit, which can be
changed with a command line argument (in the final version and the argument is -d
).
Let's first check the rule that starts everything (start condition defined in the definitions and the actual rule in the rule section).
/* include module starting condition */
%x incl_module
/* later on in the rule section */
/* handle module includes */
<incl_module>[ \t]* {/* eat whitespaces */}
<incl_module>[^ \t\n]+{ID} {
if(include_file() == NULL)
{yyerror("could not open include file");}
else {
if(verbose_flag) {
fprintf(stderr, "\n --\n\tincluding module: %s\n --\n",
yybuf_states[yylex_bufidx].fname);
}
}
BEGIN(INITIAL);
}
Here we have two rules that can be matched while satisfying the exclusive starting condition <incl_module>
;
these serve two different purposes the first one eats up all the white-space or tab characters in order to get
to the module
name which we expect to be match to an ID
. Thus upon entering the second rule the value of
yytext
would contain the actual name of the include module. The actual input redirection happens inside
include_file
function which is shown below:
/* include a file */
FILE *
include_file() {
if(yylex_bufidx >= yystack_depth-1) {
yyerror("yylex input buffer stack exhausted, current limit is: %d",
yystack_depth);
return NULL;
}
char *fname = template("%s.ptuc", yytext);
if(fname == NULL) {return NULL;}
/* assign the current include file pointer */
FILE *fptr = fopen(fname, "r");
/* return if we can't open */
if(!fptr) {
yyerror("lexical error: couldn't open %s module", yytext);
free(fname);
return NULL;
}
if(yybuf_states == NULL) {
if((yybuf_states = calloc(yystack_depth,
sizeof(*yybuf_states))) == NULL) {
yyerror("\n -- Error: Could not allocate buffer stack");
free(fname);
return NULL;
}
}
/* now save current state and set-up to switch to the next */
yybuf_states[yylex_bufidx].state = YY_CURRENT_BUFFER;
yylex_bufidx++;
yybuf_states[yylex_bufidx].state = yy_create_buffer(fptr, YY_BUF_SIZE);
yybuf_states[yylex_bufidx].fname = fname;
yybuf_states[yylex_bufidx].incl_lnum = 1;
/* switch the state */
yy_switch_to_buffer(yybuf_states[yylex_bufidx].state);
return fptr;
}
This function is comprised out of three main parts, which are the following:
- Check if we can have an additional buffer (bounded by global variable
yystack_depth
) - Try to open the file pointed by
yytext
value after appending the.ptuc
extension to it. - Save current state, advance global buffer index (
yylex_bufidx
) create state and populate it - finally switch to it.
It has to be noted that in order to represent a flex
state we only need one variable of type YY_BUFFER_STATE
but since we require a bit of additional information in order to provide some additional functionality we have
created a wrapper structure for it which is the following:
/* flex input buffers structure */
typedef struct __yybuf_state {
YY_BUFFER_STATE state; // flex buffer
char *fname; // filename we read from
uint32_t incl_lnum; // include file line no. tracker.
} yybuf_state;
The other two variables are used to store the name of the file we read from as well as the local
line counter for that file. After reaching to the end of the switched file an end of file character
is matched (<<EOF>>
) so naturally that would be the place to swap to a previous buffer (if any) or
terminate our parser. Let's take a look at how we handle <<EOF>>
when matched:
<<EOF>> {
/* pop one of the stacked buffers, if any */
if(!pop_delete_buffer()) {
if(yybuf_states)
{free(yybuf_states);}
flex_closure();
return EOF;
}
}
When we match an <<EOF>>
we try to check if we have any more buffers, if we do we pop and delete the current
one using pop_delete_buffer
function which also switches to the previous buffer to be processed; it also
returns true
as long as we have more buffers to process. If we don't have any more buffers to process
that means we have completed reading out input in its entirety so we can terminate our parser and safely
return the EOF
token -- which is not the same as <<EOF>>
, the former is an actual token
while
the latter is a character.
Since we have the ability to include multiple modules we are have a problem at hand with out starting stub
line counter implementation. Since our modules are read in-place by our lexer if we just used a single variable
to increment the line numbers we would report incorrect offsets as the end file is a stitched file of all the
inputs together, so we if had to include two files each of which were 20
lines. Then the total line counter would
be within [1, 60]
which is wrong as if a problem was found at line 5
inside the second include file
the line reported would be 25
and not 5
(why?). This is fixed by wrapping inside our buffer data-structure
a separate line counter variable which serves that purpose.
To keep track and report accurate line numbers we had to create functions in order to report and increment the correct line number. The functions are the following:
/* increment the line count, depending if we are including a file or not */
void
increment_line_count() {
if(including_file())
{yybuf_states[yylex_bufidx].incl_lnum++;}
else
{line_num++;}
}
/* fetch the line count, depending if we are including a file or not */
uint32_t
fetch_line_count() {
return including_file() ?
yybuf_states[yylex_bufidx].incl_lnum : line_num;
}
A similar problem is faced when we want to report the name of the file in which the problem occurred. For this purpose we again included another variable in our data-structure wrapper that holds the name of the file. The function that returns that value is the following:
/* fetch the currently processed file name */
char *
fetch_incl_name() {
return including_file() ?
yybuf_states[yylex_bufidx].fname : NULL;
}
Discarding not significant characters is equally important as they might take up a significant percentage of our input, thus we have to take care of that. Thankfully this is handled perfectly in our starting stub and only small changes need to be made. We will briefly explain it though.
[ \r\t] {/* skip whitespace */}
\n {increment_line_count();}
<<EOF>> {
/* pop one of the stacked buffers, if any */
if(!pop_delete_buffer()) {
if(yybuf_states)
{free(yybuf_states);}
flex_closure();
return EOF;
}
}
. {yyerror("lexical error: unrecognized literal '%s'\n", yytext);}
The first rule matches and discards all white-space, return and tab characters. The second rule matches the
newline characters and increments the line counter variable. The third rule is responsible for handling the
end of file character (<<EOF>>
) and finally the .
rule matches anything else that was not matched already;
this is regarded as an error in our case, hence the call to yyerror
.
In this chapter we touched a bit on how to create your own language lexers and explained the techniques I
used in order to create my ptuc
lexer. Hopefully this was helpful and of course I hope you poked both
the starting stub as well as the finished lexer quite a bit. The next part explains how to create
your own parsers using bison
which is found here.