Lex.html

<!DOCTYPE html PUBLIC "-//IETF//DTD HTML//EN">
<!-- saved from url=(0049)http://www.cybercom.net/~zbrad/DotNet/Lex/Lex.htm -->
<html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title></title>
<style type="text/css"></style></head>

<body>
<h1></h1>

<h2 align="center">CsLex:<br>A lexical analyzer generator for 
C#<sup><small>(TM)</small></sup><br></h2>
<p align="center"><strong>Brad Merrill<br>
</strong></p><p align="center"><strong>Microsoft
<br></strong></p>
<p align="center">Version 1.0, 20-Sep-1999</p>
<p align="center">Manual revision 24-Sep-1999</p>
<hr>


<h1><a name="SECTION1">1. Introduction</a></h1>
<p>
A lexical analyzer breaks an input stream of characters into tokens.
Writing lexical analyzers by hand can be a tedious process, so software tools have been developed to ease this task.

</p><p>Perhaps the best known such utility is the original C-based Lex.
Lex is a lexical analyzer generator for the UNIX operating system,
targeted to the C programming language. 
</p><p>
Lex takes a specially-formatted specification file containing the details of a 
lexical analyzer. This tool then creates a C source file for the associated 
table-driven lexer. 
</p><p>
The CsLex utility is based upon the Lex lexical analyzer generator model.

CsLex takes a specification file similar to that accepted by Lex,
then creates a C# source file for the corresponding lexical analyzer.
</p><p>
CsLex is loosely based on the JLex tool, which was based on the Lex tool.
This was a significant rewrite, so consequently any
errors are solely the responsibility of the most recent author.
See the credits section for more info.

</p><h1>CsLex Specifications</h1>
<p>
A CsLex input file is organized into three sections, separated by 
double-percent directives (``%%''). A proper CsLex specification has the 
following format.
<br>
<i>user code</i>
<br>%%
<br>
<i>CsLex directives</i>
<br>
%%<br>
<i>regular expression rules</i><br>
The ``%%'' directives distinguish sections of the input file and
must be placed at the beginning of 
their line. The remainder of the line containing the ``%%'' directives may be 
discarded and should not be used to house additional declarations or code. 
</p><p>
The user code section - the first section of the specification file - is 
copied directly into the resulting output file. This area of the specification 
provides space for the implementation of utility classes or return types. 
</p><p>
The CsLex directives section is the second part of the input file. Here, 
macros definitions are given and state names are declared. 
</p><p>
The third section contains the rules of lexical analysis, each of which 
consists of three parts: an optional state list, a regular expression, and an 
action. 
</p><h2>User Code</h2>
<p>
User code precedes the first double-percent directive (``%%'). This code is 
copied verbatim into the lexical analyzer source file that CsLex
outputs, at the top of the file. Therefore, if the lexer source
file needs to begin with a package declaration or with the
importation of an external class, the user code section should
begin with the corresponding declaration. This declaration will
then be copied onto the top of the generated source file.  
</p><h2>CsLex Directives</h2>

The CsLex directive section begins after the first ``%%'' and continues until 
the second ``%%'' delimiter. Each CsLex directive should be
contained on a single line and should begin that line. 

<h3>Internal Code to Lexical Analyzer Class</h3>
<p>
The <i>%{...%}</i> directive allows the user to write C# code to be copied 
into the lexical analyzer class. This directive is used as follows.
<br>
<i>%{ </i><br>
<i>&lt;code&gt; </i><br>
<i>%} </i><br>

To be properly recognized, the <i>%{ </i>and <i>%} </i>should
each be situated at the beginning of a line. The specified C#
code in <i>&lt;code&gt;</i> will be then copied into the lexical
analyzer class created by CsLex.<br>

<i>class Yylex { </i><br>
<i>... &lt;code&gt; ... </i><br>
<i>} </i><br>

This permits the declaration of variables and functions 
internal to the generated lexical analyzer class. Variable names
beginning with <i>yy</i> should be avoided, as these are
reserved for use by the generated lexical analyzer class. 

</p><h3>Initialization Code for Lexical Analyzer Class</h3>
The <i>%init{ ... %init}</i> directive allows the user to write C# code to 
be copied into the constructor for the lexical analyzer class.<br>

<i>%init{ </i><br>
<i>&lt;code&gt;</i><br>
<i>%init} </i><br>
The <i>%init{</i> and <i>%init}</i> directives should be
situated at the beginning of a line. The specified C# code in
<i>&lt;code&gt;</i> will be then copied into the lexical
analyzer class constructor.<br> 

<i>class Yylex { </i><br>
<i>Yylex () { </i><br>
<i>... &lt;code&gt; ... </i><br>
<i>} </i><br>
<i>} </i><br>

This directive permits one-time initializations of the lexical
analyzer class from inside its constructor. Variable names
beginning with <i>yy</i> should be avoided, as these are
reserved for use by the generated lexical analyzer class.

<h3>End-of-File Code for Lexical Analyzer Class</h3>

The <i>%eof{ ... %eof}</i> directive allows the user to write C# code to be 
copied into the lexical analyzer class for execution after the end-of-file is 
reached.<br>

<i>%eof{ </i><br>
<i>&lt;code&gt;</i><br>
<i>%eof} </i><br>

The <i>%eof{</i> and <i>%eof}</i> directives should be situated
at the beginning of a line. The specified C# code in
<i>&lt;code&gt;</i> will be executed at most once, and
immediately after the end-of-file is reached for the input file
the lexical analyzer class is processing.  

<h3>Macro Definitions</h3>

Macro definitions are given in the CsLex directives section of
the specification. Each macro definition is contained on a
single line and consists of a macro name followed by an equal
sign (=), then by its associated definition. The format can
therefore be summarized as follows.
<br> 
<i>&lt;name&gt;</i> = <i>&lt;definition&gt;</i><br>

Non-newline white space, e.g. blanks and tabs, is optional
between the macro name and the equal sign and between the equal
sign and the macro definition. Each macro definition should be
contained on a single line.  

<p>
Macro names should be valid identifiers, e.g. sequences of
letters, digits, and underscores beginning with a letter or
underscore.

</p><p>
Macro definitions should be valid regular expressions, the
details of which are described in another section below.

</p><p>
Macro definitions can contain other macro expansions, in the standard<br>

<i>{&lt;name&gt;} </i>format for macros within regular
expressions.  However, the user should note that these
expressions are macros - not functions or nonterminals - so
mutually recursive constructs using macros are
illegal. Therefore, cycles in macro definitions will have
unpredictable results.

</p><h3>State Declarations</h3>

Lexical states are used to control when certain regular
expressions are matched. These are declared in the CsLex
directives in the following way.<br> 

<i>%state </i>state[0][<i>, state[1], state[2], ...</i>]<br>

Each declaration of a series of lexical states should be
contained on a single line. Multiple declarations can be
included in the same CsLex specification, so the declaration of
many states can be broken into many declarations over multiple
lines.

<p>
State names should be valid identifiers, e.g. sequences of
letters, digits, and underscores beginning with a letter or
underscore.

</p><p>
A single lexical state is implicitly declared by CsLex. This
state is called <i>YYINITIAL</i>, and the generated lexer begins
lexical analysis in this state.

</p><p>
Rules of lexical analysis begin with an optional state list. If
a state list is given, the lexical rule is matched only when the
lexical analyzer is in one of the specified states. If a state
list is not given, the lexical rule is matched when the lexical
analyzer is in any state.

</p><p>
If a CsLex specification does not make use of states, by neither
declaring states nor preceding lexical rules with state lists,
the resulting lexer will remain in state <i>YYINITIAL</i>
throughout execution. Since lexical rules are not prefaced by
state lists, these rules are matched in all existing states,
including the implicitly declared state
<i>YYINITIAL</i>. Therefore, everything works as expected if
states are not used at all.  

</p><p>
States are declared as constant integers within the generated
lexical analyzer class. The constant integer declared for a
declared state has the same name as that state. The user should
be careful to avoid name conflict between state names and
variables declared in the action portion of rules or elsewhere
within the lexical analyzer class. A convenient convention would
be to declare state names in all capitals, as a reminder that
these identifiers effectively become constants.

</p><h3>Character Counting</h3>

Character counting is turned off by default, but can be
activated with the <i>%char</i> directive.<br>

<i>%char</i><br>

The zero-based character index of the first character in the
matched region of text is then placed in the integer variable
<i>yychar</i>.

<h3>Line Counting</h3>

Line counting is turned off by default, but can be activated
with the <i>%line</i> directive.<br>

<i>%line</i><br>

The zero-based line index at the beginning of the matched region
of text is then placed in the integer variable <i>yyline</i>.

<h3>Lexical Analyzer Component Titles</h3>

The following directives can be used to change the name of the
generated lexical analyzer class, the namespace, the tokenizing function,
and the token return type.

<p>
To change the name of the lexical analyzers namespace
from <i>YyNameSpace</i>,
use the <i>%namespace</i> directive.<br> 

<i>%namespace &lt;name&gt;</i><br>

To change the name of the lexical
analyzer class from <i>Yylex</i>, use the <i>%class</i>
directive.<br> 

<i>%class &lt;name&gt;</i><br>

</p><p>
To change the name of the tokenizing function from <i>yylex</i>,
use the <i>%function</i> directive.<br> 

<i>%function &lt;name&gt;</i><br>

</p><p>
To change the name of the return type from the tokenizing
function from <i>Yytoken</i>, use the <i>%type</i>
directive.<br>

<i>%type &lt;name&gt;</i><br>

</p><p>
If the default names are not altering using these directives,
the tokenizing function is envoked with a call to
<i>Yylex.yylex()</i>, which returns the <i>Ytoken</i> type.

</p><p>
To avoid scoping conflicts, names beginning with <i>yy</i> are
normally reserved for lexical analyzer internal functions and
variables.

</p><h3>Default Token Type</h3>

To return an integer type for the return type for the tokenizing function
(and therefore the token type), use the <i>%intwrap</i>
directive.<br> 

<i>%intwrap</i><br>

Under default settings, <i>Yytoken</i> is the return type of the
tokenizing function<br> 

<i>Yylex.yylex()</i>, as in the following code fragment.<br>
<i>class Yylex { ... </i><br>
<i>public Yytoken yylex () {</i><br>
<i>... } </i><br>

The <i>%intwrap</i> directive replaces the previous code with a
revised declaration, in which the token type has been changed to
integer boxed Object.<br>

<i>class Yylex { ... </i><br>
<i>public object yylex () {</i><br>
<i>... } </i><br>

This declaration allows lexical actions to return wrapped
integer codes, as in the following code fragment from a
hypothetical lexical action.<br>

<i>{ ...</i><br>
<i>return ((object) i); </i><br>
<i>... } </i>

<p>
Notice that the effect of <i>%intwrap</i> directive can be
equivalently accomplished using the <i>%type</i> directive, as
follows.<br>

<i>%type object</i><br>

This manually changes the name of the return type from
<i>Yylex.yylex()</i> to<br>

<i>object</i>. 


</p><h3>YYEOF on End-of-File</h3>

<p>
The <i>%yyeof</i> directive causes the constant integer
<i>Yylex.YYEOF</i> to be declared and returned upon
end-of-file.<br>

<i>%yyeof</i><br>

This constant integer is discussed in more detail in a previous
section. Note also that <i>Yylex.YYEOF</i> is a <i>int</i>, so
the user should make sure that this return value is compatible
with the return type of <i>Yylex.yylex()</i>.

</p><h3>Character Sets</h3>

The default settings support an alphabet of character codes
between 0 and 127 inclusive. If the generated lexical analyzer
receives an input character code that falls outside of these
bounds, the lexer may fail.

<p>
The <i>%full</i> directive can be used to extend this alphabet
to include all 8-bit values.<br>

<i>%full</i><br>

If the <i>%full</i> directive is given, CsLex will generate a
lexical analyzer that supports an alphabet of character codes
between 0 and 255 inclusive.

</p><h3>Character Format To and From File</h3>

Under the status quo, CsLex and the lexical analyzer it generates
read from and write to Ascii text files, with byte sized
characters. However, to support further extensions on the CsLex
tool, all internal processing of characters is done using the
16-bit C# character type, although the full range of 16-bit
values is not supported.

<h3>Specifying the Return Value on End-of-File</h3>

The <i>%eofval{ ... %eofval}</i> directive specifies the return
value on end-of-file. This directive allows the user to write
C# code to be copied into the lexical analyzer tokenizing
function <i>Yylex.yylex()</i> for execution when the end-of-file
is reached. This code must return a value compatible with the
type of the tokenizing function <i>Yylex.yylex()</i>.<br>

<i>%eofval{ </i><br>
<i>&lt;code&gt;</i><br>
<i>%eofval} </i><br>

The specified C# code in <i>&lt;code&gt;</i> determines the
return value of <i>Yylex.yylex()</i> when the end-of-file is
reached for the input file the lexical analyzer class is
processing. This will also be the value returned by
<i>Yylex.yylex()</i> each additional time this function is
called after end-of-file is initially reached, so
<i>&lt;code&gt;</i> may be executed more than once. Finally, the
<i>%eofval{</i> and <i>%eofval}</i> directives should be
situated at the beginning of a line.  

<p>
An example of usage is given below. Suppose the return value
desired on end-of-file is <i>(new token(sym.EOF))</i> rather
than the default value <i>null</i>. The user adds the following
declaration to the specification file.<br>

<i>%eofval{ </i><br>
<i>return (new token(sym.EOF)); </i><br>
<i>%eofval} </i><br>

The code is then copied into <i>Yylex.yylex()</i> into the
appropriate place.<br>

<i>public Yytoken yylex () { ... </i><br>
<i>return (new token(sym.EOF)); </i><br>
<i>... } </i><br>

The value returned by <i>Yylex.yylex()</i> upon end-of-file and
from that point onward is now <i>(new token(sym.EOF))</i>.

</p><h3>Specifying an interface to implement</h3>

CsLex allows the user to specify an interface which the
<i>Yylex</i> class will implement. By adding the following
declaration to the input file:<br>

<i>%implements &lt;classname&gt;</i><br>

the user specifies that Yylex will implement
<i>classname</i>. The generated parser class declaration will
look like:<br>

<tt>class Yylex : <i>classname</i> { ... </tt>
<p>
</p><h3>Making the Generated Class Public</h3>

The <i>%public</i> directive causes the lexical analyzer class
generated by CsLex to be a public class.<br>

<i>%public</i><br>

The default behavior adds no access specifier to the generated
class, resulting in the class being visible only from the
current package.

<h2>Regular Expression Rules</h2>

The third part of the CsLex specification consists of a series of
rules for breaking the input stream into tokens. These rules
specify regular expressions, then associate these expressions
with actions consisting of C# source code.

<p>
The rules have three distinct parts: the optional state list,
the regular expression, and the associated action. This format
is represented as follows.<br>

[<i>&lt;states&gt;</i>] <i>&lt;expression&gt; { &lt;action&gt; }</i><br>

Each part of the rule is discussed in a section below.

</p><p>
If more than one rule matches strings from its input, the
generated lexer resolves conflicts between rules by greedily
choosing the rule that matches the longest string. If more than
one rule matches strings of the same length, the lexer will
choose the rule that is given first in the CsLex
specification. Therefore, rules appearing earlier in the
specification are given a higher priority by the generated
lexer.

</p><p>
The rules given in a CsLex specification should match all
possible input. If the generated lexical analyzer receives input
that does not match any of its rules, an error will be raised.

</p><p>
Therefore, all input should be matched by at least one
rule. This can be guaranteed by placing the following rule at
the bottom of a CsLex specification:<br>

<i>. { Console.WriteLine("Unmatched input: " + yytext()); }</i><br>

The dot (.), as described below, will match any input except for
the newline.

</p><h3>Lexical States</h3>

An optional lexical state list preceeds each rule. This list
should be in the following form:<br>

<i>&lt;</i>state[0][<i>, state[1], state[2], ...</i>]<i>&gt;</i><br>

The outer set of brackets ([]) indicate that multiple states are
optional. The greater than (&lt;) and less than (&gt;) symbols
represent themselves and should surround the state list,
preceding the regular expression. The state list specifies under
which initial states the rule can be matched.

<p>
For instance, if <i>yylex()</i> is called with the lexer at
state <i>A</i>, the lexer will attempt to match the input only
against those rules that have <i>A</i> in their state list.

</p><p>
If no state list is specified for a given rule, the rule is
matched against in all lexical states.

</p><h3>Regular Expressions</h3>

Regular expressions should not contain any white space, as white
space is interpreted as the end of the current regular
expression. There is one exception; if (non-newline) white space
characters appear from within double quotes, these characters
are taken to represent themselves. For instance, `` '' is
interpreted as a blank space.

<p>
The alphabet for CsLex is the Ascii character set, meaning
character codes between 0 and 127 inclusive.

</p><p>
The following characters are metacharacters, with special
meanings in CsLex regular expressions.<br>

</p><pre><h4>? * + | ( ) ^ $ / ; . = &lt; &gt; [ ] { } " \</h4></pre><br>

Otherwise, individual characters stand for themselves.

<p>
<i>ef</i> Consecutive regular expressions represents their concatenation. 
</p><p>
<i>e</i>|<i>f</i> The vertical bar (|) represents an option
between the regular expressions that surround it, so matches
either expression <i>e</i> or <i>f</i>.  

</p><p>
The following escape sequences are recognized and expanded: 
<table>
  <tbody>
  <tr>
    <td>\b</td>
    <td>Backspace</td></tr>
  <tr>
    <td>\n</td>
    <td>newline</td></tr>
  <tr>
    <td>\t</td>
    <td>Tab</td></tr>
  <tr>
    <td>\f</td>
    <td>Formfeed</td></tr>
  <tr>
    <td>\r</td>
    <td>Carriage return</td></tr>
  <tr>
    <td>\<i>ddd</i></td>
    <td>The character code corresponding to the number formed by three octal 
      digits <i>ddd</i></td></tr>
  <tr>
    <td>\x<i>dd</i></td>
    <td>The character code corresponding to the number formed by
	two hexadecimal digits <i>dd</i></td></tr> 
  <tr>
    <td>\u<i>dddd</i></td>
    <td>
	The Unicode character code corresponding to the number
	formed by four 	hexidecimal digits
	<i>dddd</i>. <strong>The support of Unicode escape
	sequences of this type is unimplemented.</strong>
    </td>
  </tr>
  <tr>
    <td>\^<i>C</i></td>
    <td>Control character</td></tr>
  <tr>
    <td>\<i>c</i></td>
    <td>A backslash followed by any other character <i>c</i>
	matches itself
    </td>
  </tr>
  </tbody>
</table>

<table>
  <tbody><tr>
  <th>Symbol</th>
  <th>Meaning</th>
  </tr>
  <tr>
    <td> $ </td>
    <td> The dollar sign ($) denotes the end of a line. If
	the dollar sign ends a regular expression, the
	expression is matched only at the end of a
	line. </td>
  </tr>
  <tr>
    <td> . </td>
    <td> The dot (.) matches any character except the newline,
	so this expression is equivalent to [^\n]. </td>
  </tr>
  <tr>
    <td> "..." </td>
    <td> Metacharacters lose their meaning within double quotes
	and represent themselves. The sequence <code>\"</code>
	(which represents the single character <code>"</code>)
	is the only exception. </td>
  </tr>
  <tr>
    <td> <i>{name}</i> </td>
    <td> Curly braces denote a macro expansion, with <i>name</i>
	the declared name of the associated macro. </td>
  </tr>
  <tr>
    <td> * </td>
    <td> The star (*) represents Kleene closure and matches zero
	or more repetitions of the preceding regular
	expression. </td>
  </tr>
  <tr>
    <td> + </td>
    <td> The plus (+) matches one or more repetitions of the
	preceding regular expression, so <i>e</i>+ is equivalent
	to <i>ee</i>*. </td>
  </tr>
  <tr>
    <td> ? </td>
    <td> The question mark (?) matches zero or one repetitions
	of the preceding regular expression. </td>
  </tr>
  <tr>
    <td> (...) </td>
    <td> Parentheses are used for grouping within regular
	expressions. </td>
  </tr>
  <tr valign="top">
    <td> [...] </td>
    <td> Square backets denote a class of characters and match
	any one character enclosed in the backets. If the first
	character following the left bracket ([) is the up arrow
	(^), the set is negated and the expression matches any
	character except those enclosed in the
	backets. Different metacharacter rules hold inside the
	backets, with the following expressions having special
	meanings:  
	<table>
	  <tbody><tr>
	    <td><i>{name}</i></td>
	    <td>Macro expansion</td>
	  </tr>
	  <tr>
	    <td><i>a</i> - <i>b</i></td>
	    <td>Range of character codes from <i>a</i> to
		<i>b</i> to be included in character set</td> 
	  </tr>
	  <tr>
	    <td>"..."</td>
	    <td>All metacharacters within double quotes lose
		their special meanings. The sequence
		<code>\"</code> (which represents the single
		character <code>"</code>) is the only
		exception.</td>
	  </tr>
	  <tr>
	    <td>\</td>
	    <td>Metacharacter following backslash(\) loses its
		special meaning</td>
	  </tr>
	  <tr>
	  </tr>
        </tbody></table>

	For example, [a-z] matches any lower-case letter, [^0-9]
	matches anything except a digit, and [0-9a-fA-F] matches
	any hexadecimal digit. Inside character class brackets,
	a metacharacter following a backslash loses its special
	meaning. Therefore, [\-\\] matches a dash or a
	backslash. Likewise ["A-Z"] matches one of the three
	characters A, dash, or Z. Leading and trailing dashes in
	a character class also lose their special meanings, so
	[+-] and [-+] do what you would expect them to (ie,
	match only '+' and '-').
    </td>
  </tr>
</tbody></table>
	
</p><h3>Associated Actions</h3>

The action associated with a lexical rule consists of C# code
enclosed inside block-delimiting curly braces.<br> 

<i>{ action; return null; } </i><br>

The C# code <i>action</i> is copied, as given, into the
state-driven lexical analyzer method produced by Lex.

<p>
All curly braces contained in <i>action</i> not part of strings
or comments should be balanced.

</p><h4>Actions and Recursion:</h4>

If a null return value is returned from an action, the lexical
analyzer will loop, searching for the next match from the input
stream and returning the value associated with that match.

<p>
The lexical analyzer can be made to recur explicitly with a call
to <i>yylex()</i>, as in the following code fragment.<br> 

<i>{ ...</i> <br>
<i>return yylex();</i> <br>
<i>... } </i><br>

This code fragment causes the lexical analyzer to recur,
searching for the next match in the input and returning the
value associated with that match. The same effect can be had,
however, by simply not returning from a given action. This
results in the lexer searching for the next match, without the
additional overhead of recursion.

</p><p>
The preceding code fragment is an example of tail recursion,
since the recursive call comes at the end of the calling
function's execution. The following code fragment is an example
of a recursive call that is not tail recursive.<br> 

<i>{ ...</i> <br>
<i>next = yylex();</i> <br>
<i>return null;</i><br>
<i>... } </i><br>

Recursive actions that are not tail-recursive work in the
expected way, except that variables such as <i>yyline</i> and
<i>yychar</i> may be changed during recursion.

</p><h4>State Transitions:</h4>

If lexical states are declared in the CsLex directives section,
transitions on these states can be declared within the regular
expression actions. State transitions are made by the following
function call.<br>

<i>yybegin(state);</i><br>

The void function <i>yybegin()</i> is passed the state name
<i>state</i> and effects a transition to this lexical state.

<p>
The state <i>state</i> must be declared within the CsLex
directives section, or this call will result in a compiler error
in the generated source file. The one exception to this
declaration requirement is state <i>YYINITIAL</i>, the lexical
state implicitly declared by CsLex. The generated lexer begins
lexical analysis in state <i>YYINITIAL</i> and remains in this
state until a transition is made.

</p><h4>Available Lexical Values:</h4>

The following values, internal to the <i>Yylex</i> class, are
available within the action portion of the lexical rules.

<table>
  <tbody>
  <tr>
    <th align="left">Variable or Method</th>
    <th align="left">ActivationDirective</th>
    <th align="left">Description</th>
  </tr>
  <tr>
    <td><i>System.String yytext();</i></td>
    <td>Always active.</td>
    <td>Matched portion of the character input stream.</td>
  </tr>
  <tr>
    <td><i>int yychar;</i></td>
    <td><i>%char</i></td>
    <td>Zero-based character index of the first character in the matched 
      portion of the input stream</td>
  </tr>
  <tr>
    <td><i>int yyline;</i></td>
    <td><i>%line</i></td>
    <td>Zero-based line number of the start of the matched portion of the 
      input stream</td>
  </tr>
  </tbody>
</table>

<h2>Generated Lexical Analyzers</h2>

CsLex will take a properly-formed specification and transform it
into a C# source file for the corresponding lexical analyzer.

<p>
The generated lexical analayzer resides in the class
<i>Yylex</i>. There are two constructors to this class, both
requiring a single argument: the input stream to be
tokenized. The input stream may either be of type
System.IO.StreamReader or System.IO.FileReader.  

</p><p>
The access function to the lexer is <i>Yylex.yylex()</i>, which
returns the next token from the input stream. The return type is
<i>Yytoken</i> and the function is declared as follows.<br>

<i>class Yylex { ... </i><br>
<i>public Yytoken yylex () {</i><br>
<i>... } </i><br>

The user must declare the type of <i>Yytoken</i> and can
accomplish this conveniently in the first section of the CsLex
specification, the user code section. For instance, to make
<i>Yylex.yylex()</i> return the boxed (Int32.Box) integer
wrapper type, the user would enter the following code somewhere
preceding the first ``%%''.<br> 

<i>class Yytoken { } </i><br>

Then, in the lexical actions, wrapped integers would be
returned, in something like this way.<br>

<i>{ ...</i><br>
<i>return ((object) i); </i><br>
<i>... } </i><br>

Likewise, in the user code section, a class could be defined declaring 
constants that correspond to each of the token types.<br>

<i>class TokenCodes { ... </i><br>
<i>public static final STRING = 0; </i><br>
<i>public static final INTEGER = 1; </i><br>
<i>... } </i><br>

Then, in the lexical actions, these token codes could be
returned.<br> 

<i>{ ...</i><br>
<i>return ((object) STRING); </i><br>
<i>... } </i><br>

These are simplified examples; in actual use, one would probably
define a token class containing more information than an integer
code.

</p><p>
These examples begin to illustrate the object-oriented
techniques a user could employ to define an arbitrarily complex
token type to be returned by <i>Yylex.yylex()</i>. In
particular, inheritance permits the user to return more than one
token type. If a distinct token type was needed for strings and
integers, the user could make the following declarations.<br> 

<i>class Yytoken { ... } </i><br>
<i>class IntegerToken extends Yytoken { ... }</i><br>
<i>class StringToken extends Yytoken { ... } </i><br>

Then the user could return both <i>IntegerToken</i> and
<i>StringToken</i> types from the lexical actions.

</p><p>
The names of the lexical analyzer class, the tokening function,
and its return type each may be altered using the CsLex
directives.

</p><h2>Credits</h2>

CsLex is a derivative of the JLex implementation by Elliot
Joel Berk and C. Scott Ananian.
<p>
The design and architecture of CsLex, written in C#, is based on
a melding of the JLex implementation and the Lex functional
specification.  JLex was written for the Java language, and it's
direct antecedent Lex, designed and written for the C language.

</p><p>
CsLex distinguishes itself by incorporating a number of C#
language constructs and services; refer to the design notes for
CsLex for more details on C# specific features incorporated into
the CsLex design. 

</p><hr>
<h4>CsLex COPYRIGHT NOTICE, LICENSE AND DISCLAIMER</h4>
CsLex is Copyright 2000 by Brad Merrill
<p>

Permission to use, copy, modify, and distribute this software
and its documentation for any purpose and without fee is hereby
granted, provided that the above copyright notice appear in all
copies and that both the copyright notice and this permission
notice and warranty disclaimer appear in supporting
documentation, and that the name of the authors or their
employers not be used in advertising or publicity pertaining to
distribution of the software without specific, written prior
permission. 
</p><p>

The authors and their employers disclaim all warranties with
regard to this software, including all implied warranties of
merchantability and fitness. In no event shall the authors or
their employers be liable for any special, indirect or
consequential damages or any damages whatsoever resulting from
loss of use, data or profits, whether in an action of contract,
negligence or other tortious action, arising out of or in
connection with the use or performance of this software.
</p><p>

C# is a trademark of Microsoft Corp. References to the C#
programming language in relation to CsLex are not meant to imply
that Microsoft endorses this product.
</p><hr>

<h4>JLEX COPYRIGHT NOTICE, LICENSE AND DISCLAIMER</h4>
Copyright 1996-2000 by Elliot Joel Berk and C. Scott Ananian 
<p>

Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both the copyright notice and this permission notice and warranty disclaimer appear in supporting documentation, and that the name of the authors or their employers not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission. 
</p><p>

The authors and their employers disclaim all warranties with regard to this software, including all implied warranties of merchantability and fitness. In no event shall the authors or their employers be liable for any special, indirect or consequential damages or any damages whatsoever resulting from loss of use, data or profits, whether in an action of contract, negligence or other tortious action, arising out of or in connection with the use or performance of this software. 

</p><p>
Java is a trademark of Sun Microsystems, Inc. References to the Java programming language in relation to JLex are not meant to imply that Sun endorses this product. 


</p><hr>
<address>
<a href="http://drgdna/bmerrill">Brad Merrill</a>
<a href="mailto:bmerrill@microsoft.com">&lt;bmerrill@microsoft.com&gt;</a>
</address>
<!-- hhmts start --> Last modified: Mon Sep 18 11:46:32 Pacific Daylight Time 2000 <!-- hhmts end -->
 
</body></html>