-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathLex.html
939 lines (759 loc) · 31.4 KB
/
Lex.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
<!DOCTYPE html PUBLIC "-//IETF//DTD HTML//EN">
<!-- saved from url=(0049)http://www.cybercom.net/~zbrad/DotNet/Lex/Lex.htm -->
<html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title></title>
<style type="text/css"></style></head>
<body>
<h1></h1>
<h2 align="center">CsLex:<br>A lexical analyzer generator for
C#<sup><small>(TM)</small></sup><br></h2>
<p align="center"><strong>Brad Merrill<br>
</strong></p><p align="center"><strong>Microsoft
<br></strong></p>
<p align="center">Version 1.0, 20-Sep-1999</p>
<p align="center">Manual revision 24-Sep-1999</p>
<hr>
<h1><a name="SECTION1">1. Introduction</a></h1>
<p>
A lexical analyzer breaks an input stream of characters into tokens.
Writing lexical analyzers by hand can be a tedious process, so software tools have been developed to ease this task.
</p><p>Perhaps the best known such utility is the original C-based Lex.
Lex is a lexical analyzer generator for the UNIX operating system,
targeted to the C programming language.
</p><p>
Lex takes a specially-formatted specification file containing the details of a
lexical analyzer. This tool then creates a C source file for the associated
table-driven lexer.
</p><p>
The CsLex utility is based upon the Lex lexical analyzer generator model.
CsLex takes a specification file similar to that accepted by Lex,
then creates a C# source file for the corresponding lexical analyzer.
</p><p>
CsLex is loosely based on the JLex tool, which was based on the Lex tool.
This was a significant rewrite, so consequently any
errors are solely the responsibility of the most recent author.
See the credits section for more info.
</p><h1>CsLex Specifications</h1>
<p>
A CsLex input file is organized into three sections, separated by
double-percent directives (``%%''). A proper CsLex specification has the
following format.
<br>
<i>user code</i>
<br>%%
<br>
<i>CsLex directives</i>
<br>
%%<br>
<i>regular expression rules</i><br>
The ``%%'' directives distinguish sections of the input file and
must be placed at the beginning of
their line. The remainder of the line containing the ``%%'' directives may be
discarded and should not be used to house additional declarations or code.
</p><p>
The user code section - the first section of the specification file - is
copied directly into the resulting output file. This area of the specification
provides space for the implementation of utility classes or return types.
</p><p>
The CsLex directives section is the second part of the input file. Here,
macros definitions are given and state names are declared.
</p><p>
The third section contains the rules of lexical analysis, each of which
consists of three parts: an optional state list, a regular expression, and an
action.
</p><h2>User Code</h2>
<p>
User code precedes the first double-percent directive (``%%'). This code is
copied verbatim into the lexical analyzer source file that CsLex
outputs, at the top of the file. Therefore, if the lexer source
file needs to begin with a package declaration or with the
importation of an external class, the user code section should
begin with the corresponding declaration. This declaration will
then be copied onto the top of the generated source file.
</p><h2>CsLex Directives</h2>
The CsLex directive section begins after the first ``%%'' and continues until
the second ``%%'' delimiter. Each CsLex directive should be
contained on a single line and should begin that line.
<h3>Internal Code to Lexical Analyzer Class</h3>
<p>
The <i>%{...%}</i> directive allows the user to write C# code to be copied
into the lexical analyzer class. This directive is used as follows.
<br>
<i>%{ </i><br>
<i><code> </i><br>
<i>%} </i><br>
To be properly recognized, the <i>%{ </i>and <i>%} </i>should
each be situated at the beginning of a line. The specified C#
code in <i><code></i> will be then copied into the lexical
analyzer class created by CsLex.<br>
<i>class Yylex { </i><br>
<i>... <code> ... </i><br>
<i>} </i><br>
This permits the declaration of variables and functions
internal to the generated lexical analyzer class. Variable names
beginning with <i>yy</i> should be avoided, as these are
reserved for use by the generated lexical analyzer class.
</p><h3>Initialization Code for Lexical Analyzer Class</h3>
The <i>%init{ ... %init}</i> directive allows the user to write C# code to
be copied into the constructor for the lexical analyzer class.<br>
<i>%init{ </i><br>
<i><code></i><br>
<i>%init} </i><br>
The <i>%init{</i> and <i>%init}</i> directives should be
situated at the beginning of a line. The specified C# code in
<i><code></i> will be then copied into the lexical
analyzer class constructor.<br>
<i>class Yylex { </i><br>
<i>Yylex () { </i><br>
<i>... <code> ... </i><br>
<i>} </i><br>
<i>} </i><br>
This directive permits one-time initializations of the lexical
analyzer class from inside its constructor. Variable names
beginning with <i>yy</i> should be avoided, as these are
reserved for use by the generated lexical analyzer class.
<h3>End-of-File Code for Lexical Analyzer Class</h3>
The <i>%eof{ ... %eof}</i> directive allows the user to write C# code to be
copied into the lexical analyzer class for execution after the end-of-file is
reached.<br>
<i>%eof{ </i><br>
<i><code></i><br>
<i>%eof} </i><br>
The <i>%eof{</i> and <i>%eof}</i> directives should be situated
at the beginning of a line. The specified C# code in
<i><code></i> will be executed at most once, and
immediately after the end-of-file is reached for the input file
the lexical analyzer class is processing.
<h3>Macro Definitions</h3>
Macro definitions are given in the CsLex directives section of
the specification. Each macro definition is contained on a
single line and consists of a macro name followed by an equal
sign (=), then by its associated definition. The format can
therefore be summarized as follows.
<br>
<i><name></i> = <i><definition></i><br>
Non-newline white space, e.g. blanks and tabs, is optional
between the macro name and the equal sign and between the equal
sign and the macro definition. Each macro definition should be
contained on a single line.
<p>
Macro names should be valid identifiers, e.g. sequences of
letters, digits, and underscores beginning with a letter or
underscore.
</p><p>
Macro definitions should be valid regular expressions, the
details of which are described in another section below.
</p><p>
Macro definitions can contain other macro expansions, in the standard<br>
<i>{<name>} </i>format for macros within regular
expressions. However, the user should note that these
expressions are macros - not functions or nonterminals - so
mutually recursive constructs using macros are
illegal. Therefore, cycles in macro definitions will have
unpredictable results.
</p><h3>State Declarations</h3>
Lexical states are used to control when certain regular
expressions are matched. These are declared in the CsLex
directives in the following way.<br>
<i>%state </i>state[0][<i>, state[1], state[2], ...</i>]<br>
Each declaration of a series of lexical states should be
contained on a single line. Multiple declarations can be
included in the same CsLex specification, so the declaration of
many states can be broken into many declarations over multiple
lines.
<p>
State names should be valid identifiers, e.g. sequences of
letters, digits, and underscores beginning with a letter or
underscore.
</p><p>
A single lexical state is implicitly declared by CsLex. This
state is called <i>YYINITIAL</i>, and the generated lexer begins
lexical analysis in this state.
</p><p>
Rules of lexical analysis begin with an optional state list. If
a state list is given, the lexical rule is matched only when the
lexical analyzer is in one of the specified states. If a state
list is not given, the lexical rule is matched when the lexical
analyzer is in any state.
</p><p>
If a CsLex specification does not make use of states, by neither
declaring states nor preceding lexical rules with state lists,
the resulting lexer will remain in state <i>YYINITIAL</i>
throughout execution. Since lexical rules are not prefaced by
state lists, these rules are matched in all existing states,
including the implicitly declared state
<i>YYINITIAL</i>. Therefore, everything works as expected if
states are not used at all.
</p><p>
States are declared as constant integers within the generated
lexical analyzer class. The constant integer declared for a
declared state has the same name as that state. The user should
be careful to avoid name conflict between state names and
variables declared in the action portion of rules or elsewhere
within the lexical analyzer class. A convenient convention would
be to declare state names in all capitals, as a reminder that
these identifiers effectively become constants.
</p><h3>Character Counting</h3>
Character counting is turned off by default, but can be
activated with the <i>%char</i> directive.<br>
<i>%char</i><br>
The zero-based character index of the first character in the
matched region of text is then placed in the integer variable
<i>yychar</i>.
<h3>Line Counting</h3>
Line counting is turned off by default, but can be activated
with the <i>%line</i> directive.<br>
<i>%line</i><br>
The zero-based line index at the beginning of the matched region
of text is then placed in the integer variable <i>yyline</i>.
<h3>Lexical Analyzer Component Titles</h3>
The following directives can be used to change the name of the
generated lexical analyzer class, the namespace, the tokenizing function,
and the token return type.
<p>
To change the name of the lexical analyzers namespace
from <i>YyNameSpace</i>,
use the <i>%namespace</i> directive.<br>
<i>%namespace <name></i><br>
To change the name of the lexical
analyzer class from <i>Yylex</i>, use the <i>%class</i>
directive.<br>
<i>%class <name></i><br>
</p><p>
To change the name of the tokenizing function from <i>yylex</i>,
use the <i>%function</i> directive.<br>
<i>%function <name></i><br>
</p><p>
To change the name of the return type from the tokenizing
function from <i>Yytoken</i>, use the <i>%type</i>
directive.<br>
<i>%type <name></i><br>
</p><p>
If the default names are not altering using these directives,
the tokenizing function is envoked with a call to
<i>Yylex.yylex()</i>, which returns the <i>Ytoken</i> type.
</p><p>
To avoid scoping conflicts, names beginning with <i>yy</i> are
normally reserved for lexical analyzer internal functions and
variables.
</p><h3>Default Token Type</h3>
To return an integer type for the return type for the tokenizing function
(and therefore the token type), use the <i>%intwrap</i>
directive.<br>
<i>%intwrap</i><br>
Under default settings, <i>Yytoken</i> is the return type of the
tokenizing function<br>
<i>Yylex.yylex()</i>, as in the following code fragment.<br>
<i>class Yylex { ... </i><br>
<i>public Yytoken yylex () {</i><br>
<i>... } </i><br>
The <i>%intwrap</i> directive replaces the previous code with a
revised declaration, in which the token type has been changed to
integer boxed Object.<br>
<i>class Yylex { ... </i><br>
<i>public object yylex () {</i><br>
<i>... } </i><br>
This declaration allows lexical actions to return wrapped
integer codes, as in the following code fragment from a
hypothetical lexical action.<br>
<i>{ ...</i><br>
<i>return ((object) i); </i><br>
<i>... } </i>
<p>
Notice that the effect of <i>%intwrap</i> directive can be
equivalently accomplished using the <i>%type</i> directive, as
follows.<br>
<i>%type object</i><br>
This manually changes the name of the return type from
<i>Yylex.yylex()</i> to<br>
<i>object</i>.
</p><h3>YYEOF on End-of-File</h3>
<p>
The <i>%yyeof</i> directive causes the constant integer
<i>Yylex.YYEOF</i> to be declared and returned upon
end-of-file.<br>
<i>%yyeof</i><br>
This constant integer is discussed in more detail in a previous
section. Note also that <i>Yylex.YYEOF</i> is a <i>int</i>, so
the user should make sure that this return value is compatible
with the return type of <i>Yylex.yylex()</i>.
</p><h3>Character Sets</h3>
The default settings support an alphabet of character codes
between 0 and 127 inclusive. If the generated lexical analyzer
receives an input character code that falls outside of these
bounds, the lexer may fail.
<p>
The <i>%full</i> directive can be used to extend this alphabet
to include all 8-bit values.<br>
<i>%full</i><br>
If the <i>%full</i> directive is given, CsLex will generate a
lexical analyzer that supports an alphabet of character codes
between 0 and 255 inclusive.
</p><h3>Character Format To and From File</h3>
Under the status quo, CsLex and the lexical analyzer it generates
read from and write to Ascii text files, with byte sized
characters. However, to support further extensions on the CsLex
tool, all internal processing of characters is done using the
16-bit C# character type, although the full range of 16-bit
values is not supported.
<h3>Specifying the Return Value on End-of-File</h3>
The <i>%eofval{ ... %eofval}</i> directive specifies the return
value on end-of-file. This directive allows the user to write
C# code to be copied into the lexical analyzer tokenizing
function <i>Yylex.yylex()</i> for execution when the end-of-file
is reached. This code must return a value compatible with the
type of the tokenizing function <i>Yylex.yylex()</i>.<br>
<i>%eofval{ </i><br>
<i><code></i><br>
<i>%eofval} </i><br>
The specified C# code in <i><code></i> determines the
return value of <i>Yylex.yylex()</i> when the end-of-file is
reached for the input file the lexical analyzer class is
processing. This will also be the value returned by
<i>Yylex.yylex()</i> each additional time this function is
called after end-of-file is initially reached, so
<i><code></i> may be executed more than once. Finally, the
<i>%eofval{</i> and <i>%eofval}</i> directives should be
situated at the beginning of a line.
<p>
An example of usage is given below. Suppose the return value
desired on end-of-file is <i>(new token(sym.EOF))</i> rather
than the default value <i>null</i>. The user adds the following
declaration to the specification file.<br>
<i>%eofval{ </i><br>
<i>return (new token(sym.EOF)); </i><br>
<i>%eofval} </i><br>
The code is then copied into <i>Yylex.yylex()</i> into the
appropriate place.<br>
<i>public Yytoken yylex () { ... </i><br>
<i>return (new token(sym.EOF)); </i><br>
<i>... } </i><br>
The value returned by <i>Yylex.yylex()</i> upon end-of-file and
from that point onward is now <i>(new token(sym.EOF))</i>.
</p><h3>Specifying an interface to implement</h3>
CsLex allows the user to specify an interface which the
<i>Yylex</i> class will implement. By adding the following
declaration to the input file:<br>
<i>%implements <classname></i><br>
the user specifies that Yylex will implement
<i>classname</i>. The generated parser class declaration will
look like:<br>
<tt>class Yylex : <i>classname</i> { ... </tt>
<p>
</p><h3>Making the Generated Class Public</h3>
The <i>%public</i> directive causes the lexical analyzer class
generated by CsLex to be a public class.<br>
<i>%public</i><br>
The default behavior adds no access specifier to the generated
class, resulting in the class being visible only from the
current package.
<h2>Regular Expression Rules</h2>
The third part of the CsLex specification consists of a series of
rules for breaking the input stream into tokens. These rules
specify regular expressions, then associate these expressions
with actions consisting of C# source code.
<p>
The rules have three distinct parts: the optional state list,
the regular expression, and the associated action. This format
is represented as follows.<br>
[<i><states></i>] <i><expression> { <action> }</i><br>
Each part of the rule is discussed in a section below.
</p><p>
If more than one rule matches strings from its input, the
generated lexer resolves conflicts between rules by greedily
choosing the rule that matches the longest string. If more than
one rule matches strings of the same length, the lexer will
choose the rule that is given first in the CsLex
specification. Therefore, rules appearing earlier in the
specification are given a higher priority by the generated
lexer.
</p><p>
The rules given in a CsLex specification should match all
possible input. If the generated lexical analyzer receives input
that does not match any of its rules, an error will be raised.
</p><p>
Therefore, all input should be matched by at least one
rule. This can be guaranteed by placing the following rule at
the bottom of a CsLex specification:<br>
<i>. { Console.WriteLine("Unmatched input: " + yytext()); }</i><br>
The dot (.), as described below, will match any input except for
the newline.
</p><h3>Lexical States</h3>
An optional lexical state list preceeds each rule. This list
should be in the following form:<br>
<i><</i>state[0][<i>, state[1], state[2], ...</i>]<i>></i><br>
The outer set of brackets ([]) indicate that multiple states are
optional. The greater than (<) and less than (>) symbols
represent themselves and should surround the state list,
preceding the regular expression. The state list specifies under
which initial states the rule can be matched.
<p>
For instance, if <i>yylex()</i> is called with the lexer at
state <i>A</i>, the lexer will attempt to match the input only
against those rules that have <i>A</i> in their state list.
</p><p>
If no state list is specified for a given rule, the rule is
matched against in all lexical states.
</p><h3>Regular Expressions</h3>
Regular expressions should not contain any white space, as white
space is interpreted as the end of the current regular
expression. There is one exception; if (non-newline) white space
characters appear from within double quotes, these characters
are taken to represent themselves. For instance, `` '' is
interpreted as a blank space.
<p>
The alphabet for CsLex is the Ascii character set, meaning
character codes between 0 and 127 inclusive.
</p><p>
The following characters are metacharacters, with special
meanings in CsLex regular expressions.<br>
</p><pre><h4>? * + | ( ) ^ $ / ; . = < > [ ] { } " \</h4></pre><br>
Otherwise, individual characters stand for themselves.
<p>
<i>ef</i> Consecutive regular expressions represents their concatenation.
</p><p>
<i>e</i>|<i>f</i> The vertical bar (|) represents an option
between the regular expressions that surround it, so matches
either expression <i>e</i> or <i>f</i>.
</p><p>
The following escape sequences are recognized and expanded:
<table>
<tbody>
<tr>
<td>\b</td>
<td>Backspace</td></tr>
<tr>
<td>\n</td>
<td>newline</td></tr>
<tr>
<td>\t</td>
<td>Tab</td></tr>
<tr>
<td>\f</td>
<td>Formfeed</td></tr>
<tr>
<td>\r</td>
<td>Carriage return</td></tr>
<tr>
<td>\<i>ddd</i></td>
<td>The character code corresponding to the number formed by three octal
digits <i>ddd</i></td></tr>
<tr>
<td>\x<i>dd</i></td>
<td>The character code corresponding to the number formed by
two hexadecimal digits <i>dd</i></td></tr>
<tr>
<td>\u<i>dddd</i></td>
<td>
The Unicode character code corresponding to the number
formed by four hexidecimal digits
<i>dddd</i>. <strong>The support of Unicode escape
sequences of this type is unimplemented.</strong>
</td>
</tr>
<tr>
<td>\^<i>C</i></td>
<td>Control character</td></tr>
<tr>
<td>\<i>c</i></td>
<td>A backslash followed by any other character <i>c</i>
matches itself
</td>
</tr>
</tbody>
</table>
<table>
<tbody><tr>
<th>Symbol</th>
<th>Meaning</th>
</tr>
<tr>
<td> $ </td>
<td> The dollar sign ($) denotes the end of a line. If
the dollar sign ends a regular expression, the
expression is matched only at the end of a
line. </td>
</tr>
<tr>
<td> . </td>
<td> The dot (.) matches any character except the newline,
so this expression is equivalent to [^\n]. </td>
</tr>
<tr>
<td> "..." </td>
<td> Metacharacters lose their meaning within double quotes
and represent themselves. The sequence <code>\"</code>
(which represents the single character <code>"</code>)
is the only exception. </td>
</tr>
<tr>
<td> <i>{name}</i> </td>
<td> Curly braces denote a macro expansion, with <i>name</i>
the declared name of the associated macro. </td>
</tr>
<tr>
<td> * </td>
<td> The star (*) represents Kleene closure and matches zero
or more repetitions of the preceding regular
expression. </td>
</tr>
<tr>
<td> + </td>
<td> The plus (+) matches one or more repetitions of the
preceding regular expression, so <i>e</i>+ is equivalent
to <i>ee</i>*. </td>
</tr>
<tr>
<td> ? </td>
<td> The question mark (?) matches zero or one repetitions
of the preceding regular expression. </td>
</tr>
<tr>
<td> (...) </td>
<td> Parentheses are used for grouping within regular
expressions. </td>
</tr>
<tr valign="top">
<td> [...] </td>
<td> Square backets denote a class of characters and match
any one character enclosed in the backets. If the first
character following the left bracket ([) is the up arrow
(^), the set is negated and the expression matches any
character except those enclosed in the
backets. Different metacharacter rules hold inside the
backets, with the following expressions having special
meanings:
<table>
<tbody><tr>
<td><i>{name}</i></td>
<td>Macro expansion</td>
</tr>
<tr>
<td><i>a</i> - <i>b</i></td>
<td>Range of character codes from <i>a</i> to
<i>b</i> to be included in character set</td>
</tr>
<tr>
<td>"..."</td>
<td>All metacharacters within double quotes lose
their special meanings. The sequence
<code>\"</code> (which represents the single
character <code>"</code>) is the only
exception.</td>
</tr>
<tr>
<td>\</td>
<td>Metacharacter following backslash(\) loses its
special meaning</td>
</tr>
<tr>
</tr>
</tbody></table>
For example, [a-z] matches any lower-case letter, [^0-9]
matches anything except a digit, and [0-9a-fA-F] matches
any hexadecimal digit. Inside character class brackets,
a metacharacter following a backslash loses its special
meaning. Therefore, [\-\\] matches a dash or a
backslash. Likewise ["A-Z"] matches one of the three
characters A, dash, or Z. Leading and trailing dashes in
a character class also lose their special meanings, so
[+-] and [-+] do what you would expect them to (ie,
match only '+' and '-').
</td>
</tr>
</tbody></table>
</p><h3>Associated Actions</h3>
The action associated with a lexical rule consists of C# code
enclosed inside block-delimiting curly braces.<br>
<i>{ action; return null; } </i><br>
The C# code <i>action</i> is copied, as given, into the
state-driven lexical analyzer method produced by Lex.
<p>
All curly braces contained in <i>action</i> not part of strings
or comments should be balanced.
</p><h4>Actions and Recursion:</h4>
If a null return value is returned from an action, the lexical
analyzer will loop, searching for the next match from the input
stream and returning the value associated with that match.
<p>
The lexical analyzer can be made to recur explicitly with a call
to <i>yylex()</i>, as in the following code fragment.<br>
<i>{ ...</i> <br>
<i>return yylex();</i> <br>
<i>... } </i><br>
This code fragment causes the lexical analyzer to recur,
searching for the next match in the input and returning the
value associated with that match. The same effect can be had,
however, by simply not returning from a given action. This
results in the lexer searching for the next match, without the
additional overhead of recursion.
</p><p>
The preceding code fragment is an example of tail recursion,
since the recursive call comes at the end of the calling
function's execution. The following code fragment is an example
of a recursive call that is not tail recursive.<br>
<i>{ ...</i> <br>
<i>next = yylex();</i> <br>
<i>return null;</i><br>
<i>... } </i><br>
Recursive actions that are not tail-recursive work in the
expected way, except that variables such as <i>yyline</i> and
<i>yychar</i> may be changed during recursion.
</p><h4>State Transitions:</h4>
If lexical states are declared in the CsLex directives section,
transitions on these states can be declared within the regular
expression actions. State transitions are made by the following
function call.<br>
<i>yybegin(state);</i><br>
The void function <i>yybegin()</i> is passed the state name
<i>state</i> and effects a transition to this lexical state.
<p>
The state <i>state</i> must be declared within the CsLex
directives section, or this call will result in a compiler error
in the generated source file. The one exception to this
declaration requirement is state <i>YYINITIAL</i>, the lexical
state implicitly declared by CsLex. The generated lexer begins
lexical analysis in state <i>YYINITIAL</i> and remains in this
state until a transition is made.
</p><h4>Available Lexical Values:</h4>
The following values, internal to the <i>Yylex</i> class, are
available within the action portion of the lexical rules.
<table>
<tbody>
<tr>
<th align="left">Variable or Method</th>
<th align="left">ActivationDirective</th>
<th align="left">Description</th>
</tr>
<tr>
<td><i>System.String yytext();</i></td>
<td>Always active.</td>
<td>Matched portion of the character input stream.</td>
</tr>
<tr>
<td><i>int yychar;</i></td>
<td><i>%char</i></td>
<td>Zero-based character index of the first character in the matched
portion of the input stream</td>
</tr>
<tr>
<td><i>int yyline;</i></td>
<td><i>%line</i></td>
<td>Zero-based line number of the start of the matched portion of the
input stream</td>
</tr>
</tbody>
</table>
<h2>Generated Lexical Analyzers</h2>
CsLex will take a properly-formed specification and transform it
into a C# source file for the corresponding lexical analyzer.
<p>
The generated lexical analayzer resides in the class
<i>Yylex</i>. There are two constructors to this class, both
requiring a single argument: the input stream to be
tokenized. The input stream may either be of type
System.IO.StreamReader or System.IO.FileReader.
</p><p>
The access function to the lexer is <i>Yylex.yylex()</i>, which
returns the next token from the input stream. The return type is
<i>Yytoken</i> and the function is declared as follows.<br>
<i>class Yylex { ... </i><br>
<i>public Yytoken yylex () {</i><br>
<i>... } </i><br>
The user must declare the type of <i>Yytoken</i> and can
accomplish this conveniently in the first section of the CsLex
specification, the user code section. For instance, to make
<i>Yylex.yylex()</i> return the boxed (Int32.Box) integer
wrapper type, the user would enter the following code somewhere
preceding the first ``%%''.<br>
<i>class Yytoken { } </i><br>
Then, in the lexical actions, wrapped integers would be
returned, in something like this way.<br>
<i>{ ...</i><br>
<i>return ((object) i); </i><br>
<i>... } </i><br>
Likewise, in the user code section, a class could be defined declaring
constants that correspond to each of the token types.<br>
<i>class TokenCodes { ... </i><br>
<i>public static final STRING = 0; </i><br>
<i>public static final INTEGER = 1; </i><br>
<i>... } </i><br>
Then, in the lexical actions, these token codes could be
returned.<br>
<i>{ ...</i><br>
<i>return ((object) STRING); </i><br>
<i>... } </i><br>
These are simplified examples; in actual use, one would probably
define a token class containing more information than an integer
code.
</p><p>
These examples begin to illustrate the object-oriented
techniques a user could employ to define an arbitrarily complex
token type to be returned by <i>Yylex.yylex()</i>. In
particular, inheritance permits the user to return more than one
token type. If a distinct token type was needed for strings and
integers, the user could make the following declarations.<br>
<i>class Yytoken { ... } </i><br>
<i>class IntegerToken extends Yytoken { ... }</i><br>
<i>class StringToken extends Yytoken { ... } </i><br>
Then the user could return both <i>IntegerToken</i> and
<i>StringToken</i> types from the lexical actions.
</p><p>
The names of the lexical analyzer class, the tokening function,
and its return type each may be altered using the CsLex
directives.
</p><h2>Credits</h2>
CsLex is a derivative of the JLex implementation by Elliot
Joel Berk and C. Scott Ananian.
<p>
The design and architecture of CsLex, written in C#, is based on
a melding of the JLex implementation and the Lex functional
specification. JLex was written for the Java language, and it's
direct antecedent Lex, designed and written for the C language.
</p><p>
CsLex distinguishes itself by incorporating a number of C#
language constructs and services; refer to the design notes for
CsLex for more details on C# specific features incorporated into
the CsLex design.
</p><hr>
<h4>CsLex COPYRIGHT NOTICE, LICENSE AND DISCLAIMER</h4>
CsLex is Copyright 2000 by Brad Merrill
<p>
Permission to use, copy, modify, and distribute this software
and its documentation for any purpose and without fee is hereby
granted, provided that the above copyright notice appear in all
copies and that both the copyright notice and this permission
notice and warranty disclaimer appear in supporting
documentation, and that the name of the authors or their
employers not be used in advertising or publicity pertaining to
distribution of the software without specific, written prior
permission.
</p><p>
The authors and their employers disclaim all warranties with
regard to this software, including all implied warranties of
merchantability and fitness. In no event shall the authors or
their employers be liable for any special, indirect or
consequential damages or any damages whatsoever resulting from
loss of use, data or profits, whether in an action of contract,
negligence or other tortious action, arising out of or in
connection with the use or performance of this software.
</p><p>
C# is a trademark of Microsoft Corp. References to the C#
programming language in relation to CsLex are not meant to imply
that Microsoft endorses this product.
</p><hr>
<h4>JLEX COPYRIGHT NOTICE, LICENSE AND DISCLAIMER</h4>
Copyright 1996-2000 by Elliot Joel Berk and C. Scott Ananian
<p>
Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both the copyright notice and this permission notice and warranty disclaimer appear in supporting documentation, and that the name of the authors or their employers not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission.
</p><p>
The authors and their employers disclaim all warranties with regard to this software, including all implied warranties of merchantability and fitness. In no event shall the authors or their employers be liable for any special, indirect or consequential damages or any damages whatsoever resulting from loss of use, data or profits, whether in an action of contract, negligence or other tortious action, arising out of or in connection with the use or performance of this software.
</p><p>
Java is a trademark of Sun Microsystems, Inc. References to the Java programming language in relation to JLex are not meant to imply that Sun endorses this product.
</p><hr>
<address>
<a href="http://drgdna/bmerrill">Brad Merrill</a>
<a href="mailto:[email protected]"><[email protected]></a>
</address>
<!-- hhmts start --> Last modified: Mon Sep 18 11:46:32 Pacific Daylight Time 2000 <!-- hhmts end -->
</body></html>