Initially extracted from conversation with @Annieppo and @nicoabie in ##prolog on freenode.
The library started as a very simple and lightweight set of predicates for a common, but very limited, form of lexing. As we extend it, we aim to maintain a modest scope in order to achieve a sweet spot between ease of use and powerful flexibility.
tokenize
does not aspire to become an industrial strength lexer generator. We
aim to serve most users needs between raw input and a structured form ready for
parsing by a DCG.
If a user is parsing a language with keywords such as class
, module
, etc.,
and wants to distinguish these from variable names, tokenize
isn't going to
give you this out of the box. But, it should provide an easy means of achieving
this result through a subsequent lexing pass.
- somebody making a computer language
- needs to be able to distinguish keywords, variables and literals
- needs to be able to identify comments
- somebody making a parser for an interactive fiction game
- needs to handle stuff like "William O. N'mutu-O'Connell went to the market"
- somebody wanting to analyze human texts
- wanting to do some analysis on New York Times articles, they want to first process the articles into meaningful tokens
- We don't parse.
- Every token generated is callable (i.e., an atom or compound).
- Example of an possible compound token:
space(' ')
. - Example of a possible atom token:
escape
. tokenization need to return tokens represented with the same arity)
- Example of an possible compound token:
- Users should be able to determine the kind of token by unification.
- Users should be able to clearly see and specify the precedence for tokenizaton
- E.g., given
"-12.3"
,numbers, punctuation
should yield[pnct('-'), number(12), pnct('.'), number(3)]
whilepunctuation, numbers
should yield[number(-12.3)]
.
- E.g., given