UCLP is an experimental implementation of PEG parsing in Common Lisp which compiles grammar rules directly to source code at runtime. A parsing expression grammar is a very elegant way of recognizing, parsing, and transforming text- much more powerful than regular expressions without the complexity of a custom-built parser. Note that while it is possible to parse PEGs in guaranteed linear time (at the cost of linear space) with a packrat parser, UCLP does not. Most patterns unless very poorly written will run in linear time anyway, and for ease of use some necessary departures from the strict definition are included.
UCLP patterns are just made of native data structures, so it’s easy to compose patterns and interact with them in code. Unlike regular expressions, PEG syntax is pretty readable even if you aren’t familiar with the specifics. The below example is largely copied from the Janet language documentation:
(defparameter ip-address
'(grammar
:dig (range "09")
:0-4 (range "04")
:0-5 (range "05")
:byte (choice
(sequence "25" :0-5)
(sequence "2" :0-4 :dig)
(sequence "1" :dig :dig)
(between :dig 1 2))
:main (sequence :byte "." :byte "." :byte "." :byte)))
; uclp:match returns two values, a boolean indicating success/failure and a list of captures
(uclp:match ip-address "0.0.0.0") ; -> t nil
(uclp:match ip-address "elephant") ; -> nil
(uclp:match ip-address "256.0.0.0") ; -> nil
(uclp:match ip-address "0.0.0.0moretext") ; -> t nil
If you follow the link you’ll see that the example is almost exactly copied, with the only differences being those forced by the differences in language syntax. UCLP is a very close reproduction of the semantics of Janet’s PEG module. After experiencing how pleasant text parsing is in Janet you’ll also feel the urge to rewrite it for every language you use.
As of now UCLP is usable. Bugs are to be expected, but almost all of Janet’s patterns are supported and there shouldn’t be any significant footguns. Unfortunately it has only been tested on SBCL, but I plan on at least expanding that to some other implementations since there isn’t much implementation-specific code. Except of course that for it to be performant it needs to be compiled quickly to a quick runtime, ideally machine code.
The peg syntax used by UCLP is largely a 1:1 reproduction of Janet’s. Differences are noted below, but otherwise you can safely default to using Janet’s documentation. Many tasks that might require a particular function when using regex can be accomplished with the correct pattern. However, for convenience, there are entrypoints for a few common usecases.
This is the primary entrypoint to UCLP. Matches str
against rule
anchored to start
.
Returns two values on success, a boolean indicating success or failure and a list of
captures, and nil otherwise. Additional arguments are accessable while matching using the
argument
pattern. Note that if using additional arguments you must specify a start
.
Compiles rule
to source ahead of time. Even with SBCL’s lightning quick compilation this
can still save a significant amount of time, and for many rules almost all the
consing. The result is just a closure, and it can be used in other patterns by including
it literally. Does not promise to be nice if used outside its wrapper.
Takes exactly the same arguments as match
but returns in the opposite order- first the
list of captures, and success as the second value. Useful to get at captures without
multiple-value-bind
.
Returns the string resulting from replacing every instance of rule
in str
with
replace
, which may be a string to substitute or a function taking as an argument the
substring matched by rule
. Returns a boolean indicating if matches were found as a
second value.
Behaves like replace-all
, but replaces only the first match.
Gives the position of each match in str
as a list, such that
(defparameter dig '(* "dig:" :d))
(defparameter str "dig:7, dig:8, dig:9")
(every (lambda (position)
(uclp:match dig str position))
(uclp:find-all dig str)) ; => t
Returns the position of the first match, or nil
if there are no matches.
Mostly UCLP adheres to the behavior of Janet pegs, enough so that Janet’s documentation is better than anything I’ll have put together for a little while. However, there are some differences which obviously need to be documented somewhere. Some are due to differences in host language, some are due to taste, and some are just features I haven’t implemented yet.
Any difference in behavior from Janet not mentioned below is a bug, and either the behavior or the documentation will need to be changed.
Janet buffers and strings are simply byte strings, and Janet pegs work on arbitrary
strings of bytes. UCLP expects to work with Common Lisp strings, which in general are
not byte strings but vectors of type character
. As such the patterns intended to work
on bytes are not implemented. These are: uint
, uint-be
, int
, and int-be
. Because
using PEGs to parse binaries is so nice, I plan on at some point implementing some way of
compiling PEGs intended to operate on raw bytes. However, specialization will be at
compile-time and the above patterns will likely be available only in a byte PEG.
I’m not yet sure if UCLP will ever support a general number
pattern. It’s possible
I’ll bring in parse-float
to make a float
as well as a general number
pattern.
- While UCLP does not have
number
, it does haveinteger
, which takes identical arguments and parses an integer usingparse-integer
. - The pattern
(split div-pat fill-pat)
was contributed to Janet but not yet released. It will match whenfill-pat
matches each segment of the input when divided bydiv-pat
. However, it is a little more complicated than that. For one,split
always matches against the entire input. If you don’t want that, you need to use asub
pattern to contain it. This means that something like(match '(split "," 1) "a,b,c,")
will not match! Additionally,fill-pat
is implicitly matched in asub
- so if you want to match everything between your seperators, you can use a pattern like(split "," (any 1))
. Look at the test cases for examples here. - The pattern
(split div-pat fill-pat)
was contributed to Janet but not yet released. It matches if the input string consists of instances offill-pat
split apart bydiv-pat
. Look at the test cases to get a finer idea of the details, but it is equivalent to the following pattern:`(grammar :mid (sub (to (+ ,div-pat -1)) ,fill-pat) ; fill-pat matches up to the next div :main (* :mid (any (* (drop ,div-pat) :mid)) -1))
- Anywhere a string literal can go, including those in
range
orset
, a character or list of characters and strings can also go. This is because Common Lisp strings do not have escape codes like Janet strings. So(range ("a" #\Newline))
in UCLP is the same as(range "a\n")
in Janet. between
,at-least
,at-most
, andlook
all have the pattern as the first argument, unlike in Janet where it is the last argument.backmatch
requires a tag argument, and will not look up captures on the capture stackreplace
takes either a string which it captures literally, or a function which it calls. Taking other datatypes literally will probably be in the next version. But unlike Janet, it will never look the matches up.- Grammars, represented in Janet by tables or structs, are written in UCLP with the
grammar
rule, which is followed by alternating keywords naming rules and patterns implementing them. - Because the reader doesn’t distinguish between
:s
and:S
, the complement of a built-in pattern is prefixed with!
. So:!s
instead of:S
. - The
error
pattern has significant differences from Janet’serror
. See Error below. - In addition to the
+
and*
variants of built-in patterns, UCLP has a maybe variant marked by a?
. So:w?
denotes(? :w)
. - UCLP includes any (
*
), some (+
), and maybe (?
) variants of complement patterns. So:!d+
is(some (if-not :d 1))
. See Aliases below
UCLP offers aliases, keywords that stand in for larger patterns, similar to the Janet’s built-in patterns. And like built-in patterns, aliases are user extensible. However, there are a number of differences which are important to be aware of. Aliases are not first class citizens of UCLP- rather than full mutually recursive subpatterns, they are simple find-and-replace macros, inserted literally. You can reference other aliases from inside one, but if you create a cycle it’ll just blow out the stack. So just be cautious!
Aliases are stored in the alist *aliases*
. You can manipulate *aliases*
directly, or
call the helper functions register-alias!
and register-alias-suite!
. Both take the
name of the alias as a keyword, and the body as a peg expression, and push the new alias
to *aliases*
. However, register-alias-suite!
will also add the complement, some,
maybe, and any variants, like so:
(uclp:register-alias-suite! :v '(set "vV"))
(uclp:match '(* :v (<- :!v+)) "v not a V") ; => t (" not a ")
Because Common Lisp conditions are so different from Janet signals, the error
pattern
has some subtleties in UCLP. It takes arguments of the form
(error &optional pat condition)
.
With 0 arguments, it will raise a peg-error
. With one argument, it will
raise an error only if pat
matches. With two arguments, it will raise a condition
so
long as pat
matches. To specify a particular condition a pattern must be given, but
something like 0
will always match.
UCLP special cases conditions inheriting from peg-error
, which is exported. A
peg-error
has slots pat
, matched
, and caps
. If a pattern is given, these will
automatically be filled by the pattern itself, the text that pattern matched, and the
captures from the pattern respectively. If no pattern is supplied they will all be
nil
. These slots can be accessed with error-pat
, error-matched
, and caps
. By
default peg-error
has a report function giving the pattern and, if nonempty, the
matching substring. Depending on your choice of pattern, this can be tolerably readable.