-
Notifications
You must be signed in to change notification settings - Fork 479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace regex-based lexer with character-at-a-time lexer #406
Conversation
An additional change in behavior is that the new lexer includes trailing space as part of a recipe line. I.E if a recipe line contains |
Since there were so few justfiles that saw changes (2 out of 498), and those that did won't change behavior, I'm going to merge this. |
Fixes #241, which I've been threatening to do for a long time.
The current lexer uses regexes and is god awful.
The new lexer processes text mostly one character at a time, making decisions about which tokens to emit along the way. It's more verbose than the old lexer, but the new code is much easer to read, understand, and modify.
Also, the new lexer is 4x faster than the old lexer, when tested against a corpus of justfiles collected from github. In release mode, the change is more dramatic, with the new lexer being 15x faster.
I suspect that the speed increase is partially due to the old lexer trying to lex tokens by matching regexes in a sequence, which led to a lot of wasted work, whereas the new lexer is usually able to make a decision about which token to emit next by looking at the next character.
Since this is such a massive change, I'm testing it using a new tool called Janus, which downloads all justfiles on github and feeds them to multiple versions of just, looking for differences in behavior. Janus is of course inspired by Rust's crater, and once I finally release it we can close #251.
So far the results from Janus are encouraging. The just+new lexer produces slightly better error messages in a few cases, as well as being able to parse a previously unparsable justfile.
The only change that I need to investigate before landing the rewrite is a change in the handling of windows newlines at the end of recipe lines. For example, it looks like a text that the old lexer would extract from the line
echo foo\r\n
would beecho foo\r
, whereas the new lexer correctly recognizes\r\n
as a unit, and extractsecho foo
as the line text.Although the new lexer's behavior is correct, I'm slightly concerned that there might be cases where the new behavior might cause a shebang recipe to fail when previously it would have succeeded.