Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miscellaneous feedback #15

Open
fabiospampinato opened this issue Jan 12, 2021 · 3 comments
Open

Miscellaneous feedback #15

fabiospampinato opened this issue Jan 12, 2021 · 3 comments

Comments

@fabiospampinato
Copy link

fabiospampinato commented Jan 12, 2021

I spent some time today benchmarking the library and playing with making a toy/useless Markdown parser with it, so here's some miscellaneous feedback after having interacted some more with the library, feel free to close this and perhaps open standalone issues for the parts you think are worth addressing.


For the Markdown parser thing I was trying to write a matcher that matched headings, and I had some problems with that:

  1. I wanted to get the leading hashes, trailing hashes, and content in between as individual strings in the tag to be transformed, that immediately implies that I have to use multiple regexes because capturing groups within a single regex are lost in the tag. Perhaps this should be changed somehow as that would be quite powerful.
  2. Not being able to use a single regex in this scenario means also that I can't use \1, \2 etc. to reference other capturing groups either, in my headings scenario the trailing hashes should really be considered trailing hashes only if they are exactly the same number as the leading hashes, otherwise they should be considered part of the body, this can't quite be expressed cleanly with the current system because the first capturing group/matcher can't be referenced.
    1. Addressing the first issue would kind of address this too.
    2. Another option would be to extend the DSL adding support for \[0-9] references, which in this case would mean referencing the 1st, 2nd... 9th whole sub-matcher.
    3. Perhaps both options should be implemented, I'm not sure.
  3. Continuing on the same scenario there's another issue once the standalone regex gets broke up:
    Standalone:
    `${/(#{1,6} )(.+?)( #{1,6})/}`
    Broken-up:
    `${/#{1,6} /} ${/.+?/} ${/ #{1,6}/}`
    
    I forgot to take note of what the issue was exactly (🤦‍♂️), but unless I'm misremembering the issue is that those two expressions don't quite match the same things because the lazy modifier on the broken-up version doesn't behave the same way basically.
  4. Custom regex flags are discarded, it would be nice in some scenarios to be able to write a case-insensitive regex or a regex where "^" and "$" match the start and end of the line respectively for example.
  5. The DSL to me looks kind of like a reimplementation of a subset of the regex language, so perhaps it should be extended a bit to match it more closely, for example how-many-modifiers (what are these things actually called?) like {1,3} perhaps should be supported too.

Now about performance, perhaps the more interesting part of the feedback.

From what I've seen every ~atomic thing the library does is pretty fast, so there shouldn't be any meaningful micro-optimizations available, the root issue seems to be actually that the library spends too much times on the wrong alternations.

Some actual real numbers first so that the rest of the feedback sounds less crazy:

  1. I've been benchmarking the library with this.zip. Basically there's a parser that matches against a subset of javascript and it is asked to match a bunch of expressions in a loop.
  2. Making this 14-characters diff to a single matcher of the parser cut the time it took to run the benchmark by ~40%:
    -  = $`${LogicalORExpression} ${_} ${TernaryOperatorTrue} ${Expression} ${TernaryOperatorFalse} ${Expression}`; // Slow
    +  = $`${/(?=.*\?)/} ${LogicalORExpression} ${_} ${TernaryOperatorTrue} ${Expression} ${TernaryOperatorFalse} ${Expression}`; // Fast
  3. Basically this is what's happening:
    1. This rule matches a ternary expression.
    2. Most expressions in the benchmark aren't ternary expressions.
    3. Still those expressions could be the boolean condition of a ternary expression.
    4. So the parser parses the entire thing, until it realizes the required "?" character for the ternary expression can't be found.
    5. So it backtracks and eventually matches the entire thing again with another matcher.
    6. From the "again" word here it comes the almost doubled performance because of the changed rule.
    7. The only thing the changed rule does is checking if the "?" character is present before running any other gazillion matchers.

That's kind of the root of the performance problems with RegHex parsers in my opinion, if I had to guess with enough sophistication perhaps some parsers could become 100x faster or more just by going down branches/alternations more intelligently.

At a high-level to me RegHex parsers look kinda like CPUs, individual patterns are like instructions, each alternation is a branch etc. it should follow then that the same optimizations used for CPUs could/should be used for RegHex. I know next to nothing about that really, but just to mention some potential things that crossed my mind:

  1. In my ternary expression matcher above there are a series of instructions that should succeed for the entire thing to be considered a match, but not all instructions are the same performance-wise, e.g. checking if the string to match contains a "?" is waaaaay faster than checking if the string starts with something that could be a boolean expression for the ternary operator, the fastest checks should be performed first.
    1. This optimization specifically could be performed at the parser-level with a lookahead, that works but that's kinda ugly.
    2. Another, more general, approach would be to analyze regexes, probably at build-time so that how long it takes to do that doesn't matter, and extract subsets of the automata that must match under every branch and are easy to check for, like the presence of "?" and ":", in that order, in the input string required by my ternary expression matcher.
  2. Next perhaps a branch predictor could be added, the order in which alternations are tried matters for performance, and if the 9th alternation in a set of 10 alternation is the one that matched the most in the past perhaps that should be tried first and most of the times we can skip failing to match the first 8 alternations altogether.
    1. This could be pretty tricky to optimize for automatically, because you need to know for which chunks in the alternations array the order doesn't matter. Maybe some API could be exposed to the user and just move the problem to the user, like a "isParallelizable" third argument to the match function or something.
  3. This is kinda out-there but in some sense RegHex took the individual automata I wrote (~the matchers) and built a larger one by putting the smaller ones in a tree (~the final parser), now currently what happens if I don't add the lookahead for "?" is that the parser goes down the branch for the ternary expression, matches a whole lot of stuff, then realizes the "?" character can't be found and it goes back to the starting point, taking another branch - now it might be interesting to note that there are multiple branches of the tree here that look exactly the same for a while, so they should kind of get merged together in a prefix tree, this way once the ternary expression wouldn't ultimately get matched RegHex wouldn't go back to the very root of the tree but it would take the nearest unexplored branch first instead, which in this particular case would lead to finding a match almost immediately (e.g. "so far that was an OR expression -> there are no more characters left in the input string -> done").

Depending on how many of these fancy optimizations you are willing to spend time on perhaps a seriously fast JS parser could be written on top of RegHex 🤔 that'd be really cool.


Sorry for the long post, hopefully there's some useful feedback in here.

@kitten
Copy link
Member

kitten commented Jan 12, 2021

Don't worry about the length ✌️ I really like the ideas here!
I'm currently investigating what the cost would be to parse regular expressions themselves and have a tiny regular expression matcher just for character classes and sequences that can otherwise reuse the existing codegen to compile down some regular expressions.

Lookaheads are indeed very powerful. I do use them quite frequently when writing parsers in RegHex to avoid an expensive branch from being taken when a piece of logic would be too expensive to backtrack on.

For most optimisations I assume however that most grammars are already optimised to be LR, meaning that in the best case RegHex could extract a single character class (the start of the regular expression) as a cheap look ahead.
Whether this is worth it will depend on how costly, in terms of bundle size increases, it is to experiment with a non-built-in micro RegExp engine that inserts adhoc matchers in place of the regular expressions at compile time.

Personally I'm not too fond of the idea of adding more support for regular expression syntax itself. The case-insensitive flag is an easy one to add but I'd suspect ^ and $ are rarely useful in sticky-like matches. Hard to tell though.

I'll try to carve out some time for the character class conversion. Generally my idea is to compile character classes to an NFA/DFA (we'll see) where the character classes themselves are stored as tries / prefix trees with 4-bit prefixes and 16-bit bitmaps.

The reason why I think that is interesting to me is that if most of the cost goes towards regular expressions, any uplift in matching performance on those, compared to the built-ins, will help tremendously across the board. Using regular expressions in general can in theory not only incur the performance cost of the JS engine calling into the regular expression engine, but ease GC pressure (since it'd be hard to replace the array-based node structure itself)

@fabiospampinato
Copy link
Author

Sounds good 👍

It might be useful for guiding optimizations if the library could tell how many regexes it executed vs. what the minimum number of regexes to execute would have been if it somehow had always taken the correct branches. Maybe even your parsers that already use lookaheads heavily spend a significant amount of time on the wrong branches, like I'd be surprised if that wasn't the case 🤔

@fabiospampinato
Copy link
Author

Somewhat interestingly pre-constructing compounded regexes manually, like this:

// Before

const Digit
  = /[0-9]/;

const IdentifierStart
  = /[a-zA-Z$_]/;

const IdentifierRest
  = $`${IdentifierStart} | ${Digit}`;

const Identifier
  = $`${IdentifierStart} ${IdentifierRest}*`;

// After

const Digit
  = /[0-9]/;

const IdentifierStart
  = /[a-zA-Z$_]/;

const IdentifierRest
  = $`${IdentifierStart} | ${Digit}`;

const Identifier
  = /[a-zA-Z$_][a-zA-Z0-9$_]*/; // <-- Changed rule

Made the entire parser take ~25% less time in my benchmarks (~210ms vs ~280ms), just by changing that one "Identifier" rule.

Perhaps this is also something that could be performed automatically in some measure at built-time, when compiling regexes away entirely isn't feasible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants