-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move all regex usage to separate module to add support for fancy-regex #270
Conversation
default = ["parsing", "assets", "html", "yaml-load", "dump-load", "dump-create"] | ||
default-onig = ["parsing", "assets", "html", "yaml-load", "dump-load", "dump-create", "regex-onig"] | ||
default-fancy = ["parsing", "assets", "html", "yaml-load", "dump-load", "dump-create", "regex-fancy"] | ||
default = ["default-onig"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what the best way to structure the features is, any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key thing about features is that they're additive so that if multiple crates specify different features, the union of them works for both crates. The other nice thing to do is to preserve compatibility with existing crates that depend on us without default features, although I'm okay with breaking that if there's no good way otherwise.
I can't see a clean way of making the features backwards-compatible, so maybe just change the cfg
statements in the regex
module so that if both regex-fancy
and regex-onig
are set then regex-fancy
takes precedence, although I could see the precedence going the other way too, as long as it works with both set.
src/parsing/parser.rs
Outdated
@@ -327,7 +323,7 @@ impl ParseState { | |||
let match_pat = pat_context.match_at(pat_index); | |||
|
|||
if let Some(match_region) = self.search( | |||
line, start, match_pat, captures, search_cache, regions | |||
line, start, match_pat, captures, search_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the previous code reused the regions. I should benchmark what the impact of this is, but I decided for the straightforward API for now because I didn't want to complicate things more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is plausibly an actually important optimization, perhaps especially with onig but allocating an extra Vec on every search with fancy-regex isn't great either. But I forget since it was a long time ago that I optimized this.
I would be happy if you ran the benchmarks with onig in this PR vs the ones in master so we can see if this makes a difference that matters. If so we can just refactor the regex
module interfaces to get a region passed in, then the fancy-regex implementation can clear
the Vec
before adding to it again.
|
||
/// A region contains text positions for capture groups in a match result. | ||
#[derive(Clone, Debug, Eq, PartialEq)] | ||
pub struct Region { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these types and their methods become public API. Not sure about the naming, it's currently similar to what onig uses, but different from the regex crate/fancy-regex.
pub struct MatchPattern { | ||
pub has_captures: bool, | ||
pub regex_str: String, | ||
pub regex: Regex, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note due to how the serialization/deserialization of Regex
just delegates to the String
inside it, this happens to work without changing the binary format for packs (i.e. no need to regenerate packs).
@@ -424,25 +424,35 @@ impl SyntaxDefinition { | |||
} | |||
|
|||
fn resolve_variables(raw_regex: &str, state: &ParserState<'_>) -> String { | |||
state.variable_regex.replace_all(raw_regex, |caps: &Captures<'_>| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the only use of replace_all
. Putting that into the regex API would have been a bit complicated, so I rewrote this part to use search
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Some things I'd like before merging:
- Run the benchmarks with master, this PR on onig, and this PR on fancy-regex
- Add a bit of documentation mentioning fancy-regex and why you might use it (pure Rust!) somewhere (readme, Cargo.toml, doc comment, not sure).
- Fix the CI failure where the build doesn't work with no default features because of references to a missing
regex_impl
@@ -23,13 +22,6 @@ type Dict = serde_json::Map<String, Settings>; | |||
/// A String representation of a `ScopeSelectors` instance. | |||
type SelectorString = String; | |||
|
|||
/// A simple regex pattern, used for checking indentation state. | |||
#[derive(Debug)] | |||
pub struct Pattern { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this refactor gets rid of the duplicate implementation of regex laziness.
src/parsing/parser.rs
Outdated
@@ -327,7 +323,7 @@ impl ParseState { | |||
let match_pat = pat_context.match_at(pat_index); | |||
|
|||
if let Some(match_region) = self.search( | |||
line, start, match_pat, captures, search_cache, regions | |||
line, start, match_pat, captures, search_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is plausibly an actually important optimization, perhaps especially with onig but allocating an extra Vec on every search with fancy-regex isn't great either. But I forget since it was a long time ago that I optimized this.
I would be happy if you ran the benchmarks with onig in this PR vs the ones in master so we can see if this makes a difference that matters. If so we can just refactor the regex
module interfaces to get a region passed in, then the fancy-regex implementation can clear
the Vec
before adding to it again.
default = ["parsing", "assets", "html", "yaml-load", "dump-load", "dump-create"] | ||
default-onig = ["parsing", "assets", "html", "yaml-load", "dump-load", "dump-create", "regex-onig"] | ||
default-fancy = ["parsing", "assets", "html", "yaml-load", "dump-load", "dump-create", "regex-fancy"] | ||
default = ["default-onig"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key thing about features is that they're additive so that if multiple crates specify different features, the union of them works for both crates. The other nice thing to do is to preserve compatibility with existing crates that depend on us without default features, although I'm okay with breaking that if there's no good way otherwise.
I can't see a clean way of making the features backwards-compatible, so maybe just change the cfg
statements in the regex
module so that if both regex-fancy
and regex-onig
are set then regex-fancy
takes precedence, although I could see the precedence going the other way too, as long as it works with both set.
Also I must say I love how this feature removes nearly as many lines as it adds, largely due to the regex abstraction removing a lot of duplication. Good work on that. |
Ok, so there's a problem with the Java syntax with fancy-regex, which I've narrowed down to this bug that will need to be fixed: fancy-regex/fancy-regex#37 |
Any progress on getting this fancy-regex feature? Running
|
What can we do to help? Really keen on being able to land this. |
One avenue that might be an easier path towards fixing common Windows woes is helping with rust-onig/rust-onig#126 which should make onig no longer require Clang on Windows, although it may still need to build C I'm not sure. You could maybe make it so that the Alternatively you can try checking out this branch and seeing if you can rebase it and get it passing CI, which may require updating fancy-regex and maybe some more fixes. I think @robinst may have mentioned at some point that he was still occasionally tinkering with this but not sure if that's still true. |
That way we can add fancy-regex support behind a feature.
* Adds a std::error::Error impl for Error * Adds a backtracking limit to mitigate catastrophic backtracking
Without this, some parsing benchmarks took 30% longer to run.
Some of the regexes include `$` and expect it to match end of line. In fancy-regex, `$` means end of text by default. Adding `(?m)` activates multi-line mode which changes `$` to match end of line. This fixes a large number of the failed assertions with syntest.
In fancy-regex, POSIX character classes only match ASCII characters. Sublime's syntaxes expect them to match Unicode characters as well, so transform them to corresponding Unicode character classes.
With the regex crate and fancy-regex, `^` in multi-line mode also matches at the end of a string like "test\n". There are some regexes in the syntax definitions like `^\s*$`, which are intended to match a blank line only. So change `^` to `\A` which only matches at the beginning of text.
Note that this wasn't a problem with Oniguruma because it works on UTF-8 bytes, but fancy-regex works on characters.
Always adding `(?m)` for the entire regex meant that `.` also changed meaning, which is not what we want. The safer option is to use `(?m:$)` for `$` only. That also means we don't have to bother with `\A`. But we do need to parse look-behinds because we can't use `(?m:$)` in it.
Turns out `(?m:$)` works in look-behinds, just not `(?m)$(?-m)` which I was using before.
Includes the fix for fancy-regex/fancy-regex#37 which caused a test failure with the Java syntax.
(Am trying to bring this PR up to date with PR 7 |
fe28a3c
to
a7045b1
Compare
Might be worth introducing a specific feature for this and then have other features depend on it? Not sure.
Should the feature be called regex-rs rather than regex-fancy to match with dump-create-rs and dump-load-rs? |
Confirmed this branch builds fine on windows 10 and OSX using |
The sooner we land it, the sooner people can say |
This works fine for me on windows with the following
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked over all this again and it looks excellent. I really like the new abstractions and all the new tests.
I ran the benchmarks and it looks like this doesn't really change performance in onig mode, and fancy-regex mode is about half the speed:
highlight/"highlight_test.erb"
time: [3.8420 ms 3.9419 ms 4.1724 ms]
change: [+250.05% +314.05% +396.57%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 10 measurements (20.00%)
1 (10.00%) high mild
1 (10.00%) high severe
highlight/"InspiredGitHub.tmTheme"
time: [36.775 ms 36.969 ms 37.223 ms]
change: [+97.910% +102.09% +106.71%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
Benchmarking highlight/"Ruby.sublime-syntax": Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 12.1s or reduce sample count to 10
highlight/"Ruby.sublime-syntax"
time: [214.52 ms 216.23 ms 219.43 ms]
change: [+369.60% +377.83% +385.88%] (p = 0.00 < 0.05)
Performance has regressed.
Benchmarking highlight/"jquery.js": Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 47.5s or reduce sample count to 10
highlight/"jquery.js" time: [808.28 ms 811.63 ms 820.02 ms]
change: [+91.568% +96.014% +100.90%] (p = 0.00 < 0.05)
Performance has regressed.
Benchmarking highlight/"parser.rs": Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 28.5s or reduce sample count to 10
highlight/"parser.rs" time: [494.35 ms 495.46 ms 497.29 ms]
change: [+85.262% +87.230% +89.678%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high severe
highlight/"scope.rs" time: [45.487 ms 46.328 ms 47.719 ms]
change: [+87.907% +94.569% +100.73%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
I think fancy-regex being substantially slower shouldn't block merging as long as it's not the default so I'm totally fine with merging at the current performance levels. If we ever want to make it the default I'd want to improve the speed first though.
I also pushed a commit that runs the tests for fancy-regex mode in CI. It looks like more syntax tests pass than before (good job!) but my tricky highlight_test.erb
gets mis-highlighted, which causes an HTML comparison test to fail. Try running cargo run --features default-fancy --no-default-features --release --example syncat testdata/highlight_test.erb
to see this.
If you're able to quickly dig in and fix that it would be awesome to launch fancy-regex with no known highlighting flaws. However I can imagine it might be something that's time consuming to diagnose or fix, in which case I'd be willing to launch with that test disabled under fancy-regex and a warning in the readme that although fancy-regex works most of the time there are know cases where it messes up. Because it does work most of the time and for the people who need fancy-regex something is better than nothing.
So yah only thing I can think of right now that this needs before merge is either disabling or fixing that test, and something in the readme, both of which I can take a stab at if @robinst is busy.
Amazing work @robinst, thanks so much I'm happy to see this finally being so close! Also thanks @raphlinus for the initial work on fancy-regex, it's looking like it'll finally get put to use! And thanks @gilescope for pushing to get this done :)
Wow, that's some test case. |
This applies the regex rewriting fixes on this branch to the default packs.
Yay, thanks for helping push this over the finish line :)! The benchmarks match the shallow benchmarking that I've done. I've done some matching of single regexes (ones that don't need fancy features) and I've observed that the first N matches are slow, and only after some warmup do they get fast. It would be good to investigate this more to see if we can get more performance out of it. For the test failure, I actually already looked into that and fixed it in the regex rewriting. All that was needed was updating the packs (which I put off until the end because it's hard to rebase)! Pushed that now and build should go green. If you don't mind adding the bits to the readme @trishume, that would be awesome. |
@robinst If you're able to widdle down a simple benchmark for me, I'd be happy to take a closer look. |
This has the same goal as #34 but with a different approach.
Note that the fancy-regex implementation doesn't compile yet, but I thought it would be useful to get this reviewed earlier rather than later.
I haven't ported over the regex rewriting changes yet, I'm hoping that we can generate regexes that work on both onig and fancy-regex.