Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move all regex usage to separate module to add support for fancy-regex #270

Merged
merged 16 commits into from
Mar 29, 2020

Conversation

robinst
Copy link
Collaborator

@robinst robinst commented Nov 25, 2019

This has the same goal as #34 but with a different approach.

Note that the fancy-regex implementation doesn't compile yet, but I thought it would be useful to get this reviewed earlier rather than later.

I haven't ported over the regex rewriting changes yet, I'm hoping that we can generate regexes that work on both onig and fancy-regex.

default = ["parsing", "assets", "html", "yaml-load", "dump-load", "dump-create"]
default-onig = ["parsing", "assets", "html", "yaml-load", "dump-load", "dump-create", "regex-onig"]
default-fancy = ["parsing", "assets", "html", "yaml-load", "dump-load", "dump-create", "regex-fancy"]
default = ["default-onig"]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the best way to structure the features is, any thoughts?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key thing about features is that they're additive so that if multiple crates specify different features, the union of them works for both crates. The other nice thing to do is to preserve compatibility with existing crates that depend on us without default features, although I'm okay with breaking that if there's no good way otherwise.

I can't see a clean way of making the features backwards-compatible, so maybe just change the cfg statements in the regex module so that if both regex-fancy and regex-onig are set then regex-fancy takes precedence, although I could see the precedence going the other way too, as long as it works with both set.

@@ -327,7 +323,7 @@ impl ParseState {
let match_pat = pat_context.match_at(pat_index);

if let Some(match_region) = self.search(
line, start, match_pat, captures, search_cache, regions
line, start, match_pat, captures, search_cache
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the previous code reused the regions. I should benchmark what the impact of this is, but I decided for the straightforward API for now because I didn't want to complicate things more.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is plausibly an actually important optimization, perhaps especially with onig but allocating an extra Vec on every search with fancy-regex isn't great either. But I forget since it was a long time ago that I optimized this.

I would be happy if you ran the benchmarks with onig in this PR vs the ones in master so we can see if this makes a difference that matters. If so we can just refactor the regex module interfaces to get a region passed in, then the fancy-regex implementation can clear the Vec before adding to it again.


/// A region contains text positions for capture groups in a match result.
#[derive(Clone, Debug, Eq, PartialEq)]
pub struct Region {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these types and their methods become public API. Not sure about the naming, it's currently similar to what onig uses, but different from the regex crate/fancy-regex.

pub struct MatchPattern {
pub has_captures: bool,
pub regex_str: String,
pub regex: Regex,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note due to how the serialization/deserialization of Regex just delegates to the String inside it, this happens to work without changing the binary format for packs (i.e. no need to regenerate packs).

@@ -424,25 +424,35 @@ impl SyntaxDefinition {
}

fn resolve_variables(raw_regex: &str, state: &ParserState<'_>) -> String {
state.variable_regex.replace_all(raw_regex, |caps: &Captures<'_>| {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the only use of replace_all. Putting that into the regex API would have been a bit complicated, so I rewrote this part to use search instead.

Copy link
Owner

@trishume trishume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some things I'd like before merging:

  • Run the benchmarks with master, this PR on onig, and this PR on fancy-regex
  • Add a bit of documentation mentioning fancy-regex and why you might use it (pure Rust!) somewhere (readme, Cargo.toml, doc comment, not sure).
  • Fix the CI failure where the build doesn't work with no default features because of references to a missing regex_impl

@@ -23,13 +22,6 @@ type Dict = serde_json::Map<String, Settings>;
/// A String representation of a `ScopeSelectors` instance.
type SelectorString = String;

/// A simple regex pattern, used for checking indentation state.
#[derive(Debug)]
pub struct Pattern {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this refactor gets rid of the duplicate implementation of regex laziness.

@@ -327,7 +323,7 @@ impl ParseState {
let match_pat = pat_context.match_at(pat_index);

if let Some(match_region) = self.search(
line, start, match_pat, captures, search_cache, regions
line, start, match_pat, captures, search_cache
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is plausibly an actually important optimization, perhaps especially with onig but allocating an extra Vec on every search with fancy-regex isn't great either. But I forget since it was a long time ago that I optimized this.

I would be happy if you ran the benchmarks with onig in this PR vs the ones in master so we can see if this makes a difference that matters. If so we can just refactor the regex module interfaces to get a region passed in, then the fancy-regex implementation can clear the Vec before adding to it again.

default = ["parsing", "assets", "html", "yaml-load", "dump-load", "dump-create"]
default-onig = ["parsing", "assets", "html", "yaml-load", "dump-load", "dump-create", "regex-onig"]
default-fancy = ["parsing", "assets", "html", "yaml-load", "dump-load", "dump-create", "regex-fancy"]
default = ["default-onig"]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key thing about features is that they're additive so that if multiple crates specify different features, the union of them works for both crates. The other nice thing to do is to preserve compatibility with existing crates that depend on us without default features, although I'm okay with breaking that if there's no good way otherwise.

I can't see a clean way of making the features backwards-compatible, so maybe just change the cfg statements in the regex module so that if both regex-fancy and regex-onig are set then regex-fancy takes precedence, although I could see the precedence going the other way too, as long as it works with both set.

@trishume
Copy link
Owner

Also I must say I love how this feature removes nearly as many lines as it adds, largely due to the regex abstraction removing a lot of duplication. Good work on that.

@robinst
Copy link
Collaborator Author

robinst commented Dec 6, 2019

Ok, so there's a problem with the Java syntax with fancy-regex, which I've narrowed down to this bug that will need to be fixed: fancy-regex/fancy-regex#37

@Adarma
Copy link

Adarma commented Mar 4, 2020

Any progress on getting this fancy-regex feature?
Even if performance is worse, it is way better than a failed build on Windows.

Running cargo run --features="default-fancy" --example synstats still tries and fails to build onig_sys:

/c/CODE/Rust/syntect ((fe28a3c...))
$ cargo run --features="default-fancy" --example synstats
   Compiling onig_sys v69.2.0
error: failed to run custom build command for `onig_sys v69.2.0`

Caused by:
  process didn't exit successfully: `C:\CODE\Rust\syntect\target\debug\build\onig_sys-9c435766ff906277\build-script-build` (exit code: 101)
--- stdout
cargo:warning=couldn't execute `llvm-config --prefix` (error: The system cannot find the file specified. (os error 2))
cargo:warning=set the LLVM_CONFIG_PATH environment variable to a valid `llvm-config` executable

--- stderr
thread 'main' panicked at 'Unable to find libclang: "couldn\'t find any valid shared libraries matching: [\'clang.dll\', \'libclang.dll\'], set the `LIBCLANG_PATH` environment variable to a path where one of these files can be found (invalid: [])"', src\libcore\result.rs:1188:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

@gilescope
Copy link

What can we do to help? Really keen on being able to land this.

@trishume
Copy link
Owner

trishume commented Mar 7, 2020

One avenue that might be an easier path towards fixing common Windows woes is helping with rust-onig/rust-onig#126 which should make onig no longer require Clang on Windows, although it may still need to build C I'm not sure. You could maybe make it so that the onig crate uses a binary Windows build of the oniguruma library.

Alternatively you can try checking out this branch and seeing if you can rebase it and get it passing CI, which may require updating fancy-regex and maybe some more fixes. I think @robinst may have mentioned at some point that he was still occasionally tinkering with this but not sure if that's still true.

robinst and others added 12 commits March 16, 2020 10:44
That way we can add fancy-regex support behind a feature.
* Adds a std::error::Error impl for Error
* Adds a backtracking limit to mitigate catastrophic backtracking
Without this, some parsing benchmarks took 30% longer to run.
Some of the regexes include `$` and expect it to match end of line. In
fancy-regex, `$` means end of text by default. Adding `(?m)` activates
multi-line mode which changes `$` to match end of line.

This fixes a large number of the failed assertions with syntest.
In fancy-regex, POSIX character classes only match ASCII characters.
Sublime's syntaxes expect them to match Unicode characters as well, so
transform them to corresponding Unicode character classes.
With the regex crate and fancy-regex, `^` in multi-line mode also
matches at the end of a string like "test\n". There are some regexes in
the syntax definitions like `^\s*$`, which are intended to match a blank
line only. So change `^` to `\A` which only matches at the beginning of
text.
Note that this wasn't a problem with Oniguruma because it works on UTF-8
bytes, but fancy-regex works on characters.
Always adding `(?m)` for the entire regex meant that `.` also changed meaning,
which is not what we want. The safer option is to use `(?m:$)` for `$` only.

That also means we don't have to bother with `\A`. But we do need to parse
look-behinds because we can't use `(?m:$)` in it.
Turns out `(?m:$)` works in look-behinds, just not `(?m)$(?-m)` which I was
using before.
Includes the fix for
fancy-regex/fancy-regex#37 which caused a test
failure with the Java syntax.
@gilescope
Copy link

(Am trying to bring this PR up to date with PR 7

@robinst robinst force-pushed the move-regex-use-to-module branch from fe28a3c to a7045b1 Compare March 20, 2020 09:17
Might be worth introducing a specific feature for this and then have
other features depend on it? Not sure.
@gilescope
Copy link

Should the feature be called regex-rs rather than regex-fancy to match with dump-create-rs and dump-load-rs?

@gilescope
Copy link

Confirmed this branch builds fine on windows 10 and OSX using cargo build --features regex-fancy --no-default-features with no onix installed. Maybe we could add a line to the readme to explicitly say if you want to build a pure rust version this is how to do it. Aside from that I'm very happy with this PR.

@gilescope
Copy link

The sooner we land it, the sooner people can say cargo install cargo-expand and it will just work (tm). This will be really great for a lot of rust devs!

@Adarma
Copy link

Adarma commented Mar 21, 2020

This works fine for me on windows with the following Cargo.toml:

syntect = { git = "https://github.com/trishume/syntect", branch = "move-regex-use-to-module", default-features = false, features = ["default-fancy"] }

Copy link
Owner

@trishume trishume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked over all this again and it looks excellent. I really like the new abstractions and all the new tests.

I ran the benchmarks and it looks like this doesn't really change performance in onig mode, and fancy-regex mode is about half the speed:

highlight/"highlight_test.erb"
                        time:   [3.8420 ms 3.9419 ms 4.1724 ms]
                        change: [+250.05% +314.05% +396.57%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe
highlight/"InspiredGitHub.tmTheme"
                        time:   [36.775 ms 36.969 ms 37.223 ms]
                        change: [+97.910% +102.09% +106.71%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking highlight/"Ruby.sublime-syntax": Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 12.1s or reduce sample count to 10
highlight/"Ruby.sublime-syntax"
                        time:   [214.52 ms 216.23 ms 219.43 ms]
                        change: [+369.60% +377.83% +385.88%] (p = 0.00 < 0.05)
                        Performance has regressed.
Benchmarking highlight/"jquery.js": Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 47.5s or reduce sample count to 10
highlight/"jquery.js"   time:   [808.28 ms 811.63 ms 820.02 ms]
                        change: [+91.568% +96.014% +100.90%] (p = 0.00 < 0.05)
                        Performance has regressed.
Benchmarking highlight/"parser.rs": Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 28.5s or reduce sample count to 10
highlight/"parser.rs"   time:   [494.35 ms 495.46 ms 497.29 ms]
                        change: [+85.262% +87.230% +89.678%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
highlight/"scope.rs"    time:   [45.487 ms 46.328 ms 47.719 ms]
                        change: [+87.907% +94.569% +100.73%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)

I think fancy-regex being substantially slower shouldn't block merging as long as it's not the default so I'm totally fine with merging at the current performance levels. If we ever want to make it the default I'd want to improve the speed first though.

I also pushed a commit that runs the tests for fancy-regex mode in CI. It looks like more syntax tests pass than before (good job!) but my tricky highlight_test.erb gets mis-highlighted, which causes an HTML comparison test to fail. Try running cargo run --features default-fancy --no-default-features --release --example syncat testdata/highlight_test.erb to see this.

If you're able to quickly dig in and fix that it would be awesome to launch fancy-regex with no known highlighting flaws. However I can imagine it might be something that's time consuming to diagnose or fix, in which case I'd be willing to launch with that test disabled under fancy-regex and a warning in the readme that although fancy-regex works most of the time there are know cases where it messes up. Because it does work most of the time and for the people who need fancy-regex something is better than nothing.

So yah only thing I can think of right now that this needs before merge is either disabling or fixing that test, and something in the readme, both of which I can take a stab at if @robinst is busy.

Amazing work @robinst, thanks so much I'm happy to see this finally being so close! Also thanks @raphlinus for the initial work on fancy-regex, it's looking like it'll finally get put to use! And thanks @gilescope for pushing to get this done :)

@gilescope
Copy link

Wow, that's some test case.

This applies the regex rewriting fixes on this branch to the default
packs.
@robinst
Copy link
Collaborator Author

robinst commented Mar 25, 2020

Yay, thanks for helping push this over the finish line :)!

The benchmarks match the shallow benchmarking that I've done. I've done some matching of single regexes (ones that don't need fancy features) and I've observed that the first N matches are slow, and only after some warmup do they get fast. It would be good to investigate this more to see if we can get more performance out of it.

For the test failure, I actually already looked into that and fixed it in the regex rewriting. All that was needed was updating the packs (which I put off until the end because it's hard to rebase)! Pushed that now and build should go green.

If you don't mind adding the bits to the readme @trishume, that would be awesome.

@BurntSushi
Copy link

@robinst If you're able to widdle down a simple benchmark for me, I'd be happy to take a closer look.

@trishume trishume merged commit c0efc8c into master Mar 29, 2020
@trishume
Copy link
Owner

This is now released as v4.0.0! Thanks again everyone for your work, especially @robinst!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants