-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Kinda-working fancy-regex support #34
Conversation
I took a quick look at this, profiling the highlighting of jquery. It's promising but clearly not compelling yet. It seems to be spending most of its time delegating to regex, but in the VM. This suggests that it's doing backtracking, and might not even be using the NFA (it delegates just to get classes). I have a bunch of ideas on how to optimize more, but don't have insight into specifically what's slow now. The best case would be something like The way to make progress here is to capture which regexes are consuming the most time. I'd add some profiling, something like a lazy_static hash table on the side, so that every time the VM runs it increments a count for that regex, and accumulates the time. Then just go down the list in terms of which regexes burn the most time. I'd be tempted to investigate myself, but am currently trying to really focus on incremental update in xi. Thanks for pushing this forward! |
@raphlinus Yah that was my thought on what to investigate as well. Since it's single-threaded I can do even better than a count for each regex, I can actually measure the total elapsed time and count per regex to figure out which ones are slow. Then I can run it again with Oniguruma and see which regexes are faster with fancy-regex and which are slower. Unfortunately, I'm back to being busy with school work and I'm not sure when/if I'll have time to do this. The perf regression combined with missing features means it's going to be a bunch of work. Not an undoable amount, but still substantial. |
@trishume: I'm trying to collect some per-regex timings, however trying to run the jquery highlighting benchmark fails because of the |
So I did some initial benchmarking of the jquery benchmark (measuring how long each regex matching took), the result are in this gist: https://gist.github.com/b6bb756f96b58e52b3299b709fa785dd
The code is available in the respective branches in the TimNN/syntect repo. The worst offenders by far (based on both, CUM: PT7.310148847S AVG: PT0.000004021S REGEX: [_$[:alpha:]][_$[:alnum:]]*(?=\s*[\[.])
CUM: PT17.180939801S AVG: PT0.000004806S REGEX: ([_$[:alpha:]][_$[:alnum:]]*)(?=\s*\()
CUM: PT21.358124652S AVG: PT0.000008019S REGEX: ([_$[:alpha:]][_$[:alnum:]]*)\s*(\.)\s*(prototype)\s*(\.)\s*(?=[_$[:alpha:]][_$[:alnum:]]*\s*=\s*(\s*\b(async\s+)?function\b|\s*(\basync\s*)?([_$[:alpha:]][_$[:alnum:]]*|\(([^()]|\([^()]*\))*\))\s*=>))
CUM: PT22.522137730S AVG: PT0.000008456S REGEX: ([_$[:alpha:]][_$[:alnum:]]*)\s*(\.)\s*(prototype)(?=\s*=\s*(\s*\b(async\s+)?function\b|\s*(\basync\s*)?([_$[:alpha:]][_$[:alnum:]]*|\(([^()]|\([^()]*\))*\))\s*=>)) They seem to match the "best case" mentioned by @raphlinus, which I guess is a good thing? |
@TimNN awesome thank you! That's definitely useful information since it does indeed match up with the case @raphlinus said could be optimized without too much difficulty. Thanks for the help. And yes the jQuery benchmark breaks because of a substitution I perform for nonewlines mode. I fixed the benchmark to use line strings with newline characters, but didn't end up committing it, sorry. |
So I've been hacking a bit on The results however look very promising so far: On my machine, highlighting jquery went from The code is in the Edit: It's probably going to take a bit longer, until I find the time to cleanup / send a PR. |
@TimNN that's awesome! 858ms is still more time than it takes Oniguruma to highlight jQuery on my computer, but my computer also takes less than 1,228ms to highlight with fancy-regex so it is possible that on my computer fancy-regex will be just as fast. I'll try and test your branch on my machine at some point. I was hoping it would actually lead to a significant performance increase eventually but merely matching the performance of Oniguruma is enough for me to make it the default once the compatibility issues are fixed since it will fix #33 and make all dependencies pure rust. I may be able to find the time to fix some of the smaller compatibility issues I listed. Specifically the first two unfinished ones listed (I sorted by estimated difficulty). Some of the issues look difficult though, specifically a full expression parser/rewriter for the character class operators. |
I ran the jquery benchmark again with the oniguruma version and got (Note that my per regex benchmarking code is currently not very efficient since I had planned on collecting more stats than average & total time, so this may slow everything down a bit). Also, using |
@TimNN Excellent. There's probably more optimization possible but that's great for now. It should be theoretically faster than Oniguruma at least on syntaxes which have been optimized for Sublime's sregex engine to not use many/any fancy regex features, which should allow the I'm not sure even that would match Sublime Text's performance though. I think it uses something like https://doc.rust-lang.org/regex/regex/struct.RegexSet.html to match many regexes at once in time proportional only to the number of characters, but with support for extracting captures and match positions (unlike the At the moment |
Should we create issues in fancy-regex for the unsupported syntax? Seems like the better place to discuss these. For fancy character classes ( (see UTS#18 RL1.3) I think that would also help with implementing things such as |
@robinst Good point. I guess fancy-regex would be a better place. I created an issue in regex (rust-lang/regex#341), haven't created any in fancy-regex yet though. |
I would be interested in lending some insight here if you folks wind up seeing bottlenecks inside the regex crate. The regex crate is fast in a lot of cases, but that doesn't mean it's fast in every case, so don't assume that the regex crate will always bail you out. :-) I'll get the ball rolling by throwing some things against the wall and seeing what sticks. If a regex is particularly large, then it's possible that the DFA will be forced to bail out because it's thrashing its cache. When the DFA bails, it falls back to a much slower (by an ~order of magnitude) regex engine. You can tweak how much cache space is available with the
Finally, have you folks seen any problems with performance for compiling the regexes? Does it add any noticeable overhead? |
@BurntSushi cool thank you. If we get around to optimizing it more than @TimNN already has, we could probably use your advice. From the very basic benchmarks I did on this branch I didn't see anything noticeable from Regex compilation. It didn't seem to be very different from Oniguruma. I do make sure to compile each regex at most once and only compile them if they are actually needed. |
Hi, I just wanted to make a quick note that I've submitted a PR to the ST Packages repo that removes the named backrefs compatibility issue from the Markdown syntax, so, depending whether any other syntaxes use named backreferences, you may get away without this support. |
Some comments. First, I took a look at @TimNN 's optimization. It's definitely the optimization I had in mind, but is not quite suitable for merging yet (it changes the output, specifically adding more capture groups than were originally present). I am a bit surprised it's only a 30% gain, I would have expected more. The instrumentation for total time spent, number of invocations, etc., sounds extremely useful, and I recommend that gets checked in. We'll want to track performance on an ongoing basis (assuming we go ahead with fancy-regex, and even then it's extremely useful for making that decision). Where is the time going after the optimization is in place? In my (admittedly not very thorough) testing, the time impact of regex compilation was minimal. For large files, it's spending seconds computing the highlighting. What's the secret to super-fast performance? Is it running multiple regexes in parallel? Is it using |
I just learned something from this conversation on Reddit with @BurntSushi and @raphlinus that may be part of the cause of the lack of performance gains.
I think an optimization where it keeps track of if a certain rule needs the captures or not may help performance with fancy-regex and possibly with Oniguruma as well, although I'm not sure there's much of a penalty to getting captures with Oniguruma, especially since it lets you re-use the capture regions struct between calls. Not sure exactly how much difference it would make, but it would probably help a bit. |
There's a lot of subtlety here. There are various factors at play:
Right. Classical backtracking engines, IME, typically don't impose a penalty for extracting captures. |
FWIW, there are plans (in my head, anyway) to make capture extraction faster, but I can't commit to a timeline. |
I rebased this branch and have been implementing missing features in fancy-regex, see my pull requests. Also, the just released regex 0.2.2 now supports nested character classes and intersections, which means the "Fancy character class syntax So I think the only task that doesn't have a pull request yet is "Ability to use I also found a regex that fancy-regex currently has trouble with (haven't investigated why yet): https://github.com/google/fancy-regex/issues/14 |
I looked into "Ability to use With the following: let re = onig::Regex::new(r"^a[\z]").unwrap();
println!("{}", re.is_match("a"));
println!("{}", re.is_match("az")); The regex is compiled, but it prints |
Update: Fixed #76 now. With that, all of the check boxes in the description are done. I've rebased @trishume's fancy-regex branch here: https://github.com/robinst/syntect/tree/fancy-regex Now there's the following failing tests left, and they fail on the assertions (instead of panicking while compiling regexes):
The last one I added here and shows that the YAML syntax is not working yet: thread 'parsing::parser::tests::can_parse_yaml' panicked at 'assertion failed: `(left == right)`
- left: `[(0, Push(<source.yaml>)), (0, Push(<string.unquoted.plain.out.yaml>)), (1, Pop(1)), (1, Push(<string.unquoted.plain.out.yaml>)), (2, Pop(1)), (2, Push(<constant.language.boolean.yaml>)), (3, Pop(1)), (3, Push(<punctuation.separator.key-value.mapping.yaml>)), (4, Pop(1)), (5, Push(<string.unquoted.plain.out.yaml>)), (10, Pop(1))]`,
+ right: `[(0, Push(<source.yaml>)), (0, Push(<string.unquoted.plain.out.yaml>)), (0, Push(<entity.name.tag.yaml>)), (3, Pop(2)), (3, Push(<punctuation.separator.key-value.mapping.yaml>)), (4, Pop(1)), (5, Push(<string.unquoted.plain.out.yaml>)), (10, Pop(1))]`', src/parsing/parser.rs:482:8 I had a look at the syntax but it's pretty complex. If someone who knows the syntax wants to track down the problem, that would be cool. (I guess I should learn how to use a debugger for Rust :)). |
I don't know YAML syntax very well, but I cut down the syntax definition to the following, and I think it should still have the same behavior with the %YAML 1.2
---
# See http://www.sublimetext.com/docs/3/syntax.html
scope: source.yaml-test
name: YAML-Test
variables:
c_indicator: '[-?:,\[\]{}#&*!|>''"%@`]'
# plain scalar begin and end patterns
ns_plain_first_plain_out: |- # c=plain-out
(?x:
[^\s{{c_indicator}}]
| [?:-] \S
)
_flow_scalar_end_plain_out: |- # kind of the negation of nb-ns-plain-in-line(c) c=plain-out
(?x:
(?=
\s* $
| \s+ \#
| \s* : (\s|$)
)
)
contexts:
main:
- include: block-mapping
- include: flow-scalar-plain-out
block-mapping:
- match: |
(?x)
(?=
{{ns_plain_first_plain_out}}
(
[^\s:]
| : \S
| \s+ (?![#\s])
)*
\s*
:
(\s|$)
)
push:
#- include: flow-scalar-plain-out-implicit-type
- match: '{{_flow_scalar_end_plain_out}}'
pop: true
- match: '{{ns_plain_first_plain_out}}'
set:
- meta_scope: string.unquoted.plain.out.yaml entity.name.tag.yaml
meta_include_prototype: false
- match: '{{_flow_scalar_end_plain_out}}'
pop: true
- match: :(?=\s|$)
scope: punctuation.separator.key-value.mapping.yaml
flow-scalar-plain-out:
# http://yaml.org/spec/1.2/spec.html#style/flow/plain
# ns-plain(n,c) (c=flow-out, c=block-key)
#- include: flow-scalar-plain-out-implicit-type
- match: '{{ns_plain_first_plain_out}}'
push:
- meta_scope: string.unquoted.plain.out.yaml
meta_include_prototype: false
- match: '{{_flow_scalar_end_plain_out}}'
pop: true
|
Thanks @keith-hall! That helped, I've noticed a difference with this pattern (narrowed down): let regex = r"(?=\s*$|\s*:(\s|$))";
let s = "key: value";
println!("{:?}", onig::Regex::new(regex).unwrap().find(s));
println!("{:?}", fancy_regex::Regex::new(regex).unwrap().find(s)); onig returns |
Ok, tracked down the problem and have a fix here: google/fancy-regex#21 With that fix, |
Some of the regexes include `$` and expect it to match end of line. In fancy-regex, `$` means end of text by default. Adding `(?m)` activates multi-line mode which changes `$` to match end of line. This fixes a large number of the failed assertions with syntest.
In fancy-regex, POSIX character classes only match ASCII characters. Sublime's syntaxes expect them to match Unicode characters as well, so transform them to corresponding Unicode character classes.
With the regex crate and fancy-regex, `^` in multi-line mode also matches at the end of a string like "test\n". There are some regexes in the syntax definitions like `^\s*$`, which are intended to match a blank line only. So change `^` to `\A` which only matches at the beginning of text.
Note that this wasn't a problem with Oniguruma because it works on UTF-8 bytes, but fancy-regex works on characters.
Done! Note that you might have to
|
small note for people struggling to get Side note: probably this PR should be updated so that --- Cargo.toml
+++ [new] Cargo.toml
@@ -15,7 +15,7 @@
[dependencies]
yaml-rust = { version = "0.4", optional = true }
-onig = { version = "3.2.1", optional = true }
+#onig = { version = "3.2.1", optional = true }
walkdir = "2.0"
regex-syntax = { version = "0.4", optional = true }
lazy_static = "1.0"
@@ -25,7 +25,7 @@
flate2 = { version = "1.0", optional = true, default-features = false }
fnv = { version = "1.0", optional = true }
regex = "*"
-fancy-regex = { git = "https://github.com/google/fancy-regex.git" }
+fancy-regex = { git = "https://github.com/google/fancy-regex.git", optional = true }
serde = { version = "1.0", features = ["rc"] }
serde_derive = "1.0"
serde_json = "1.0"
@@ -51,7 +51,7 @@
# Pure Rust dump creation, worse compressor so produces larger dumps than dump-create
dump-create-rs = ["flate2/rust_backend", "bincode"]
-parsing = ["onig", "regex-syntax", "fnv"]
+parsing = ["fancy-regex", "regex-syntax", "fnv"]
# The `assets` feature enables inclusion of the default theme and syntax packages.
# For `assets` to do anything, it requires one of `dump-load-rs` or `dump-load` to be set.
assets = [] |
@keith-hall Pushed a commit with those changes, thanks! |
src/parsing/syntax_definition.rs
Outdated
RegexOptions::REGEX_OPTION_CAPTURE_GROUP, | ||
Syntax::default()) | ||
.unwrap(); | ||
println!("compiling {:?}", self.regex_str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this println
be here, as it generates a lot of noise? if it's useful for debugging, maybe it would be best to hide it behind a feature flag as discussed at #146 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just commenting it out is fine, see my reply on the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keith-hall I pushed a commit that changes this to only print in case it fails.
I think the println! was only there to see the regex that failed to compile.
It would be great if that change landed, because libgit2 crashes in binaries linking oniguruma, and I'd like to use both libgit2 and syntect together. |
Is https://github.com/google/fancy-regex/issues/44 the only blocker left for this? |
I was also wondering about this. If we don't want to default to fancy-regex yet, maybe it would be worth maintaining both the oniguruma and fancy-regex code paths for now and consumers can choose which one to use from a feature flag. It may then be easier to do performance comparisons etc. |
@@ -5,12 +5,13 @@ | |||
//! into this data structure? | |||
use std::collections::{BTreeMap, HashMap}; | |||
use std::hash::Hash; | |||
use onig::{Regex, RegexOptions, Region, Syntax}; | |||
use fancy_regex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not continue using fancy_regex::Regex
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean why not write use fancy_regex::Regex;
here? Yeah there's not really a good reason. I will change it next time I work on this.
@Keats Yes, unless we find another one. @keith-hall A feature sounds like a good idea, yeah. Maybe we can abstract the regex compilation and matching parts a bit to make the feature less painful to maintain (so that it's just in one module and not all over the place). I might have some time this week to work on this. |
/// In fancy-regex, POSIX character classes only match ASCII characters. | ||
/// Sublime's syntaxes expect them to match Unicode characters as well, so transform them to | ||
/// corresponding Unicode character classes. | ||
fn replace_posix_char_classes(regex: String) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there by any chance that we are able to do the sublime syntax replacement before run-time?
Sorry to poke at an old issue, but is there any news on this? I'd really like to use |
Unfortunately this is not quite complete and based on an older release of syntect. It would probably take a fair amount of work to complete. I'm not personally interested in doing that work unfortunately, so unless someone else steps up to do it, it's not on the roadmap. |
Ah that's a bummer, I'm sorry for the letdown but I don't think I'd be able to commit to making it happen right now. |
Hey, just an update. I'm working on this again, but with a different approach. I'm moving all the regex usage to a module first, so then we can swap out the implementation using a cargo feature. I'll have a pull request next week :). |
This branch switches the regex engine to fancy-regex or more specifically my fork of it.
Currently it only works for a few syntaxes because of a few different features fancy-regex doesn't support:
\n
escape (Everything, but fixed it my fork)[\<]
\h
escape in character classes (Rust)nonewlines
mode doesn't produce weird regexes.\k<marker>
(Markdown)[a-w&&[^c-g]z]
The jQuery highlighting benchmark now takes 1s instead of 0.66s. Which is super unfortunate given that I'd hoped it would be faster than Oniguruma. I have no idea why it is substantially slower.
@raphlinus @robinst