RegexSet match first [feature request] #259

jayanderson · 2016-07-02T18:19:12Z

The common case (for me at least) is to expect only one of the regular expressions in the set to match. I assume the is_match function works by finding the first match. It'd be nice to be able to know which regular expression that was. The (somewhat contrived) benchmark below shows that performing the full match and then getting the first match is slower than is_match. Could adding a function to return the index of the first match have the same performance as is_match? Let me know if I'm off in my assumptions or if there's something else causing the slowdown.

#[cfg(test)]
mod tests {
  use regex::RegexSet;
  use test::Bencher;

  #[bench]
  fn is_match_regex_set(b: &mut Bencher) {
    let regexset = RegexSet::new(&[
      r"^0+",
      r"^1+",
      r"^2+",
      r"^3+",
      r"^4+",
      r"^a+",
      r"^5+",
      r"^6+",
      r"^7+",
      r"^8+",
      r"^9+",
    ]).unwrap();
    b.iter(|| regexset.is_match("aaaaa"));
  }

  #[bench]
  fn match_first_regex_set(b: &mut Bencher) {
    let regexset = RegexSet::new(&[
      r"^0+",
      r"^1+",
      r"^2+",
      r"^3+",
      r"^4+",
      r"^a+",
      r"^5+",
      r"^6+",
      r"^7+",
      r"^8+",
      r"^9+",
    ]).unwrap();
    b.iter(|| regexset.matches("aaaaa").into_iter().next().unwrap());
  }
}

running 2 tests
test tests::is_match_regex_set    ... bench:          13 ns/iter (+/- 1)
test tests::match_first_regex_set ... bench:         320 ns/iter (+/- 8)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

The text was updated successfully, but these errors were encountered:

BurntSushi · 2016-07-10T02:40:21Z

This sounds interesting and potentially plausible, but I have to think about whether it's possible or not. The RegexSet additions to the DFA can be pretty tricky to reason about. In theory, it seems quite possible, since is_match does indeed have to observe a Match state, and if it observes a Match state, then it must also observe the identifier stored in that Match state (which corresponds to the regex in the set that matched).

I am also a little surprised at the speed difference in your benchmark, particularly given that only one regex will match and all of them are anchored (so that both regexes will short circuit). The only real extra work involved is looping over all of the NFA states in the final DFA state to find the precise Match states (which contain the ids of the regexes that matched). This same exact step would be required for your use case too, which suggests one wouldn't realize any performance gains. Therefore, I think one would have to change how that initial Match state is observed such that we wouldn't have to loop over all of the NFA states to find it at the end.

With that said, it's probably worth an experiment to make sure I've got the performance analysis correct. Unfortunately, that experiment requires a bit of plumbing through both the DFA and the execution planner, so it's probably not very accessible.

jayanderson · 2016-07-11T06:17:02Z

(I don't know good ways to get profiling data and I'm out of my depth in this code and rust really, but here's a try.) Playing with https://github.com/pegasos1/cargo-profiler I was able to run this code for testing:

  let regexset = RegexSet::new(&[
    r"^0+",
    r"^1+",
    r"^2+",
    r"^3+",
    r"^4+",
    r"^a+",
    r"^5+",
    r"^6+",
    r"^7+",
    r"^8+",
    r"^9+",
  ]).unwrap();
  for _ in 0..10000 {
    regexset.is_match("aaaaa");
    //regexset.matches("aaaaa");
  }

I put this in a main file and swapped out which method it ran against. Then I used cargo profiler callgrind -n 20 --bin ./target/release/regex_set to get the callgrind data. Here's with `is_match:

Profiling regex_set with callgrind...

Total Instructions...2,353,786

770,108 (32.7%) dfa.rs:regex::re_set::unicode::RegexSet::is_match
-----------------------------------------------------------------------
300,002 (12.7%) re_set.rs:regex::re_set::unicode::RegexSet::is_match
-----------------------------------------------------------------------
170,004 (7.2%) exec.rs:regex::re_set::unicode::RegexSet::is_match
-----------------------------------------------------------------------
75,295 (3.2%) dl-lookup.c:do_lookup_x
-----------------------------------------------------------------------
63,269 (2.7%) ???:je_arena_tcache_fill_small
-----------------------------------------------------------------------
51,459 (2.2%) ???:je_arena_boot
-----------------------------------------------------------------------
50,979 (2.2%) vfscanf.c:_IO_vfscanf
-----------------------------------------------------------------------
50,000 (2.1%) cell.rs:regex::re_set::unicode::RegexSet::is_match
-----------------------------------------------------------------------
40,504 (1.7%) dl-lookup.c:_dl_lookup_symbol_x
-----------------------------------------------------------------------
34,041 (1.4%) ???:sdallocx
-----------------------------------------------------------------------
33,799 (1.4%) ???:mallocx
-----------------------------------------------------------------------
31,518 (1.3%) dl-machine.h:_dl_relocate_object
-----------------------------------------------------------------------
31,179 (1.3%) strcmp.S:strcmp'2
-----------------------------------------------------------------------
30,077 (1.3%) vec.rs:regex_set::main
-----------------------------------------------------------------------
30,003 (1.3%) cmp.rs:regex_set::main
-----------------------------------------------------------------------
30,003 (1.3%) lib.rs:regex::re_set::unicode::RegexSet::is_match
-----------------------------------------------------------------------
25,087 (1.1%) strtol_l.c:____strtoul_l_internal
-----------------------------------------------------------------------
20,079 (0.9%) dfa.rs:regex::compile::Compiler::new
-----------------------------------------------------------------------
20,008 (0.9%) main.rs:regex_set::main
-----------------------------------------------------------------------
20,006 (0.8%) pthread_self.c:pthread_self
-----------------------------------------------------------------------

Here's with matches:

Profiling regex_set with callgrind...

Total Instructions...27,953,521

4,410,053 (15.8%) dfa.rs:regex::exec::ExecNoSync::many_matches_at
-----------------------------------------------------------------------
3,247,128 (11.6%) ???:rallocx
-----------------------------------------------------------------------
2,584,953 (9.2%) ???:je_arena_ralloc_no_move
-----------------------------------------------------------------------
2,271,827 (8.1%) dfa.rs:regex::dfa::Fsm::next_state
-----------------------------------------------------------------------
1,634,549 (5.8%) ???:sdallocx
-----------------------------------------------------------------------
1,614,489 (5.8%) ???:mallocx
-----------------------------------------------------------------------
1,448,503 (5.2%) ???:je_arena_ralloc
-----------------------------------------------------------------------
1,191,288 (4.3%) dfa.rs:regex::dfa::Fsm::cached_state
-----------------------------------------------------------------------
910,872 (3.3%) sip.rs:_..std..collections..hash..map..DefaultHasher..as..core..hash..Hasher..::write
-----------------------------------------------------------------------
720,003 (2.6%) exec.rs:regex::exec::ExecNoSync::many_matches_at
-----------------------------------------------------------------------
709,978 (2.5%) slice.rs:regex::exec::ExecNoSync::many_matches_at
-----------------------------------------------------------------------
500,400 (1.8%) sip.rs:_..std..collections..hash..map..DefaultHasher..as..core..hash..Hasher..::finish
-----------------------------------------------------------------------
400,118 (1.4%) map.rs:_..std..collections..hash..map..HashMap..K$C$..V$C$..S....::get
-----------------------------------------------------------------------
380,317 (1.4%) raw_vec.rs:_..alloc..raw_vec..RawVec..T....::double
-----------------------------------------------------------------------
360,001 (1.3%) re_set.rs:regex::re_set::unicode::RegexSet::matches
-----------------------------------------------------------------------
290,222 (1.0%) sparse.rs:regex::dfa::Fsm::next_state
-----------------------------------------------------------------------
260,126 (0.9%) vec.rs:regex::dfa::Fsm::next_state
-----------------------------------------------------------------------
239,994 (0.9%) vec.rs:regex::exec::ExecNoSync::many_matches_at
-----------------------------------------------------------------------
210,215 (0.8%) slice.rs:regex::dfa::Fsm::next_state
-----------------------------------------------------------------------
200,042 (0.7%) table.rs:_..std..collections..hash..map..HashMap..K$C$..V$C$..S....::get
-----------------------------------------------------------------------

I'm not sure how helpful this really is. I can play with it a bit more if so.

BurntSushi · 2016-07-11T10:35:33Z

That is certainly interesting. The fact that matches has entries for things like next_state and cached_state but is_match doesn't is surprising. I'll have to investigate when I get the chance.

kevincox · 2018-03-31T22:26:30Z

This would be useful for me as well. I can imagine a number of use cases that require taking only the first match. For example URL routers in web frameworks often use regexes for flexibility and they want to provide a sequential match semantics. In this case only the first match is required. (Although you can imagine a warning in development if a URL would have had multiple matches).

BurntSushi · 2023-03-06T18:45:34Z

While this may not be available on RegexSet in the immediate future, you will be able to do this with regex-automata 0.3 once it's out. It will expose complete support for multi-regex APIs, including even asking for capture groups (in the non-overlapping case anyway).

I usually close tickets on a commit-by-commit basis, but this refactor was so big that it wasn't feasible to do that. So ticket closures are marked here. Closes #244 Closes #259 Closes #476 Closes #644 Closes #675 Closes #824 Closes #961 Closes #68 Closes #510 Closes #787 Closes #891 Closes #429 Closes #517 Closes #579 Closes #779 Closes #850 Closes #921 Closes #976 Closes #1002 Closes #656

This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.8.4` -> `1.9.1` | --- ### Release Notes <details> <summary>rust-lang/regex (regex)</summary> ### [`v1.9.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#191-2023-07-07) [Compare Source](rust-lang/regex@1.9.0...1.9.1) \================== This is a patch release which fixes a memory usage regression. In the regex 1.9 release, one of the internal engines used a more aggressive allocation strategy than what was done previously. This patch release reverts to the prior on-demand strategy. Bug fixes: - [BUG #1027](rust-lang/regex#1027): Change the allocation strategy for the backtracker to be less aggressive. ### [`v1.9.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#190-2023-07-05) [Compare Source](rust-lang/regex@1.8.4...1.9.0) \================== This release marks the end of a [years long rewrite of the regex crate internals](rust-lang/regex#656). Since this is such a big release, please report any issues or regressions you find. We would also love to hear about improvements as well. In addition to many internal improvements that should hopefully result in "my regex searches are faster," there have also been a few API additions: - A new `Captures::extract` method for quickly accessing the substrings that match each capture group in a regex. - A new inline flag, `R`, which enables CRLF mode. This makes `.` match any Unicode scalar value except for `\r` and `\n`, and also makes `(?m:^)` and `(?m:$)` match after and before both `\r` and `\n`, respectively, but never between a `\r` and `\n`. - `RegexBuilder::line_terminator` was added to further customize the line terminator used by `(?m:^)` and `(?m:$)` to be any arbitrary byte. - The `std` Cargo feature is now actually optional. That is, the `regex` crate can be used without the standard library. - Because `regex 1.9` may make binary size and compile times even worse, a new experimental crate called `regex-lite` has been published. It prioritizes binary size and compile times over functionality (like Unicode) and performance. It shares no code with the `regex` crate. New features: - [FEATURE #244](rust-lang/regex#244): One can opt into CRLF mode via the `R` flag. e.g., `(?mR:$)` matches just before `\r\n`. - [FEATURE #259](rust-lang/regex#259): Multi-pattern searches with offsets can be done with `regex-automata 0.3`. - [FEATURE #476](rust-lang/regex#476): `std` is now an optional feature. `regex` may be used with only `alloc`. - [FEATURE #644](rust-lang/regex#644): `RegexBuilder::line_terminator` configures how `(?m:^)` and `(?m:$)` behave. - [FEATURE #675](rust-lang/regex#675): Anchored search APIs are now available in `regex-automata 0.3`. - [FEATURE #824](rust-lang/regex#824): Add new `Captures::extract` method for easier capture group access. - [FEATURE #961](rust-lang/regex#961): Add `regex-lite` crate with smaller binary sizes and faster compile times. - [FEATURE #1022](rust-lang/regex#1022): Add `TryFrom` implementations for the `Regex` type. Performance improvements: - [PERF #68](rust-lang/regex#68): Added a one-pass DFA engine for faster capture group matching. - [PERF #510](rust-lang/regex#510): Inner literals are now used to accelerate searches, e.g., `\w+@\w+` will scan for `@`. - [PERF #787](rust-lang/regex#787), [PERF #891](rust-lang/regex#891): Makes literal optimizations apply to regexes of the form `\b(foo|bar|quux)\b`. (There are many more performance improvements as well, but not all of them have specific issues devoted to them.) Bug fixes: - [BUG #429](rust-lang/regex#429): Fix matching bugs related to `\B` and inconsistencies across internal engines. - [BUG #517](rust-lang/regex#517): Fix matching bug with capture groups. - [BUG #579](rust-lang/regex#579): Fix matching bug with word boundaries. - [BUG #779](rust-lang/regex#779): Fix bug where some regexes like `(re)+` were not equivalent to `(re)(re)*`. - [BUG #850](rust-lang/regex#850): Fix matching bug inconsistency between NFA and DFA engines. - [BUG #921](rust-lang/regex#921): Fix matching bug where literal extraction got confused by `$`. - [BUG #976](rust-lang/regex#976): Add documentation to replacement routines about dealing with fallibility. - [BUG #1002](rust-lang/regex#1002): Use corpus rejection in fuzz testing. </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).  Co-authored-by: cabr2-bot <[email protected]> Co-authored-by: crapStone <[email protected]> Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1957 Reviewed-by: crapStone <[email protected]> Co-authored-by: Calciumdibromid Bot <[email protected]> Co-committed-by: Calciumdibromid Bot <[email protected]>

BurntSushi added enhancement question labels Jul 10, 2016

BurntSushi mentioned this issue Jul 12, 2016

Feature request: conditionals #263

Closed

AlbertoGP mentioned this issue Dec 6, 2017

execute a regex on text streams #425

Open

ohaddahan mentioned this issue Dec 26, 2020

RegexSet - enhancement #732

Closed

BurntSushi mentioned this issue Dec 7, 2021

[Feature request]: match multiple regular expressions simultaneously obtaining capture groups #822

Closed

BurntSushi added the fix-incoming label Mar 6, 2023

BurntSushi closed this as completed in aa64e6d Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RegexSet match first [feature request] #259

RegexSet match first [feature request] #259

jayanderson commented Jul 2, 2016

BurntSushi commented Jul 10, 2016 •

edited

Loading

jayanderson commented Jul 11, 2016

BurntSushi commented Jul 11, 2016

kevincox commented Mar 31, 2018

BurntSushi commented Mar 6, 2023

RegexSet match first [feature request] #259

RegexSet match first [feature request] #259

Comments

jayanderson commented Jul 2, 2016

BurntSushi commented Jul 10, 2016 • edited Loading

jayanderson commented Jul 11, 2016

BurntSushi commented Jul 11, 2016

kevincox commented Mar 31, 2018

BurntSushi commented Mar 6, 2023

BurntSushi commented Jul 10, 2016 •

edited

Loading