-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request]: match multiple regular expressions simultaneously obtaining capture groups #822
Comments
This has been requested a few times in one form or another. See #259 and #352 for example. I think there are more, but a quick search doesn't turn them up. Also, the docs for I do think this may be possible one day, once #656 is complete. In particular, I'll leave this open for now, but this is not just a simple matter of adding new APIs. I don't think it makes sense to attempt this until |
Hi @BurntSushi, after reading your comments I have some doubts regarding the internals of If thats the case, the feature requested should be better for performance reasons since processing the DFA and tracing back capture groups could be done in
It's been a while since the last time I implemented a
I have no rush :), I would be pleased to help if necessary |
Nits aside (see below), I believe you are basically correct. And that's exactly how OK, now I'll address some nits. I don't think these are material to the overall point, but I mention them because I think it's important to be precise here. And in particular, I suspect this will have performance implications for your strategy.
So this is not quite correct... Currently, the regex crate doesn't use any DFAs anywhere. It uses NFAs, bounded backtracking and a hybrid NFA/DFA (also called a "lazy DFA"). The issue with DFAs is that they take up a lot of resources, and take exponential time in the worst case to build. The lazy DFA strikes a good balance there that mitigates those downsides. It is generally appropriate to think of a lazy DFA as just a DFA in terms of its search time complexity, although it does have some important differences in its execution model that would surprise you if you went into it thinking it was traditional DFA. (For example, a lazy DFA builds itself at search time, so there is no pre-populated transition table. Instead, that is written incrementally as-needed during search.) But a lazy DFA is only used if possible. Otherwise, bounded backtracking or the Pike VM ("thompson NFA with capturing support") is used. Once
DFAs do not have the computational power to implement capturing groups. Technically, NFAs don't either, but it is typically very easy to bolt on support of capturing groups to an NFA simulation by virtue of the fact that NFA simulations tend to require tracking the match progress of multiple states. This is relevant to you because it means that if you want to resolve capturing groups, it would have to use one of the slower execution engines instead of the much faster DFA. Note also that when using a DFA, even when using multiple regexes, search time is only proportional to the length of the haystack, so
I see two possible paths here:
It sounds to me like you have ruled out (2) as being slower than (1) (perhaps due to a belief that a DFA can resolve the capturing groups in one pass), but it is not at all clear to me that it is true. It may look like (2) has to be slower than (1), but the key difference is that (1) has to run an NFA with multiple regexes in it, which will tend to take longer since there are more states to manage and keep track of. In (2), you only need to run it for a single regex that you know matches. There should be less overhead associated with that. Now, there are ways to solve this problem with a single pass over the haystack while retaining the speed you might expect from a DFA. But you have to used something more powerful than a DFA known as a "tagged" DFA. re2c implements this approach and its author has published several papers on the topic. This is well beyond the scope of this library or even |
All right then, thank you so much for the explanation :)
Exactly, I was expecting capture groups to be resolved in one pass.
I have started to read the paper, I'll probably try to implement it once I fully understand it. My issue is solved now but I think it would be great to implement this feature in a future. Should I close the issue? Thank you so much man, that state of the art's review was really helpful :) |
Let's leave it open. It's useful to track this enhancement to |
I just want to drop an interesting note here. I'm implementing a lexer on top of just the regex crate (for comparison to logos, among other things), and got some interesting results. Because I'm manipulating the regex at proc-macro time, I could easily build a giant list of alternations to do "single pass" matching between the set of regex using capture groups. Using a two-pass approach with The single pass approach was building a regex (If anyone wants to see my code for the benchmark, it's here ([RegexSet]), [capturing groups]).) |
FWIW, this is now working (with full capture support) in my in-progress work on regex-automata:
|
OK, so I am unfortunately going to close this. I declared premature success in the previous comment. The problem here is that supproting overlapping matches---even without capturing groups---is incredibly tricky and full of foot guns. /// Execute an overlapping search, and for each match found, also find its
/// overlapping starting positions.
///
/// N.B. This routine used to be part of the crate API, but 1) it wasn't clear
/// to me how useful it was and 2) it wasn't clear to me what its semantics
/// should be. In particular, a potentially surprising footgun of this routine
/// that it is worst case *quadratic* in the size of the haystack. Namely, it's
/// possible to report a match at every position, and for every such position,
/// scan all the way to the beginning of the haystack to find the starting
/// position. Typical leftmost non-overlapping searches don't suffer from this
/// because, well, matches can't overlap. So subsequent searches after a match
/// is found don't revisit previously scanned parts of the haystack.
///
/// Its semantics can be strange for other reasons too. For example, given
/// the regex '.*' and the haystack 'zz', the full set of overlapping matches
/// is: [0, 0], [1, 1], [0, 1], [2, 2], [1, 2], [0, 2]. The ordering of
/// those matches is quite strange, but makes sense when you think about the
/// implementation: an end offset is found left-to-right, and then one or more
/// starting offsets are found right-to-left.
///
/// Nevertheless, we provide this routine in our test suite because it's
/// useful to test the low level DFA overlapping search and our test suite
/// is written in a way that requires starting offsets.
fn try_search_overlapping<A: Automaton>(
re: &Regex<A>,
input: &Input<'_, '_>,
) -> Result<TestResult> {
let mut matches = vec![];
let mut fwd_state = OverlappingState::start();
let (fwd_dfa, rev_dfa) = (re.forward(), re.reverse());
while let Some(end) = {
fwd_dfa.try_search_overlapping_fwd(input, &mut fwd_state)?;
fwd_state.get_match()
} {
let revsearch = input
.clone()
.pattern(Some(end.pattern()))
.earliest(false)
.range(input.start()..end.offset());
let mut rev_state = OverlappingState::start();
while let Some(start) = {
rev_dfa.try_search_overlapping_rev(&revsearch, &mut rev_state)?;
rev_state.get_match()
} {
// let start = rev_dfa
// .try_search_rev(rev_cache, &revsearch)?
// .expect("reverse search must match if forward search does");
let span = ret::Span { start: start.offset(), end: end.offset() };
// Some tests check that we don't yield matches that split a
// codepoint when UTF-8 mode is enabled, so skip those here.
if input.get_utf8()
&& span.start == span.end
&& !input.is_char_boundary(span.end)
{
continue;
}
let mat = ret::Match { id: end.pattern().as_usize(), span };
matches.push(mat);
}
}
Ok(TestResult::matches(matches))
} How to generalize this to capturing groups is not particularly clear to me. I think it's also worth pointing out that the OP of this issue talks about wanting this for perf reasons, but resolving capturing groups almost always needs to take a slower code path. So anyway, I'm going to close this particular request for now because I don't see it happening. However, while |
Describe your feature request
Given a vector of regular expressions, I'd like to match all of then in a simultaneously obtaining an array of matches with capture groups.
Where
Captures-i
represents an object with anindex
(or similar name) with valuei
and acaptures
(or similar name) value with anOption
with captures for the regex at indexi
.Of course,
matches_and_captures
is just an example of name for this feature atRegexSet
.Motivation
WASM and wasm-pack in addition to this feature would allow us to build a routing algorithm faster than find-my-way algorithm, the current Fastify routing algorithm, which is the fastest routing algorithm in the NodeJs ecosytem.
I'm open to contribute to this repo in order to achieve this feature and avoid unnecesary forks :)
The text was updated successfully, but these errors were encountered: