Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

execute a regex on text streams #425

Open
cessen opened this issue Dec 3, 2017 · 104 comments
Open

execute a regex on text streams #425

cessen opened this issue Dec 3, 2017 · 104 comments

Comments

@cessen
Copy link

cessen commented Dec 3, 2017

This is more-or-less a continuation of issue #25 (most of which is actually here).

Preface

I don't personally have an urgent need for this functionality, but I do think it would be useful and would make the regex crate even more powerful and flexible. I also have a motivating use-case that I didn't see mentioned in the previous issue.

More importantly, though, I think I have a reasonable design that would handle all the relevant use-cases for streaming regex--or at least would make the regex crate not the limiting/blocking factor. I don't have the time/energy to work on implementing it myself, so please take this proposal with the appropriate amount of salt. It's more of a thought and a "hey, I think this design might work", than anything else.

And most importantly: thanks so much to everyone who has put time and effort into contributing to the regex crate! It is no coincidence that it has become such staple of the Rust ecosystem. It's a great piece of software!

My use-case

I occasionally hack on a toy text editor project of mine, and this editor uses ropes as its in-memory text data structure. The relevant implication of this is that text in my editor is split over non-contiguous chunks of memory. Since the regex crate only works on contiguous strings, that means I can't use it to perform searches on text in my editor. (Unless, I suppose, I copy the text wholesale into a contiguous chunk of memory just to perform the search on that copy. But that seems overkill and wouldn't make sense for larger texts.)

Proposal

In the previous issue discussing this topic, the main problem noted was that the regex crate would have to allocate (e.g. a String) to return the contents of matches from an arbitrary stream. My proposed solution essentially amounts to: don't return the content of the match at all, and instead only return the byte offsets. It is then the responsibility of the client code to fetch the actual contents. For example, my editor would use its own rope APIs to fetch the contents (or replace them, or whatever), completely independent of the regex crate.

The current API that returns the contents along with offsets could (and probably should) still be included as a convenience for performing regex on contiguous slices. But the "raw" or "low level" API would only yield byte offsets, allowing for a wider range of use-cases.

Layered API

I'm imagining there would be three "layers" to the API, of increasing levels of convenience and decreasing levels of flexibility:

1. Feed chunks of bytes manually, handling matches as we go

let re = Regex::new("...");
let mut matcher = re::streaming_matcher();

for match in matcher.consume("Some bytes from a larger stream of data that") {
    my_data_source.do_something(match.start(), match.end());
    // Note: there is no match.as_str() for this API.
}

for match in matcher.consume(" we don't know how much there might be in total.") {
    // ...
}

2. Give regex an iterator that yields bytes

let re = Regex::new("...");

for match in re::find_iter_from_bytes_iter("Non-contiguous bytes that can be iterated over.".bytes()) {
    // Again, only match.start() and match.end() for this API.
    // ...
}

3. Give regex a slice, just like the current API

let re = Regex::new("...");

for match in re.find_iter("A directly passed slice of data.") {
    match.as_str(); // In this API you can get the str slice of the match.
    // ...
}

I'm of course not suggesting naming schemes here, or even the precise way that these API's should work. I'm just trying to illustrate the idea. :-)

Note that API 2 above addresses my use-case just fine. But API 1 provides even more flexibility for other use-cases.

Things this doesn't address

BurntSushi noted the following in the previous discussion (referencing Go's streaming regex support):

The most interesting bit there is that the regexp functions on readers finds precisely one match and returns byte indices into that stream. In the Go world, if I wanted to use that and actually retrieve the text, I'd write my own "middleware" reader to save text as its being read. Then I could use that after the first match to grab the matched text. Then reset the "middleware" reader and start searching again.

The problem with that approach: what if there's never a match? You'd end up piling the whole stream into memory (likely defeating the purpose of using a stream).

This proposal doesn't solve that problem, but rather side-steps it, making it the responsibility of the client code to decide how to handle it (or not). Practically speaking, this isn't actually an API problem but rather is a fundamental problem with unbounded streaming searches.

IMO, it doesn't make sense to keep this functionality out of the the regex crate because of this issue, because the issue is by its nature outside of the regex crate. The important thing is to design the API such that people can implement their own domain-specific solutions in the client code.

As an aside: API 1 above could be enhanced to provide the length of the longest potential match so far. For clarity of what I mean, here is an example of what that might look like and how it could be used:

// ...
let mut matcher = re::streaming_matcher();
let mut buffer = String::new();
let mut offset = 0;

loop {
    // Get next chunk of streaming data
    buffer.push(streaming_data_source.get_data());

    // Handle matches
    for match in matcher.consume(&buffer) {
        match_contents = &buffer[(match.start()-offset)..(match.end()-offset)];
        // ...
    }

    // Shrink buffer to smallest size that is still guaranteed to contain all data for
    // potential future matches.
    let pot_match_len = matcher.longest_potential_match_len();
    offset += buffer.len() - pot_match_len;
    buffer.truncate_from_front(pot_match_len);
}

That would allow client code to at least only hold onto the minimal amount of data. Nevertheless, that only mitigates the problem, since you can still have regex's that match unbounded amounts of data.

@BurntSushi
Copy link
Member

Thanks for taking the time to write this up!

I'm afraid the issue tracker (and the issue you linked) isn't entirely up to date with my opinion on this particular feature. Basically, I agree that this would be useful to have. API design is certainly one aspect of this that needs to be solved, and I thank you for spending time on that. However, the API design is just the surface. The far more difficult aspect of providing a feature like this is the actual implementation work required. My estimation of the level of difficulty for this is roughly "a rewrite of every single regex engine." That's about as bad as it gets and it's why this particular feature hasn't happened yet. The problem is that the assumption that a regex engine has access to the entire buffer on which it can match is thoroughly baked in across all of the matching routines. For example, as a trivially simple thing to understand (but far from the only issue), consider that none of the existing engines have the ability to pause execution of a match at an arbitrary point, and then pick it back up again. Such a thing is necessary for a streaming match mode. (There is an interesting compromise point here. For example, the regex crate could accept an R: io::Read and simply search that until it finds a match. This would avoid needing the aforementioned pause/resume functionality, and I think it could be added without rewriting the regex crate, but it would be a significant complication IMO. However, there are problems with this approach, see below.)

Getting back to API design, I think we would want to take a very strong look at Hyperscan, whose documentation can be found here. This particular regex library is probably the fastest in the world, is built on finite automata, and specifically specializes in matching regexes on streams. You'll notice that an interesting knob on their API is the setting of whether to report the start of a match or not. This is important and particularly relevant to the regex crate. In particular, when running the DFA (the fastest regex engine), it is not possible to determine both the start and end of a match in a single pass in the general case. The first pass searches forwards through the string for the end of the match. The second pass then uses a reversed DFA and searches backwards in the text from the end of the match to find the start of the match. Cool, right? And it unfortunately requires that the entire match be buffered in memory. That throws a wrench in things, because now you need to handle the case of "what happens when a match is 2GB?" There's no universal right answer, so you need to expose knobs like Hyperscan does.

The start-of-match stuff throws a wrench in my idea above as well. Even if the regex crate provided an API to search an R: io::Read, it would in turn need to do one of the following things with respect to reporting match locations:

  1. Only provide end of match indices.
  2. Always provide start of match indices, at the cost of running a slower regex engine. (The NFA based engines can find the start and end of match in a single pass.) Note that this is the choice that Go's regex engine makes. Also note that "slow" here approximately means "an order of magnitude slower."
  3. Always provide start of match indices, at the cost of buffering the entire R: io::Read into memory. (Which kind of defeats the purpose of providing this API at all.)

You can also see that RE2 doesn't have such an API either, mostly for the reasons I've already outlined here. That issue does note another problem, in particular, managing the state of the DFA, but that just falls under the bigger problem of "how do you pause/resume each regex engine" that I mentioned above.

For Go's regexp engine, there is someone working on adding a DFA, and even in their work, they defer to the NFA when given a rune reader to use with the DFA. In other words, even with the DFA, that particular implementation chooses (2).

What am I trying to say here? What I'm saying is that a number of people who specialize in this area (including myself) have come to the same conclusion: the feature you're asking for is freaking hard to implement at multiple levels, at least in a fast finite automaton based regex engine. So this isn't really something that we can just think ourselves out of here. The paths available are known, they are just a ton of work.

With that said, this is something that I think would be very cool to have. My plans for the future of regex revolve around one goal: "make it easier to add and maintain new optimizations." This leaves me yearning for a more refined internal architecture, which also incidentally amounts to a rewrite. In the course of doing this, my plan was to re-evaluate the streaming feature and see if I could figure out how to do it. However, unfortunately, the time scale on this project is probably best measured in years, so it won't help you any time soon.

If you need this feature yesterday, then this is the best thing I can come up with, if you're intent on sticking with pure Rust:

  1. Convince yourself that you're OK with a slow regexp implementation.
  2. Fork the regex crate and remove everything except for the Pike VM. Remove the DFA, the literal optimizations and the backtracker. Make it pass the test suite.
  3. Modify the Pike VM to accept an R: io::Read.

Basically, this amounts to "make the same trade off as Go." You could lobby me to make this a feature of the regex crate (with the warning that it will always run slowly), but I'm not particularly inclined to do that because it is still quite a bit of work and I'd rather hold off on adding such things until I more thoroughly understand the problem domain. It's basically a matter of priorities. I don't want to spend a couple months of my time adding a feature that has known bad performance, with no easy route to fixing it.

Apologies for the bad news!

@BurntSushi
Copy link
Member

@cessen What do you think about closing this ticket as a duplicate of #386?

@cessen
Copy link
Author

cessen commented Dec 4, 2017

@BurntSushi
Wow, thanks so much for the thorough explanation! Indeed, I didn't realize there were so many other issues involved. The older discussion I looked at made it seem like it would mostly be easy, but it just wasn't clear how the API could work. So I appreciate you clarifying all this!

I think I was also perhaps mislead by my experience with the C++ standard library, e.g. http://en.cppreference.com/w/cpp/regex
It doesn't (as far as I can tell) support streaming, but it does take an iterator rather than a slice to do the search. But indeed the iterator is bidrectional.

(As an aside, it seems slightly strange to me that the Rust stdlib doesn't have a bidirectional iterator trait. It does have double-ended, but that's different. Maybe I should write up a Rust RFC.)

As I said before, my own use-case is not urgent. It's only for a toy project anyway. No one is going to die or lose their job. ;-) But I appreciate your "if you need this yesterday" explanation. I may end up doing that at some point, if/when I get around to adding regex to my editor.

@cessen What do you think about closing this ticket as a duplicate of #386?

Ah! I feel quite silly, now! I didn't notice that ticket. Apologies.

Having said that, I think this ticket actually covers a super-set of #386. #386 is essentially just API 2 in this write-up, and doesn't cover the streaming case (API 1). And although my personal use-case only needs API 2, I was intentionally trying to cover the wider range of use-cases covered in #25.

So I guess it depends on how relevant you feel the streaming use-case is?

In any case, I won't be at all offended if you close this as duplicate. :-)

@BurntSushi
Copy link
Member

Sounds good to me! I closed out #386. :-)

@BurntSushi BurntSushi changed the title Regex on text streams - Take 2 execute a regex on text streams Dec 4, 2017
@AlbertoGP
Copy link

AlbertoGP commented Dec 6, 2017

That was a greatly useful explanation for me. I'm writing a tool that needs string matching although not general regexps, and have been looking at three alternatives:

  • use the regex crate
  • build on the aho_corasick crate (I know it's used inside regex)
  • fully custom pattern matcher (have done this before, which is why I'm first checking whether I could use yours instead for this specific application)

The ideal for me would be a RegexSet that would give me not just the indexes of matched patterns but also the (start,end) indexes, or rather the (start,end) of the first match, and that could be paused/restarted to feed it the input in blocks, not all at once.
[EDIT: just noticed that the "match first" RegexSet feature is already requested at #259]

I see now why all that is not possible at the moment, and even if implemented it would be slower than what we already have.

Thanks!

@BurntSushi
Copy link
Member

I think I was also perhaps mislead by my experience with the C++ standard library, e.g. http://en.cppreference.com/w/cpp/regex
It doesn't (as far as I can tell) support streaming, but it does take an iterator rather than a slice to do the search. But indeed the iterator is bidrectional.

I didn't respond to this before, but to be clear, C++'s standard library regex engine is backtracking AFAIK, which has a completely different set of trade offs in this space. If you asked me to do this feature request in a backtracking engine, my first thought to you would be "forget everything I just said." :)

@cessen
Copy link
Author

cessen commented Jan 14, 2018

@BurntSushi
Yeah, that totally makes sense. Thanks!

Also, looking at pikevm.rs, my initial reaction (though I may be misunderstanding some things) is that it seems like turning the main matching loop into something that can be paused/resumed wouldn't be too much work.

Although I don't think I have the time/energy for it right now (I'm putting most of my spare time into Ropey), if I get around to it would you be offended if I took a crack at creating a separate crate to accommodate the use-cases outlined in this issue, using just the slower NFA engine from this crate? My intent in doing so would be twofold:

  1. Give an immediate (even if not optimal) solution for people with these use-cases.
  2. Better explore the API design space.

This would all be with the idea that these use-cases would be folded back into this crate whenever you get the time to tackle your bigger plans for it, and my fork would be then deprecated. But it would give a stop-gap solution in the mean time.

@BurntSushi
Copy link
Member

@cessen I would not be offended. :) In fact, I suggested that exact idea above I think. You are right if course that the pikevm is more amenable to this change. The bounded backtracker might also be capable of handling it, but I haven't thought about it.

@cessen
Copy link
Author

cessen commented Jan 14, 2018

Yup! I was double-checking that you were okay with it being a published crate, rather than just an in-project-repo sort of thing. Thanks much!

If I get started on this at some point, I'll post here.

@cessen
Copy link
Author

cessen commented Jan 18, 2018

I've started poking at this in https://github.com/cessen/streaming_regex

So far I've ripped out everything except the PikeVM engine, and also ripped out the byte regex. My goal is to get the simple case working, and then I can start adding things back in (like the literal matcher, byte regex, etc.).

Now that I've gotten to this point, it's really obvious what you meant @BurntSushi , regarding everything needing to be re-done to make this possible!

The first thing I'm going to try is to rework the PikeVM engine so that it incrementally takes a byte at a time as input. I think this can work even in the unicode case by having a small four-byte buffer to collect the bytes of unicode scalar values, and only execute regex instructions once a full value has been collected. Hopefully that can be done relatively efficiently.

Once that's done, then building on that for incremental input will hopefully be relatively straight-forward (fingers crossed).

Does that sound like a reasonable approach, @BurntSushi ?

@BurntSushi
Copy link
Member

@cessen Indeed it does! Exciting. :-)

@cessen
Copy link
Author

cessen commented Jan 21, 2018

@BurntSushi
I'm getting to the point where I have some more solid API proposals, largely for internal API's at this point (e.g. the RegularExpression trait). Since I would like this work to be applicable to this crate at some point in the future (even if not directly pulled), I would love to discuss it with you. Should I do that here, or make an issue on my fork and ping you from there?

@BurntSushi
Copy link
Member

I wouldn't bother honestly. I would do whatever is natural for you. I expect the current internals to be completely redone before something like streaming functionality could get merged.

I'm happy to be a sounding board of course!

@cessen
Copy link
Author

cessen commented Jan 22, 2018

Oh sure, I'm not expecting it to be mergeable. But I'm hoping that the architecture of this fork will be relevant to your rewrite in the future, so that any lessons learned can be applied. So I'd rather not go off in some direction that's completely tangential to (or fundamentally clashing with) what you have in mind for the future. If that makes sense?

(Edited to add the "not" before "expecting" above. Really bad typo...)

@BurntSushi
Copy link
Member

Yeah I think so. I'm happy to give feedback as time permits. :)

@cessen
Copy link
Author

cessen commented Jan 22, 2018

Awesome, thanks! And, of course, I don't mean to squeeze you for time. Apologies for coming off a bit presumptuous earlier--I didn't mean it that way.

I think I'll post here, if that's okay. Let me know if you'd prefer elsewhere!

@cessen
Copy link
Author

cessen commented Jan 22, 2018

So, the first thing is a bit bigger-picture: I think we'll have a lot more wiggle-room to experiment with fast incremental scanning approaches if we keep the incremental APIs chunk-based rather than byte-based.

This basically amounts to changing API 2 in my original proposal above to take an iterator over byte slices instead of an iterator over bytes. I think that still covers all of the real use-cases. (And if someone really needs to use a byte iterator with the API, they can do their own buffering to pass larger chunks.)

The relevance to what I'm doing right now is that I'm thinking of changing the RegularExpression trait in re_trait.rs to be oriented around sequential chunk input. This mostly involves adding something like a is_last_chunk parameter to the various *_at() methods. It might look something like this:

pub trait RegularExpression {
    fn is_match_at(
        &self,
        chunk: &[u8],           // Analagous to `text` in your code
        offset_in_chunk: usize, // Analagous to `start` in your code
        is_last_chunk: bool,
    ) -> bool;

    fn find_at(
        &self,
        chunk: &[u8],
        offset_in_chunk: usize,
        is_last_chunk: bool,
    ) -> Option<(usize, usize)>;

    // etc...
}

Calling code would pass a chunk at a time, and indicate via is_last_chunk if it's the end of input. This approach puts the onus on implementors of RegularExpression to keep track of how many bytes from start-of-input have already been consumed, but I think that's pretty reasonable...?

The *_iter() methods would be passed iterators over text chunks, and Iterator::size_hint() would be used internally to know when we're at the last chunk:

pub trait RegularExpression {
    // The above stuff

    fn find_iter<I: Iterator<Item=&[u8]>> (
        self,
        text: I
    ) -> Matches<Self> {
        // ...
    }

    // etc...
}

One of the nice things about this approach is that a single contiguous text is just a special case: for find_at() et al. you just pass the text as a byte slice with is_last_chunk = true. And for the iterator methods, you just pass [text].iter().

Using DFA internally on contiguous text falls naturally out of this as well, since we can switch to DFA when we know we're on the last chunk. And this gives us plenty of room to experiment with approaches to faster incremental scanning.

(Incidentally, the reason I'm focusing on the RegularExpression trait here is that it seems to be the glue between the higher-level and lower-level APIs, so I'm thinking it will have a lot of influence over the implementations of both. But I could be wrong about that!)

@BurntSushi
Copy link
Member

@cessen Everything you said sounds pretty sensible to me. Here are some random thoughts:

  1. Using size_hint on iterators can't be assumed to be correct. You can use a peekable iterator (or equivalent).
  2. You might consider alternatives from the is_last_chunk construction. For example, you could state that "passing an empty chunk indicates end of input."
  3. I think consideration should be given to io::Read as well. I believe it's possible to write an adapter on top of your proposed API that takes an arbitrary io::Read and executes a search, right? If so, that's good. Just wanted to put it on your radar. :)

@cessen
Copy link
Author

cessen commented Jan 24, 2018

Thanks for the feedback!

  1. As far as I understand from the documentation, size_hint can't be assumed correct for purposes of memory safety, but can be assumed correct for other purposes. Specifically, "[...]That said, the implementation should provide a correct estimation, because otherwise it would be a violation of the trait's protocol." Of course, I may be interpreting that incorrectly.
  2. Yeah, I was thinking about that too. The down-side to that approach is then we don't know when we're on the last chunk while processing it, which means we lose the ability to e.g. switch to DFA for the last chunk. I agree that the is_last_chunk parameter is awkward and clunky, but I think the benefit may be worth it if it lets other things "just work". However, I'm certainly open to other approaches.
  3. Absolutely! I think that's another benefit of the chunk-based approach, actually: it maps nicely to how io::Read::read() works. It should be pretty trivial to build an io::Read based API on top of API 1 in the OP proposal. In any case, to put you at ease: I also agree it's important to support that, and I'm definitely approaching things with that in mind.

@BurntSushi
Copy link
Member

@cessen

  1. The key part of the docs is this line: "The default implementation returns (0, None) which is correct for any iterator." The part you quoted is when the reported lower/upper bounds are wrong. But a 0 lower bound is never wrong and an unknown upper bound is also never wrong. An example of something that's wrong is returning 1 as a lower bound for an iterator over an empty slice.

@cessen
Copy link
Author

cessen commented Jan 24, 2018

Yes, that was my understanding as well. I didn't mean to imply that size_hint guarantees that we know when we're on the last chunk, but rather that when it's indicated that we can know, it's correct and we can take advantage of it.

@cessen
Copy link
Author

cessen commented Jan 24, 2018

Or, more specifically, if we reach the upper bound (if one is given) we can set is_last_chunk to true.

We would still have to do something like pass an empty chunk at the end when the upper bound isn't given or when we reach the end before the upper bound.

@cessen
Copy link
Author

cessen commented Jan 25, 2018

So, regardless of all of this, I think your suggestion to use the peekable iterator is a better idea. For some reason I thought not all iterators could be made peekable, but apparently they can. So that would be a great approach!

@cessen
Copy link
Author

cessen commented Feb 5, 2018

Quick status report:

The PikeVM engine now takes a single token at a time, and doesn't access the Input anymore. Essentially, that means that it can be built off of to do streaming input now, though for utf8 it has to be passed an entire scalar value at a time.

Next step is to get it to take just a single byte at a time.

One thing I've noticed is that I think the PikeVM can be made to handle Unicode and byte data use-cases at the same time, even within the same regex (assuming the syntax supports making a distinction). I'm guessing with the DFA that's not possible in a reasonable way, because it can only be in one state at a time...? It's not particularly relevant to the use-cases I'm trying to accommodate, but I think it's interesting nonetheless.

Edit:
@BurntSushi
Also, gotta say, all the unit tests you wrote are a life-saver! Would never have been able to get to this point without them!

@BurntSushi
Copy link
Member

@pascalkuthe Sounds good. There is also the nuclear option: that we add nfa::thompson::pikevm_stream::PikeVM or some such. As in, it's a completely different engine. I'd prefer not to go that way, but if trying to hit both use cases with the same engine leads to a redonkulous API, then the nuclear option might be better.

(The marginal cost of adding another engine is fairly high, but not nearly as high as it was before regex-automata. And certainly less risky. The infrastructure for testing the new engine is in place and it all "just needs to be hooked up." Maintenance going forward is somewhat of a concern, but in practice, the internals of these engines changes very rarely.)

@BurntSushi
Copy link
Member

One other thing worth asking here for folks who are working on the streaming case: are the APIs for the DFA and the lazy DFA low level enough to make it possible to build a streaming engine on top of it?

@pascalkuthe
Copy link
Contributor

Yeah I actually got the DFA working first (both lazy and eager). With the exception of #1031 and a way to determine the length of a prefilter (so I know how much to buffer) that worked fairly well.

There is obviously sone duplication but it's not too bad: https://github.com/pascalkuthe/ropey_regex/blob/master/src/engines/dfa/search.rs

For the dfa in particular the memchar based acceleration might be nice to reuse (it's a verbatim copy right now IIRC) but it's not too much code

@pascalkuthe
Copy link
Contributor

I have been working on this again a bit and I have run into a small roadblack.

I am trying to be very general with my apporache supporting a cursor API that optionally supports backtracking. The goal would be that inputs that support backtracking (like ropes) can use all engines (dfa, hybrid, ..) while fully streaming inputs (basically any Iterator<&[u8]>) could only use the pikevm (and potentially prefilters).

The problem is that empy module that ensures utf8 codepoint alignment essentially makes all regex engines require some form of backtracking as it restarts the search when the end of a match is not a valid utf8 codepoint. That would make it impossible to fully support the streaming case in unicode mode.

However, I read the documentation of that module carefully and it sounds like this may not be a huge problem in practice, because most practical cases only involve emtpy matches at the start of the haystack. These case doesn't pose an issue because there is not any backtracking involved (note that I require chunks to be Unicode aligned so if the match is at a chunk boundary it is automatically unicode aligned). The only cases which wouldn't work is:

  • (?m:$) and(?m^) within unicode line-endings (very niche)
  • (?-u:\B) (also pretty niche)

I would be ok with panicking in these two cases with a cursor that don't support backtracking (happen automatically anyway) since they are pretty niche as long as the more common cases work.

@BurntSushi is my understanding correct here or do you have any better ideas how to handle this?

@BurntSushi
Copy link
Member

I'm not quite sure to be honest. (Note that (?m:$) and (?m:^) are not Unicode-aware line anchors. This crate doesn't support them at all. You might be thinking of the support for overriding the line terminator to be any particular byte. The empty module docs discuss that briefly at the very end.)

I wonder if there is perhaps a way to "not solve" this problem. I can think of two different approaches to take there:

  1. No support for stream searching for regexes that can match the empty string. Or special case it somehow.
  2. No support for UTF-8 mode. In your comment, it looks like you're conflating Unicode mode with UTF-8 mode. These modes are fully de-coupled in regex-automata. The only thing you lose by not supporting UTF-8 mode is that you can get empty matches that split a codepoint. You might be able to then handle those in a different way (perhaps via the iterator logic described in the empty module docs). UTF-8 mode is somewhat of a special case for things that are guaranteed to be valid UTF-8, such as Rust's &str. For anything that isn't guaranteed to be valid UTF-8, UTF-8 mode has unspecified behavior.

It's also worth point out that, at least for the lazy DFA and fully compiled DFAs, you don't have to use the search APIs that handle empty matches. You can implement your own.

@pascalkuthe
Copy link
Contributor

Thanks for your response @BurntSushi

I'm not quite sure to be honest.

After thinking about it some more I am not sure either. I think its possible that the NFA/DFA would try to match some other pattern and then terminate at the empty match (like foo| matching fox would require backtracking in the empty method I think).

You might be thinking of the support for overriding the line terminator to be any particular byte. The empty module docs discuss that briefly at the very end

yeah sorry, that is what I meant. I didn't want to spell it all out and got tripped up by the documentation referring to that and somehow taught they were the same thing.

It's also worth point out that, at least for the lazy DFA and fully compiled DFAs, you don't have to use the search APIs that handle empty matches. You can implement your own.

Yeah that's what I am doing (I am also reimplementing the pikevm) but I am also reimplementing the copy the empty' module (and using it in my search primitives just like the upstream equivalents) since I ideally wanted to allow the streaming regex engine to handle all the edgecases the regex crate can (for a backtracking cursors that should be possible).

I wonder if there is perhaps a way to "not solve" this problem. I can think of two different approaches to take there:

  • No support for stream searching for regexes that can match the empty string. Or special case it somehow.
  • No support for UTF-8 mode. In your comment, it looks like you're conflating Unicode mode with UTF-8 mode. These modes are fully de-coupled in regex-automata. The only thing you lose by not supporting UTF-8 mode is that you can get empty matches that split a codepoint. You might be able to then handle those in a different way (perhaps via the iterator logic described in the empty module docs). UTF-8 mode is somewhat of a special case for things that are guaranteed to be valid UTF-8, such as Rust's &str. For anything that isn't guaranteed to be valid UTF-8, UTF-8 mode has unspecified behavior.

For my personal usecase of ropey/helix all of this doens't matter since backtracking isn't an issue there. The first option is actually quite intesrting to me. We never want an empty match in helix. I would be happy to just remove those states from the DFA/NFA somehow?

I would like to find a way to at least partially support these cases in the future too (probably using the old hack used in the regex crate for iteration, iteration of cursor without backtracking is probably pretty niche anyway since that only works with earliest match semantics for similar reasons)

@BurntSushi
Copy link
Member

I would be happy to just remove those states from the DFA/NFA somehow?

Oh interesting, I wasn't thinking of it this way. I was thinking of it as in "regex compilation returns an error if streaming mode is asked for and it can match the empty string." I'm not sure how to do your idea. What would you do with a regex like (?m:^$) that matches empty lines, for example? I was thinking you'd reject it or perhpas put it through a different path that isn't as optimized. Dunno. It's a half-baked idea.

(It's worth noting that Hyperscan---which supports stream searching---errors on regexes that can match the empty string by default. You have to go out of your way to enable HS_FLAG_ALLOWEMPTY. Although I don't think this is related to streaming, but rather, given Hyperscan's match semantics, it would always result in reporting a match at literally every position.)

ideally wanted to allow the streaming regex engine to handle all the edgecases the regex crate can (for a backtracking cursors that should be possible). ...snip... For my personal usecase of ropey/helix all of this doens't matter since backtracking isn't an issue there.

I do wonder if it makes sense to forget about the "true streaming" use case and really just focus on the "non-contiguous storage" use case. It would be an incremental improvement and it would satisfy some real world use cases. It isn't as operationally flexible which is a bummer.

@pascalkuthe
Copy link
Contributor

Oh interesting, I wasn't thinking of it this way. I was thinking of it as in "regex compilation returns an error if streaming mode is asked for and it can match the empty string." I'm not sure how to do your idea. What would you do with a regex like (?m:^$) that matches empty lines, for example? I was thinking you'd reject it or perhpas put it through a different path that isn't as optimized. Dunno. It's a half-baked idea.

For my specific usecase I essentially want to select all text that matches a regex pattern. An empty match essentially leads to no selection and therefore can be ignored (they actually lead to bugs currently and I was going to work around it donwstream by just checking the length of each match in a .filter() so just "removing" the empty state from the NFA would work quite well for me.

Reporting an error for you example could be fine but repetitions tend to be a bit annoying. Patterns like (foo)* are pretty common. * and + are often used interchangeably if people don't pay too much attention (like when working interactively in a text editor). So to me just removing the traditions from the NFA that could lead to matching an empty string would be perfects.

Ofcourse that is kind of specific to my usecase and reporting an error is probably the better general solution (I would love a better solution for (foo)* tough, maybe something can be done at a syntactic level?)

I do wonder if it makes sense to forget about the "true streaming" use case and really just focus on the "non-contiguous storage" use case. It would be an incremental improvement and it would satisfy some real world use cases. It isn't as operationally flexible which is a bummer.

I have been tempted by this too. It may indeed make more sense to do that for now. I actually have no specical codepath for the non-backtracking case right now. I have a cursor trait:

pub trait Cursor {
    fn chunk(&self) -> &[u8];
    /// Whether this cursor can be used for unicode matching That
    /// means specifically that it promises that unicode codepoints are never
    /// split across chunk boundaries
    fn utf8_aware(&self) -> bool;
    /// Returns true if successful, false at EOI
    fn advance(&mut self) -> bool;
	/// Returns true if successful, false at EOI
    /// or if backtracking is not supported
    fn backtrack(&mut self) -> bool;
}

and the backtrack function will simply always fail for the cursor that can't backtrack. I havn an abstraction around this that will track the position of the cursor that will panic if backtack fails at an non-zero offset.

So for regex searches that won't run into these restrictions you can totally still use my crate. I was mostly trying to remove the cases where it would panic but it might be better to ignore that usecsase for now.

Now that you mention it this API probably isn't too useful for something "fully" streaming as they would probably want to treat the regex search more like a state machine (and use regex-automata directly would work pretty well in that case I think, I imagine these are highly specific use-cases that may even be able to constrain their regex patterns further).

So that was a lot of words to say, you are right I probably shouldn't focus on the fully streaming case for now :D

@BurntSushi
Copy link
Member

Yeah I like the idea of building a Shitty First Draft that targets what you need specifically first. That will get you (and us) some real world experience with a prototype. And then hopefully it can be iterated on and improved in the future.

@pascalkuthe
Copy link
Contributor

I managed to implement a fully working meta regex engine (pikevm + dfa + hybrid + prefilter) that passes all regex tests at https://github.com/pascalkuthe/regex-cursor.

I put a short summary in the readme. I think it would be nice to upstream this eventually since I had to duplicate a ton of private code (primarily in the meta engine) and I am a bit worried about maintaining the duplication in the long run. The cursor API that I came up with is very generic and the pikevm implementation could be made to fit fully streaming input in the future (altough with heavy limitation see the readme).

Ofcourse this is just a prototype and there is a long road to getting this upstream but I want to get the ball rolling a bit. A couple points:

  • The cursor is a trait so I had to use an enum instead of dynamic dispatch for the strategies. There was a note about that in the code anyway. I doubt it matters much wich variant you use so this should be ok to switch IMO.
  • regex-cursor has its own input struct. That is really API contagious so lots of stuff needed to be "needlessly" rewritten. Can we add some abstraction to reduce the duplication? In theory the non-streaming case can be handled by a cursor that only ever yields a single chunk but I imagine that a compiler would have trouble optimizing that back to the original code. Maybe some different abstraction could work that would essentially be an Input trait

There is also some stuff missing:

  • I haven't implemented some of the more advanced strategies that have custom search code
  • overlapping search routines
  • onepass and bounded-backtrack engine
  • a frontend regex API (currently only the equivalent of regex-automata::meta is implemented). I decided to forgo this for now since it is mostly a bunch of copy-paste legwork

@BurntSushi
Copy link
Member

@pascalkuthe Wow, that is amazing. I'll have to take a deeper look soon.

Maybe some different abstraction could work that would essentially be an Input trait

The main problem here I think, and why I designed it the way I did, was to avoid making everything in the crate polymorphic. My suspicion is that with Input being a trait, compile times would suck even worse than they do today.

@pascalkuthe
Copy link
Contributor

yeah, that is a good point, maybe the boilerplate cost is work paying to avoid bloating compile-time. It may also make sense to keep all the cursor stuff behind feature flags. There is a lot of code that is just fully copy pasted

@BurntSushi
Copy link
Member

Yeah I'm definitely overall in favor of finding a way to upstream your work. It's a really nice use case to serve if we can do it. But I'll want to understand more about the trade offs.

Are you planning to publish the crate? (So that it's easy to read the crate docs.)

@pascalkuthe
Copy link
Contributor

yeah I will publish the crate later today but full disclaimer: The docs still need work. They are nowhere near the great documentation upstream has (most of the docs are also copied from upstream where API was covered).

The public API probably won't be too interesting for you though since it's mostly a carbon copy from regex-automata (including docs). Mostly the cursor trait will be interesting.

@BurntSushi
Copy link
Member

Yeah no/poor docs is cool. Just want to navigate the types and what not.

@pascalkuthe
Copy link
Contributor

pascalkuthe commented Jan 25, 2024

I polish the docs up a bit and pushed it to crates.io, should be on docs.rs in a couple minutes: https://docs.rs/regex-cursor/

@ratmice
Copy link

ratmice commented Jan 26, 2024

Had a look at what would be involved with implementing the overlapping search routines, and indeed I see what you mean by lots of copy/paste since these rely on some pub crate fields of e.g. OverlappingState. Not sure how far I want to jump down that rabbit hole.

Edit: FWIW the impression I get it seems overall likely that i'd be better just running matches over many regexp individually, than trying to complete that aspect of the API if only for maintainability sake.

@CeleritasCelery
Copy link

@pascalkuthe have you run any benchmarks against your implementation? I am curious how it compares as a baseline.

@pascalkuthe
Copy link
Contributor

no haven't got around to that yet. The performance will mostly depend on your collection (a collection with small chunks and slow cursor will see a much larger impact than one with large chunks and a fast cursor). I will only be benchmarking ropey as that is the only practical usecase I have.

There will be some slowdown from the cases being accelerated by the engines/strategies not implemented yet but that is only temporary.

The only thing where performance would really be interesting is the prefilter since that does actually has some additional complexity. The rest I would expect to be very close to upstream regex if chunk breaks are reasonably rare (which they are in ropeys case).

@CeleritasCelery
Copy link

CeleritasCelery commented Jan 29, 2024

I am really excited for this feature! Thank you for putting in so much work on path clearing it. I took some time to benchmark it using rebar to get baseline results. This is for the unicode, curated, dictionary, wild, and reported benchmarks. I only included results where there is more than a 10% difference (there were 54 benchmarks where the difference was the same or less than 10%). I don't know these benchmarks or the regex internals well enough to say which of these come from inherit differences between the two implementations, and which come from missing engines/strategies.

benchmark                                           rust/regex           rust/regex/cursor
---------                                           ----------           -----------------
curated/01-literal/sherlock-en                      23.1 GB/s (1.00x)    20.6 GB/s (1.12x)
curated/04-ruff-noqa/real                           1146.8 MB/s (1.00x)  540.9 MB/s (2.12x)
curated/05-lexer-veryl/single                       7.1 MB/s (1.00x)     696.2 KB/s (10.38x)
curated/06-cloud-flare-redos/original               408.2 MB/s (1.00x)   222.8 MB/s (1.83x)
curated/06-cloud-flare-redos/simplified-long        44.8 GB/s (1.00x)    22.3 GB/s (2.00x)
curated/07-unicode-character-data/parse-line        265.3 MB/s (1.00x)   26.2 MB/s (10.14x)
curated/09-aws-keys/full                            1353.5 MB/s (1.00x)  1119.8 MB/s (1.21x)
curated/11-unstructured-to-json/extract             66.6 MB/s (1.00x)    24.1 MB/s (2.77x)
curated/14-quadratic/2x                             6.0 MB/s (1.10x)     6.7 MB/s (1.00x)
dictionary/search/english                           86.9 MB/s (1.00x)    18.5 MB/s (4.70x)
dictionary/search/english-tiny                      145.5 MB/s (1.00x)   24.2 MB/s (6.00x)
dictionary/search/english-10                        153.7 MB/s (1.00x)   14.4 MB/s (10.65x)
reported/i13-subset-regex/original-ascii            4.1 GB/s (1.00x)     344.4 MB/s (12.15x)
reported/i13-subset-regex/original-unicode          255.2 MB/s (1.00x)   12.0 MB/s (21.22x)
reported/i13-subset-regex/big-ascii                 65.7 MB/s (1.00x)    10.5 MB/s (6.24x)
reported/i13-subset-regex/big-unicode               11.7 MB/s (1.00x)    10.5 MB/s (1.12x)
reported/i13-subset-regex/huge-ascii                11.1 MB/s (1.00x)    9.9 MB/s (1.13x)
reported/i13-subset-regex/huge-unicode              11.2 MB/s (1.00x)    9.9 MB/s (1.14x)
reported/i13-subset-regex/huge-ascii-nosuffixlit    11.1 MB/s (1.00x)    9.9 MB/s (1.13x)
reported/i13-subset-regex/huge-unicode-nosuffixlit  11.2 MB/s (1.00x)    9.8 MB/s (1.15x)
unicode/codepoints/letters-one                      11.3 MB/s (1.00x)    10.1 MB/s (1.12x)
unicode/codepoints/letters-alt                      11.2 MB/s (1.00x)    10.1 MB/s (1.11x)
unicode/overlapping-words/ascii                     42.2 MB/s (1.00x)    19.9 MB/s (2.12x)
unicode/overlapping-words/english                   6.2 MB/s (1.00x)     2.9 MB/s (2.12x)
unicode/overlapping-words/russian                   5.8 MB/s (1.00x)     2.5 MB/s (2.30x)
unicode/word/around-holmes-english                  28.6 GB/s (1.00x)    584.9 MB/s (50.10x)
wild/bibleref/short                                 35.0 MB/s (1.00x)    9.4 MB/s (3.74x)
wild/bibleref/line                                  475.6 MB/s (1.00x)   423.9 MB/s (1.12x)
wild/caddy/caddy                                    349.5 MB/s (1.00x)   36.1 MB/s (9.69x)
wild/dot-star-capture/rust-src-tools                433.9 MB/s (1.00x)   21.1 MB/s (20.55x)
wild/parol-veryl/ascii                              7.0 MB/s (1.00x)     698.4 KB/s (10.30x)
wild/parol-veryl/unicode                            4.9 MB/s (1.00x)     461.0 KB/s (10.81x)
wild/parol-veryl/multi-captures-ascii               18.4 MB/s (1.00x)    5.1 MB/s (3.64x)
wild/ruff/whitespace-around-keywords                213.2 MB/s (1.00x)   91.4 MB/s (2.33x)
wild/ruff/noqa                                      1132.9 MB/s (1.00x)  539.5 MB/s (2.10x)
wild/ruff/space-around-operator                     322.2 MB/s (1.00x)   174.0 MB/s (1.85x)
wild/ruff/shebang                                   680.9 MB/s (1.37x)   930.6 MB/s (1.00x)
wild/rustsec-cargo-audit/original-unix              17.0 GB/s (1.00x)    10.8 GB/s (1.57x)
wild/rustsec-cargo-audit/original-windows           16.4 GB/s (1.00x)    9.8 GB/s (1.67x)
wild/rustsec-cargo-audit/both-slashes               17.1 GB/s (1.00x)    10.3 GB/s (1.65x)
wild/rustsec-cargo-audit/both-alternate             17.1 GB/s (1.00x)    11.4 GB/s (1.51x)

@pascalkuthe
Copy link
Contributor

I think kw ecsn ignore everything under 20% for now.

I haven't gone trough all of them but for most of the big offenders (that are an order magnitude slower) I think I identified the cause.

  • caused by missing one pass engine
    • wild/dot-star-capture/rust-src-tools
    • wild/caddy/caddy (I think)
    • the various (parol) veryl benchmarks (I think)
  • our prefilter does buffering for aho-corsick instead of feeding data directly into the dfa like we do intenrally.
    • all the dictionary searches likely quite a lot of other ones as well
    • need to use upstream streaming support directly
    • this requires access to regex internals
    • the streaming support upstream is not sufficient for the usecase here. Using an iterator instead of a cursor is a problem, leftmost search is not supported, anchored search is not supported, ...
  • missing reverse suffix optimization (one of the missing strategies)
    • reported/i13-subset-regex/original-*

@BurntSushi
Copy link
Member

Those are really impressive numbers. Wow. Yes there are a few 10x in there, but not many. I was expecting things to be a lot worse. (I still haven't looked at the code yet though.) Nice work @pascalkuthe.

@pascalkuthe
Copy link
Contributor

Maybe a small update: refex-cursor was merged into helix master a while ago and was included in the new 24.03 release so it has a pretty large userbase. Seems to hold up great to scrutiny so far

@8573
Copy link

8573 commented Aug 9, 2024

I recently saw that another Rust implementation of regexes over streams, ergex, was released a few years ago, which I thought to mention here, as it does not seem to have been known to this discussion previously: https://github.com/deadpixi/ergex. Warning: It is licensed incompatibly with regex.

@tisonkun
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests