-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex on text streams #14015
Comments
I'm not sure if this is worth the effort. In both cases, the regex library will have to implement a stream interface. Could you give a few cases where you think this might be useful? |
Obvious one: find/replace in big files or stdin. |
I'm not sure why I didn't think of that... |
This actually shouldn't be as hard as one might think. The current implementation only requires a one character lookahead, which should be fine with working on a stream. It specifically never copies the search text, so I think most of the implementation is already amenable to using a stream. What would the API look like? |
@BurntSushi doesn't the VM runner require access to arbitrary byte index too here: https://github.com/mozilla/rust/blob/e87e18064ece0d7eddb269b8211fb8fdf5efaa91/src/libregex/vm.rs#L366-L390 ? |
@utkarshkukreti Hmm, yeah, that piece would have to be re-worked. I believe it's primarily used for skipping across the search text when there is a literal prefix in the regex. I think we could find a way to do that without using raw byte indices. It was just much more convenient to do it that way. (This skipping requires no backtracking. If the prefix is never found, then the regex can't match.) |
I thought of a possible issue with this: changing to a plain |
Yuck. So all the return types would have to change to |
Worse than that: |
That is indeed pretty grim. I don't think it's acceptable for trait Regexable<'a, T: Iterator<char>> {
fn scannable(&self) -> MaybeIterator<'a, T>;
}
enum MaybeIterator<'a, T> {
Iter(T),
Slice(&'a str),
} Then provide impls for the right types. Then the methods could return But yeah, this is going to be harder than I thought. In the iterator's case, we need a way to accrue the matched text. The VM is fundamentally built on the notion of returning indices into an existing string, and the notion simply doesn't work with streams. So that would have to be gutted. This is a bigger change I thought. Damn. |
I just took a look at Go's regexp package. They have some functions that work on streams, but they are a bit crippled. Firstly, there is only the equivalent of The most interesting bit there is that the regexp functions on readers finds precisely one match and returns byte indices into that stream. In the Go world, if I wanted to use that and actually retrieve the text, I'd write my own "middleware" reader to save text as its being read. Then I could use that after the first match to grab the matched text. Then reset the "middleware" reader and start searching again. The problem with that approach: what if there's never a match? You'd end up piling the whole stream into memory (likely defeating the purpose of using a stream). Hmm. Am I missing anything? Are there any other implementations of regex searching on streams? |
No regex stream matching in Perl 5, Python, .NET, Java. The Internet is full of hacks for when you know the max size of the text to match. I did find this, which claims to have streaming regexes: https://github.com/openresty/sregex |
In-tree regex was removed in #21458 . This should probably be moved to https://github.com/rust-lang/regex |
This issue has been moved to the regex repo: rust-lang/regex#25 |
Don't include lifetime or label apostrophe when renaming Closes rust-lang#13907
Regex library defines methods for find/replace on strings, but what about text streams?
The text was updated successfully, but these errors were encountered: