-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow starting DFA in noncontinuous bytes #1031
Conversation
regex-automtaton already supports transversing the DFA one byte at a time with `next_state`. This is potentially very useful when scanning noncontinuous data like network stream or a rope data structures as commonly used in editors. However, to start the DFA with `start_state_forward`/`start_state_reverse` currently requires an `Input` and will look ahead/look one byte behind the span boundaries. To support that (especially when using prefilters/literal optimization) a streaming use case can not provide such a haystack easily (it can be worked around with a temporary array and copying one byte over but its extremely brittle/hacky). This commit adds the `start_state_forward_with`/`start_state_reverse_with` function which allow passing the information extracted from the Input directly.
Can you show how you would use the existing APIs for the streaming use case? You mention it would be extremely hacky, but in my head it doesn't seem that way? The problem with this addition is that it complicates the API even more than it already is, and more importantly, makes it so the start state can only be computed from the span and/or look-behind. The status quo is that the start state can be computed from anything on an |
So with this PR starting a DFA looks as follows let sid = dfa.start_state_forward_with(
cache,
input.get_anchored(),
input.look_behind(),
input.get_span(),
)?; whereas a workaround would look as follows: let fake_haystack;
let fake_input = match input.look_behind() {
Some(byte) => {
//The second byte is just so that the span is not empty.
// I am not sure that matters otherwise it could be a single-element array
fake_haystack = [byte, byte];
regex_automata::Input::new(&fake_haystack).range(1..)
}
None => regex_automata::Input::new(input.curr_haystack()),
};
let sid = dfa.start_state_forward(
cache,
&fake_input
.anchored(input.get_anchored())
.earliest(input.get_earliest()),
)?; Just to provide some background I am implementing a regex engine for a rope data structure so I have a cursor of byte chunks. That means I am not using In the workaround I am creating a fake input just with the look behind the byte. I can't use the current chunk/haystack because the lookahead byte might be in the previous chunk (if the new DFA starts exactly at a new chunk so the look_behind byte is the last byte of the previous chunk). Doing it this way kind of works but feels pretty hacky. Especially if you plan on using more information from the input in the future (for example looking at the actual content of the haystack inside the span) this solution might break down the road. I am not sure how you would treat something like that but depending on your definition that wouldn't be a breaking change either so I would have to pin a patch version. I was looking for an API that explicitly considers this kind of streaming application a valid use-case and not "I passed an input that happens to behave the same thanks to my knowledge of this function implementation details". Of course, if your plan is to actually evolve that function then providing such a function would be pretty limiting. Perhaps the parameter set could be extended or I could just construct an input with my first chunk and only provide the look_behind byte (or lookahead for reverse scan) byte seperatly? |
I'm respond more later when I have my hands on a keyboard, but the entire point of I'll respond with more detail and thoughts about your code sample tomorrow. |
So I slept on this, and I basically think the In the immediate term, I think it's safe to just use a haystack consisting of a single byte and setting the search span offsets accordingly to get the behavior you desire. In the less immediate term, I think a new API that lets one build up a So if you wanted to work on this, here's what I'm thinking:
How's that sound? |
Thanks for the detailed thoughts! I will try to finish my prototype implementation first and use the workaround. It sounds like that it should be reasonably safe to rely on that for now. I think exposing an extra type makes a lot of sense and would feel clearer. I will try to take a stab at that in the (hopefully) not to distant future. I will close this PR and open a new PR then 👍 |
Aye, and I might take a stab at it too if you don't get to it. In particular, there are interesting API design challenges here. For example, there is probably an invariant where if |
I'm going to re-open this PR because it's hard to track the important API addition I outlined above otherwise. |
I am tackling this with a rough plan to just fix the existing API and put out a |
sorry for nog teggint back to this sooner, I had a contract comeup that is sucking up all my time. |
No worries! I wasn't planning on getting to it myself any time soon, but with the possibility of a breaking release it makes sense to try and get it in. |
It turns out that requiring callers to provide an `Input` (and thus a `&[u8]` haystack) is a bit onerous for all cases. Namely, part of the point of `regex-automata` was to expose enough guts to make it tractable to write a streaming regex engine. A streaming regex engine, especially one that does a byte-at-a-time loop, is somewhat antithetical to having a haystack in a single `&[u8]` slice. This made computing start states possible but very awkward and quite unclear in terms of what the implementation would actually do with the haystack. This commit fixes that by exposing a lower level `start_state` method on both of the DFAs that can be called without materializing an `Input`. Instead, callers must create a new `start::Config` value which provides all of the information necessary for the DFA to compute the correct start state. This in turn also exposes the `crate::util::start` module. This is ultimately a breaking change because it adds a new required method to the `Automaton` trait. It also makes `start_state_forward` and `start_state_reverse` optional. It isn't really expected for callers to implement the `Automaton` trait themselves (and perhaps I will seal it so we can do such changes in the future without it being breaking), but still, this is technically breaking. Callers using `start_state_forward` or `start_state_reverse` with either DFA remain unchanged and unaffected. Closes #1031
@pascalkuthe What do you think?
There is technically a breaking change here due to a new required method ( |
It turns out that requiring callers to provide an `Input` (and thus a `&[u8]` haystack) is a bit onerous for all cases. Namely, part of the point of `regex-automata` was to expose enough guts to make it tractable to write a streaming regex engine. A streaming regex engine, especially one that does a byte-at-a-time loop, is somewhat antithetical to having a haystack in a single `&[u8]` slice. This made computing start states possible but very awkward and quite unclear in terms of what the implementation would actually do with the haystack. This commit fixes that by exposing a lower level `start_state` method on both of the DFAs that can be called without materializing an `Input`. Instead, callers must create a new `start::Config` value which provides all of the information necessary for the DFA to compute the correct start state. This in turn also exposes the `crate::util::start` module. This is ultimately a breaking change because it adds a new required method to the `Automaton` trait. It also makes `start_state_forward` and `start_state_reverse` optional. It isn't really expected for callers to implement the `Automaton` trait themselves (and perhaps I will seal it so we can do such changes in the future without it being breaking), but still, this is technically breaking. Callers using `start_state_forward` or `start_state_reverse` with either DFA remain unchanged and unaffected. Closes #1031
It turns out that requiring callers to provide an `Input` (and thus a `&[u8]` haystack) is a bit onerous for all cases. Namely, part of the point of `regex-automata` was to expose enough guts to make it tractable to write a streaming regex engine. A streaming regex engine, especially one that does a byte-at-a-time loop, is somewhat antithetical to having a haystack in a single `&[u8]` slice. This made computing start states possible but very awkward and quite unclear in terms of what the implementation would actually do with the haystack. This commit fixes that by exposing a lower level `start_state` method on both of the DFAs that can be called without materializing an `Input`. Instead, callers must create a new `start::Config` value which provides all of the information necessary for the DFA to compute the correct start state. This in turn also exposes the `crate::util::start` module. This is ultimately a breaking change because it adds a new required method to the `Automaton` trait. It also makes `start_state_forward` and `start_state_reverse` optional. It isn't really expected for callers to implement the `Automaton` trait themselves (and perhaps I will seal it so we can do such changes in the future without it being breaking), but still, this is technically breaking. Callers using `start_state_forward` or `start_state_reverse` with either DFA remain unchanged and unaffected. Closes #1031
I read trough the docs, and that looks great! I really like the builder style API it makes a lot of sense here (especially if you do endup introducing more complex start state it could avoid future breaking changes). Also great docs, the examples are great! Awesome that you picked this up, this is gonna really help make my streaming DFA search implementation less hacky:) I haven't had the chance to read the code too deeply yet as I am away this weekend. If that would be helpful to you I can try to read trough/review the changes next week. |
Absolutely! I have dreams of having the code merged by then, but in practice I have a lot of other stuff to do, specifically #469. Wouldn't be surprised if that took me into next week.
Yeah exactly. That's part of why I created |
It turns out that requiring callers to provide an `Input` (and thus a `&[u8]` haystack) is a bit onerous for all cases. Namely, part of the point of `regex-automata` was to expose enough guts to make it tractable to write a streaming regex engine. A streaming regex engine, especially one that does a byte-at-a-time loop, is somewhat antithetical to having a haystack in a single `&[u8]` slice. This made computing start states possible but very awkward and quite unclear in terms of what the implementation would actually do with the haystack. This commit fixes that by exposing a lower level `start_state` method on both of the DFAs that can be called without materializing an `Input`. Instead, callers must create a new `start::Config` value which provides all of the information necessary for the DFA to compute the correct start state. This in turn also exposes the `crate::util::start` module. This is ultimately a breaking change because it adds a new required method to the `Automaton` trait. It also makes `start_state_forward` and `start_state_reverse` optional. It isn't really expected for callers to implement the `Automaton` trait themselves (and perhaps I will seal it so we can do such changes in the future without it being breaking), but still, this is technically breaking. Callers using `start_state_forward` or `start_state_reverse` with either DFA remain unchanged and unaffected. Closes #1031
It turns out that requiring callers to provide an `Input` (and thus a `&[u8]` haystack) is a bit onerous for all cases. Namely, part of the point of `regex-automata` was to expose enough guts to make it tractable to write a streaming regex engine. A streaming regex engine, especially one that does a byte-at-a-time loop, is somewhat antithetical to having a haystack in a single `&[u8]` slice. This made computing start states possible but very awkward and quite unclear in terms of what the implementation would actually do with the haystack. This commit fixes that by exposing a lower level `start_state` method on both of the DFAs that can be called without materializing an `Input`. Instead, callers must create a new `start::Config` value which provides all of the information necessary for the DFA to compute the correct start state. This in turn also exposes the `crate::util::start` module. This is ultimately a breaking change because it adds a new required method to the `Automaton` trait. It also makes `start_state_forward` and `start_state_reverse` optional. It isn't really expected for callers to implement the `Automaton` trait themselves (and perhaps I will seal it so we can do such changes in the future without it being breaking), but still, this is technically breaking. Callers using `start_state_forward` or `start_state_reverse` with either DFA remain unchanged and unaffected. Closes #1031
sorry for the late feedback. It seems you have already merged that commit. Nonetheless I took a look trough the changes in more detail. Looks great to me! Thank you for working on this yourself. I really appreciate that you put effort to add these APIs for more niche (streaming) usecases of regex-automaton |
Sweet! Good to hear. :) |
regex-automtaton already supports transversing the DFA one byte at a time with
next_state
. This is potentially very useful when scanning noncontinuous datalike network stream or a rope data structures as commonly used in editors.
However, to start the DFA with
start_state_forward
/start_state_reverse
currently requires an
Input
and will look ahead/look one byte behind thespan boundaries. To support that (especially when using prefilters/literal
optimization) a streaming use case can not provide such a haystack easily (it
can be worked around with a temporary array and copying one byte over but its
extremely brittle/hacky).
This PR adds the
start_state_forward_with
/start_state_reverse_with
function which allow passing the information extracted from the Input directly.
This is currently a draft because I haven't added proper documentation yet. I wanted to wait whether you were open to this change in general before writing the docs.