-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add convenience methods for extracting bits of text #824
Comments
I do find this somewhat intriguing. Const generics does make this sort of API reasonably appealing. I haven't given this a ton of thought yet, but one question did pop up. Toward the end, you mention that capturing groups are flattened to remove non-participating capturing groups. You provide an example where both sides of an alternation have the same number of captures. But what happens when both sides of the branch have a different number of capturing groups? I believe in such a case, it seems almost impossible to correctly use this new I'm not saying that this means we shouldn't add an API like this, but it seems like a possible footgun to me. And I don't think it's a terribly uncommon use case, although certainly not as common as others. |
Yes, I would consider this a programmer error. There are three ways we could deal with it:
|
I think it can be introduced in other ways, e.g., |
And I'm sure that there's valid regexes that can be proven to always have the same number of capturing groups that would be excluded by such a static analysis. This would be the advantage of approach nr. 1, nothing would be disallowed. But I do think approach nr. 2 is the most sensible for most users, preventing accidental footguns. Then you can be sure that if it works in testing, it always works. Again, people can always use the original API if they want full flexibility. Briefly browsing through the source code I think adding a 'deterministic capture group' check to assert_eq!(exec.ro.deterministic_capture_group_count, Some(N), "extract must always have N capture groups") |
Ideally such properties are computed inductively and not recursively. The In any case, I haven't fully decided whether this is an API I'd want. I need to noodle on it. And ultimately get more eyes on it, since once it's added, that's it. No going back. |
Alright, fair enough, in that case it would be an attribute on For info.det_capture_count = Some(0); For info.det_capture_count = match (rep.kind, rep.hir.det_capture_count) {
// Always zero.
(_, Some(0)) => Some(0),
(RepetitionKind::Range(RepetitionRange::Exactly(0)), _) => Some(0),
(RepetitionKind::Range(RepetitionRange::Bounded(_, 0)) => Some(0),
// Nondeterministic.
(RepetitionKind::ZeroOrOne, _) => None,
(RepetitionKind::ZeroOrMore, _) => None,
(RepetitionKind::Range(RepetitionRange::AtLeast(0)), _) => None,
(RepetitionKind::Range(RepetitionRange::Bounded(0, _)) => None
// Deterministic.
(RepetitionKind::OneOrMore, Some(n)) => Some(n),
(RepetitionKind::Range(RepetitionRange::Exactly(_)), Some(n)) => Some(n),
(RepetitionKind::Range(RepetitionRange::AtLeast(_)), Some(n)) => Some(n),
(RepetitionKind::Range(RepetitionRange::Bounded(_, _)), Some(n)) => Some(n),
}; For info.det_capture_count = match grp.kind {
GroupKind::CaptureIndex(_) => grp.hir.info.det_capture_count.map(|n| n + 1),
GroupKind::CaptureName(_) => todo!("unresolved question: disallowing named captures, or treat as unnamed?"),
GroupKind::NonCapturing(_) => grp.hir.info.det_capture_count,
}; For info.det_capture_count = exprs.iter().fold(Some(0), |cnt, e| Some(cnt? + e.info.det_capture_count?)); For info.det_capture_count = exprs.iter().reduce(|a, b| (a? == b?).then(|| a.unwrap())).unwrap_or(Some(0)); It does lead to an unresolved question: what should we do regarding named capture groups when using |
I think they get treated as unnamed groups. Every capturing group, whether named or not, is addressable by index. And this API merely requires that groups are indexable, which is satisfied by all such groups. Named groups just give a name to a group in addition to an index. |
Ok, that makes sense, then it's just GroupKind::CaptureIndex(_) | GroupKind::CaptureName(_) => grp.hir.info.det_capture_count.map(|n| n + 1), |
How is the noodling coming along? As a summary refresher I believe we agreed on a static analysis method being feasible to allow a deterministic number of capture groups be efficiently determined during Regex compile time. This would allow /// Finds the leftmost-first match in `text` and returns a tuple containing the whole match
/// and its N participating capture groups as strings. If no match is found, `None` is returned.
///
/// # Panics
///
/// Panics if the number of participating captures is not equal to N, or is non-deterministic.
fn extract<'t, const N: usize>(&self, text: &'t str) -> Option<(&'t str, [&'t str; N])> to always immediately panic when called with an incorrect use anyhow::Context;
use itertools::Itertools;
let re = Regex::new(r"(on|off)\s+x=(-?\d+)..(-?\d+),y=(-?\d+)..(-?\d+),z=(-?\d+)..(-?\d+)")?;
let (status, x1, x2, y1, y2, z1, z2) = re
.captures(&line)
.context("match failed")?
.iter()
.skip(1)
.map(|c| c.unwrap().as_str())
.collect_tuple()
.unwrap(); with use anyhow::Context;
let re = Regex::new(r"(on|off)\s+x=(-?\d+)..(-?\d+),y=(-?\d+)..(-?\d+),z=(-?\d+)..(-?\d+)")?;
let [status, x1, x2, y1, y2, z1, z2] = re.extract(&line).context("match failed")?.1; |
Is it possible to provide a bigger code example? Looking at your samples with fresh eyes, they look unnecessarily convoluted by the fact that you're using tuples. But if you're using tuples, why not just use capture indexing instead? Because you have a tuple, you must know the number of capturing groups at compile time. For example:
I believe this is equivalent to your code but much simpler. |
Well, to be apples to apples with the previous example we have to extract all seven parts: use anyhow::Context;
let re = Regex::new(r"(on|off)\s+x=(-?\d+)..(-?\d+),y=(-?\d+)..(-?\d+),z=(-?\d+)..(-?\d+)")?;
let caps = re.captures(&line).context("match failed")?;
let (status, x1, x2, y1, y2, z1, z2) = (caps[1], caps[2], caps[3], caps[4], caps[5], caps[6], caps[7]); This sort of approach really smells though, for multiple reasons:
|
Those are good points. Does the ecosystem have a way of introducing unstable APIs into crates? I'm not convinced we can get this right on our first try. It would be better I think to make it available somehow without committing to it to see how it gets used in practice. |
@BurntSushi In this very specific scenario there might be an alternative middle-ground. Note that /// Returns the number of captures.
pub fn captures_len(&self) -> usize I believe could fairly reasonably stabilize the following method on /// Returns the number of participating captures that this regex will return on a successful match.
/// If this number is non-deterministic and would depend on the input this function returns `None`.
pub fn participating_captures_len(&self) -> Option<usize> This function to me looks well-defined and future-proof (we can always return Then if this function were available in use regex::Regex;
use regex_unstable_extensions::Extract; and have |
I like the idea of starting with a My preference would be to wait until #656 lands to add any new features, but this is somewhat small so I think I'd be willing to accept a PR for this. Although it will require I think paving some new ground in (I don't have any time table for #656 landing. It is my primary focus, but I've been working on it for years already. I just don't have the time I used to devote to FOSS.) Also, small nit, but "non-deterministic" doesn't sound right to me in this context. The real key here is that the compile time syntactic analysis is impossible to reconcile with the runtime behavior in all cases. I'm not sure how to highlight that pithily, but perhaps being pithy is the wrong approach for something subtle like this anyway. But that wording can be hashed out later. |
Also, speaking for my own maintenance bandwidth, I definitely don't have the capacity to maintain a separate unstable crate. That might be something you want to do if you want to go that path. But I think I could take on a |
@BurntSushi Just to confirm I read you correctly, you would be open to accepting a PR implementing As for the non-deterministic wording, how about: /// Returns the number of participating captures that this regex will
/// return on a successful match. If this number can not be statically
/// determined from the regex this function returns `None`.
pub fn participating_captures_len(&self) -> Option<usize> |
@orlp Yeah! I'll have the think on the name a bit more and I'll review docs more carefully in PR review. |
Should /// Finds the leftmost-first match in `text` and returns a tuple containing the whole match
/// and its N participating capture groups as strings. If no match is found, `None` is returned.
///
/// # Panics
///
/// Panics if the number of participating captures can not be statically determined to be equal to N.
/// This is likely due to an optional capture group, or a capture group mis-match between the left
/// and right hand side of an alternation. If you need an optional capture group, consider making the
/// contents of rather than the group itself optional, e.g. `((?:foo)?)` instead of `(foo)?`.
pub fn extract<'t, const N: usize>(&self, text: &'t str) -> Option<(&'t str, [&'t str; N])> {
match self.participating_captures_len() {
Some(c) if c == N => {
let caps = self.captures(text)?;
let mut participating = caps.iter().flatten();
let whole_match = participating.next().unwrap().as_str();
let captured = [0; N].map(|_| participating.next().unwrap().as_str());
Some((whole_match, captured))
}
Some(c) => panic!("wrong number of capture groups, expected {} found {}", N, c),
None => panic!("regex could not statically be determined to have {} capture groups", N),
}
} With a self-contained usage example: let re = Regex::new(r"\s*(\d{4})-(\d{2})-(\d{2})\s*").unwrap();
let [year, month, day] = re.extract(&date).expect("could not parse date").1; While it and its cousin |
I grant that it looks very appealing, but the footgun here is very concerning to me. I'm not sure how to mitigate it, and if we can't, whether it's still worth adding. I'm just not sold on it, sorry. |
Which footgun are you referring to? |
Panicking when the number of capturing groups isn't fixed. |
I think there are other problems too, particularly with its flexibility. i.e., If you want to only get the first N capturing groups, you still have to write out a type that handles all of the capturing groups. |
I would say this is a feature rather than a footgun, it ensures that you'll never get a panic due to a bad index with some unexpected input. It guarantees that if you test your function once it will handle all possible inputs. I think it will reduce bugs rather than introduce them.
I don't think that's all that bad. E.g. if I only wanted the year in the above I have two options: let [year, _, _] = re.extract(&date).expect("could not parse date").1;
let [year, ..] = re.extract::<3>(&date).expect("could not parse date").1; The latter scales even for very large number of capture groups. Plus it's not like |
That's a decent point. I'd like to revisit this after #656 lands. I think the next step will be to write up the API along with its documentation and doc examples, and put it into the first comment of this issue. At that point, we should request feedback from folks to make sure we aren't missing anything. |
I also wonder whether it might be better to flatten the tuple into a single array, where the first element of the array is the entire match. That would be more consistent with the rest of the APIs in the crate. |
That is fine.
My only problem with that is that it makes using let ymd = re.extract(&date).expect("could not parse date").1;
let [year, month, day] = ymd.map(|x| x.parse::<u32>().unwrap()); if the result was simply // Specify length manually.
let [_, ymd @ ..] = re.extract<4>(&date).expect("could not parse date");
let [year, month, day] = ymd.map(|x| x.parse::<u32>().unwrap());
// Repeat variables.
let [_, year, month, day] = re.extract(&date).expect("could not parse date");
let [year, month, day] = [year, month, day].map(|x| x.parse::<u32>().unwrap()); I just feel that most of the time when extracting you're only interested in the bits you extract, and not the whole match. And when you do want the whole thing it's there as the first tuple member. That said this isn't a hill I'm willing to die on, if you'd still prefer a flat single array. |
This adds a new routine for computing the static number of capture groups that will appear in every match. If the number of groups is not invariant across all matches, then there is no static capture length. This is meant to help implement higher level convenience APIs for extracting capture groups, such as the one described in #824. We may wind up including such APIs in the regex crate itself, but this commit stops short of that. Instead, we just add this new property which should permit those APIs to exist outside of this crate for now. Closes #908
This adds a new routine for computing the static number of capture groups that will appear in every match. If the number of groups is not invariant across all matches, then there is no static capture length. This is meant to help implement higher level convenience APIs for extracting capture groups, such as the one described in #824. We may wind up including such APIs in the regex crate itself, but this commit stops short of that. Instead, we just add this new property which should permit those APIs to exist outside of this crate for now. Closes #908
This adds a new routine for computing the static number of capture groups that will appear in every match. If the number of groups is not invariant across all matches, then there is no static capture length. This is meant to help implement higher level convenience APIs for extracting capture groups, such as the one described in #824. We may wind up including such APIs in the regex crate itself, but this commit stops short of that. Instead, we just add this new property which should permit those APIs to exist outside of this crate for now. Closes #908
This adds a new routine for computing the static number of capture groups that will appear in every match. If the number of groups is not invariant across all matches, then there is no static capture length. This is meant to help implement higher level convenience APIs for extracting capture groups, such as the one described in #824. We may wind up including such APIs in the regex crate itself, but this commit stops short of that. Instead, we just add this new property which should permit those APIs to exist outside of this crate for now. Closes #908
This adds a new routine for computing the static number of capture groups that will appear in every match. If the number of groups is not invariant across all matches, then there is no static capture length. This is meant to help implement higher level convenience APIs for extracting capture groups, such as the one described in #824. We may wind up including such APIs in the regex crate itself, but this commit stops short of that. Instead, we just add this new property which should permit those APIs to exist outside of this crate for now. Closes #908
This adds a new routine for computing the static number of capture groups that will appear in every match. If the number of groups is not invariant across all matches, then there is no static capture length. This is meant to help implement higher level convenience APIs for extracting capture groups, such as the one described in #824. We may wind up including such APIs in the regex crate itself, but this commit stops short of that. Instead, we just add this new property which should permit those APIs to exist outside of this crate for now. Closes #908
This adds a new routine for computing the static number of capture groups that will appear in every match. If the number of groups is not invariant across all matches, then there is no static capture length. This is meant to help implement higher level convenience APIs for extracting capture groups, such as the one described in #824. We may wind up including such APIs in the regex crate itself, but this commit stops short of that. Instead, we just add this new property which should permit those APIs to exist outside of this crate for now. Closes #908
This adds a new routine for computing the static number of capture groups that will appear in every match. If the number of groups is not invariant across all matches, then there is no static capture length. This is meant to help implement higher level convenience APIs for extracting capture groups, such as the one described in #824. We may wind up including such APIs in the regex crate itself, but this commit stops short of that. Instead, we just add this new property which should permit those APIs to exist outside of this crate for now. Closes #908
This adds a new routine for computing the static number of capture groups that will appear in every match. If the number of groups is not invariant across all matches, then there is no static capture length. This is meant to help implement higher level convenience APIs for extracting capture groups, such as the one described in #824. We may wind up including such APIs in the regex crate itself, but this commit stops short of that. Instead, we just add this new property which should permit those APIs to exist outside of this crate for now. Closes #908
This adds a new routine for computing the static number of capture groups that will appear in every match. If the number of groups is not invariant across all matches, then there is no static capture length. This is meant to help implement higher level convenience APIs for extracting capture groups, such as the one described in #824. We may wind up including such APIs in the regex crate itself, but this commit stops short of that. Instead, we just add this new property which should permit those APIs to exist outside of this crate for now. Closes #908
This adds a new routine for computing the static number of capture groups that will appear in every match. If the number of groups is not invariant across all matches, then there is no static capture length. This is meant to help implement higher level convenience APIs for extracting capture groups, such as the one described in #824. We may wind up including such APIs in the regex crate itself, but this commit stops short of that. Instead, we just add this new property which should permit those APIs to exist outside of this crate for now. Closes #908
This adds a new routine for computing the static number of capture groups that will appear in every match. If the number of groups is not invariant across all matches, then there is no static capture length. This is meant to help implement higher level convenience APIs for extracting capture groups, such as the one described in #824. We may wind up including such APIs in the regex crate itself, but this commit stops short of that. Instead, we just add this new property which should permit those APIs to exist outside of this crate for now. Closes #908
This adds a new routine for computing the static number of capture groups that will appear in every match. If the number of groups is not invariant across all matches, then there is no static capture length. This is meant to help implement higher level convenience APIs for extracting capture groups, such as the one described in #824. We may wind up including such APIs in the regex crate itself, but this commit stops short of that. Instead, we just add this new property which should permit those APIs to exist outside of this crate for now. Closes #908
This adds a new routine for computing the static number of capture groups that will appear in every match. If the number of groups is not invariant across all matches, then there is no static capture length. This is meant to help implement higher level convenience APIs for extracting capture groups, such as the one described in #824. We may wind up including such APIs in the regex crate itself, but this commit stops short of that. Instead, we just add this new property which should permit those APIs to exist outside of this crate for now. Closes #908
I've been giving this API proposal a bit more thought. I've thought of some other ideas/questions/concerns:
I want to clarify that I am overall positive on adding this API. I'm just trying to think of every angle I can, because even though it looks small, I consider this to be a major addition. |
I suppose. You could design an extract API that uses capture group names as well, but that would involve procedural macros.
I really don't understand what that would do. Would If so, we would lose all the benefits of checked length we discussed above, and it would be significantly less convenient as you can not pattern match like you can with an array. Perhaps you could elaborate a bit more on how EDIT: I believe I misunderstood the let [status, x1, x2, y1, y2, z1, z2] = re.extract(&line).context("match failed")?.1; it would be this? let [status, x1, x2, y1, y2, z1, z2] = re.captures(&line).context("match failed")?.extract().1; If so, I would be fine with that. There's pros and cons for both. My main concern is that the API is less discoverable if a user has to click on |
Yes, your edit is correct. It's the same API, just on I am tempted to start there, because even if we added I'm going to plan on getting this added to the |
Here is the API item for /// This is a convenience routine for extracting the substrings
/// corresponding to matching capture groups.
///
/// This returns a tuple where the first element corresponds to the full
/// substring of the haystack that matched the regex. The second element is
/// an array of substrings, with each corresponding to the to the substring
/// that matched for a particular capture group.
///
/// # Panics
///
/// This panics if the number of possible matching groups in this
/// `Captures` value is not fixed to `N` in all circumstances.
/// More precisely, this routine only works when `N` is equivalent to
/// [`Regex::static_captures_len`].
///
/// Stated more plainly, if the number of matching capture groups in a
/// regex can vary from match to match, then this function always panic.
///
/// For example, `(a)(b)|(c)` could produce two matching capture groups
/// or one matching capture group for any given match. Therefore, one
/// cannot use `extract` with such a pattern.
///
/// But a pattern like `(a)(b)|(c)(d)` can be used with `extract` because
/// the number of capture groups in every match is always equivalent,
/// even if the capture _indices_ in each match are not.
///
/// # Example
///
/// ```
/// use regex::Regex;
///
/// let re = Regex::new(r"([0-9]{4})-([0-9]{2})-([0-9]{2})").unwrap();
/// let hay = "On 2010-03-14, I became a Tenneessee lamb.";
/// let Some((full, [year, month, day])) =
/// re.captures(hay).map(|caps| caps.extract()) else { return };
/// assert_eq!("2010-03-14", full);
/// assert_eq!("2010", year);
/// assert_eq!("03", month);
/// assert_eq!("14", day);
/// ```
///
/// # Example: iteration
///
/// This example shows how to use this method when iterating over all
/// `Captures` matches in a haystack.
///
/// ```
/// use regex::Regex;
///
/// let re = Regex::new(r"([0-9]{4})-([0-9]{2})-([0-9]{2})").unwrap();
/// let hay = "1973-01-05, 1975-08-25 and 1980-10-18";
///
/// let mut dates: Vec<(&str, &str, &str)> = vec![];
/// for (_, [y, m, d]) in re.captures_iter(hay).map(|c| c.extract()) {
/// dates.push((y, m, d));
/// }
/// assert_eq!(dates, vec![
/// ("1973", "01", "05"),
/// ("1975", "08", "25"),
/// ("1980", "10", "18"),
/// ]);
/// ```
///
/// # Example: parsing different formats
///
/// This API is particularly useful when you need to extract a particular
/// value that might occur in a different format. Consider, for example,
/// an identifier that might be in double quotes or single quotes:
///
/// ```
/// use regex::Regex;
///
/// let re = Regex::new(r#"id:(?:"([^"]+)"|'([^']+)')"#).unwrap();
/// let hay = r#"The first is id:"foo" and the second is id:'bar'."#;
/// let mut ids = vec![];
/// for (_, [id]) in re.captures_iter(hay).map(|c| c.extract()) {
/// ids.push(id);
/// }
/// assert_eq!(ids, vec!["foo", "bar"]);
/// ```
pub fn extract<const N: usize>(&self) -> (&'t str, [&'t str; N]) { ... } |
@BurntSushi Looks good, one nitpick, "then this function always panic." -> always panics. Also if I'm not mistaken for the The compiler might optimize that out though, in which case I can understand not wanting to introduce a new iterator type in the first iteration (heh) of the design. Should the API becomes successful I do think that |
Yeah The
|
I usually close tickets on a commit-by-commit basis, but this refactor was so big that it wasn't feasible to do that. So ticket closures are marked here. Closes #244 Closes #259 Closes #476 Closes #644 Closes #675 Closes #824 Closes #961 Closes #68 Closes #510 Closes #787 Closes #891 Closes #429 Closes #517 Closes #579 Closes #779 Closes #850 Closes #921 Closes #976 Closes #1002 Closes #656
@BurntSushi Nice! Just spotted a typo, |
This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.8.4` -> `1.9.1` | --- ### Release Notes <details> <summary>rust-lang/regex (regex)</summary> ### [`v1.9.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#191-2023-07-07) [Compare Source](rust-lang/regex@1.9.0...1.9.1) \================== This is a patch release which fixes a memory usage regression. In the regex 1.9 release, one of the internal engines used a more aggressive allocation strategy than what was done previously. This patch release reverts to the prior on-demand strategy. Bug fixes: - [BUG #​1027](rust-lang/regex#1027): Change the allocation strategy for the backtracker to be less aggressive. ### [`v1.9.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#190-2023-07-05) [Compare Source](rust-lang/regex@1.8.4...1.9.0) \================== This release marks the end of a [years long rewrite of the regex crate internals](rust-lang/regex#656). Since this is such a big release, please report any issues or regressions you find. We would also love to hear about improvements as well. In addition to many internal improvements that should hopefully result in "my regex searches are faster," there have also been a few API additions: - A new `Captures::extract` method for quickly accessing the substrings that match each capture group in a regex. - A new inline flag, `R`, which enables CRLF mode. This makes `.` match any Unicode scalar value except for `\r` and `\n`, and also makes `(?m:^)` and `(?m:$)` match after and before both `\r` and `\n`, respectively, but never between a `\r` and `\n`. - `RegexBuilder::line_terminator` was added to further customize the line terminator used by `(?m:^)` and `(?m:$)` to be any arbitrary byte. - The `std` Cargo feature is now actually optional. That is, the `regex` crate can be used without the standard library. - Because `regex 1.9` may make binary size and compile times even worse, a new experimental crate called `regex-lite` has been published. It prioritizes binary size and compile times over functionality (like Unicode) and performance. It shares no code with the `regex` crate. New features: - [FEATURE #​244](rust-lang/regex#244): One can opt into CRLF mode via the `R` flag. e.g., `(?mR:$)` matches just before `\r\n`. - [FEATURE #​259](rust-lang/regex#259): Multi-pattern searches with offsets can be done with `regex-automata 0.3`. - [FEATURE #​476](rust-lang/regex#476): `std` is now an optional feature. `regex` may be used with only `alloc`. - [FEATURE #​644](rust-lang/regex#644): `RegexBuilder::line_terminator` configures how `(?m:^)` and `(?m:$)` behave. - [FEATURE #​675](rust-lang/regex#675): Anchored search APIs are now available in `regex-automata 0.3`. - [FEATURE #​824](rust-lang/regex#824): Add new `Captures::extract` method for easier capture group access. - [FEATURE #​961](rust-lang/regex#961): Add `regex-lite` crate with smaller binary sizes and faster compile times. - [FEATURE #​1022](rust-lang/regex#1022): Add `TryFrom` implementations for the `Regex` type. Performance improvements: - [PERF #​68](rust-lang/regex#68): Added a one-pass DFA engine for faster capture group matching. - [PERF #​510](rust-lang/regex#510): Inner literals are now used to accelerate searches, e.g., `\w+@​\w+` will scan for `@`. - [PERF #​787](rust-lang/regex#787), [PERF #​891](rust-lang/regex#891): Makes literal optimizations apply to regexes of the form `\b(foo|bar|quux)\b`. (There are many more performance improvements as well, but not all of them have specific issues devoted to them.) Bug fixes: - [BUG #​429](rust-lang/regex#429): Fix matching bugs related to `\B` and inconsistencies across internal engines. - [BUG #​517](rust-lang/regex#517): Fix matching bug with capture groups. - [BUG #​579](rust-lang/regex#579): Fix matching bug with word boundaries. - [BUG #​779](rust-lang/regex#779): Fix bug where some regexes like `(re)+` were not equivalent to `(re)(re)*`. - [BUG #​850](rust-lang/regex#850): Fix matching bug inconsistency between NFA and DFA engines. - [BUG #​921](rust-lang/regex#921): Fix matching bug where literal extraction got confused by `$`. - [BUG #​976](rust-lang/regex#976): Add documentation to replacement routines about dealing with fallibility. - [BUG #​1002](rust-lang/regex#1002): Use corpus rejection in fuzz testing. </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNi4wLjAiLCJ1cGRhdGVkSW5WZXIiOiIzNi44LjExIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCJ9--> Co-authored-by: cabr2-bot <[email protected]> Co-authored-by: crapStone <[email protected]> Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1957 Reviewed-by: crapStone <[email protected]> Co-authored-by: Calciumdibromid Bot <[email protected]> Co-committed-by: Calciumdibromid Bot <[email protected]>
A lot of my use cases for regex involve extracting some information from text with a simple pattern, where full parsing would be overkill. In this scenario I generally just have
N
subsections of the text I'm looking to extract (possibly many times), whereN
is a small constant. Some examples from the recent advent of code:In such a scenario it is quite the pain to extract the information I want using the current API. It generally involves something like this, for the last example:
Note that I already had to include a third-party library
itertools
just to getcollect_tuple
, without that we would either have to start repeatedly callingnext()
orget(i)
, or invoke an unnecessary allocation, and in both cases an extra check is needed to ensure there weren't too many captures (indicating a programmer error).I think that adding a convenience method for this incredibly common use-case would be nice. I would propose the following (with sample implementation):
And similarly
extract_iter
returningimpl Iterator<Item = (&'t str, [&'t str; N])>
.Then, the cumbersome example above would become the elegant:
Another example from the README:
This could become:
Caveats
This API does panic, but I believe that this is fine. It will only ever panic due to a direct programmer error (mismatching possibly participating capture groups to the number of outputs), which would always have been wrong, and does not represent an exceptional but otherwise valid scenario. And for anyone for which this is not acceptable, they can always still use the original API.
Additionally, we call
.flatten()
on the capturing groups to hide any non-participating capturing groups. I do believe this is actually fine as well, allowing uses cases such as this, where we have an alternation with a similar capture group on either side, without wanting to capture the context:If this would be a welcome addition I'm more than happy to make a pull request.
The text was updated successfully, but these errors were encountered: