Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add convenience methods for extracting bits of text #824

Closed
orlp opened this issue Dec 22, 2021 · 33 comments
Closed

Add convenience methods for extracting bits of text #824

orlp opened this issue Dec 22, 2021 · 33 comments

Comments

@orlp
Copy link
Contributor

orlp commented Dec 22, 2021

A lot of my use cases for regex involve extracting some information from text with a simple pattern, where full parsing would be overkill. In this scenario I generally just have N subsections of the text I'm looking to extract (possibly many times), where N is a small constant. Some examples from the recent advent of code:

let re = Regex::new(r"target area:\s*x=(-?\d+)..(-?\d+), y=(-?\d+)..(-?\d+)")?;
let re = Regex::new(r"Player 1 starting position:\s*(\d)\s*Player 2 starting position:\s*(\d)\s*")?;
let re = Regex::new(r"(on|off)\s+x=(-?\d+)..(-?\d+),y=(-?\d+)..(-?\d+),z=(-?\d+)..(-?\d+)")?;

In such a scenario it is quite the pain to extract the information I want using the current API. It generally involves something like this, for the last example:

use anyhow::Context;
use itertools::Itertools;
let (status, x1, x2, y1, y2, z1, z2) = re
    .captures(&line)
    .context("match failed")?
    .iter()
    .skip(1)
    .map(|c| c.unwrap().as_str())
    .collect_tuple()
    .unwrap();

Note that I already had to include a third-party library itertools just to get collect_tuple, without that we would either have to start repeatedly calling next() or get(i), or invoke an unnecessary allocation, and in both cases an extra check is needed to ensure there weren't too many captures (indicating a programmer error).

I think that adding a convenience method for this incredibly common use-case would be nice. I would propose the following (with sample implementation):

/// Finds the leftmost-first match in `text` and returns a tuple containing the whole match
/// and its N participating capture groups as strings. If no match is found, `None` is returned.
///
/// # Panics
///
/// Panics if the number of participating captures is not equal to N.
fn extract<'t, const N: usize>(&self, text: &'t str) -> Option<(&'t str, [&'t str; N])> {
    let caps = re.captures(text)?;
    let mut participating = caps.iter().flatten();
    let whole_match = participating.next().unwrap().as_str();
    let captured = [0; N].map(|_| participating.next().unwrap().as_str());
    assert!(participating.next().is_none(), "too many participating capture groups");
    Some((whole_match, captured))
}

And similarly extract_iter returning impl Iterator<Item = (&'t str, [&'t str; N])>.


Then, the cumbersome example above would become the elegant:

let [status, x1, x2, y1, y2, z1, z2] = re.extract(&line).context("match failed")?.1;

Another example from the README:

let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap();
for caps in re.captures_iter(TO_SEARCH) {
    // Note that all of the unwraps are actually OK for this regex
    // because the only way for the regex to match is if all of the
    // capture groups match. This is not true in general though!
    println!("year: {}, month: {}, day: {}",
                caps.get(1).unwrap().as_str(),
                caps.get(2).unwrap().as_str(),
                caps.get(3).unwrap().as_str());
}

This could become:

let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap();
for (_date, [year, month, day]) in re.extract_iter(TO_SEARCH) {
    println!("year: {}, month: {}, day: {}", year, month, day);
}

Caveats

This API does panic, but I believe that this is fine. It will only ever panic due to a direct programmer error (mismatching possibly participating capture groups to the number of outputs), which would always have been wrong, and does not represent an exceptional but otherwise valid scenario. And for anyone for which this is not acceptable, they can always still use the original API.

Additionally, we call .flatten() on the capturing groups to hide any non-participating capturing groups. I do believe this is actually fine as well, allowing uses cases such as this, where we have an alternation with a similar capture group on either side, without wanting to capture the context:

let id_strings_re = Regex::new(r#""id:([^"]*)"|'id:([^']*)'"#)?;
let [id] = id_strings_re.extract(text).unwrap().1;

If this would be a welcome addition I'm more than happy to make a pull request.

@BurntSushi
Copy link
Member

I do find this somewhat intriguing. Const generics does make this sort of API reasonably appealing.

I haven't given this a ton of thought yet, but one question did pop up. Toward the end, you mention that capturing groups are flattened to remove non-participating capturing groups. You provide an example where both sides of an alternation have the same number of captures. But what happens when both sides of the branch have a different number of capturing groups? I believe in such a case, it seems almost impossible to correctly use this new extract API. Because the size of the array returned is determined at compile time, but because of the flattening, the actual number of capture matches is determined at runtime. So it's impossible to reconcile.

I'm not saying that this means we shouldn't add an API like this, but it seems like a possible footgun to me. And I don't think it's a terribly uncommon use case, although certainly not as common as others.

@orlp
Copy link
Contributor Author

orlp commented Dec 22, 2021

But what happens when both sides of the branch have a different number of capturing groups? I believe in such a case, it seems almost impossible to correctly use this new extract API. Because the size of the array returned is determined at compile time, but because of the flattening, the actual number of capture matches is determined at runtime. So it's impossible to reconcile.

Yes, I would consider this a programmer error. There are three ways we could deal with it:

  1. As it is now in the sample implementation, only panic if the number of matched participating groups differs from N. A bit questionable, because your program might seem to work correctly in testing, but panic and fail for a different input.

  2. During regex compilation, keep track of the number of capture groups in the syntax tree. When a|b occurs in the regex, and the number of capture groups for a and b differs, set a flag. Then, extract will check this flag at the start, and if set, immediately panic. (This does assume this problem can only be introduced due to alternation, correct me if I'm wrong on this.)

  3. Don't allow any 'fallible capture groups' (that is, a capture group inside an alternation, or any other such constructs I've missed) whatsoever when using extract, again by checking and setting a flag during compilation. I think this is unnecessarily strict.

@BurntSushi
Copy link
Member

I think it can be introduced in other ways, e.g., (?:(\w)(\s))? might have 2 or 0 capturing groups. But either way, your suggestion is I believe workable: such a property is something that can be detected by analyzing the syntax of the regex.

@orlp
Copy link
Contributor Author

orlp commented Dec 22, 2021

And I'm sure that there's valid regexes that can be proven to always have the same number of capturing groups that would be excluded by such a static analysis. This would be the advantage of approach nr. 1, nothing would be disallowed. But I do think approach nr. 2 is the most sensible for most users, preventing accidental footguns. Then you can be sure that if it works in testing, it always works. Again, people can always use the original API if they want full flexibility.

Briefly browsing through the source code I think adding a 'deterministic capture group' check to Hir shouldn't be too difficult by adding a method fn deterministic_capture_group_count(&self) -> Option<usize> which recursively computes the value, using the basic rules such that an Alternation HirKind's children must all have an equal Some(x), else the result is None. This value is then computed once during regex compilation near the end, and stored, and extract can then check

assert_eq!(exec.ro.deterministic_capture_group_count, Some(N), "extract must always have N capture groups")

@BurntSushi
Copy link
Member

Ideally such properties are computed inductively and not recursively. The Hir type has a bunch of "smart" constructors for exactly this reason, and indeed, they are the only public way to compose regular expression syntactic values, precisely so that a number of properties can be computed inductively. This avoids needing to do another recursive pass on the syntax tree.

In any case, I haven't fully decided whether this is an API I'd want. I need to noodle on it. And ultimately get more eyes on it, since once it's added, that's it. No going back.

@orlp
Copy link
Contributor Author

orlp commented Dec 22, 2021

Alright, fair enough, in that case it would be an attribute on HirInfo. Out of curiosity I went over the cases and came up with this:

For empty(), literal(), class(), anchor(), word_boundary():

info.det_capture_count = Some(0);

For repetition(rep):

info.det_capture_count = match (rep.kind, rep.hir.det_capture_count) {
    // Always zero.
    (_, Some(0)) => Some(0),
    (RepetitionKind::Range(RepetitionRange::Exactly(0)), _) => Some(0),
    (RepetitionKind::Range(RepetitionRange::Bounded(_, 0)) => Some(0),

    // Nondeterministic.
    (RepetitionKind::ZeroOrOne, _) => None,
    (RepetitionKind::ZeroOrMore, _) => None,
    (RepetitionKind::Range(RepetitionRange::AtLeast(0)), _) => None,
    (RepetitionKind::Range(RepetitionRange::Bounded(0, _)) => None
    
    // Deterministic.
    (RepetitionKind::OneOrMore, Some(n)) => Some(n),
    (RepetitionKind::Range(RepetitionRange::Exactly(_)), Some(n)) => Some(n),
    (RepetitionKind::Range(RepetitionRange::AtLeast(_)), Some(n)) => Some(n),
    (RepetitionKind::Range(RepetitionRange::Bounded(_, _)), Some(n)) => Some(n),
};

For group(grp):

info.det_capture_count = match grp.kind {
    GroupKind::CaptureIndex(_) => grp.hir.info.det_capture_count.map(|n| n + 1),
    GroupKind::CaptureName(_) => todo!("unresolved question: disallowing named captures, or treat as unnamed?"),
    GroupKind::NonCapturing(_) => grp.hir.info.det_capture_count,
};

For concat(exprs):

info.det_capture_count = exprs.iter().fold(Some(0), |cnt, e| Some(cnt? + e.info.det_capture_count?));

For alternation(exprs):

info.det_capture_count = exprs.iter().reduce(|a, b| (a? == b?).then(|| a.unwrap())).unwrap_or(Some(0));

It does lead to an unresolved question: what should we do regarding named capture groups when using extract?

@BurntSushi
Copy link
Member

I think they get treated as unnamed groups. Every capturing group, whether named or not, is addressable by index. And this API merely requires that groups are indexable, which is satisfied by all such groups. Named groups just give a name to a group in addition to an index.

@orlp
Copy link
Contributor Author

orlp commented Dec 22, 2021

Ok, that makes sense, then it's just

GroupKind::CaptureIndex(_) | GroupKind::CaptureName(_) => grp.hir.info.det_capture_count.map(|n| n + 1),

@orlp
Copy link
Contributor Author

orlp commented Apr 30, 2022

In any case, I haven't fully decided whether this is an API I'd want. I need to noodle on it. And ultimately get more eyes on it, since once it's added, that's it. No going back.

How is the noodling coming along? As a summary refresher I believe we agreed on a static analysis method being feasible to allow a deterministic number of capture groups be efficiently determined during Regex compile time. This would allow

/// Finds the leftmost-first match in `text` and returns a tuple containing the whole match
/// and its N participating capture groups as strings. If no match is found, `None` is returned.
///
/// # Panics
///
/// Panics if the number of participating captures is not equal to N, or is non-deterministic.
fn extract<'t, const N: usize>(&self, text: &'t str) -> Option<(&'t str, [&'t str; N])>

to always immediately panic when called with an incorrect N or if the regex has an non-deterministic number of capture groups. This makes the extract method either always succeed or always fail, and never a footgun depending on the text input, which was your main concern on the initial proposal. The existence of this method would make using regex (in my opinion) vastly more convenient to use for extracting matches from text, e.g. replacing

use anyhow::Context;
use itertools::Itertools;
let re = Regex::new(r"(on|off)\s+x=(-?\d+)..(-?\d+),y=(-?\d+)..(-?\d+),z=(-?\d+)..(-?\d+)")?;
let (status, x1, x2, y1, y2, z1, z2) = re
    .captures(&line)
    .context("match failed")?
    .iter()
    .skip(1)
    .map(|c| c.unwrap().as_str())
    .collect_tuple()
    .unwrap();

with

use anyhow::Context;
let re = Regex::new(r"(on|off)\s+x=(-?\d+)..(-?\d+),y=(-?\d+)..(-?\d+),z=(-?\d+)..(-?\d+)")?;
let [status, x1, x2, y1, y2, z1, z2] = re.extract(&line).context("match failed")?.1;

@BurntSushi
Copy link
Member

Is it possible to provide a bigger code example? Looking at your samples with fresh eyes, they look unnecessarily convoluted by the fact that you're using tuples. But if you're using tuples, why not just use capture indexing instead? Because you have a tuple, you must know the number of capturing groups at compile time. For example:

let (x, y, z) = (caps[1], caps[2], caps[3]);

I believe this is equivalent to your code but much simpler.

@orlp
Copy link
Contributor Author

orlp commented May 1, 2022

I believe this is equivalent to your code but much simpler.

Well, to be apples to apples with the previous example we have to extract all seven parts:

use anyhow::Context;
let re = Regex::new(r"(on|off)\s+x=(-?\d+)..(-?\d+),y=(-?\d+)..(-?\d+),z=(-?\d+)..(-?\d+)")?;
let caps = re.captures(&line).context("match failed")?;
let (status, x1, x2, y1, y2, z1, z2) = (caps[1], caps[2], caps[3], caps[4], caps[5], caps[6], caps[7]);

This sort of approach really smells though, for multiple reasons:

  1. We have hardcoded a loop over the captures using manual indexing. This is error prone, and boiler-platey.

  2. This code is not actually equivalent in the sense that we have lost the panic when we mess up and mis-match the number of capture groups to our extraction. For example, say we no longer needed the status but forgot to remove the capture group from the regex, and introduced this bug:

    let re = Regex::new(r"(on|off)\s+x=(-?\d+)..(-?\d+),y=(-?\d+)..(-?\d+),z=(-?\d+)..(-?\d+)")?;
    let caps = re.captures(&line).context("match failed")?;
    let (x1, x2, y1, y2, z1, z2) = (caps[1], caps[2], caps[3], caps[4], caps[5], caps[6]);

    This would just silently 'succeed' because our indexing scheme does not check that the number of capture groups matches our extraction. The extract method would correctly instantly panic, as would the collect_tuple.

  3. The indexing scheme is less flexible because it does not flatten deterministically determined capture groups. E.g. consider an (overly) simplified regex for extracting URLs from HTML:

    let re = Regex::new(r#"<(a)[^>]+href="([^"]+)"|<(img)[^>]+src="([^"]+)""#);
    for (_el, [tag, url]) in re.extract_iter(&html).context("match failed")? {
        println!("{tag} {url}");
    }

    You can't do this with the indexing API, because we have four capture groups that alternatingly participate.

  4. The indexing approach also suffers from your initially raised footgun: captures can participate or not depending on the user input. Because we haven't done any static checking for deterministic capture groups the indexing approach might succeed with test inputs but fail in production. With extract and extract_iter we can rest easy that our number of capture groups always matches our intent exactly, regardless of the string being matched.

@BurntSushi
Copy link
Member

Those are good points.

Does the ecosystem have a way of introducing unstable APIs into crates? I'm not convinced we can get this right on our first try. It would be better I think to make it available somehow without committing to it to see how it gets used in practice.

@orlp
Copy link
Contributor Author

orlp commented May 1, 2022

@BurntSushi tokio does it via a rustc cfg flag: https://docs.rs/tokio/latest/tokio/index.html#unstable-features The most important reason for this (as opposed to a regular feature) is that you can't silently depend on an unstable feature transitively. This way always requires the final person building the crate to set the regex_unstable rustc flag and be aware that they're opting into semver breaking changes. This seems sensible and can definitely be a way to go.

In this very specific scenario there might be an alternative middle-ground. Note that Regex already has this method:

/// Returns the number of captures.
pub fn captures_len(&self) -> usize

I believe could fairly reasonably stabilize the following method on Regex:

/// Returns the number of participating captures that this regex will return on a successful match.
/// If this number is non-deterministic and would depend on the input this function returns `None`.
pub fn participating_captures_len(&self) -> Option<usize>

This function to me looks well-defined and future-proof (we can always return None if no static analysis for future features is possible/reasonable), so even conservatively I don't think we have to worry about getting this function right.

Then if this function were available in regex, we could write extract and extract_iter as proposed as an extension trait in a crate, say, regex_unstable_extensions, where this crate would forever stay < 1.0 to not run into semver stability issues. Then a user would simply state

use regex::Regex;
use regex_unstable_extensions::Extract;

and have extract and extract_iter methods available in their scope for Regex. If we feel these functions are a success and properly designed we can move them into regex proper.

@BurntSushi
Copy link
Member

I like the idea of starting with a participating_captures_len. That's a much easier pill to swallow. And it would need to be implemented anyway for extract to work.

My preference would be to wait until #656 lands to add any new features, but this is somewhat small so I think I'd be willing to accept a PR for this. Although it will require I think paving some new ground in regex-syntax, since today I believe all inductive properties are predicates, where as this one is an integer. I don't think it will be a hard path to pave, but it's worth saying that it will need to be paved, if that makes sense. :)

(I don't have any time table for #656 landing. It is my primary focus, but I've been working on it for years already. I just don't have the time I used to devote to FOSS.)

Also, small nit, but "non-deterministic" doesn't sound right to me in this context. The real key here is that the compile time syntactic analysis is impossible to reconcile with the runtime behavior in all cases. I'm not sure how to highlight that pithily, but perhaps being pithy is the wrong approach for something subtle like this anyway. But that wording can be hashed out later.

@BurntSushi
Copy link
Member

Also, speaking for my own maintenance bandwidth, I definitely don't have the capacity to maintain a separate unstable crate. That might be something you want to do if you want to go that path. But I think I could take on a cfg gate like what tokio does. That looks low cost.

@orlp
Copy link
Contributor Author

orlp commented May 1, 2022

@BurntSushi Just to confirm I read you correctly, you would be open to accepting a PR implementing participating_captures_len as a start?

As for the non-deterministic wording, how about:

/// Returns the number of participating captures that this regex will
/// return on a successful match. If this number can not be statically
/// determined from the regex this function returns `None`.
pub fn participating_captures_len(&self) -> Option<usize>

@BurntSushi
Copy link
Member

@orlp Yeah! I'll have the think on the name a bit more and I'll review docs more carefully in PR review.

@orlp
Copy link
Contributor Author

orlp commented Sep 24, 2022

Should participating_captures_len get merged, the new proposed implementation for extract is:

/// Finds the leftmost-first match in `text` and returns a tuple containing the whole match
/// and its N participating capture groups as strings. If no match is found, `None` is returned.
///
/// # Panics
///
/// Panics if the number of participating captures can not be statically determined to be equal to N.
/// This is likely due to an optional capture group, or a capture group mis-match between the left
/// and right hand side of an alternation. If you need an optional capture group, consider making the
/// contents of rather than the group itself optional, e.g. `((?:foo)?)` instead of `(foo)?`.
pub fn extract<'t, const N: usize>(&self, text: &'t str) -> Option<(&'t str, [&'t str; N])> {
    match self.participating_captures_len() {
        Some(c) if c == N => {
            let caps = self.captures(text)?;
            let mut participating = caps.iter().flatten();
            let whole_match = participating.next().unwrap().as_str();
            let captured = [0; N].map(|_| participating.next().unwrap().as_str());
            Some((whole_match, captured))
        }
        Some(c) => panic!("wrong number of capture groups, expected {} found {}", N, c),
        None => panic!("regex could not statically be determined to have {} capture groups", N),
    }
}

With a self-contained usage example:

let re = Regex::new(r"\s*(\d{4})-(\d{2})-(\d{2})\s*").unwrap();
let [year, month, day] = re.extract(&date).expect("could not parse date").1;

While it and its cousin extract_iter are fairly trivial to implement, I strongly feel that this method would make regex much more convenient out-of-the box for the average user if they are interested in extracting a bit of text from a larger document or simply destructuring text. I think its inclusion in regex would help a lot of people save time, and reduce bugs as the number of matches is now automatically and statically checked.

@BurntSushi
Copy link
Member

I grant that it looks very appealing, but the footgun here is very concerning to me. I'm not sure how to mitigate it, and if we can't, whether it's still worth adding. I'm just not sold on it, sorry.

@orlp
Copy link
Contributor Author

orlp commented Sep 24, 2022

I grant that it looks very appealing, but the footgun here is very concerning to me. I'm not sure how to mitigate it, and if we can't, whether it's still worth adding. I'm just not sold on it, sorry.

Which footgun are you referring to?

@BurntSushi
Copy link
Member

Panicking when the number of capturing groups isn't fixed.

@BurntSushi
Copy link
Member

I think there are other problems too, particularly with its flexibility. i.e., If you want to only get the first N capturing groups, you still have to write out a type that handles all of the capturing groups.

@orlp
Copy link
Contributor Author

orlp commented Sep 24, 2022

Panicking when the number of capturing groups isn't fixed.

I would say this is a feature rather than a footgun, it ensures that you'll never get a panic due to a bad index with some unexpected input. It guarantees that if you test your function once it will handle all possible inputs. I think it will reduce bugs rather than introduce them.

I think there are other problems too, particularly with its flexibility. i.e., If you want to only get the first N capturing groups, you still have to write out a type that handles all of the capturing groups.

I don't think that's all that bad. E.g. if I only wanted the year in the above I have two options:

let [year, _, _] = re.extract(&date).expect("could not parse date").1;
let [year, ..] = re.extract::<3>(&date).expect("could not parse date").1;

The latter scales even for very large number of capture groups. Plus it's not like extract replaces anything, you can still use the old way if it's more convenient.

@BurntSushi
Copy link
Member

I would say this is rather a feature than a footgun, it ensures that you'll never get a panic due to a bad index with some unexpected input. It guarantees that if you test your function once it will handle all possible inputs. I think it will reduce bugs rather than introduce them.

That's a decent point.

I'd like to revisit this after #656 lands. I think the next step will be to write up the API along with its documentation and doc examples, and put it into the first comment of this issue. At that point, we should request feedback from folks to make sure we aren't missing anything.

@BurntSushi
Copy link
Member

I also wonder whether it might be better to flatten the tuple into a single array, where the first element of the array is the entire match. That would be more consistent with the rest of the APIs in the crate.

@orlp
Copy link
Contributor Author

orlp commented Sep 24, 2022

I'd like to revisit this after #656 lands.

That is fine.

I also wonder whether it might be better to flatten the tuple into a single array, where the first element of the array is the entire match.

My only problem with that is that it makes using [T; N]::map for follow-up parsing more annoying, as there isn't an easy way to split arrays. Say with the date example if you wanted it as numbers you could do

let ymd = re.extract(&date).expect("could not parse date").1;
let [year, month, day] = ymd.map(|x| x.parse::<u32>().unwrap());

if the result was simply [&str; N] including the whole string you would need to either manually specify the number of capture groups to extract, or repeat the year, month, day thrice:

// Specify length manually.
let [_, ymd @ ..] = re.extract<4>(&date).expect("could not parse date");
let [year, month, day] = ymd.map(|x| x.parse::<u32>().unwrap());

// Repeat variables.
let [_, year, month, day] = re.extract(&date).expect("could not parse date");
let [year, month, day] = [year, month, day].map(|x| x.parse::<u32>().unwrap());

I just feel that most of the time when extracting you're only interested in the bits you extract, and not the whole match. And when you do want the whole thing it's there as the first tuple member.

That said this isn't a hill I'm willing to die on, if you'd still prefer a flat single array.

BurntSushi added a commit that referenced this issue Mar 4, 2023
This adds a new routine for computing the static number of capture
groups that will appear in every match. If the number of groups is not
invariant across all matches, then there is no static capture length.

This is meant to help implement higher level convenience APIs for
extracting capture groups, such as the one described in #824. We may
wind up including such APIs in the regex crate itself, but this commit
stops short of that. Instead, we just add this new property which should
permit those APIs to exist outside of this crate for now.

Closes #908
BurntSushi added a commit that referenced this issue Mar 5, 2023
This adds a new routine for computing the static number of capture
groups that will appear in every match. If the number of groups is not
invariant across all matches, then there is no static capture length.

This is meant to help implement higher level convenience APIs for
extracting capture groups, such as the one described in #824. We may
wind up including such APIs in the regex crate itself, but this commit
stops short of that. Instead, we just add this new property which should
permit those APIs to exist outside of this crate for now.

Closes #908
BurntSushi added a commit that referenced this issue Mar 15, 2023
This adds a new routine for computing the static number of capture
groups that will appear in every match. If the number of groups is not
invariant across all matches, then there is no static capture length.

This is meant to help implement higher level convenience APIs for
extracting capture groups, such as the one described in #824. We may
wind up including such APIs in the regex crate itself, but this commit
stops short of that. Instead, we just add this new property which should
permit those APIs to exist outside of this crate for now.

Closes #908
BurntSushi added a commit that referenced this issue Mar 15, 2023
This adds a new routine for computing the static number of capture
groups that will appear in every match. If the number of groups is not
invariant across all matches, then there is no static capture length.

This is meant to help implement higher level convenience APIs for
extracting capture groups, such as the one described in #824. We may
wind up including such APIs in the regex crate itself, but this commit
stops short of that. Instead, we just add this new property which should
permit those APIs to exist outside of this crate for now.

Closes #908
BurntSushi added a commit that referenced this issue Mar 15, 2023
This adds a new routine for computing the static number of capture
groups that will appear in every match. If the number of groups is not
invariant across all matches, then there is no static capture length.

This is meant to help implement higher level convenience APIs for
extracting capture groups, such as the one described in #824. We may
wind up including such APIs in the regex crate itself, but this commit
stops short of that. Instead, we just add this new property which should
permit those APIs to exist outside of this crate for now.

Closes #908
BurntSushi added a commit that referenced this issue Mar 20, 2023
This adds a new routine for computing the static number of capture
groups that will appear in every match. If the number of groups is not
invariant across all matches, then there is no static capture length.

This is meant to help implement higher level convenience APIs for
extracting capture groups, such as the one described in #824. We may
wind up including such APIs in the regex crate itself, but this commit
stops short of that. Instead, we just add this new property which should
permit those APIs to exist outside of this crate for now.

Closes #908
BurntSushi added a commit that referenced this issue Mar 21, 2023
This adds a new routine for computing the static number of capture
groups that will appear in every match. If the number of groups is not
invariant across all matches, then there is no static capture length.

This is meant to help implement higher level convenience APIs for
extracting capture groups, such as the one described in #824. We may
wind up including such APIs in the regex crate itself, but this commit
stops short of that. Instead, we just add this new property which should
permit those APIs to exist outside of this crate for now.

Closes #908
BurntSushi added a commit that referenced this issue Apr 15, 2023
This adds a new routine for computing the static number of capture
groups that will appear in every match. If the number of groups is not
invariant across all matches, then there is no static capture length.

This is meant to help implement higher level convenience APIs for
extracting capture groups, such as the one described in #824. We may
wind up including such APIs in the regex crate itself, but this commit
stops short of that. Instead, we just add this new property which should
permit those APIs to exist outside of this crate for now.

Closes #908
BurntSushi added a commit that referenced this issue Apr 15, 2023
This adds a new routine for computing the static number of capture
groups that will appear in every match. If the number of groups is not
invariant across all matches, then there is no static capture length.

This is meant to help implement higher level convenience APIs for
extracting capture groups, such as the one described in #824. We may
wind up including such APIs in the regex crate itself, but this commit
stops short of that. Instead, we just add this new property which should
permit those APIs to exist outside of this crate for now.

Closes #908
BurntSushi added a commit that referenced this issue Apr 17, 2023
This adds a new routine for computing the static number of capture
groups that will appear in every match. If the number of groups is not
invariant across all matches, then there is no static capture length.

This is meant to help implement higher level convenience APIs for
extracting capture groups, such as the one described in #824. We may
wind up including such APIs in the regex crate itself, but this commit
stops short of that. Instead, we just add this new property which should
permit those APIs to exist outside of this crate for now.

Closes #908
BurntSushi added a commit that referenced this issue Apr 17, 2023
This adds a new routine for computing the static number of capture
groups that will appear in every match. If the number of groups is not
invariant across all matches, then there is no static capture length.

This is meant to help implement higher level convenience APIs for
extracting capture groups, such as the one described in #824. We may
wind up including such APIs in the regex crate itself, but this commit
stops short of that. Instead, we just add this new property which should
permit those APIs to exist outside of this crate for now.

Closes #908
BurntSushi added a commit that referenced this issue Apr 17, 2023
This adds a new routine for computing the static number of capture
groups that will appear in every match. If the number of groups is not
invariant across all matches, then there is no static capture length.

This is meant to help implement higher level convenience APIs for
extracting capture groups, such as the one described in #824. We may
wind up including such APIs in the regex crate itself, but this commit
stops short of that. Instead, we just add this new property which should
permit those APIs to exist outside of this crate for now.

Closes #908
BurntSushi added a commit that referenced this issue Apr 17, 2023
This adds a new routine for computing the static number of capture
groups that will appear in every match. If the number of groups is not
invariant across all matches, then there is no static capture length.

This is meant to help implement higher level convenience APIs for
extracting capture groups, such as the one described in #824. We may
wind up including such APIs in the regex crate itself, but this commit
stops short of that. Instead, we just add this new property which should
permit those APIs to exist outside of this crate for now.

Closes #908
BurntSushi added a commit that referenced this issue Apr 17, 2023
This adds a new routine for computing the static number of capture
groups that will appear in every match. If the number of groups is not
invariant across all matches, then there is no static capture length.

This is meant to help implement higher level convenience APIs for
extracting capture groups, such as the one described in #824. We may
wind up including such APIs in the regex crate itself, but this commit
stops short of that. Instead, we just add this new property which should
permit those APIs to exist outside of this crate for now.

Closes #908
@BurntSushi
Copy link
Member

I've been giving this API proposal a bit more thought. I've thought of some other ideas/questions/concerns:

  1. It seems like this API would discourage the use of capture group names in favor of relying on capture group positions.
  2. I believe an extract() method could be added to Captures, which would make the API a bit more orthogonal at the expense of additional typing. So for example, you'd need to do re.captures(haystack).map(|c| c.extract()) instead. It would similarly avoid needing to add a new iterator type, e.g., re.captures_iter(haystack).map(|c| c.extract()).

I want to clarify that I am overall positive on adding this API. I'm just trying to think of every angle I can, because even though it looks small, I consider this to be a major addition.

@orlp
Copy link
Contributor Author

orlp commented May 4, 2023

It seems like this API would discourage the use of capture group names in favor of relying on capture group positions.

I suppose. You could design an extract API that uses capture group names as well, but that would involve procedural macros.

I believe an extract() method could be added to Captures, which would make the API a bit more orthogonal at the expense of additional typing.

I really don't understand what that would do. Would .extract() here just be shorthand for .unwrap().as_str()?

If so, we would lose all the benefits of checked length we discussed above, and it would be significantly less convenient as you can not pattern match like you can with an array. Perhaps you could elaborate a bit more on how .extract() on Captures would work?

EDIT: I believe I misunderstood the .map to be the iterator map rather than a Result map. Would I be correct in understanding that you would propose instead of this:

let [status, x1, x2, y1, y2, z1, z2] = re.extract(&line).context("match failed")?.1;

it would be this?

let [status, x1, x2, y1, y2, z1, z2] = re.captures(&line).context("match failed")?.extract().1;

If so, I would be fine with that. There's pros and cons for both. My main concern is that the API is less discoverable if a user has to click on Captures in the documentation to find out there's an extract method at all.

@BurntSushi
Copy link
Member

Yes, your edit is correct. It's the same API, just on Captures.

I am tempted to start there, because even if we added Regex::{extract, extract_iter}, it would probably make sense to also add it to Captures as well. Starting there is also a little more conservative. If the API turns out to be a bad idea, the fact that it's a little less discoverable will be an advantage. But if it turns out to be a hit, then we can add Regex::{extract, extract_iter} later.

I'm going to plan on getting this added to the regex 1.9 release. And I'll try and mitigate the discoverability problem, at least for now, by using it in top-level examples in the crate docs. (Which I am currently rewriting.)

@BurntSushi
Copy link
Member

Here is the API item for Captures::extract. What do you think?

    /// This is a convenience routine for extracting the substrings
    /// corresponding to matching capture groups.
    ///
    /// This returns a tuple where the first element corresponds to the full
    /// substring of the haystack that matched the regex. The second element is
    /// an array of substrings, with each corresponding to the to the substring
    /// that matched for a particular capture group.
    ///
    /// # Panics
    ///
    /// This panics if the number of possible matching groups in this
    /// `Captures` value is not fixed to `N` in all circumstances.
    /// More precisely, this routine only works when `N` is equivalent to
    /// [`Regex::static_captures_len`].
    ///
    /// Stated more plainly, if the number of matching capture groups in a
    /// regex can vary from match to match, then this function always panic.
    ///
    /// For example, `(a)(b)|(c)` could produce two matching capture groups
    /// or one matching capture group for any given match. Therefore, one
    /// cannot use `extract` with such a pattern.
    ///
    /// But a pattern like `(a)(b)|(c)(d)` can be used with `extract` because
    /// the number of capture groups in every match is always equivalent,
    /// even if the capture _indices_ in each match are not.
    ///
    /// # Example
    ///
    /// ```
    /// use regex::Regex;
    ///
    /// let re = Regex::new(r"([0-9]{4})-([0-9]{2})-([0-9]{2})").unwrap();
    /// let hay = "On 2010-03-14, I became a Tenneessee lamb.";
    /// let Some((full, [year, month, day])) =
    ///     re.captures(hay).map(|caps| caps.extract()) else { return };
    /// assert_eq!("2010-03-14", full);
    /// assert_eq!("2010", year);
    /// assert_eq!("03", month);
    /// assert_eq!("14", day);
    /// ```
    ///
    /// # Example: iteration
    ///
    /// This example shows how to use this method when iterating over all
    /// `Captures` matches in a haystack.
    ///
    /// ```
    /// use regex::Regex;
    ///
    /// let re = Regex::new(r"([0-9]{4})-([0-9]{2})-([0-9]{2})").unwrap();
    /// let hay = "1973-01-05, 1975-08-25 and 1980-10-18";
    ///
    /// let mut dates: Vec<(&str, &str, &str)> = vec![];
    /// for (_, [y, m, d]) in re.captures_iter(hay).map(|c| c.extract()) {
    ///     dates.push((y, m, d));
    /// }
    /// assert_eq!(dates, vec![
    ///     ("1973", "01", "05"),
    ///     ("1975", "08", "25"),
    ///     ("1980", "10", "18"),
    /// ]);
    /// ```
    ///
    /// # Example: parsing different formats
    ///
    /// This API is particularly useful when you need to extract a particular
    /// value that might occur in a different format. Consider, for example,
    /// an identifier that might be in double quotes or single quotes:
    ///
    /// ```
    /// use regex::Regex;
    ///
    /// let re = Regex::new(r#"id:(?:"([^"]+)"|'([^']+)')"#).unwrap();
    /// let hay = r#"The first is id:"foo" and the second is id:'bar'."#;
    /// let mut ids = vec![];
    /// for (_, [id]) in re.captures_iter(hay).map(|c| c.extract()) {
    ///     ids.push(id);
    /// }
    /// assert_eq!(ids, vec!["foo", "bar"]);
    /// ```
    pub fn extract<const N: usize>(&self) -> (&'t str, [&'t str; N]) { ... }

@orlp
Copy link
Contributor Author

orlp commented May 5, 2023

@BurntSushi Looks good, one nitpick, "then this function always panic." -> always panics.

Also if I'm not mistaken for the captures_iter examples it should just be .captures_iter().extract(), without the map. Otherwise we are checking the static capture length on each iteration.

The compiler might optimize that out though, in which case I can understand not wanting to introduce a new iterator type in the first iteration (heh) of the design. Should the API becomes successful I do think that re.extract and re.extract_iter would provide a decent amount of convenience to the user, but I am more than happy with this initial version.

@BurntSushi
Copy link
Member

Yeah captures_iter().extract() is a nice idea, but I'd prefer to go the conservative route for now.

The static_captures_len check is not something I'm too worried about. If it were done during a find_iter I'd probably care about it more, but extracting captures is already pretty expensive. Remember, the captures and captures_iter APIs are already allocating an entire Vec<usize> internally to store all of the capture offsets. (You have to use lower level APIs to avoid that.)

regex-automata will also expose an extract routine on its Captures type. It doesn't do the static_captures_len check and lets you pull out a few number of groups than what exist. And as the examples show, callers can create a Captures themselves and pass it to regex search routines in order to avoid creating a new allocation for each search. I say all of this just to point out that it's possible to circumvent some of the checks in the higher level more convenient API by dropping down into lower level APIs.

BurntSushi added a commit that referenced this issue Jul 5, 2023
I usually close tickets on a commit-by-commit basis, but this refactor
was so big that it wasn't feasible to do that. So ticket closures are
marked here.

Closes #244
Closes #259
Closes #476
Closes #644
Closes #675
Closes #824
Closes #961

Closes #68
Closes #510
Closes #787
Closes #891

Closes #429
Closes #517
Closes #579
Closes #779
Closes #850
Closes #921
Closes #976
Closes #1002

Closes #656
@orlp
Copy link
Contributor Author

orlp commented Jul 5, 2023

@BurntSushi Nice! Just spotted a typo, Capptures instead of Captures in the changelog.

crapStone added a commit to Calciumdibromid/CaBr2 that referenced this issue Jul 18, 2023
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.8.4` -> `1.9.1` |

---

### Release Notes

<details>
<summary>rust-lang/regex (regex)</summary>

### [`v1.9.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#191-2023-07-07)

[Compare Source](rust-lang/regex@1.9.0...1.9.1)

\==================
This is a patch release which fixes a memory usage regression. In the regex
1.9 release, one of the internal engines used a more aggressive allocation
strategy than what was done previously. This patch release reverts to the
prior on-demand strategy.

Bug fixes:

-   [BUG #&#8203;1027](rust-lang/regex#1027):
    Change the allocation strategy for the backtracker to be less aggressive.

### [`v1.9.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#190-2023-07-05)

[Compare Source](rust-lang/regex@1.8.4...1.9.0)

\==================
This release marks the end of a [years long rewrite of the regex crate
internals](rust-lang/regex#656). Since this is
such a big release, please report any issues or regressions you find. We would
also love to hear about improvements as well.

In addition to many internal improvements that should hopefully result in
"my regex searches are faster," there have also been a few API additions:

-   A new `Captures::extract` method for quickly accessing the substrings
    that match each capture group in a regex.
-   A new inline flag, `R`, which enables CRLF mode. This makes `.` match any
    Unicode scalar value except for `\r` and `\n`, and also makes `(?m:^)` and
    `(?m:$)` match after and before both `\r` and `\n`, respectively, but never
    between a `\r` and `\n`.
-   `RegexBuilder::line_terminator` was added to further customize the line
    terminator used by `(?m:^)` and `(?m:$)` to be any arbitrary byte.
-   The `std` Cargo feature is now actually optional. That is, the `regex` crate
    can be used without the standard library.
-   Because `regex 1.9` may make binary size and compile times even worse, a
    new experimental crate called `regex-lite` has been published. It prioritizes
    binary size and compile times over functionality (like Unicode) and
    performance. It shares no code with the `regex` crate.

New features:

-   [FEATURE #&#8203;244](rust-lang/regex#244):
    One can opt into CRLF mode via the `R` flag.
    e.g., `(?mR:$)` matches just before `\r\n`.
-   [FEATURE #&#8203;259](rust-lang/regex#259):
    Multi-pattern searches with offsets can be done with `regex-automata 0.3`.
-   [FEATURE #&#8203;476](rust-lang/regex#476):
    `std` is now an optional feature. `regex` may be used with only `alloc`.
-   [FEATURE #&#8203;644](rust-lang/regex#644):
    `RegexBuilder::line_terminator` configures how `(?m:^)` and `(?m:$)` behave.
-   [FEATURE #&#8203;675](rust-lang/regex#675):
    Anchored search APIs are now available in `regex-automata 0.3`.
-   [FEATURE #&#8203;824](rust-lang/regex#824):
    Add new `Captures::extract` method for easier capture group access.
-   [FEATURE #&#8203;961](rust-lang/regex#961):
    Add `regex-lite` crate with smaller binary sizes and faster compile times.
-   [FEATURE #&#8203;1022](rust-lang/regex#1022):
    Add `TryFrom` implementations for the `Regex` type.

Performance improvements:

-   [PERF #&#8203;68](rust-lang/regex#68):
    Added a one-pass DFA engine for faster capture group matching.
-   [PERF #&#8203;510](rust-lang/regex#510):
    Inner literals are now used to accelerate searches, e.g., `\w+@&#8203;\w+` will scan
    for `@`.
-   [PERF #&#8203;787](rust-lang/regex#787),
    [PERF #&#8203;891](rust-lang/regex#891):
    Makes literal optimizations apply to regexes of the form `\b(foo|bar|quux)\b`.

(There are many more performance improvements as well, but not all of them have
specific issues devoted to them.)

Bug fixes:

-   [BUG #&#8203;429](rust-lang/regex#429):
    Fix matching bugs related to `\B` and inconsistencies across internal engines.
-   [BUG #&#8203;517](rust-lang/regex#517):
    Fix matching bug with capture groups.
-   [BUG #&#8203;579](rust-lang/regex#579):
    Fix matching bug with word boundaries.
-   [BUG #&#8203;779](rust-lang/regex#779):
    Fix bug where some regexes like `(re)+` were not equivalent to `(re)(re)*`.
-   [BUG #&#8203;850](rust-lang/regex#850):
    Fix matching bug inconsistency between NFA and DFA engines.
-   [BUG #&#8203;921](rust-lang/regex#921):
    Fix matching bug where literal extraction got confused by `$`.
-   [BUG #&#8203;976](rust-lang/regex#976):
    Add documentation to replacement routines about dealing with fallibility.
-   [BUG #&#8203;1002](rust-lang/regex#1002):
    Use corpus rejection in fuzz testing.

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNi4wLjAiLCJ1cGRhdGVkSW5WZXIiOiIzNi44LjExIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCJ9-->

Co-authored-by: cabr2-bot <[email protected]>
Co-authored-by: crapStone <[email protected]>
Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1957
Reviewed-by: crapStone <[email protected]>
Co-authored-by: Calciumdibromid Bot <[email protected]>
Co-committed-by: Calciumdibromid Bot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants