Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex 1.0 #1620

Merged
merged 8 commits into from
Sep 12, 2016
Merged

regex 1.0 #1620

merged 8 commits into from
Sep 12, 2016

Conversation

BurntSushi
Copy link
Member

@BurntSushi BurntSushi commented May 18, 2016

rendered

Pretty much all of the RFC is implemented and can be tracked at this PR: rust-lang/regex#230

rustdoc output of the corresponding API is also available: http://burntsushi.net/stuff/regex-rfc/regex/

@BurntSushi BurntSushi added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label May 18, 2016
### Expansion concerns
[expansion-concerns]: #expansion-concerns

There are a few possible avenues for expansion, and we take measures to make
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility here is "compatibility" flags, e.g. if there is a breaking change for whatever reason, people can opt-in to it with some sort of flag like (?X), ala (?i).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, good idea, added!

@huonw
Copy link
Member

huonw commented May 18, 2016

I wonder if (usize, usize) is the best signature for find/find_iter? It seems to me that "forcing" users to slice themselves is annoying and somewhat error-prone in find, and doubly so for find_iter where one is less likely to slice to manually use the tail for something (the find_iter is already using the tail).

They could return &str, but this makes it harder to manually slice when that is desired (although, the standard library could provide a function on str like the old subslice_offset or maybe fn prefix_suffix(&self, substring: &str) -> (&str, &str) that requires substring be an in-memory substring of self and then returns the pieces before and after that substring). Thus, maybe they could return:

pub struct Found<'a> {
     string: &'a str,
     start: usize,
     end: usize,
}
impl<'a> Found<'a> {
     fn as_str(&self) -> &'a str { self.string }
     fn as_indices(&self) -> (usize, usize) { (self.start, self.end) }
}

(I'm not happy with the names.)

`String`.
* The `Error` type no longer has the `InvalidSet` variant, since the error is
no longer possible. Its `Syntax` variant was also modified to wrap a `String`
instead of a `regex_syntax::Error`. If you need access to specific parse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it could wrap a type defined in regex that internally contains a regex_syntax::Error (the type may just implement Error, Display etc. for now), so we can choose to expose more detailed info in future if we want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a universally better idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@BurntSushi
Copy link
Member Author

@huonw I do like (usize, usize) because it's very straight-forward. Returning something like your Found struct is certainly plausible (&str on its own is no good for the reasons you stated), but the extra indirection as defined seems unfortunate. I suppose we could impl Deref on Found to &str, but that also feels unfortunate.

I think I can buy that (usize, usize) is mildly annoying, but I'm not sure how error prone it is if all you want is the matched text. You typically don't need to do any arithmetic; something like re.find(text).map(|(s, e)| &text[s..e]) will work.

Another alternative is adding a new method with a different return type, but I've tried hard to resist that urge.

Another alternative is to just have find return a Captures that only has the first capturing group, but that's less ergonomic then your Found type.

@Diggsey
Copy link
Contributor

Diggsey commented May 18, 2016

Hmm, neither the byte or string based regex implementations would be suited for matching on OsStrs.

It would be nice if the actual matching code, and all FSM related code could be generic over an element type. That way the regular expression syntax would be specific to the use (so it might only define regular expression syntax for unicode strings and byte sequences initially) but the rest of the algorithm is reusable, such that external crates could define their own "front ends" to the matching engine.

It would also allow future expansions to other string-like types (such as defining a regex syntax for matching OsStrs, where the regex itself might be an OsStr, so that it's capable of matching even non-unicode sequences, without depending on the platform-specific representation)

@huonw
Copy link
Member

huonw commented May 19, 2016

I think I can buy that (usize, usize) is mildly annoying, but I'm not sure how error prone it is if all you want is the matched text. You typically don't need to do any arithmetic; something like re.find(text).map(|(s, e)| &text[s..e]) will work.

The failure I was imagining is that code with several strings (possibly several that get regex'd) could accidentally mix up which strings are being sliced: re.find(text).map(|(s, e)| &text2[s..e]). The API as currently written has a disconnect between the input string and the output value, that requires the user to connect them back together right, but by providing direct access to the &str---which most people will be interested in---we completely side-step that.

Of course, you're in closer touch with the API/the crate than me so you're likely to have a better sense for the trade-offs here.

@BurntSushi
Copy link
Member Author

Hmm, neither the byte or string based regex implementations would be suited for matching on OsStrs.

Nope. The FSM implementation is currently very much coupled to UTF-8.

such that external crates could define their own "front ends" to the matching engine

I think this is a valuable thing, but is out of scope for this RFC. This RFC is not for breaking new ground and inventing a general purpose regular expression interface. I perceive this RFC to be for stabilizing well trodden ground. I'll be more clear: I personally don't want to design a regex interface. It's hard work, requires lots of experimentation, lots of use and at least two real implementations of said interface before I could believe it is good enough to stabilize.

@BurntSushi
Copy link
Member Author

Of course, you're in closer touch with the API/the crate than me so you're likely to have a better sense for the trade-offs here.

Your point is totally valid and your suggestion clearly has more safety. But that little bit of extra indirection is a touch annoying. I honestly don't think there is a clear right answer. Perhaps others can weigh in!

@eddyb
Copy link
Member

eddyb commented May 19, 2016

something like re.find(text).map(|(s, e)| &text[s..e]) will work.

What about Range<usize>? That seems to have more semantic information, maybe just enough to make it palatable.

@Veedrac
Copy link

Veedrac commented May 19, 2016

I was on the (likely unfounded) assumption that the -> &str API was to be dealt with by implementing Pattern when that stabilizes.

@BurntSushi
Copy link
Member Author

I was on the (likely unfounded) assumption

There is an implementation of Pattern in the regex crate, but it's only available on a nightly. The Pattern API doesn't fit all the use cases of Regex matching. I suppose we could toss out any methods strictly covered by Pattern (and also block regex 1.0 on Pattern stabilizing), but: 1) that makes for a very janky API experience, where some methods have the haystack as the receiver and others have the regex as a receiver and 2) it seems very unfortunate to block this RFC on stabilization of Pattern. The value of the Pattern impl IMO is that a Regex could be used in a more generic context.

@BurntSushi
Copy link
Member Author

@eddyb I don't follow. Could you explain a touch more please?

@Veedrac
Copy link

Veedrac commented May 19, 2016

I wasn't saying that we should block anything, just that I don't think we need to return anything more than (usize, usize) when &str-returning methods are already going to be available through other convenience APIs.

I'm curious as to what you think about the "janky API", though. Do you think the Pattern API is going to be a bit of a second-class citizen for regex?

@BurntSushi
Copy link
Member Author

I'm curious as to what you think about the "janky API", though. Do you think the Pattern API is going to be a bit of a second-class citizen for regex?

I don't know. I'm not sure what "second class citizen" means in this context. It seems like it depends on how you're using it. I said "janky" at least partially because I misunderstood your suggestion. For example, if we removed the find method from Regex under the logic that it is provided by Pattern, then the public API of Regex would be janky because a critical feature of Regex (find) is over there in this other interface with a method signature different than the other methods on Regex.

If you're not advocating removing find, then my comment doesn't really apply.

@BurntSushi
Copy link
Member Author

What about Range? That seems to have more semantic information, maybe just enough to make it palatable.

@eddyb clarified on IRC that he was talking about changing find to:

pub fn find(&self, text: &str) -> Option<::std::ops::Range<usize>>;

Which is kind of nice, but I don't think actually addresses @huonw's concern. :-/

http://doc.rust-lang.org/regex/regex/index.html#syntax

To my knowledge, the evolution as proposed in this RFC has been followed since
`regex` was created. The syntax has largely remain unchanged with few
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "remained"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Thanks!

@RustPowers
Copy link

Why the RegexBuilder is using move semantics? This option looks much better: https://is.gd/0aLEJq. But if it is wrong, then why not clone the data only 1 time in the method compile?

@mawww
Copy link

mawww commented May 19, 2016

I'd suggest to really think about the trait based version, or any alternative that allow the use of arbitrary iterators. Regex matching on non &str is an important feature, think about the text editor use case that wont be storing its whole buffer content in a contiguous memory buffer (be it an array of lines, a rope, a gap buffer or anything else). Currently almost every regex engines except the C++ ones (std::regex and boost::regex namely) require a contiguous buffer and are a pain to work with for these cases.

@Diggsey
Copy link
Contributor

Diggsey commented May 19, 2016

I think this is a valuable thing, but is out of scope for this RFC.

This RFC need not define the new API, but it needs to consider the current API in the context of future expansion: will the current API make it difficult to add these features in future without breaking backwards compatibility?

2. It was faster.

Today, (1) can be simulated in practice with the use of a Clippy lint and (2)
is no longer true. In fact, `regex!` is at least one order of magnitude faster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

faster -> slower?

@BurntSushi
Copy link
Member Author

This option looks much better

@RustPowers Could you explain why it is much better? Cloning the pattern string is insignificant (and always will be).

@BurntSushi
Copy link
Member Author

This RFC need not define the new API, but it needs to consider the current API in the context of future expansion: will the current API make it difficult to add these features in future without breaking backwards compatibility?

@Diggsey The current API is defined to match on UTF-8 encoded text (or "ASCII compatible" text when the Unicode flag is disabled in a bytes::Regex). And no, I do not think it is the role of this RFC to design an interface for expanding a regex implementation to arbitrary text encodings.

@BurntSushi
Copy link
Member Author

I'd suggest to really think about the trait based version, or any alternative that allow the use of arbitrary iterators. Regex matching on non &str is an important feature, think about the text editor use case that wont be storing its whole buffer content in a contiguous memory buffer (be it an array of lines, a rope, a gap buffer or anything else). Currently almost every regex engines except the C++ ones (std::regex and boost::regex namely) require a contiguous buffer and are a pain to work with for these cases.

@mawww I think it is definitely a valuable feature, and can be important, for sure. But ultimately, I think it is out of scope for this RFC. Notably, I do not think it is unreasonable to expect implementors of text editors to provide their own regex implementation that works under their particular constraints. I note that the regex-syntax crate is available, which would make a simple Thompson NFA implementation actually feasible to implement because the hard part of that is done for you already: parsing.

Some things to consider:

Matching text that is not contiguous in memory is an orthogonal feature to matching arbitrary iterators, so your request is a little bit murky. I suspect text editors care more about matching text that isn't contiguous in memory rather than being able to use an arbitrary iterator. Matching on an arbitrary iterator is a streaming interface and is challenging for a number of reasons, and generally requires some type of clever buffering if you want to be fast. (There exist specialized regex engines built around the idea of streaming and it permeates their implementation and interface, for good reason.) Matching on non-contiguous memory is probably quite a bit easier, but would be a significant departure from today's implementation, which makes many assumptions about the layout of text in memory for performance reasons. I could conceive of a future where the regex crate offers such an API, but trying to unify it under a trait today in an interface seems ill-advised.

@BurntSushi
Copy link
Member Author

Well, then I suggest a compromise: fresh the API and release a 0.2 version. During the year, Pattern accurately standardize, and then you can release the cherished 1.0. Why wait so long? There's no hurry, besides, libraries which experienced several major version seem unreliable.

I don't want to split the regex API over multiple traits. I want a single cohesive interface because I think that is easiest to consume. I feel strongly about this.

The current regex crate API has been out in the wild and in use for two years at this point, there's no hurry.

Mandatory arguments better to be in the build function rather than in the constructor.

I don't necessarily agree. If mandatory arguments are in the constructor, then there's no way to misuse the API by "forgetting" to call a method before compilation.

It is advisable to implement the borrow chaining.

Yes, this was brought up above and I agreeish.

Why RegexBuilder::new() takes &str regexp pattern, converts it to String and adds to Vec?

Because &str is the most friendly API. There's no compelling reason I'm aware of to make it generic.

I look at the Regex::find_Iter and expect to see Regex::find_iter_mut, but it is not, so the current name is wrong. Maybe should name similar to the Pattern methods?

What are the semantics of find_iter_mut on Regex? Why does its absence imply the name find_iter is wrong?

The constructor Regex::with_size_limit should be remove, for this using RegexBuilder.

Yes, this is in the RFC. It is being removed.

Maybe should add *_unchecked methods to the Captures?

This is out of scope IMO. If someone can write a meaningful benchmark that would benefit from those methods, then I think I'd be happy to oblige, but we don't need to resolve this for 1.0.

Method Regex::as_str seems odd, because why in the compiled regex keep its textual representation?

Because reading the original string of the regex that was compiled is occasionally useful and doesn't really cost us anything.

Needs rightmost methods like a .find, .rfind, .is_match, .is_rmatch, etc.

Do you mean reverse searching? Maybe one day, but it's out of scope for 1.0.

I think it is worth to rename some of the methods and add more

Yes, I'm going to go over the iterator names and make sure they line up with conventions in std. I struggle with whether to use the Iter suffix because the suffix makes it really easy to distinguish which types are iterators and which aren't when reading the documentation. Although, I suppose there are other ways to solve that problem that don't involve the Iter suffix.

@BurntSushi
Copy link
Member Author

@RustPowers Thanks for the feedback! I purposefully didn't respond to a few of your concerns because they were about the implementation. If you'd like to talk about those, I'd be happy to, but please open an issue for each on the regex repo.

@BurntSushi
Copy link
Member Author

Sorry everyone for the delay, but summer hit and it was hard for me to find my way back. I think I'm now finally caught up with everyone's feedback. Here are some changes:

  • find and find_iter now return a Match.
  • The iterators defined on Captures have been removed in favor of one capture_names method defined on Regex.
  • RegexBuilder now uses &mut self instead of self and is eliminated using build instead of compile.
  • Most iterator types have been renamed to match the names found in std::str.
  • Upgrading Unicode definitions shouldn't qualify as a breaking change.

The implementation has also been updated to reflect these changes.

@alexcrichton
Copy link
Member

alexcrichton commented Aug 23, 2016

🔔 This RFC is now entering its final comment period 🔔

The @rust-lang/libs team is inclined to merge this RFC, given the recent updates @BurntSushi has made. @rust-lang/libs if you could check off your name below or write a comment below to the contrary:

@alexcrichton alexcrichton added the final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. label Aug 23, 2016
@leoyvens
Copy link

Is implementing Deref for Match an unresolved question?

@ahmedcharles
Copy link

Just an amusing comment, since I've had this discussion about other languages in the past.

Most languages (which support functions on objects) have a regex api where the regex is an object and searching/matching is done with methods on those objects. C++ seems to stand out in that, the regex class has very few methods and searching/matching is done through free functions.

The motivation is that regex's don't actually perform searching or matching, they are only an input into a searching/matching algorithm. I'm mildly curious if this was ever considered? (While it would likely be easy to change, given Rust's privacy model, it's probably too late in the process to have it be worth it.)

@BurntSushi
Copy link
Member Author

@ahmedcharles Not really. The Rust ecosystem generally eschews free functions in favor of types with methods. I also disagree that regexes don't actually perform searching or matching. Much of the API is very closely coupled with regexes. (For example, only a subset of the Regex API methods are available via the generic str::Pattern trait.) Defining a bunch of free functions like fn is_match(re: &Regex, haystack: &str) -> bool would generally be unidiomatic IMO.

@Diggsey
Copy link
Contributor

Diggsey commented Aug 23, 2016

Also, a Regex is not just the pattern to match, but is a compiled representation of that pattern which is directly tied to how the matching algorithm is implemented.

@Ericson2314
Copy link
Contributor

Ericson2314 commented Aug 25, 2016

I feel like allocator support and the synchronization problem are closely tied together---an API to explicitly preallocate and optionally share scratch space solves this nicely. I'm also wary of the bytes duplication / want to extend bytes to supports iterators or arbitrary sized types (streamed, fix-sized atoms). Most importantly, I think the recent acceptance of procedural macros 1.1 can put the wind back into the sails of regex-macros. I assume that the vast majority of use-cases use statically known regexes and thus are more appropriately handled with such an interface (I don't like the smell of lint + unwrap).

Given all this, I think the time will soon be ripe for a big overhaul, in a way it hasn't been since regex was first created. If making a 1.0 doesn't impede progress on a 2.0 at all, sure, go ahead. Otherwise 👎 from me.

@BurntSushi
Copy link
Member Author

Improving regex_macros such that it is on par with the dynamic implementation (which, on the contrary, has been overhauled quite a bit since regex was first created) is a significant project and I strongly disagree with that blocking a 1.0 release in any form. The really important point here is that it's not clear how much faster such an implementation could even be, and it probably comes with significant costs such as code bloat.

The Rust ecosystem hasn't solved the problem of allocator support yet. If it makes sense to build regex on top of that, and if that absolutely requires breaking changes, than a 2.0 release seems fine to me.

@Ericson2314
Copy link
Contributor

@BurntSushi do you know where the maintenance policy for stable rust-lang crates is laid out? If 2.0 can released not because breaking changes are "absolutely required" but just because they're nice to have, that would be good to know.

@BurntSushi
Copy link
Member Author

@Ericson2314 That would probably come under the purview of the rust-lang crates RFC. Specifically:

Crates in rust-lang may issue new major versions, just like any other crates, though such changes should go through the RFC process. While the library subteam is responsible for major decisions about the library after 1.0, its original author(s) will of course wield a great deal of influence, and their objections will be given due weight in the consensus process.

To specifically answer your question, my interpretation is that subsequent major version bumps are largely left up to the discretion of the maintainers, community, libs team and the willingness of someone to write an RFC for it.

@Ericson2314
Copy link
Contributor

Ericson2314 commented Aug 27, 2016

@BurntSushi Thanks for the link and quote. I either crates.io can support per-release versions, or prototypes can be disseminated via git and git dependencies (master vs stable branch for example). Does that sounds good to you as maintainer?

[N.B. I'm reminded that bitflags should probably do this for experimenting with associated constants.]

@BurntSushi
Copy link
Member Author

@Ericson2314 It doesn't sound unreasonable to me, but I can't really know what we'll do because I can't predict what exactly will provoke a new major version upgrade.

@Ericson2314
Copy link
Contributor

That's reasonable.

@aturon
Copy link
Member

aturon commented Aug 29, 2016

Just wanted to say: this is a beautiful and inspiring RFC. Basically every question I had while reading it was answered in the next paragraph.

I'm 👍 on landing this!

My biggest worry is about the bytes/utf8 split, but I pretty much buy your reasoning that the proposed API is the simplest, most ergonomic way to deal with it and that there aren't a lot of advantages for trying to be more clever.

We've talked sometimes about "inherent trait impls", where you implement inherent methods and a trait at the same time. That might eventually provide a way to ensure a consistent API across the two variants while retaining the simplicity and ergonomics of the current setup. But that's something to explore later down the line.

@alexcrichton
Copy link
Member

Ok, the decision of the libs team was to merge and accept, so I will do so! Thanks again for the RFC @BurntSushi!

@alexcrichton alexcrichton merged commit 8434b82 into rust-lang:master Sep 12, 2016
@BurntSushi BurntSushi deleted the regex-1.0 branch September 12, 2016 22:37
@ciphergoth
Copy link
Contributor

@BurntSushi
Copy link
Member Author

@ciphergoth Thanks! Updated!

@Centril Centril added A-nursery Proposals relating to the rust-lang-nursery. A-regex Proposals relating to regular expressions. labels Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-nursery Proposals relating to the rust-lang-nursery. A-regex Proposals relating to regular expressions. final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.