Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp/syntax: add Cut #44254

Open
aclements opened this issue Feb 13, 2021 · 17 comments
Open

regexp/syntax: add Cut #44254

aclements opened this issue Feb 13, 2021 · 17 comments

Comments

@aclements
Copy link
Member

aclements commented Feb 13, 2021

Note: Current proposal is #44254 (comment)


Regular expressions are often embedded in other languages, and the current regexp package makes it difficult to correctly parse such regexps. Common examples of such embedding include awk, Perl, and Javascript, all of which have a /regexp/ expression syntax. In Go, this appears in the testing package's "-test.run" flag, which is a sequence of /-separated regexps; in benchstat v2's filter syntax; and in at least one other place @rsc mentioned that's now slipping my mind.

In general, this is difficult to implement outside regexp itself because the delimiter may appear nested in the regexp. For example, in the testing package, the run expression a[/]b/c matches subtest c of top-level tests matching a[/]b. The first slash is not a separator because it does not appear at the top level of the regexp. The testing package implements a simple, ad hoc parser for this (splitRegexp) but it doesn't get every corner case.

Since this is now a pattern, the regexp package (or perhaps regexp/syntax) should itself implement a "parse until delimiter" function, which would make it easy to parse regular expressions embedded in a larger syntax.

To make a concrete proposal, I propose we add the following function to regexp/syntax:

// ParseUntil parses a regular expression from the beginning of str
// until the string delim appears at the top level of the expression.
// It returns the regular expression prefix of str and the remainder of str.
// If successful, rest will always begin with delim.
// If delim does not appear at the top level of str, it returns str, "", ErrNoDelim.
func ParseUntil(str, delim string) (expr, rest string, err error)

I propose this should return the split input string, rather than the parsed regexp, so it can be composed with any other regexp parsing entry point (e.g., regexp/syntax.Parse or regexp.Compile).

I don't think this operation needs to take Flags, but I'm not positive.

/cc @rsc

@gopherbot gopherbot added this to the Proposal milestone Feb 13, 2021
@jfesler
Copy link

jfesler commented Feb 14, 2021

I would use this feature for config files taking use input. It would be great for my end users. Currently I force them to cope with single/double quoting rules, which is not good when they need to match on quotes. By allowing / as a delimiter, and a rebel aware parser looking for it intelligently, my users would be able to use more familiar regexes directly without figuring out what to escape or how many backslashes to escape with.

Instead of “remainder of string” I would prefer returning bytes consumed.

@aclements
Copy link
Member Author

Instead of “remainder of string” I would prefer returning bytes consumed.

I'm fine either way, but I'm curious what your rationale for preferring bytes consumed is. It seems like one would always take that count and just slice the string to get the remainder.

@mpx
Copy link
Contributor

mpx commented Feb 16, 2021

I've wanted something like this in the past to support handling s/REGEXP/REPLACEMENT/ in the past. As pointed out, that's impractical without parsing the regexp itself.

regexp.Compile and regexp.CompilePOSIX parse slightly differently which might affect the composability if parse flags aren't supported - I haven't thought about this too hard yet, perhaps it isn't necessary.

Given how regexp.Compile and regexp.CompilePOSIX are implemented, it might be better to make the parse/compile steps available to developers:

package syntax
func ParseUntil(s string, flags Flags, delim string) (_ *Regexp, rest string, err error)

package regexp
func CompileSyntax(re *syntax.Regexp, longest bool) (*Regexp, error)

This has some advantages:

  • Avoids duplicating parsing work
  • Should support updating flags after parsing (eg, supporting s/regexp/replacement/flags).

@rsc
Copy link
Contributor

rsc commented Jul 28, 2021

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Aug 4, 2021

We don't have to do the full parse, just enough to find the delimiter.
So in practice the work will not be duplicated as people may fear.
Calling it Parse may not be right, though, since it doesn't return a *Regexp.
Perhaps:

func Cut(s, delim string, flags Flags) (expr, rest string, err error)

I think the flags are needed to tell where to stop in Cut("[[:alpha:]/]", "/", ...).
This is similar but not exactly the same as strings.Cut (just accepted #46336)
in that it returns an error instead of ok bool - it can fail whereas strings.Cut cannot.

@rsc
Copy link
Contributor

rsc commented Aug 11, 2021

Any objections to syntax.Cut?

@rsc rsc changed the title proposal: regexp: add "parse until delimiter" operation proposal: regexp/syntax: add Cut Aug 11, 2021
@mpx
Copy link
Contributor

mpx commented Aug 13, 2021

@carlmjohnson the delimiter is a fixed string and it is skipped when returning rest -- same as strings.Cut. Hence returning the delimiter would be redundant.

The syntax.Cut signature should work for me. It also ensures the original regexp text available before parsing (useful for errors/logging). The effective duplicate parsing still seems like a disadvantage to me. Cut also risks a mismatch between the implementations if it doesn't directly reuse the parse logic. However, I'll defer to your expertise on this :).

@earthboundkid
Copy link
Contributor

I misunderstood how this worked. I withdraw the comment. The signature sounds good.

@rsc
Copy link
Contributor

rsc commented Aug 18, 2021

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@nightlyone
Copy link
Contributor

@aclements could you update the proposal description to the new function signature?

@rsc
Copy link
Contributor

rsc commented Aug 20, 2021

Note to self: should use regexp.Cut to fix #39904.
I also updated the top comment to add a link to the updated proposal.

@firelizzard18
Copy link
Contributor

@rsc If I understand #39904 correctly, A/B|C/D would be equivalent to (A/B)|(C/D), which is to say the delimiter / is subordinate to the alternation |. So the delimiter is not at the top level of the expression, so regexp.Cut does not apply. ISTM that regexp.Cut would give the same behavior as the current implementation, where A/B|C/D is equivalent to A/(B|C)/D.

@rsc
Copy link
Contributor

rsc commented Aug 25, 2021

To be clear,

Cut(`A/B|C/D`, "/") = "A", "B|C/D", true

Any instance of sep that is not itself enclosed in [ ] or ( ) counts as a separator. "Implicit" parens do not matter.
This is the same as, say, sed expressions.

@rsc
Copy link
Contributor

rsc commented Aug 25, 2021

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

@rsc rsc changed the title proposal: regexp/syntax: add Cut regexp/syntax: add Cut Aug 25, 2021
@rsc rsc modified the milestones: Proposal, Backlog Aug 25, 2021
@rsc
Copy link
Contributor

rsc commented Oct 7, 2021

This is the same as, say, sed expressions.

Hmm. Now that I go to implement this, I've noticed this is not the same as sed expressions. In sed 's/[/]/slash/g' is a syntax error: the slash inside the square brackets is still treated as splitting the expression. Similarly, typing /(/) in ed or sam is also a syntax error.

Package testing is doing something very subtle (but mostly correct) by splitting the regexp at a slash and implementing the search for the two different halves. But it's not doing what tools that incorporate regexps into a larger syntax do. The rationale for adding it was that it would help with tools like that, in addition to testing. But it looks like maybe it only helps with testing.

Now on the fence about retracting/declining this proposal. If only testing will use it, it's not that useful.

@mpx
Copy link
Contributor

mpx commented Oct 9, 2021

I'm less interested using this with testing package (although it would be useful), and more interested in enabling parsing the "s/PATTERN/REPLACEMENT/FLAGS", "/PATTERN/", and "m@PATTERN@" idioms via Go. These idioms are commonly known and I'd like to reuse them in various places. Implementing this manually is potentially error prone or impractical.

Perl has documented their approach to finding the end delimeter.

Due to differences, it isn't possible to match how all regexp implementations find their end delimeter. I think it would be best for Go to pick the most well reasoned option that is easiest to use / least likely to trip people up.

@rsc rsc moved this to Accepted in Proposals Aug 10, 2022
@rsc rsc added this to Proposals Aug 10, 2022
@cuiweixie
Copy link
Contributor

We don't have to do the full parse, just enough to find the delimiter. So in practice the work will not be duplicated as people may fear. Calling it Parse may not be right, though, since it doesn't return a *Regexp. Perhaps:

func Cut(s, delim string, flags Flags) (expr, rest string, err error)

I think the flags are needed to tell where to stop in Cut("[[:alpha:]/]", "/", ...). This is similar but not exactly the same as strings.Cut (just accepted #46336) in that it returns an error instead of ok bool - it can fail whereas strings.Cut cannot.

May I ask that what is the enum values of Flags, and how it means?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

No branches or pull requests

9 participants