Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement capture group references in substitution strings #11849

Merged
merged 1 commit into from
Jul 23, 2015

Conversation

malmaud
Copy link
Contributor

@malmaud malmaud commented Jun 24, 2015

Closes #1820.

Use http://www.pcre.org/current/doc/html/pcre2_substitute.html instead of Julia's native replace when the search pattern is a regex. This allows for capture reference in the substitution string.

Incidentally, it is ~4x faster than Julia's native replace implementation and should probably be used even when the search pattern is a not a regex, although I'll save that for another PR if people are interested.

Once potentially breaking change is how empty matches are handled:
Before, replace("abcd", r"b?", "^") == "^a^c^d^".
Now, replace("abcd", r"b?", "^") == "^a^^c^d^".

There are no set of options to PCRE that could cause it to replicate the old behavior.

Another breaking change is that PCRE doesn't support an option to limit how many replacements are made.

@simonster
Copy link
Member

Maybe you could support the limit parameter by constructing a match context and calling pcre2_set_match_limit?

@malmaud
Copy link
Contributor Author

malmaud commented Jun 24, 2015

That's not what match_limit does:

Internally, pcre2_match() uses a function called match(), which it calls repeatedly (sometimes recursively). The limit set by match_limit is imposed on the number of times this function is called during a match, which has the effect of limiting the amount of backtracking that can take place. For patterns that are not anchored, the count restarts from zero for each position in the subject string. This limit is not relevant to pcre2_dfa_match(), which ignores it.

@simonster
Copy link
Member

Huh, I would have thought there'd be some way to do this (or at least to replace only the first match), but it's very possible there's not.

@malmaud
Copy link
Contributor Author

malmaud commented Jun 24, 2015

It is possible to replace the only first match by not setting the SUBSTITUTE_GLOBAL flag, but as far as I can find there is no way direct way to find out where the offset should appropriately be set for replacing the second match.

@malmaud
Copy link
Contributor Author

malmaud commented Jun 25, 2015

What do you think, @vtjnash ? If we want to support a limit parameter besides 1 or infinity, we'll either have to patch pcre's substitute function (it's a pretty trivial change) or port the whole function to julia (it's about 200 lines of C code in pcre2_substitute.c).

@StefanKarpinski
Copy link
Member

Seems like a good candidate for a patch that we can submit upstream.

@malmaud
Copy link
Contributor Author

malmaud commented Jun 25, 2015

I'm skeptical they'd accept it since it would seem to necessarily involve changing the PCRE API.

@nalimilan
Copy link
Member

OTOH if even PCRE doesn't support that option, is it really needed in Julia? I'd say remove it, and if it turns out it's really needed it can always be added back later.

@malmaud
Copy link
Contributor Author

malmaud commented Jun 30, 2015

Since the Julia-native replace has comparable performance now, I'd vote for just changing the julia replace to accept the PCRE $x syntax for including captured groups in the replacement.

@StefanKarpinski
Copy link
Member

That would be cool but it seems like you'd need a special string literal type for that. Maybe s"foo$(x)bar" or something like that where the s stands for "substitute".

@ScottPJones
Copy link
Contributor

😀 and then, if you want to combine that syntax with string interpolation, you could use the Swift (and others) \(...) syntax!

@malmaud
Copy link
Contributor Author

malmaud commented Jul 2, 2015

Sounds good @StefanKarpinski . IMO, we might as well use the more traditional Perl and Python backreference substitution syntax (https://docs.python.org/2/library/re.html#re.sub) instead of the ad-hoc once introduced by PCRE2:

a dollar character is an escape character that can specify the insertion of characters from capturing groups in the pattern. The following forms are recognized:

$$ insert a dollar character
$ insert the contents of group
${} insert the contents of group

@malmaud
Copy link
Contributor Author

malmaud commented Jul 6, 2015

The functionality is implemented now, mimicking the Python API:

replace("abcde", r"(..)(?P<byname>d)", s"\g<byname>xy\1") == "adxybce"

If this API looks good to people, I'll update this PR to include a new section describing it in the manual.

@malmaud malmaud changed the title Use PCRE's implementation of replace for regex search patterns Implement capture group references in substitution strings Jul 6, 2015
end

# Backcapture reference in substitution string
@test replace("abcde", r"(..)(?P<byname>d)", s"\g<byname>xy\1") == "adxybce"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add a test for when the groupname doesn't exist in the regex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@vtjnash
Copy link
Member

vtjnash commented Jul 7, 2015

@StefanKarpinski why does this need a special string literal? why wouldn't general-purpose "literal" string literal be sufficient?

@yuyichao
Copy link
Contributor

yuyichao commented Jul 7, 2015

Maybe to avoid API breaking and performance regression?

@malmaud
Copy link
Contributor Author

malmaud commented Jul 8, 2015

bump @StefanKarpinski

@StefanKarpinski
Copy link
Member

Sorry, I'm traveling and can't take a look at this (writing this on a plane
without wifi; it'll get delivered whenever I resurface).

On Wednesday, July 8, 2015, Jonathan Malmaud [email protected]
wrote:

bump @StefanKarpinski https://github.com/StefanKarpinski


Reply to this email directly or view it on GitHub
#11849 (comment).

@malmaud
Copy link
Contributor Author

malmaud commented Jul 10, 2015

Sorry to keep bugging you @StefanKarpinski , but I'm unbelievably excited to reference capture groups in my substitution strings.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jul 10, 2015 via email

@malmaud
Copy link
Contributor Author

malmaud commented Jul 10, 2015

No problem. I think the only real question is the one that @vtjnash brings up - should we introduce the s string macro, or just have users deal with escaping unwanted backcapture references in the substitution strings if the pattern is a regex (as Python, Perl, etc. do)?

@kmsquire
Copy link
Member

+1 -- I like this. But I agree with @vtjnash that SubstitutionString should simply be a literal/raw string.

@StefanKarpinski
Copy link
Member

On Friday, July 10, 2015, Kevin Squire [email protected] wrote:

+1 -- I like this. But I agree with @vtjnash https://github.com/vtjnash
that SubstitutionString should simply be a literal/raw string.

A reason to use a custom string type is to allow splicing Julia code into
substitution strings, but that could be done as a later feature while still
allowing literal strings for compatibility.

@StefanKarpinski
Copy link
Member

On Tuesday, July 7, 2015, Jonathan Malmaud [email protected] wrote:

The functionality is implemented now, mimicking the Python API:

replace("abcde", r"(..)(?Pd)", s"\gxy\1") == "adxybce"

If this API looks good to people, I'll update this PR to include a new
section describing it in the manual.

Nice. I like the API. Good reason to use a custom string literal for
substitutions.

@malmaud
Copy link
Contributor Author

malmaud commented Jul 11, 2015

I think the question is whether replace(str, pattern::Regex, "\1") should just always interpret the "\1" in the replacement string as a backcapture reference unless the caller escapes it, which would be the quite ugly "\1".

@kmsquire
Copy link
Member

My understanding is that raw strings wouldn't modify the string at all, and would simply be a generalization of the s_str proposed here. See #11764.

My main concern about adding _str macros to base is that they then become unavailable for use in packages (or user's own code). You can't even qualify to use them:

julia> Base.r"test"
ERROR: UndefVarError: r not defined

Would we really want to interpolate into a SubstitutionString?

@malmaud
Copy link
Contributor Author

malmaud commented Jul 11, 2015

I'd be on board with using raw strings here to avoid backslash hell, and just using string as an alternative to $ interpolation in the relatively rare case where I need interpolation in my substitution string.

@StefanKarpinski
Copy link
Member

I think that always substituting plain strings completely literally – i.e. no \1 substitution – would be for the best and only use a special type with custom literal syntax for interpreting that syntax as doing substitution. Otherwise you end up needing a way to escape those things to indicated that you

@StefanKarpinski
Copy link
Member

@github, could you finally fix this horrible UI bug?

... didn't want to substitute. You also have the issue that computed substitutions run the risk of accidentally substituting.

@malmaud
Copy link
Contributor Author

malmaud commented Jul 14, 2015

So just to be clear @StefanKarpinski , you are in favor of a SubstitutionString type? So if I add some docs to this PR, it should be good to go?

@vtjnash
Copy link
Member

vtjnash commented Jul 14, 2015

if we are going to add a SubstitutionString type, I would rather have the API look like:

SubstitutionString(raw"a\1b")

single character string literals are just too unreadable, and too valuable for packages

@malmaud
Copy link
Contributor Author

malmaud commented Jul 14, 2015

If you're already using raw to avoid escaping "\1", then the only thing
you gain from SubstitutionString is avoiding the risk that a computed
substitution string inadvertently contains a backcapture reference. Is it
worth the verbosity and new type definition for that? In Python, you would
use re.escape on the replacement string if it was computed.

On Tue, Jul 14, 2015 at 1:07 PM, Jameson Nash [email protected]
wrote:

if we are going to add a SubstitutionString type, I would rather have the
API look like:

SubstitutionString(raw"a\1b")

single character string literals are just too unreadable, and too valuable
for packages


Reply to this email directly or view it on GitHub
#11849 (comment).

@hayd
Copy link
Member

hayd commented Jul 14, 2015

What about sub".." or subst".." ?

It's quite difficult to find help for single letter things, atm even if your knew it was @s_str

help?> @b_str
Symbol not found. Falling back on apropos search ...
No help information found.

@StefanKarpinski
Copy link
Member

@malmaud wrote:

So just to be clear @StefanKarpinski , you are in favor of a SubstitutionString type? So if I add some docs to this PR, it should be good to go?

Yes, I'm all in favor of this – I actually couldn't see the patch before since I was reading email offline and composing messages to be sent later (have been traveling until recently). I like this PR as it is and I also like s"..." as the syntax. Writing SubstitutionString(raw"a\1b") is a total non-starter. Using regular expressions to do search and replace is not some rare thing that can have an awkward, verbose syntax. Perl didn't get popular for it's lovely syntax, it got popular because it made working with strings easy, convenient and fast – in no small part due to it's now-ubiquitous regular expressions.

If you're already using raw to avoid escaping "\1", then the only thing you gain from SubstitutionString is avoiding the risk that a computed substitution string inadvertently contains a backcapture reference. Is it worth the verbosity and new type definition for that? In Python, you would use re.escape on the replacement string if it was computed.

Yes, that's entirely worth it. How many Python programs are buggy because someone didn't call re.escape when they should have? This is the worst kind of bug because it is impossible to fine since it works most of the time because user data rarely contains strings like \1. Just because other languages made mistakes like this doesn't me we ought to repeat them.

@kmsquire
Copy link
Member

Probably fine, but it's worth a grep through registered packages to make sure none define s_str.

@malmaud
Copy link
Contributor Author

malmaud commented Jul 20, 2015

@kmsquire Is there an easy way to do that?

@ivarne
Copy link
Member

ivarne commented Jul 20, 2015

Just clone all the packages in METADATA with a script, or ask someone who has already done so, to grep for you. I think there is great karma to be earned if someone have the skill to properly secure such a search service and provide it to the community.

@malmaud
Copy link
Contributor Author

malmaud commented Jul 20, 2015

At any rate, I added documentation and rebased, so this is g2g once they're consensus on this SubstitutionString/s-prefix business.

@malmaud
Copy link
Contributor Author

malmaud commented Jul 20, 2015

Might be nice to have this in before the .4 RC.

@malmaud
Copy link
Contributor Author

malmaud commented Jul 21, 2015

@tkelman @IainNZ You guys know a lot about packages...are you in a position to quickly check if any registered packages use macro s_str?

@malmaud
Copy link
Contributor Author

malmaud commented Jul 21, 2015

@IainNZ
Copy link
Member

IainNZ commented Jul 21, 2015

Strings isn't registered as well

@malmaud
Copy link
Contributor Author

malmaud commented Jul 21, 2015

Still, maybe @vtjnash is right and we should use sub instead of s as the prefix. Less cryptic and less likely to conflict with any future packages.

@StefanKarpinski
Copy link
Member

Personally, I like the fact that s comes after r in the alphabet. Am I incorrect that any module that defines its own s_str macro will just end up seeing that?

@mbauman
Copy link
Member

mbauman commented Jul 21, 2015

I like s for exactly that reason, too. It also evokes the perl/sed syntax s/x/y/ in my head.

ParserCombinator.jl exports s_str, so this breaks uses of s"" after using ParserCombinator unless you add an explicit import (this may be solved independently as part of #12183):

julia> s""
Warning: both Bar and Foo export "@s_str"; uses of it in module Main must be qualified
ERROR: UndefVarError: @s_str not defined

@ScottPJones
Copy link
Contributor

Strings.jl is a work in progress at the moment, I think, @quinnj would have to say if he really wants the s_str macro to be just s"...".

@quinnj
Copy link
Member

quinnj commented Jul 21, 2015

Don't let Strings.jl get in the way of using s_str. It's currently just a convenience constructor for my own custom string type and has no long-term plans.

@malmaud
Copy link
Contributor Author

malmaud commented Jul 21, 2015

@andrewcooke Would it be problematic for you if Base exports s_str?

@andrewcooke
Copy link
Contributor

well it conflicts with ParserCombinator, but i can change it to some other letter (i have all of one user that i know of right now). and also, isn't there some way to choose which you want on conflicts? if not, i assume there will be at some point... (so, no, nothing terrible, but thanks for asking!)

@malmaud
Copy link
Contributor Author

malmaud commented Jul 22, 2015

OK then. Checks pass, should be g2g.

StefanKarpinski added a commit that referenced this pull request Jul 23, 2015
Implement capture group references in substitution strings
@StefanKarpinski StefanKarpinski merged commit 10ae21a into JuliaLang:master Jul 23, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.