Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple pair arguments in replace on strings. #30457

Closed
wants to merge 24 commits into from
Closed

Allow multiple pair arguments in replace on strings. #30457

wants to merge 24 commits into from

Conversation

MasonProtter
Copy link
Contributor

@MasonProtter MasonProtter commented Dec 20, 2018

This PR is intended to allow syntax like

replace("abc", 'a'=>'A', 'c'=>'C')

in order to be more in line with other methods of replace like

replace([1, 2, 1, 3], 1=>0, 2=>4)

@MasonProtter
Copy link
Contributor Author

MasonProtter commented Dec 20, 2018

I can't figure out what's causing the whitespace issues. I tried a git push git rebase --whitespace=fix master and a force push and nothing changed. Anyone have any ideas?

Identified by make check-whitespace
@ararslan
Copy link
Member

There was trailing whitespace on the line mentioned in the logs. I've added a commit that fixes it.

@ararslan ararslan added strings "Strings!" needs news A NEWS entry is required for this change labels Dec 20, 2018
@nalimilan
Copy link
Member

See previous (and more ambitious) attempt at #25732.

@MasonProtter
Copy link
Contributor Author

So the MacOS build "exceeded log length and was terminated" causing it to fail whereas the linux and freeBSD builds were fine. How do I fix the MacOS issue?

@nalimilan
Copy link
Member

CI failures indeed look unrelated.

I'd like to raise a more fundamental API question before merging this. Reading the long discussions with @stevengj and @StefanKarpinski at #25396 again, @stevengj was opposed to adding this API because it cannot be made efficient when applying a replacement to multiple strings (e.g. in a loop): compiling a regex once is much faster. The only situation where multiple arguments to replace can be fast is when operating on single characters. So maybe we shouldn't add an API which cannot be made efficient and which would trap users into thinking it's the recommended approach.

@nalimilan nalimilan added needs decision A decision on this change is needed and removed needs news A NEWS entry is required for this change labels Dec 21, 2018
@MasonProtter
Copy link
Contributor Author

MasonProtter commented Dec 21, 2018

That's fair, but I think that if you look into code people write when trying to do multiple replacements, they end up getting forced to write things like

replace(replace(replace(s, "foo"=>"bar"), "baz"=>"Baz"), "..."=> ".")

and it just ends up making a giant code stink that really doesn't need to exist if replace would just follow the convention it uses on arrays.

I think this is a pattern many want to use regardless of performance so what if there was just a note in the docstring warning that this is not ideally performant?

Copy link
Contributor

@KlausC KlausC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perform a successive series of replace operations on s, evaluating the replacements in reps...
sequentially from left to right.

This text could be misleading. Actually the replace operations are not applied to s, but to the intermediate result of applying the previous operations.

Proposal:

Apply first replacement in `reps` to s, then successively the other replacements to the previous intermediate results. If `reps` is empty, return s.

@stevengj
Copy link
Member

stevengj commented Dec 27, 2018

If that is the meaning you want, I don't see a lot of advantage over just telling people to use foldl(replace, reps, init=s) directly in the replace docstring. Furthermore, those semantics make it even harder to optimize to eliminate the temporary strings.

"Use this library function if you want to guarantee the slowest possible way to do multiple replacements" is not a compelling case for something to be in the standard library.

@ararslan
Copy link
Member

It seems to me we should be able to implement the feature with a note in the docstring about potential performance problems, then chip away at improving it over time. It's not like it's flat out impossible to make this efficient; the Magic of Dispatch™ means we can at some point use a completely different implementation for this method if we want.

@stevengj
Copy link
Member

It's not like it's flat out impossible to make this efficient; the Magic of Dispatch™ means we can at some point use a completely different implementation for this method if we want.

It's extremely difficult to make it faster if you define the semantics as a foldl, to the point where I doubt anyone will go to the trouble. Anyone who cares about performance will just implement their own API.

@andyferris
Copy link
Member

I happen to agree that foldl(replace, reps, init=s) is pretty nice actually.

I'd note that without reading the documentation it wouldn't be clear to someone reading the code if these pairs are applied sequentially, or in some more efficient parallel form, or whatever. In particular, in the OP it is written:

in order to be more in line with other methods of replace like

replace([1, 2, 1, 3], 1=>0, 2=>4)

That method isn't sequentially applied; it apparently has different semantics. Consider replace("abac", 'a'=>'A', 'A'=>'z') vs replace([1,2,1,3], 1=>0, 0=>9). I would find it surprising if different methods of replace followed different semantics in this regard.

It seems to me that when/if we eventually have a more efficient, all-at-once algorithm, we would actually want it to use the method signature in this PR. I suppose we could narrow the signature to replace(::AbstractString, ::Pair{Char, Char}...) and then use a streaming algorithm?

@o314
Copy link
Contributor

o314 commented Oct 7, 2019

Situation

There are needs of

Proposal

Let's forget the foldl approach (may be it could stand up to 10 iterations).

One could provide an inductive definition of replace that performs automatically the unrolling up to current Base.

@inline _replace(s::S, pat::Pair{S,S}) where {S<:AbstractString} = replace(s, pat)
@inline _replace(s::S, pat::Pair{S,S}, pats::(Pair{S,S})...) where {S<:AbstractString} = _replace(replace(s, pat), pats...)

const _REPLACE_STRING_PAIRS_MAX = 9
for k in 2:_REPLACE_STRING_PAIRS_MAX
    pats = [Symbol("p",i) for i in 1:k]
    typed_pats = [:($p::Pair{S,S} where {S<:AbstractString}) for p in pats]
    @eval @inline Base.replace(s::AbstractString, $(typed_pats...)) = _replace(s, $(pats...))
end

Lowered code

nreplcalls(f) = begin
    io = IOBuffer()
    code_llvm(io, f, Tuple{String}; optimize=true, debuginfo=:none)
    s = String(take!(io));
    
    s |>                                                    # llvm code
        Base.Fix2(split, "\n") |>                           # llvm code lines
        Base.Fix1(filter, Base.Fix1(occursin,"replace")) |> # llvm code lines with replace
        length                                              # count them
end


@test 3 == nreplcalls(s->replace(s,"b"=>"c","a"=>"b","c"=>"a"))
@test 4 == nreplcalls(s->replace(s,"c"=>"d","b"=>"c","a"=>"b","d"=>"a"))
@test 5 == nreplcalls(s->replace(s,"d"=>"e","c"=>"d","b"=>"c","a"=>"b","e"=>"d"))
@test 6 == nreplcalls(s->replace(s,"e"=>"f","d"=>"e","c"=>"d","b"=>"c","a"=>"b","f"=>"a"))

Code with n pairs has been converted to n calls to replace in llvm.

@carstenbauer
Copy link
Member

What is the status here?

This just came up in https://discourse.julialang.org/t/how-to-replace-the-charactors-in-the-string/30778/24. I was super suprised that we don't have a multi-pair replace for strings (char replacement in this case) while we do have the mentioned replace([1, 2, 1, 3], 1=>0, 2=>4).

I fully agree with @andyferris on the semantics: replace("hallo", 'a'=>'o', 'o'=>'a') should definitely not be a no-op (as it would be for sequential application).

@stevengj
Copy link
Member

stevengj commented Nov 7, 2019

I agree that the foldl (sequential) semantics in this PR are not desirable.

Another issue with this API in general is that any implementation for large numbers of replacements will probably involve constructing a Dict, but this API provides no way to re-use the Dict between calls, making it inherently inefficient for applying the replacement many times to many small strings (which is likely to be a common use case). I also commented on this here.

@MasonProtter
Copy link
Contributor Author

MasonProtter commented Nov 7, 2019

I'm also now convinced that the foldl is not the right way to go. However, I don't have the time or knowledge to do a better job and I only made this PR because I naïvely thought it was a low hanging fruit.

Perhaps I should close this to make room for someone else to do a more serious PR?

@o314
Copy link
Contributor

o314 commented Nov 8, 2019

There are several points to deal with there :

  1. syntax enhancement (the shorter, the better)
  2. non speed regression on change (no foldl)
  3. algorithmic enhancement on large strings.

My previous post addresses 1. - syntax is the asked one, and
2. all code is inlined (upto a given threshold)

3. is not trivial. See dave glick; 2015; multiple string replacement,
aditya goel; 2016?; in-place replace multiple occurrences of a pattern;
has nothing in commons with naive implementation except the moment we switch from a naive to a cleverly algo.

The post i have proposed contains an integer threshold based on the number of allowed replacement pairs we can use as a switch to rely / relay to an upcoming and enhanced implementation of the (many) clever algo the Julia community will not miss the opportunity to propose later, i am sure (:

vtjnash added a commit to vtjnash/julia that referenced this pull request Apr 14, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes JuliaLang#35327
fixes JuliaLang#39061
fixes JuliaLang#35414
fixes JuliaLang#29849
fixes JuliaLang#30457
fixes JuliaLang#25396
JeffBezanson pushed a commit that referenced this pull request Jun 7, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. #25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes #35327
fixes #39061
fixes #35414
fixes #29849
fixes #30457
fixes #25396
shirodkara pushed a commit to shirodkara/julia that referenced this pull request Jun 9, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes JuliaLang#35327
fixes JuliaLang#39061
fixes JuliaLang#35414
fixes JuliaLang#29849
fixes JuliaLang#30457
fixes JuliaLang#25396
johanmon pushed a commit to johanmon/julia that referenced this pull request Jul 5, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes JuliaLang#35327
fixes JuliaLang#39061
fixes JuliaLang#35414
fixes JuliaLang#29849
fixes JuliaLang#30457
fixes JuliaLang#25396
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs decision A decision on this change is needed strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants