Allow multiple pair arguments in replace on strings. #30457

MasonProtter · 2018-12-20T03:11:35Z

This PR is intended to allow syntax like

replace("abc", 'a'=>'A', 'c'=>'C')

in order to be more in line with other methods of replace like

replace([1, 2, 1, 3], 1=>0, 2=>4)

MasonProtter · 2018-12-20T03:50:05Z

I can't figure out what's causing the whitespace issues. I tried a git push git rebase --whitespace=fix master and a force push and nothing changed. Anyone have any ideas?

Identified by make check-whitespace

ararslan · 2018-12-20T06:00:47Z

There was trailing whitespace on the line mentioned in the logs. I've added a commit that fixes it.

base/strings/util.jl

test/strings/util.jl

nalimilan · 2018-12-20T09:52:29Z

See previous (and more ambitious) attempt at #25732.

base/strings/util.jl

test/strings/util.jl

Co-Authored-By: MasonProtter <[email protected]>

MasonProtter · 2018-12-20T21:14:45Z

So the MacOS build "exceeded log length and was terminated" causing it to fail whereas the linux and freeBSD builds were fine. How do I fix the MacOS issue?

nalimilan · 2018-12-21T09:34:22Z

CI failures indeed look unrelated.

I'd like to raise a more fundamental API question before merging this. Reading the long discussions with @stevengj and @StefanKarpinski at #25396 again, @stevengj was opposed to adding this API because it cannot be made efficient when applying a replacement to multiple strings (e.g. in a loop): compiling a regex once is much faster. The only situation where multiple arguments to replace can be fast is when operating on single characters. So maybe we shouldn't add an API which cannot be made efficient and which would trap users into thinking it's the recommended approach.

MasonProtter · 2018-12-21T15:59:54Z

That's fair, but I think that if you look into code people write when trying to do multiple replacements, they end up getting forced to write things like

replace(replace(replace(s, "foo"=>"bar"), "baz"=>"Baz"), "..."=> ".")

and it just ends up making a giant code stink that really doesn't need to exist if replace would just follow the convention it uses on arrays.

I think this is a pattern many want to use regardless of performance so what if there was just a note in the docstring warning that this is not ideally performant?

base/strings/util.jl

NEWS.md

base/strings/util.jl

KlausC

Perform a successive series of replace operations on s, evaluating the replacements in reps...
sequentially from left to right.

This text could be misleading. Actually the replace operations are not applied to s, but to the intermediate result of applying the previous operations.

Proposal:

Apply first replacement in `reps` to s, then successively the other replacements to the previous intermediate results. If `reps` is empty, return s.

stevengj · 2018-12-27T16:11:40Z

If that is the meaning you want, I don't see a lot of advantage over just telling people to use foldl(replace, reps, init=s) directly in the replace docstring. Furthermore, those semantics make it even harder to optimize to eliminate the temporary strings.

"Use this library function if you want to guarantee the slowest possible way to do multiple replacements" is not a compelling case for something to be in the standard library.

ararslan · 2018-12-27T22:11:10Z

It seems to me we should be able to implement the feature with a note in the docstring about potential performance problems, then chip away at improving it over time. It's not like it's flat out impossible to make this efficient; the Magic of Dispatch™ means we can at some point use a completely different implementation for this method if we want.

stevengj · 2018-12-28T14:08:09Z

It's not like it's flat out impossible to make this efficient; the Magic of Dispatch™ means we can at some point use a completely different implementation for this method if we want.

It's extremely difficult to make it faster if you define the semantics as a foldl, to the point where I doubt anyone will go to the trouble. Anyone who cares about performance will just implement their own API.

andyferris · 2019-01-02T06:07:38Z

I happen to agree that foldl(replace, reps, init=s) is pretty nice actually.

I'd note that without reading the documentation it wouldn't be clear to someone reading the code if these pairs are applied sequentially, or in some more efficient parallel form, or whatever. In particular, in the OP it is written:

in order to be more in line with other methods of replace like

replace([1, 2, 1, 3], 1=>0, 2=>4)

That method isn't sequentially applied; it apparently has different semantics. Consider replace("abac", 'a'=>'A', 'A'=>'z') vs replace([1,2,1,3], 1=>0, 0=>9). I would find it surprising if different methods of replace followed different semantics in this regard.

It seems to me that when/if we eventually have a more efficient, all-at-once algorithm, we would actually want it to use the method signature in this PR. I suppose we could narrow the signature to replace(::AbstractString, ::Pair{Char, Char}...) and then use a streaming algorithm?

o314 · 2019-10-07T01:54:59Z

Situation

There are needs of

enhancing usability (lot of possibility) - point of @MasonProtter @ Allow multiple pair arguments in replace on strings. #30457 (comment)
being conservative (as much as possible) with bench - point of @stevengj @ Allow multiple pair arguments in replace on strings. #30457 (comment)

Proposal

Let's forget the foldl approach (may be it could stand up to 10 iterations).

One could provide an inductive definition of replace that performs automatically the unrolling up to current Base.

@inline _replace(s::S, pat::Pair{S,S}) where {S<:AbstractString} = replace(s, pat)
@inline _replace(s::S, pat::Pair{S,S}, pats::(Pair{S,S})...) where {S<:AbstractString} = _replace(replace(s, pat), pats...)

const _REPLACE_STRING_PAIRS_MAX = 9
for k in 2:_REPLACE_STRING_PAIRS_MAX
    pats = [Symbol("p",i) for i in 1:k]
    typed_pats = [:($p::Pair{S,S} where {S<:AbstractString}) for p in pats]
    @eval @inline Base.replace(s::AbstractString, $(typed_pats...)) = _replace(s, $(pats...))
end

Lowered code

nreplcalls(f) = begin
    io = IOBuffer()
    code_llvm(io, f, Tuple{String}; optimize=true, debuginfo=:none)
    s = String(take!(io));
    
    s |>                                                    # llvm code
        Base.Fix2(split, "\n") |>                           # llvm code lines
        Base.Fix1(filter, Base.Fix1(occursin,"replace")) |> # llvm code lines with replace
        length                                              # count them
end


@test 3 == nreplcalls(s->replace(s,"b"=>"c","a"=>"b","c"=>"a"))
@test 4 == nreplcalls(s->replace(s,"c"=>"d","b"=>"c","a"=>"b","d"=>"a"))
@test 5 == nreplcalls(s->replace(s,"d"=>"e","c"=>"d","b"=>"c","a"=>"b","e"=>"d"))
@test 6 == nreplcalls(s->replace(s,"e"=>"f","d"=>"e","c"=>"d","b"=>"c","a"=>"b","f"=>"a"))

Code with n pairs has been converted to n calls to replace in llvm.

carstenbauer · 2019-11-07T06:11:32Z

What is the status here?

This just came up in https://discourse.julialang.org/t/how-to-replace-the-charactors-in-the-string/30778/24. I was super suprised that we don't have a multi-pair replace for strings (char replacement in this case) while we do have the mentioned replace([1, 2, 1, 3], 1=>0, 2=>4).

I fully agree with @andyferris on the semantics: replace("hallo", 'a'=>'o', 'o'=>'a') should definitely not be a no-op (as it would be for sequential application).

stevengj · 2019-11-07T14:08:38Z

I agree that the foldl (sequential) semantics in this PR are not desirable.

Another issue with this API in general is that any implementation for large numbers of replacements will probably involve constructing a Dict, but this API provides no way to re-use the Dict between calls, making it inherently inefficient for applying the replacement many times to many small strings (which is likely to be a common use case). I also commented on this here.

MasonProtter · 2019-11-07T17:56:38Z

I'm also now convinced that the foldl is not the right way to go. However, I don't have the time or knowledge to do a better job and I only made this PR because I naïvely thought it was a low hanging fruit.

Perhaps I should close this to make room for someone else to do a more serious PR?

o314 · 2019-11-08T02:38:19Z

There are several points to deal with there :

syntax enhancement (the shorter, the better)
non speed regression on change (no foldl)
algorithmic enhancement on large strings.

My previous post addresses 1. - syntax is the asked one, and
2. all code is inlined (upto a given threshold)

3. is not trivial. See dave glick; 2015; multiple string replacement,
aditya goel; 2016?; in-place replace multiple occurrences of a pattern;
has nothing in commons with naive implementation except the moment we switch from a naive to a cleverly algo.

The post i have proposed contains an integer threshold based on the number of allowed replacement pairs we can use as a switch to rely / relay to an upcoming and enhanced implementation of the (many) clever algo the Julia community will not miss the opportunity to propose later, i am sure (:

This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes JuliaLang#35327 fixes JuliaLang#39061 fixes JuliaLang#35414 fixes JuliaLang#29849 fixes JuliaLang#30457 fixes JuliaLang#25396

This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. #25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes #35327 fixes #39061 fixes #35414 fixes #29849 fixes #30457 fixes #25396

This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes JuliaLang#35327 fixes JuliaLang#39061 fixes JuliaLang#35414 fixes JuliaLang#29849 fixes JuliaLang#30457 fixes JuliaLang#25396

MasonProtter added 7 commits December 19, 2018 19:54

Add replace method for many Pairs.

c430383

Remove error on string replace

09d6a88

Update util.jl

4f5b2fa

Added tests for replace

f56617e

remove whitespace

93e343b

Update util.jl

f995dbf

Update util.jl

efac5d4

Remove trailing whitespace

52dc3c9

Identified by make check-whitespace

ararslan reviewed Dec 20, 2018

View reviewed changes

base/strings/util.jl Outdated Show resolved Hide resolved

ararslan added strings "Strings!" needs news A NEWS entry is required for this change labels Dec 20, 2018

vtjnash reviewed Dec 20, 2018

View reviewed changes

base/strings/util.jl Outdated Show resolved Hide resolved

nalimilan reviewed Dec 20, 2018

View reviewed changes

test/strings/util.jl Show resolved Hide resolved

MasonProtter added 4 commits December 20, 2018 08:52

added test

8377129

Updated docs, removed count kwarg.

07d2015

Update NEWS.md

177e7b2

Update NEWS.md

a62919b

nalimilan reviewed Dec 20, 2018

View reviewed changes

base/strings/util.jl Outdated Show resolved Hide resolved

nalimilan reviewed Dec 20, 2018

View reviewed changes

test/strings/util.jl Outdated Show resolved Hide resolved

nalimilan and others added 4 commits December 20, 2018 09:34

add Milan's fix

b671230

Co-Authored-By: MasonProtter <[email protected]>

Add Milan's fix

c3a462f

Co-Authored-By: MasonProtter <[email protected]>

remove whitespace

0d3bc89

remove whitespace

2000333

nalimilan added needs decision A decision on this change is needed and removed needs news A NEWS entry is required for this change labels Dec 21, 2018

MasonProtter added 3 commits December 21, 2018 16:36

Add note to docstring about performance

7590186

move text to avoid whitespace issue

d66f228

remove whitespace

3a53400

ararslan reviewed Dec 22, 2018

View reviewed changes

base/strings/util.jl Outdated Show resolved Hide resolved

Change performance note to documenter style

f18ab88

ararslan reviewed Dec 22, 2018

View reviewed changes

base/strings/util.jl Outdated Show resolved Hide resolved

NEWS.md Outdated Show resolved Hide resolved

MasonProtter added 3 commits December 21, 2018 17:11

move note outside of code block

74a2f2d

added space

1d9d373

remove whitespace

1b70688

nalimilan mentioned this pull request Dec 22, 2018

Inconsistent behaviour for replace #28967

Closed

andyferris reviewed Dec 27, 2018

View reviewed changes

base/strings/util.jl Outdated Show resolved Hide resolved

KlausC suggested changes Dec 27, 2018

View reviewed changes

remove anonymous function

f0127fb

tkf mentioned this pull request Apr 1, 2020

replace does not handle multiple patterns for String #35327

Closed

stevengj mentioned this pull request Apr 12, 2020

[RFC] support multiple s=>r Pairs in strings replace method #35414

Closed

JeffBezanson closed this in 70771b2 Jun 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow multiple pair arguments in replace on strings. #30457

Allow multiple pair arguments in replace on strings. #30457

MasonProtter commented Dec 20, 2018 •

edited

Loading

MasonProtter commented Dec 20, 2018 •

edited

Loading

ararslan commented Dec 20, 2018

nalimilan commented Dec 20, 2018

MasonProtter commented Dec 20, 2018

nalimilan commented Dec 21, 2018

MasonProtter commented Dec 21, 2018 •

edited

Loading

KlausC left a comment •

edited

Loading

stevengj commented Dec 27, 2018 •

edited

Loading

ararslan commented Dec 27, 2018

stevengj commented Dec 28, 2018

andyferris commented Jan 2, 2019

o314 commented Oct 7, 2019 •

edited

Loading

carstenbauer commented Nov 7, 2019

stevengj commented Nov 7, 2019 •

edited

Loading

MasonProtter commented Nov 7, 2019 •

edited

Loading

o314 commented Nov 8, 2019

Allow multiple pair arguments in replace on strings. #30457

Allow multiple pair arguments in replace on strings. #30457

Conversation

MasonProtter commented Dec 20, 2018 • edited Loading

MasonProtter commented Dec 20, 2018 • edited Loading

ararslan commented Dec 20, 2018

nalimilan commented Dec 20, 2018

MasonProtter commented Dec 20, 2018

nalimilan commented Dec 21, 2018

MasonProtter commented Dec 21, 2018 • edited Loading

KlausC left a comment • edited Loading

Choose a reason for hiding this comment

stevengj commented Dec 27, 2018 • edited Loading

ararslan commented Dec 27, 2018

stevengj commented Dec 28, 2018

andyferris commented Jan 2, 2019

o314 commented Oct 7, 2019 • edited Loading

Situation

Proposal

Lowered code

carstenbauer commented Nov 7, 2019

stevengj commented Nov 7, 2019 • edited Loading

MasonProtter commented Nov 7, 2019 • edited Loading

o314 commented Nov 8, 2019

MasonProtter commented Dec 20, 2018 •

edited

Loading

MasonProtter commented Dec 20, 2018 •

edited

Loading

MasonProtter commented Dec 21, 2018 •

edited

Loading

KlausC left a comment •

edited

Loading

stevengj commented Dec 27, 2018 •

edited

Loading

o314 commented Oct 7, 2019 •

edited

Loading

stevengj commented Nov 7, 2019 •

edited

Loading

MasonProtter commented Nov 7, 2019 •

edited

Loading