String search inconsistency #24103

bkamins · 2017-10-11T22:33:03Z

The following searches return unexpected results:

julia> search("1", Char[], 2) # is bounds error for start=3
2

julia> search("", 'a', 1) # is bounds error for start=2
0

julia> search("1", 'a', 2) # is bounds error for start=3
0

julia> search("1", "", 2) # probably should be bounds error, 2:1 is not allowed by the documentation
2:1

julia> search("1", "", 20) # probably should be bounds error - similar case
0:-1

julia> search("∀", "1", 4) # should be bounds error, it is an error for start equal to 3 and 5
0:-1

In general the reason is that the functions handle index out of range by one in a special way. But maybe it is intended?

Additionally there is an inconsistency:

julia> search("123", "") # no match indicator
1:0

julia> search("123", Char[]) # match indicator
1

Is it intended?

In summary I have two questions:

if start passed to search is out of bounds should the function always throw an error (or rather return no match result) - I would say always throw an error;
how matching of empty structs like "" and Char[] should be handled - I would say it should be no match in both cases;

The text was updated successfully, but these errors were encountered:

booleanings · 2017-10-12T02:52:51Z

if start passed to search is out of bounds should the function always throw an error (or rather return no match result) - I would say always throw an error;

I'm not so sure about this. Most languages would just return no match. The purpose (totally assuming) being that you might want to check if a string is contained in a string you don't know the length of and don't care if it's out of bounds. I think the functionality is more in substring when you try to substring further than the start index.

Definitely +1 to the Char[] + "" handling though.

bkamins · 2017-10-12T06:00:30Z

I agree. I simply did not want to change the existing functionality too much (which throws errors).

If we are flexible I would propose the following rules (applied in this order):

if string is empty or chars is empty return no match (0 or 0:-1) respectively
never throw an error but properly handle start values:
- if start<1 then assume start=1
- if start>endof(str) then return no match
- if start is a valid index work as current search works
- if start is an invalid index use nextind(str, start) to start matching (currently search throws an error in this case)

In this way string would never throw an error (if there are any comments to those rules I am open to change them - the key issue for me is to be consistent).

Additionally I would clean up the code not to use string as a name of a variable and rename it to str as string conflicts with a library function.
If this would be agreed I can make a proper PR.

nalimilan · 2017-10-12T09:33:15Z

I'm in favor of throwing an error when an out of bounds index is passed. I don't see the use case for passing an invalid index: if you don't know, don't pass an index, or pass 1. That approach is safer as it helps catching bugs, and we can always change it later; OTC if we accept invalid indices in 1.0 we'll have to retain that behavior in all 1.x releases. It will also be more consistent with recent changes to getindex and SubString.

I'm less sure about what to do when looking for the empty string/empty list of chars. I think the reasoning is that the empty string is found everywhere between two subsequent characters, but it has a zero length, so the first match i 1:0, the second match 2:1 (in ASCII at least), etc. But that doesn't make much sense for an empty list of characters. Either throwing an error or returning 0 sounds fine to me.

Cc: @stevengj

bkamins · 2017-10-12T10:06:34Z

@ararslan When there is a consensus about the correct behavior it would be good to have 7d244be merged as it reimplements rsearch to update it accordingly in the PR following this discussion.

ararslan · 2017-10-12T18:30:47Z

There are still test failures in that PR that I haven't had time to figure out but yeah it'd be good to have that merged soon.

JeffBezanson · 2017-10-13T04:11:02Z

that the empty string is found everywhere between two subsequent characters, but it has a zero length

👍

For an empty list/set of characters, I think the result should be no-match, since we did not find a character that's in the set.

bkamins · 2017-10-13T11:24:22Z

In #24121 I have made a proposal of a consistent implementation following the discussion.
The docstring given there gives all relevant cases that are covered.

In short:

search into an empty string without start index always returns 0 or 0:-1;
search into an empty string with start index given always throws an error;
passed index has to be a valid one unless we search String for Union{Int8, UInt8} in which case it only has to be within bounds (does not have to point to a valid character index);
when a string is passed as the second argument the returned range must meet a predicate s[search(s,x)] == x or 0:-1 is returned;
when a collection of characters is passed as the second argument the returned range must meet a predicate s[search(s,x)] in x or 0 is returned (earlier docstring defined a different predicate here);

nalimilan · 2017-10-16T08:20:36Z

What do people think about the particular case of search("", "abc") and search("", "abc", 1)? 1 is not a valid index for "", so we could raise an error in the second case, and return 0:-1 in the first case. But that will probably be problematic in practice since that would mean there's no index that works with any string, which is likely to create bugs that can go unnoticed until you test with an empty string. Since start("") == 1, I'd be inclined to accept 1 as a special case instead of throwing an error. See #24116 (comment) and following comments.

bkamins · 2017-10-16T08:32:48Z

Another option is to remove a third argument (idx) from all search functions altogether. This would simplify the logic of those functions (and would make them easier to understand by users).

I am not sure how important is the case of using idx is, but the consequences of removing this third argument are:

in the case when it is removed we do not have to check if it is valid (and I assume this is the most common use case so we actually improve the performance a bit here);
if user wants to make a search from an index idx it is simple to pass SubString(s, idx) as an argument to a search function and it should have low overhead (and it might be acceptable if this scenario is not used very often in practice).

nalimilan · 2017-10-16T08:38:44Z

As I noted elsewhere, these functions are going to be merged with find* soon, so they should be consistent with them. So far we always support passing an index. It's been discussed to replace them with a findeach iterator, but even if we add that it's not clear that we wouldn't want to keep the functions accepting an index since they can be useful (e.g. if you need to pass the index around). Also using SubString doesn't sound very intuitive. In general, until the new unified API is implemented, I'd rather avoid changing things too much.

These seem unrelated, but they're actually linked: * If you reverse generic strings by wrapping them in `RevString` then then this generic `reverseind` is incorrect. * In order to have a correct generic `reverseind` one needs to assume that `reverse(s)` returns a string of the same type and encoding as `s` with code points in reverse order; one also needs to assume that the code units encoding each character remain the same when reversed. This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32. Reverse string search functions are pretty messed up by this and I've fixed them well enough to work but they may be quite inefficient for long strings now. I'm not going to spend too much time on this since there's other work going on to generalize and unify searching APIs. Close #22611 Close #24613 See also: #10593 #23612 #24103

These seem unrelated, but they're actually linked: * If you reverse generic strings by wrapping them in `RevString` then then this generic `reverseind` is incorrect. * In order to have a correct generic `reverseind` one needs to assume that `reverse(s)` returns a string of the same type and encoding as `s` with code points in reverse order; one also needs to assume that the code units encoding each character remain the same when reversed. This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32. Reverse string search functions are pretty messed up by this and I've fixed them well enough to work but they may be quite inefficient for long strings now. I'm not going to spend too much time on this since there's other work going on to generalize and unify searching APIs. Close JuliaLang#22611 Close JuliaLang#24613 See also: JuliaLang#10593 JuliaLang#23612 JuliaLang#24103

bkamins · 2018-02-11T21:40:41Z

@nalimilan This is the state of this issue after removal of search:

julia> findfirst("", "123")
1:0

julia> findfirst(equalto(Char[]), "123")

julia>

Different return value, but this is probably intended.

julia> findnext("", "123", 5)
0:-1

julia> findnext("4", "123", 5)
ERROR: BoundsError: attempt to access "123"
  at index [5]
Stacktrace:
 [1] findnext(::Base.EqualTo{Char}, ::String, ::Int64) at .\strings\search.jl:8
 [2] _searchindex(::String, ::String, ::Int64) at .\strings\search.jl:165
 [3] _search at .\strings\search.jl:233 [inlined]
 [4] findnext(::String, ::String, ::Int64) at .\strings\search.jl:267
 [5] top-level scope

I am not sure if this is intended.

julia> findnext("", "", 10)
0:-1

julia> findnext(r"", "", 10)
ERROR: BoundsError
Stacktrace:
 [1] findnext(::Regex, ::String, ::Int64) at .\regex.jl:271
 [2] top-level scope

julia> findnext(equalto(Char[]), "", 10)
ERROR: BoundsError: attempt to access ""
  at index [10]
Stacktrace:
 [1] findnext(::Base.EqualTo{Array{Char,1}}, ::String, ::Int64) at .\strings\search.jl:107
 [2] top-level scope

I am not sure if this is intended

nalimilan · 2018-02-12T21:38:50Z

Actually I haven't really changed the underlying code when merging search into find* functions, so it's not surprising that the same issues remain. It still looks like we could only allow index 1 for the empty string, and throw an error for other indices.

@StefanKarpinski Opinions about these behaviors?

StefanKarpinski · 2018-02-12T21:44:38Z

We should probably allow 1 through ncodeunits(s)+1 returning nothing and throw a bounds error otherwise. The the default start index of 1 is always allowed and returns nothing if there are no occurrences of the needle.

nalimilan · 2018-02-13T08:41:55Z

Why +1? Just so that 1 is valid for "" without special-casing it?

StefanKarpinski · 2018-02-13T14:33:06Z

Yes, that was my thinking. Usually not allowing any valid index for the empty string case gets ugly.

StefanKarpinski · 2018-02-13T14:40:38Z

Another way to look at it: empty matches are a possibility for regex search and the input value is the start of an empty match in an empty string, so ncodeunits(s)+1 is a valid output index and should therefore also be a valid input index. In general, I've found that you want to throw an error only if the input couldn't possibly lead to an valid output. For example, nextind(s, codeunits(s)+1) is invalid because it's starting out of bounds in the direction that nextind advances. In this case, since findnext(r"", "") == 1:0 it seems clear (to me) that 1 is a valid starting point since it is the starting point of the match that is returned.

JeffBezanson added the strings "Strings!" label Oct 13, 2017

This was referenced Oct 13, 2017

improve regex search functions signatures #24116

Closed

Make search more consistent #24121

Closed

StefanKarpinski mentioned this issue Nov 22, 2017

remove RevString; efficient generic reverseind #24708

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String search inconsistency #24103

String search inconsistency #24103

bkamins commented Oct 11, 2017

booleanings commented Oct 12, 2017

bkamins commented Oct 12, 2017

nalimilan commented Oct 12, 2017

bkamins commented Oct 12, 2017

ararslan commented Oct 12, 2017

JeffBezanson commented Oct 13, 2017

bkamins commented Oct 13, 2017

nalimilan commented Oct 16, 2017

bkamins commented Oct 16, 2017

nalimilan commented Oct 16, 2017

bkamins commented Feb 11, 2018

nalimilan commented Feb 12, 2018

StefanKarpinski commented Feb 12, 2018

nalimilan commented Feb 13, 2018

StefanKarpinski commented Feb 13, 2018

StefanKarpinski commented Feb 13, 2018

String search inconsistency #24103

String search inconsistency #24103

Comments

bkamins commented Oct 11, 2017

booleanings commented Oct 12, 2017

bkamins commented Oct 12, 2017

nalimilan commented Oct 12, 2017

bkamins commented Oct 12, 2017

ararslan commented Oct 12, 2017

JeffBezanson commented Oct 13, 2017

bkamins commented Oct 13, 2017

nalimilan commented Oct 16, 2017

bkamins commented Oct 16, 2017

nalimilan commented Oct 16, 2017

bkamins commented Feb 11, 2018

nalimilan commented Feb 12, 2018

StefanKarpinski commented Feb 12, 2018

nalimilan commented Feb 13, 2018

StefanKarpinski commented Feb 13, 2018

StefanKarpinski commented Feb 13, 2018