Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String search inconsistency #24103

Open
bkamins opened this issue Oct 11, 2017 · 16 comments
Open

String search inconsistency #24103

bkamins opened this issue Oct 11, 2017 · 16 comments
Labels
strings "Strings!"

Comments

@bkamins
Copy link
Member

bkamins commented Oct 11, 2017

The following searches return unexpected results:

julia> search("1", Char[], 2) # is bounds error for start=3
2

julia> search("", 'a', 1) # is bounds error for start=2
0

julia> search("1", 'a', 2) # is bounds error for start=3
0

julia> search("1", "", 2) # probably should be bounds error, 2:1 is not allowed by the documentation
2:1

julia> search("1", "", 20) # probably should be bounds error - similar case
0:-1

julia> search("∀", "1", 4) # should be bounds error, it is an error for start equal to 3 and 5
0:-1

In general the reason is that the functions handle index out of range by one in a special way. But maybe it is intended?

Additionally there is an inconsistency:

julia> search("123", "") # no match indicator
1:0

julia> search("123", Char[]) # match indicator
1

Is it intended?

In summary I have two questions:

  • if start passed to search is out of bounds should the function always throw an error (or rather return no match result) - I would say always throw an error;
  • how matching of empty structs like "" and Char[] should be handled - I would say it should be no match in both cases;
@booleanings
Copy link
Contributor

if start passed to search is out of bounds should the function always throw an error (or rather return no match result) - I would say always throw an error;

I'm not so sure about this. Most languages would just return no match. The purpose (totally assuming) being that you might want to check if a string is contained in a string you don't know the length of and don't care if it's out of bounds. I think the functionality is more in substring when you try to substring further than the start index.

Definitely +1 to the Char[] + "" handling though.

@bkamins
Copy link
Member Author

bkamins commented Oct 12, 2017

I agree. I simply did not want to change the existing functionality too much (which throws errors).

If we are flexible I would propose the following rules (applied in this order):

  • if string is empty or chars is empty return no match (0 or 0:-1) respectively
  • never throw an error but properly handle start values:
    • if start<1 then assume start=1
    • if start>endof(str) then return no match
    • if start is a valid index work as current search works
    • if start is an invalid index use nextind(str, start) to start matching (currently search throws an error in this case)

In this way string would never throw an error (if there are any comments to those rules I am open to change them - the key issue for me is to be consistent).

Additionally I would clean up the code not to use string as a name of a variable and rename it to str as string conflicts with a library function.
If this would be agreed I can make a proper PR.

@nalimilan
Copy link
Member

I'm in favor of throwing an error when an out of bounds index is passed. I don't see the use case for passing an invalid index: if you don't know, don't pass an index, or pass 1. That approach is safer as it helps catching bugs, and we can always change it later; OTC if we accept invalid indices in 1.0 we'll have to retain that behavior in all 1.x releases. It will also be more consistent with recent changes to getindex and SubString.

I'm less sure about what to do when looking for the empty string/empty list of chars. I think the reasoning is that the empty string is found everywhere between two subsequent characters, but it has a zero length, so the first match i 1:0, the second match 2:1 (in ASCII at least), etc. But that doesn't make much sense for an empty list of characters. Either throwing an error or returning 0 sounds fine to me.

Cc: @stevengj

@bkamins
Copy link
Member Author

bkamins commented Oct 12, 2017

@ararslan When there is a consensus about the correct behavior it would be good to have 7d244be merged as it reimplements rsearch to update it accordingly in the PR following this discussion.

@ararslan
Copy link
Member

There are still test failures in that PR that I haven't had time to figure out but yeah it'd be good to have that merged soon.

@JeffBezanson JeffBezanson added the strings "Strings!" label Oct 13, 2017
@JeffBezanson
Copy link
Sponsor Member

that the empty string is found everywhere between two subsequent characters, but it has a zero length

👍

For an empty list/set of characters, I think the result should be no-match, since we did not find a character that's in the set.

@bkamins
Copy link
Member Author

bkamins commented Oct 13, 2017

In #24121 I have made a proposal of a consistent implementation following the discussion.
The docstring given there gives all relevant cases that are covered.

In short:

  • search into an empty string without start index always returns 0 or 0:-1;
  • search into an empty string with start index given always throws an error;
  • passed index has to be a valid one unless we search String for Union{Int8, UInt8} in which case it only has to be within bounds (does not have to point to a valid character index);
  • when a string is passed as the second argument the returned range must meet a predicate s[search(s,x)] == x or 0:-1 is returned;
  • when a collection of characters is passed as the second argument the returned range must meet a predicate s[search(s,x)] in x or 0 is returned (earlier docstring defined a different predicate here);

@nalimilan
Copy link
Member

What do people think about the particular case of search("", "abc") and search("", "abc", 1)? 1 is not a valid index for "", so we could raise an error in the second case, and return 0:-1 in the first case. But that will probably be problematic in practice since that would mean there's no index that works with any string, which is likely to create bugs that can go unnoticed until you test with an empty string. Since start("") == 1, I'd be inclined to accept 1 as a special case instead of throwing an error. See #24116 (comment) and following comments.

@bkamins
Copy link
Member Author

bkamins commented Oct 16, 2017

Another option is to remove a third argument (idx) from all search functions altogether. This would simplify the logic of those functions (and would make them easier to understand by users).

I am not sure how important is the case of using idx is, but the consequences of removing this third argument are:

  • in the case when it is removed we do not have to check if it is valid (and I assume this is the most common use case so we actually improve the performance a bit here);
  • if user wants to make a search from an index idx it is simple to pass SubString(s, idx) as an argument to a search function and it should have low overhead (and it might be acceptable if this scenario is not used very often in practice).

@nalimilan
Copy link
Member

As I noted elsewhere, these functions are going to be merged with find* soon, so they should be consistent with them. So far we always support passing an index. It's been discussed to replace them with a findeach iterator, but even if we add that it's not clear that we wouldn't want to keep the functions accepting an index since they can be useful (e.g. if you need to pass the index around). Also using SubString doesn't sound very intuitive. In general, until the new unified API is implemented, I'd rather avoid changing things too much.

StefanKarpinski added a commit that referenced this issue Nov 22, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
StefanKarpinski added a commit that referenced this issue Nov 22, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
StefanKarpinski added a commit that referenced this issue Nov 27, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
StefanKarpinski added a commit that referenced this issue Nov 28, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
StefanKarpinski added a commit that referenced this issue Nov 29, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
StefanKarpinski added a commit that referenced this issue Dec 4, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
StefanKarpinski added a commit that referenced this issue Dec 4, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
evetion pushed a commit to evetion/julia that referenced this issue Dec 12, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close JuliaLang#22611
Close JuliaLang#24613

See also: JuliaLang#10593 JuliaLang#23612 JuliaLang#24103
@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2018

@nalimilan This is the state of this issue after removal of search:

julia> findfirst("", "123")
1:0

julia> findfirst(equalto(Char[]), "123")

julia>

Different return value, but this is probably intended.

julia> findnext("", "123", 5)
0:-1

julia> findnext("4", "123", 5)
ERROR: BoundsError: attempt to access "123"
  at index [5]
Stacktrace:
 [1] findnext(::Base.EqualTo{Char}, ::String, ::Int64) at .\strings\search.jl:8
 [2] _searchindex(::String, ::String, ::Int64) at .\strings\search.jl:165
 [3] _search at .\strings\search.jl:233 [inlined]
 [4] findnext(::String, ::String, ::Int64) at .\strings\search.jl:267
 [5] top-level scope

I am not sure if this is intended.

julia> findnext("", "", 10)
0:-1

julia> findnext(r"", "", 10)
ERROR: BoundsError
Stacktrace:
 [1] findnext(::Regex, ::String, ::Int64) at .\regex.jl:271
 [2] top-level scope

julia> findnext(equalto(Char[]), "", 10)
ERROR: BoundsError: attempt to access ""
  at index [10]
Stacktrace:
 [1] findnext(::Base.EqualTo{Array{Char,1}}, ::String, ::Int64) at .\strings\search.jl:107
 [2] top-level scope

I am not sure if this is intended

@nalimilan
Copy link
Member

Actually I haven't really changed the underlying code when merging search into find* functions, so it's not surprising that the same issues remain. It still looks like we could only allow index 1 for the empty string, and throw an error for other indices.

@StefanKarpinski Opinions about these behaviors?

@StefanKarpinski
Copy link
Sponsor Member

We should probably allow 1 through ncodeunits(s)+1 returning nothing and throw a bounds error otherwise. The the default start index of 1 is always allowed and returns nothing if there are no occurrences of the needle.

@nalimilan
Copy link
Member

Why +1? Just so that 1 is valid for "" without special-casing it?

@StefanKarpinski
Copy link
Sponsor Member

Yes, that was my thinking. Usually not allowing any valid index for the empty string case gets ugly.

@StefanKarpinski
Copy link
Sponsor Member

Another way to look at it: empty matches are a possibility for regex search and the input value is the start of an empty match in an empty string, so ncodeunits(s)+1 is a valid output index and should therefore also be a valid input index. In general, I've found that you want to throw an error only if the input couldn't possibly lead to an valid output. For example, nextind(s, codeunits(s)+1) is invalid because it's starting out of bounds in the direction that nextind advances. In this case, since findnext(r"", "") == 1:0 it seems clear (to me) that 1 is a valid starting point since it is the starting point of the match that is returned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

No branches or pull requests

6 participants