make reverseind generic #24613

StefanKarpinski · 2017-11-15T03:50:06Z

It has always bothered me that strings have to define reverseind. But figuring out a correct generic definition for this function has eluded me – until now. I think I've finally figured it out:

reverseind(s,i) gives the index in s of the character beginning at byte i in reverse(s).
Then ncodeunits(s)-i+1 is index of the end of that character in s and ncodeunits(s)-i+2 is the index of the beginning of the next character in s (or the index right after the end of s).
Therefore prevind(s, ncodeunits(s)-i+2) is always the index of the character in question in s. In other words, this is a generic expression for reverseind(s,i) in terms of prevind and ncodeunits.

Edit: I've replaced sizeof with ncodeunits as suggested below.

This does actually work out:

julia> s = "∀ x ∃ y"
"∀ x ∃ y"

julia> [reverseind(s,i) for i=1:sizeof(s)]
11-element Array{Int64,1}:
 11
 10
  7
  7
  7
  6
  5
  4
  1
  1
  1

julia> [prevind(s, sizeof(s)-i+2) for i=1:sizeof(s)]
11-element Array{Int64,1}:
 11
 10
  7
  7
  7
  6
  5
  4
  1
  1
  1

The only problem with this is that it requires a generic definition of sizeof(s) which does not exist, and arguably should not exist for string types that may not be backed by bytes in the usual way. Instead, I would suggest using nextind(s, endof(s)) and giving this some generic function name. This function is something that specific string types may want to overwrite, but that's much easier to do since for typical string types, it's just the storage size of the string.

cc: @stevengj

The text was updated successfully, but these errors were encountered:

oxinabox · 2017-11-15T05:01:57Z

I am now convinced this works, purely empirically.

Fuzzer:


julia> newcheck(s)=[prevind(s, sizeof(s)-i+2) for i=1:sizeof(s)]
newcheck (generic function with 1 method)

julia> oldcheck(s)=[reverseind(s,i) for i=1:sizeof(s)]
oldcheck (generic function with 1 method)

julia> for t in [join(rand(Char, rand(1:100))) for _ in 1:10^5]
	Base.Test.@test(newcheck(t) == oldcheck(t))
end

no errors.

stevengj · 2017-11-15T13:25:30Z

sizeof is wrong here, because it is the size in bytes. e.g. it will fail for a UTF-16 array.

What you want is the number of code units, which I think should be nextind(s,endof(s))-1.

Since we have a codeunit(s, i) function, it makes sense to have a lengthcodeunits(s) function or similar that gives the maximum index. Not sure of a good name.

StefanKarpinski · 2017-11-15T18:33:40Z

Yes, that's right. So the generic definitions would be:

reverseind(s::AbstractString, i::Int) = prevind(s, ncodeunits(s)-i+2)
ncodeunits(s::AbstractString) = nextind(s, endof(s))-1

and you'd have these specific definitions for speed:

ncodeunits(s::String) = sizeof(s)
ncodeunits(s::UTF16String) = sizeof(s) >> 1
ncodeunits(s::UTF32String) = sizeof(s) >> 2

I think everything else falls out of the definition of prevind which is complex for String and UTF16String but just does i-1 for UTF32String.

stevengj · 2017-11-15T18:38:39Z

I would just define ncodeunits(s::UTF16String) = length(s.data), but yes.

StefanKarpinski · 2017-11-16T19:28:40Z

[Not breaking, so removing the triage label.]

StefanKarpinski · 2017-11-16T19:35:04Z

This relies on the indices into a string being the same as its code unit indices. We haven't formally required that before, but I think that we should – that's how all actual string types we've seen work and it's hard to imagine any other way to do this.

stevengj · 2017-11-16T20:17:36Z

It won't work if we define a StringIndex type (#9297), because then the - 1 might not be defined.

StefanKarpinski · 2017-11-16T21:08:02Z

I suspect we're not going to move ahead with #9297, but if we do then string types will just have to define ncodeunits directly and we can't provide a fallback for them anymore.

StefanKarpinski · 2017-11-20T02:50:15Z

I've realized that there's a complication here. The contract of reverseind is essentially the identity

s[reverseind(s,i)] == reverse(s)[i]

However, there's an assumption baked into this which is a bit of an issue: the type and encoding of reverse(s). Making reverseind generic in the way I've proposed assumes that the type and encoding of reverse(s) is the same as that of s. Until now reverse(::String) has returned a RevString, which has made changing this behavior harder than expected since the generic definition does not work. We can fix this particular issue, but this raises a basic question:

Should reverse(s) have the same type and encoding as s? OR...
Should reverse(s) return a String – i.e. normalize to standard string type?

We can only have a correct generic fallback for reverseind for one choice, not both – since they dictate different behaviors for reverseind. I'm inclined to go with option 1 for a couple of reasons:

If someone is working with a specific encoding, it's likely for a reason and we should respect that unless they specifically request changing encodings by converting types.
An efficient definition of reverseind is possible for the same type, under fairly reasonable assumptions that are valid, e.g. for UTF-8 and UTF-16 (and trivially UTF-32).
An efficient generic definition of reverseind between String and a generic encoding isn't possible in the same way (as far as I can tell).

stevengj · 2017-11-20T17:42:17Z

@StefanKarpinski, recall that this was discussed in #23612, and in consequence we documented that reverse(s) always returns a String. (That would also argue against a generic reverseind.)

StefanKarpinski · 2017-11-20T17:46:52Z

I think that requiring people to define reverseind for custom string types is pretty unfortunate. It's a complicated and very weird function, yet string types don't work correctly without defining it. That's pretty bad. The other approach that's possible is to double down on RevString and always have reverse(s) return RevString(s).

stevengj · 2017-11-20T19:07:25Z

Having reverse(s) return RevString(s) seems like a step backwards. As discussed in #6165, the only realistic justification for having a reverse(s) function for strings in the first place is to support reverse-order processing with external libraries like PCRE, and only physically reversing the data will accomplish this.

StefanKarpinski · 2017-11-20T22:03:04Z

That seems to argue for reverse(s) generally returning a string of the same type with the same encoding. String types that don't have that property will need to define their own reverseind.

stevengj · 2017-11-20T23:34:03Z

So, no fallback for reverse(s::AbstractString)? I’m fine with that.

These seem unrelated, but they're actually linked: * If you reverse generic strings by wrapping them in `RevString` then then this generic `reverseind` is incorrect. * In order to have a correct generic `reverseind` one needs to assume that `reverse(s)` returns a string of the same type and encoding as `s` with code points in reverse order; one also needs to assume that the code units encoding each character remain the same when reversed. This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32. Reverse string search functions are pretty messed up by this and I've fixed them well enough to work but they may be quite inefficient for long strings now. I'm not going to spend too much time on this since there's other work going on to generalize and unify searching APIs. Close #22611 Close #24613 See also: #10593 #23612 #24103

These seem unrelated, but they're actually linked: * If you reverse generic strings by wrapping them in `RevString` then then this generic `reverseind` is incorrect. * In order to have a correct generic `reverseind` one needs to assume that `reverse(s)` returns a string of the same type and encoding as `s` with code points in reverse order; one also needs to assume that the code units encoding each character remain the same when reversed. This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32. Reverse string search functions are pretty messed up by this and I've fixed them well enough to work but they may be quite inefficient for long strings now. I'm not going to spend too much time on this since there's other work going on to generalize and unify searching APIs. Close JuliaLang#22611 Close JuliaLang#24613 See also: JuliaLang#10593 JuliaLang#23612 JuliaLang#24103

StefanKarpinski added strings "Strings!" triage This should be discussed on a triage call labels Nov 15, 2017

StefanKarpinski removed the triage This should be discussed on a triage call label Nov 16, 2017

StefanKarpinski self-assigned this Nov 16, 2017

stevengj mentioned this issue Nov 21, 2017

Improve SubString nextind/prevind #24255

Closed

StefanKarpinski mentioned this issue Nov 22, 2017

Implements thisind function #24414

Merged

StefanKarpinski added this to the 1.0 milestone Nov 22, 2017

StefanKarpinski mentioned this issue Nov 22, 2017

remove RevString; efficient generic reverseind #24708

Merged

StefanKarpinski closed this as completed in 5167f17 Dec 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make reverseind generic #24613

make reverseind generic #24613

StefanKarpinski commented Nov 15, 2017 •

edited

Loading

oxinabox commented Nov 15, 2017 •

edited

Loading

stevengj commented Nov 15, 2017 •

edited

Loading

StefanKarpinski commented Nov 15, 2017

stevengj commented Nov 15, 2017

StefanKarpinski commented Nov 16, 2017

StefanKarpinski commented Nov 16, 2017

stevengj commented Nov 16, 2017

StefanKarpinski commented Nov 16, 2017

StefanKarpinski commented Nov 20, 2017 •

edited

Loading

stevengj commented Nov 20, 2017 •

edited

Loading

StefanKarpinski commented Nov 20, 2017 •

edited

Loading

stevengj commented Nov 20, 2017 •

edited

Loading

StefanKarpinski commented Nov 20, 2017

stevengj commented Nov 20, 2017

make reverseind generic #24613

make reverseind generic #24613

Comments

StefanKarpinski commented Nov 15, 2017 • edited Loading

oxinabox commented Nov 15, 2017 • edited Loading

stevengj commented Nov 15, 2017 • edited Loading

StefanKarpinski commented Nov 15, 2017

stevengj commented Nov 15, 2017

StefanKarpinski commented Nov 16, 2017

StefanKarpinski commented Nov 16, 2017

stevengj commented Nov 16, 2017

StefanKarpinski commented Nov 16, 2017

StefanKarpinski commented Nov 20, 2017 • edited Loading

stevengj commented Nov 20, 2017 • edited Loading

StefanKarpinski commented Nov 20, 2017 • edited Loading

stevengj commented Nov 20, 2017 • edited Loading

StefanKarpinski commented Nov 20, 2017

stevengj commented Nov 20, 2017

StefanKarpinski commented Nov 15, 2017 •

edited

Loading

oxinabox commented Nov 15, 2017 •

edited

Loading

stevengj commented Nov 15, 2017 •

edited

Loading

StefanKarpinski commented Nov 20, 2017 •

edited

Loading

stevengj commented Nov 20, 2017 •

edited

Loading

StefanKarpinski commented Nov 20, 2017 •

edited

Loading

stevengj commented Nov 20, 2017 •

edited

Loading