Improved nextind and prevind #23805

bkamins · 2017-09-21T11:10:37Z

First part of implementation of #23765. Extends prevind and nextind with nchar parameter.
Additionally ensures prevind and nextind have consistent return value across different types of strings. In particular:

prevind always returns a result between start(str)-1 and endof(str)
nextind always returns a result between start(str) and endof(str)+1

bkamins · 2017-09-21T13:35:00Z

Does anyone have an advice how to test DirectIndexString without loading LegacyStrings or how to allow use of LegacyStrings in tests?

bkamins · 2017-09-21T14:21:52Z

I have added a custom DirectIndexString subtype to Base.Test similar to GenericString.

bkamins · 2017-09-21T14:23:46Z

The other change is that maximum from nextind is not endof(str)+1 but next(str, endof(str))[2] as it is more consistent (it gives a next character index that would be returned if the string were longer).

nalimilan · 2017-09-21T11:17:31Z

base/strings/basic.jl

@@ -236,14 +236,33 @@ end

 ## Generic indexing functions ##

-prevind(s::DirectIndexString, i::Integer) = Int(i)-1
-nextind(s::DirectIndexString, i::Integer) = Int(i)+1
+function prevind(s::DirectIndexString, i::Integer, nchar::Integer=1)


The definition of DirectIndexString is that each codepoint takes one byte. So you can just do Int(i-nchar-1).

nalimilan · 2017-09-21T11:20:04Z

base/strings/basic.jl

-Get the previous valid string index before `i`.
-Returns a value less than `1` at the beginning of the string.
+Get the `nchar`-th valid string index before `i`.
+Returns a `start(str)-1` at the beginning of the string.


start(str) always returns 1, so no need to change that.

nalimilan · 2017-09-21T11:20:22Z

base/strings/basic.jl

-Returns a value less than `1` at the beginning of the string.
+Get the `nchar`-th valid string index before `i`.
+Returns a `start(str)-1` at the beginning of the string.
+If `i>endof(str)` then `endof(s)` is considered a first valid string index.


"a first" -> "the first"

nalimilan · 2017-09-21T11:20:49Z

base/strings/basic.jl

-    if i > e
-        return e
+function prevind(str::AbstractString, i::Integer, nchar::Integer=1)
+    nchar > 0 || error("nchar must be greater than 0")


Use ArgumentError.

nalimilan · 2017-09-21T11:23:32Z

base/strings/basic.jl

+function prevind(str::AbstractString, i::Integer, nchar::Integer=1)
+    nchar > 0 || error("nchar must be greater than 0")
+    j = Int(i)
+    j <= start(str) && return start(str)-1


Same remark about start.

nalimilan · 2017-09-21T16:53:38Z

base/strings/basic.jl

-            return j
-        end
+
+    while nchar > 0 && j >= start(str)


Would there be a way to avoid checking j in both loops?

nalimilan · 2017-09-21T16:58:54Z

I have added a custom DirectIndexString subtype to Base.Test similar to GenericString.

I wonder why we still have DirectIndexString since it isn't used at all. We should probably move it to LegacyStrings.

bkamins · 2017-09-21T20:08:51Z

@nalimilan Thank you for the comments. Considering them led me to redesign the implementation. In particular I left old nextind and prevind untouched. And only added additional methods with nchar argument.
This way the old functions behave as they did before (which is less consistent, but will not break anything).

For instance for DirectIndexString the implementation Int(i)-1 etc. is inconsistent with other implementations when index is out of string range, but for now I recommend to leave it as is.

Also, as the PR introduced some unexpected errors on CI I will let you know when I am confident it is finished and ready for review.

nalimilan · 2017-09-21T21:00:54Z

I wouldn't use two different functions unless there's a clear performance gain for the one-argument case. Having different methods means they are less tested, and as you said they can be inconsistent. Also don't worry too much about DirectIndexString, which should probably be removed anyway.

More generally, if the handling of out of range indices is a problem for the complexity or performance of the methods, we could choose a different rule (and/or not document it). It probably doesn't make a difference in practice whether we return endof(s)+1 or next(s, endof(s))[2] as long as the value is out of bounds.: anyway any attempt to use the index will give a BoundsError.

bkamins · 2017-09-21T21:52:46Z

It seems no it will go through CI so it should be good for a review.
Regarding your comment:

there is a clear performance gain (see benchmarks below) - we do not have to test for nchar and there is no need to set up a loop;
I also thought to merge them initially as you saw, but then strange errors in CI started to pop out;

Performance benchmark:

julia> function test(s, i)
           print("prevind($s, $i):\t")
           @btime prevind($s, $i)
           print("prevind($s, $i, 1):\t")
           @btime prevind($s, $i, 1)
           print("nextind($s, $i):\t")
           @btime nextind($s, $i)
           print("nextind($s, $i, 1):\t")
           @btime nextind($s, $i, 1)
           nothing
       end
test (generic function with 1 method)

julia> test("test", 0)
prevind(test, 0):         4.665 ns (0 allocations: 0 bytes)
prevind(test, 0, 1):      8.864 ns (0 allocations: 0 bytes)
nextind(test, 0):         4.198 ns (0 allocations: 0 bytes)
nextind(test, 0, 1):      7.931 ns (0 allocations: 0 bytes)

julia> test("test", 1)
prevind(test, 1):         4.665 ns (0 allocations: 0 bytes)
prevind(test, 1, 1):      8.864 ns (0 allocations: 0 bytes)
nextind(test, 1):         4.665 ns (0 allocations: 0 bytes)
nextind(test, 1, 1):      8.864 ns (0 allocations: 0 bytes)

julia> test("test", 10)
prevind(test, 10):        5.598 ns (0 allocations: 0 bytes)
prevind(test, 10, 1):     10.730 ns (0 allocations: 0 bytes)
nextind(test, 10):        4.198 ns (0 allocations: 0 bytes)
nextind(test, 10, 1):     7.931 ns (0 allocations: 0 bytes)

julia> test("∀∃∀", 0)
prevind(∀∃∀, 0):          4.665 ns (0 allocations: 0 bytes)
prevind(∀∃∀, 0, 1):       8.864 ns (0 allocations: 0 bytes)
nextind(∀∃∀, 0):          4.198 ns (0 allocations: 0 bytes)
nextind(∀∃∀, 0, 1):       7.931 ns (0 allocations: 0 bytes)

julia> test("∀∃∀", 1)
prevind(∀∃∀, 1):          4.665 ns (0 allocations: 0 bytes)
prevind(∀∃∀, 1, 1):       8.864 ns (0 allocations: 0 bytes)
nextind(∀∃∀, 1):          6.997 ns (0 allocations: 0 bytes)
nextind(∀∃∀, 1, 1):       10.263 ns (0 allocations: 0 bytes)

julia> test("∀∃∀", 12)
prevind(∀∃∀, 12):         6.998 ns (0 allocations: 0 bytes)
prevind(∀∃∀, 12, 1):      13.062 ns (0 allocations: 0 bytes)
nextind(∀∃∀, 12):         4.198 ns (0 allocations: 0 bytes)
nextind(∀∃∀, 12, 1):      7.931 ns (0 allocations: 0 bytes)

Additionally I have checked that for nchar>1 preformance wise it is better to use the proposed implementations than calling single step nextind/prevind in a loop nchar times.

nalimilan · 2017-09-22T08:22:35Z

Thanks for running the benchmarks. Indeed that's a significant difference. I wouldn't have expected it to be so large ("setting up a loop" should be mostly equivalent to checking nchar > 0 twice, but maybe the compiler doesn't optimize for the nchar=1 case).

The CI errors you got still worry me a bit. Even if we keep the one-argument methods for performance, the two-argument ones should give the same result. So I think the CI should pass without the one-argument methods.

nalimilan · 2017-09-22T08:07:59Z

NEWS.md

@@ -224,6 +224,9 @@ This section lists changes that do not have deprecation warnings.
 Library improvements
 --------------------

+  * THe functions `nextind` and `prevind` now accept `nchar` argument that indicates


nalimilan · 2017-09-22T08:08:26Z

base/strings/basic.jl

 """
    prevind(str::AbstractString, i::Integer)
+    prevind(str::AbstractString, i::Integer, nchar::Integer)


Just use a single line with nchar=1. Then below you can simply say that the function goes nchar characters, implementation details do not matter.

nalimilan · 2017-09-22T08:13:56Z

test/strings/basic.jl

+    @test_throws ArgumentError nextind(str, 20, 0)
+end
+
+let str = GenericString("∀α>β:α+1>β")


Couldn't this block be merged with the previous one using a for loop?

Unfortunately no - nextind and prevind return something different in corner cases for String and AbstractString. Both are correct wrt the contract given in the docstring but give different results.

But maybe have a common part and a separate part, so that it's easier to spot the differences?

nalimilan · 2017-09-22T08:20:15Z

base/strings/string.jl

+                j -= 1
+            end
+        end
+        j <= 0 && return j


This check wasn't present in the original code. Would there be a way to avoid it?

this check is needed to make sure that nextind(s,i,1)==nextind(s,i) contract always holds.

Yeah, but I wondered whether there would be a way to avoid it by adapting the code logic. Maybe not.

I do not see one (and this is cheap).

nalimilan · 2017-09-22T08:21:21Z

base/strings/string.jl

@@ -104,6 +104,27 @@ function prevind(s::String, i::Integer)
    @inbounds while j > 0 && is_valid_continuation(codeunit(s,j))
        j -= 1
    end
+


Unrelated change. (Below you also have an empty line which isn't there in other functions.)

bkamins · 2017-09-22T12:01:44Z

@nalimilan Now CI would pass without one argument methods as I have made the contracts nextind(s,i,1)==nextind(s,i) and prevind(s,i,1)==prevind(s,i) to always be met.

The problem is that with i outside or at the boundary of string old prevind and nextind behave inconsistently between String, AbstractString and DirectIndexString and also depending on the value of i (if it is a boundary or outside the string range). In the earlier implementation I have made this consistent and this produced the CI errors (as it seems that some internal functions in base used the behavior of nextind/prevind that was not specified in docstring contract).

This (the use of undocumented properties of nextind/prevind) probably could be cleaned one day, but I would leave it as a separate task.

nalimilan · 2017-09-22T13:23:33Z

base/strings/basic.jl


 Get the previous valid string index before `i`.
 Returns a value less than `1` at the beginning of the string.
+If `nchar` argument is given the function goes back `nchar` charsacters.


"the nchar" and "characters".

nalimilan · 2017-09-22T13:27:15Z

base/strings/string.jl

+                j -= 1
+            end
+        end
+        j <= 0 && return j


Yeah, but I wondered whether there would be a way to avoid it by adapting the code logic. Maybe not.

nalimilan · 2017-09-22T13:28:03Z

test/strings/basic.jl

+    @test_throws ArgumentError nextind(str, 20, 0)
+end
+
+let str = GenericString("∀α>β:α+1>β")


But maybe have a common part and a separate part, so that it's easier to spot the differences?

nalimilan · 2017-09-22T13:28:39Z

base/strings/basic.jl

@@ -269,11 +285,32 @@ function prevind(s::AbstractString, i::Integer)
    return 0 # out of range
 end

+function prevind(s::AbstractString, i::Integer, nchar::Integer)
+    nchar > 0 || throw(ArgumentError("nchar must be greater than 0"))


Why not accept 0?

Handling of 0:

requires separate logic (additional if)

is unclear what should be the result if i is not a proper byte index (if one calls prevind one expects to get a proper byte index in return - this is an important invariant of those functions in my opinion)

nalimilan · 2017-09-22T13:32:24Z

base/strings/basic.jl

+            for j = j+1:e
+                isvalid(s,j) && break
+            end
+            isvalid(s,j) || next(s,e)[2] # out of range


I don't understand why you need to call isvalid again here, since it has been called in the last iteration already.

You are right. Line 361 was not needed at all as if j==e after the loop we are sure that s[e] is valid (as this is the contract of endof)

bkamins · 2017-09-22T14:40:57Z

In the tests I have collected the common part.

nalimilan · 2017-09-22T14:41:52Z

base/strings/basic.jl

@@ -355,10 +355,14 @@ function nextind(s::AbstractString, i::Integer, nchar::Integer)
        else
            j > e && return j+1
            j == e && return next(s,e)[2]


And isn't this one redundant, since the loop will be a no-op anyway?

Since you've removed the call to next below I guess this one needs to stay.

nalimilan · 2017-09-22T14:43:06Z

base/strings/basic.jl

@@ -255,7 +255,7 @@ end

 Get the previous valid string index before `i`.
 Returns a value less than `1` at the beginning of the string.
-If `nchar` argument is given the function goes back `nchar` charsacters.
+If `nchar` argument is given the function goes back the `nchar` characters.


Sorry, my suggestion about "the" was about "If the nchar...", not about the second occurrence (but I'm not a native speaker).

bkamins · 2017-09-22T14:50:23Z

The the location is corrected in the comment.

nalimilan · 2017-09-22T15:18:30Z

base/strings/basic.jl

+                j -= 1
+            end
+        end
+        j < 1 && return j


Doesn't the one-argument function return 0 in that case?

BTW, the CI failures are due to whitespace in tests.

It gets triggered only if nchar>1 so it did not apply (for nchar=1 everything was OK) and previously I have made it consistent with String implementation. But you are right that it is more consistent to return 0. Fixed

So that means this line isn't tested? It should definitely be, especially since that code is quite tricky (are there other cases like this?).

OK. I have changed the implementation to satisfy a contract that iterating nextind/prevind k times should give the same as nextind/prevind with nchar=k. It is also fully tested.

StefanKarpinski · 2017-09-22T17:47:06Z

That seems like a sensible contract to me... what else would it mean?

StefanKarpinski · 2017-09-22T17:51:28Z

To me the interesting corner cases are when you start out on a byte that's in the middle of a character – IIRC, there used to be a thisind function to get you to the start of the current character in that situation. We could now use prevind(s, i, 0) to get to the start of the current character – the character whose data the index points into – and nextind(s, i, 0) to get to the start of the next character – if you're already at the start of a character, return i, but if you're in the middle of a character, advance through the trailing bytes.

bkamins · 2017-09-22T18:47:46Z

@StefanKarpinski but then:

if you are in the middle of the character:
- prevind(s,i)==prevind(s,i,1)==prevind(s,i,0)
- nextind(s,i)==nextind(s,i,1)==nextind(s,i,0)
if you are at the start of the character:
- prevind(s,i)==prevind(s,i,1)!=prevind(s,i,0)
- nextind(s,i)==nextind(s,i,1)!=nextind(s,i,0)

I would prefer a separate function for this. Why thisind was removed?

bkamins · 2017-09-23T07:13:59Z

I have rebased the PR to fix merge conflicts.

nalimilan · 2017-09-23T09:38:59Z

That seems like a sensible contract to me... what else would it mean?

We could always return 0 when the index would be out of bounds on the left, and endof(s)+1 or next(s, endof(s))[2] when it would be out of bounds on the right. Returning endof(s)+nchar is quite a weird convention as it does as if characters always used a single byte. Anyway users which would rely on these kinds of details probably don't do things correctly. I'd say the best convention is the one which is the easiest to implement efficiently (i.e. no additional checks for corner cases like this).

Regarding thisind, I'd rather avoid supporting officially indices which are in the middle of a codepoint. There's no way to obtain such an index, except by doing things the wrong way (like i-1). Such indices already throw an error when passed to getindex, and (hopefully soon) SubString will do the same (#22511). So maybe we could allow prevind(s, i, 0) at some point, but I'd rather not allow it unless we have a strong use case.

nalimilan · 2017-09-23T09:42:30Z

CI failures seem legitimate.

bkamins · 2017-09-23T11:18:47Z

I would focus this PR on non-breaking changes (as it is now).
When we have it merged (and also #22511 and #23765 finished) I can handle two breaking changes (as this will involve fixing other code - as we already know and I want to keep PR as small as possible to avoid merge conflicts):

make prevind/nextind always return 0 or next(s,endof(s))[2] when out of string range;
move DirectIndexString to LegacyStrings (and fix functionality insonsistency)

Regarding thisind I am indifferent. It would have its uses, but I agree with @nalimilan that we should not encourage invalid indexes in general (and we have isvalid already for testing validity). Also if it were to be reintroduced we would have to decide what to do when (I do not know what previous implementation did in such cases):

index is less than 1;
index is greater or equal than nextind(s, endof(s))[2];

shoud thisind throw an error then?

(CI should go through now)

nalimilan · 2017-09-23T11:35:40Z

test/strings/basic.jl

+
+# prevind and nextind tests
+
+let


Sorry for the repeated change requests, but this file has just been ported to using testsets. Could you change the new tests to use @testset (and remove the comment above)?

Done. Thank you for the commitment. Fixing is not a problem, but resolving conflicts when rebasing is a pain 😃.

Yes, I've experienced this recently with my own PRs. There's too much activity in Julia at the moment -- which is actually a good sign!

StefanKarpinski · 2017-09-23T19:22:04Z

Use case: I have a very large file and I want to do parallel line-oriented processing of it. The. A natural approach is to pick evenly spread out indices and then synchronize to the nearest character and the search for the preceding new line and start processing there.

bkamins · 2017-09-23T20:57:50Z

Let me propose to move the discussion on thisind to #23765 (I hope this PR can be closed with the functionality already implemented).

bkamins · 2017-09-29T10:07:44Z

Any comments before it could be merged? (I would want to move forward with #23765 using this PR)

bkamins mentioned this pull request Sep 21, 2017

Improvement of the functions for handling string indexing #23765

Closed

nalimilan reviewed Sep 21, 2017

View reviewed changes

ararslan added the strings "Strings!" label Sep 21, 2017

nalimilan reviewed Sep 22, 2017

View reviewed changes

bkamins added 2 commits September 23, 2017 09:02

improved prevind and nextind

ea7ecb2

fix rebase conflicts

1b72987

bkamins force-pushed the string_fun branch from 5ef8121 to 1b72987 Compare September 23, 2017 07:12

change i to j in loop

dd1d949

nalimilan reviewed Sep 23, 2017

View reviewed changes

wrap tests in @testset

710a481

nalimilan approved these changes Sep 23, 2017

View reviewed changes

bkamins mentioned this pull request Sep 26, 2017

invalid character index in printf with unicode #23880

Closed

KristofferC merged commit f02395b into JuliaLang:master Oct 1, 2017

bkamins deleted the string_fun branch October 1, 2017 12:01

bkamins mentioned this pull request Oct 21, 2017

Remove DirectIndexString from Base #24259

Closed

stevengj mentioned this pull request Jun 12, 2018

thisind, 3-arg length/nextind/prevind, codeunit(s) JuliaLang/Compat.jl#573

Merged

Improved nextind and prevind #23805

Improved nextind and prevind #23805

Conversation

bkamins commented Sep 21, 2017

bkamins commented Sep 21, 2017

bkamins commented Sep 21, 2017

bkamins commented Sep 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Sep 21, 2017

bkamins commented Sep 21, 2017

nalimilan commented Sep 21, 2017

bkamins commented Sep 21, 2017

nalimilan commented Sep 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Sep 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Sep 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Sep 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanKarpinski commented Sep 22, 2017

StefanKarpinski commented Sep 22, 2017

bkamins commented Sep 22, 2017

bkamins commented Sep 23, 2017

nalimilan commented Sep 23, 2017

nalimilan commented Sep 23, 2017

bkamins commented Sep 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanKarpinski commented Sep 23, 2017

bkamins commented Sep 23, 2017

bkamins commented Sep 29, 2017

bkamins commented Sep 21, 2017 •

edited

Loading