-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved nextind and prevind #23805
Improved nextind and prevind #23805
Conversation
Does anyone have an advice how to test |
I have added a custom |
The other change is that maximum from |
base/strings/basic.jl
Outdated
@@ -236,14 +236,33 @@ end | |||
|
|||
## Generic indexing functions ## | |||
|
|||
prevind(s::DirectIndexString, i::Integer) = Int(i)-1 | |||
nextind(s::DirectIndexString, i::Integer) = Int(i)+1 | |||
function prevind(s::DirectIndexString, i::Integer, nchar::Integer=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The definition of DirectIndexString
is that each codepoint takes one byte. So you can just do Int(i-nchar-1)
.
base/strings/basic.jl
Outdated
Get the previous valid string index before `i`. | ||
Returns a value less than `1` at the beginning of the string. | ||
Get the `nchar`-th valid string index before `i`. | ||
Returns a `start(str)-1` at the beginning of the string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start(str)
always returns 1, so no need to change that.
base/strings/basic.jl
Outdated
Returns a value less than `1` at the beginning of the string. | ||
Get the `nchar`-th valid string index before `i`. | ||
Returns a `start(str)-1` at the beginning of the string. | ||
If `i>endof(str)` then `endof(s)` is considered a first valid string index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"a first" -> "the first"
base/strings/basic.jl
Outdated
if i > e | ||
return e | ||
function prevind(str::AbstractString, i::Integer, nchar::Integer=1) | ||
nchar > 0 || error("nchar must be greater than 0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ArgumentError
.
base/strings/basic.jl
Outdated
function prevind(str::AbstractString, i::Integer, nchar::Integer=1) | ||
nchar > 0 || error("nchar must be greater than 0") | ||
j = Int(i) | ||
j <= start(str) && return start(str)-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark about start
.
base/strings/basic.jl
Outdated
return j | ||
end | ||
|
||
while nchar > 0 && j >= start(str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be a way to avoid checking j
in both loops?
I wonder why we still have |
@nalimilan Thank you for the comments. Considering them led me to redesign the implementation. In particular I left old For instance for Also, as the PR introduced some unexpected errors on CI I will let you know when I am confident it is finished and ready for review. |
I wouldn't use two different functions unless there's a clear performance gain for the one-argument case. Having different methods means they are less tested, and as you said they can be inconsistent. Also don't worry too much about More generally, if the handling of out of range indices is a problem for the complexity or performance of the methods, we could choose a different rule (and/or not document it). It probably doesn't make a difference in practice whether we return |
It seems no it will go through CI so it should be good for a review.
Performance benchmark:
Additionally I have checked that for |
Thanks for running the benchmarks. Indeed that's a significant difference. I wouldn't have expected it to be so large ("setting up a loop" should be mostly equivalent to checking The CI errors you got still worry me a bit. Even if we keep the one-argument methods for performance, the two-argument ones should give the same result. So I think the CI should pass without the one-argument methods. |
NEWS.md
Outdated
@@ -224,6 +224,9 @@ This section lists changes that do not have deprecation warnings. | |||
Library improvements | |||
-------------------- | |||
|
|||
* THe functions `nextind` and `prevind` now accept `nchar` argument that indicates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
base/strings/basic.jl
Outdated
""" | ||
prevind(str::AbstractString, i::Integer) | ||
prevind(str::AbstractString, i::Integer, nchar::Integer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use a single line with nchar=1
. Then below you can simply say that the function goes nchar
characters, implementation details do not matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
test/strings/basic.jl
Outdated
@test_throws ArgumentError nextind(str, 20, 0) | ||
end | ||
|
||
let str = GenericString("∀α>β:α+1>β") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this block be merged with the previous one using a for
loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately no - nextind
and prevind
return something different in corner cases for String
and AbstractString
. Both are correct wrt the contract given in the docstring but give different results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe have a common part and a separate part, so that it's easier to spot the differences?
base/strings/string.jl
Outdated
j -= 1 | ||
end | ||
end | ||
j <= 0 && return j |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check wasn't present in the original code. Would there be a way to avoid it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check is needed to make sure that nextind(s,i,1)==nextind(s,i)
contract always holds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but I wondered whether there would be a way to avoid it by adapting the code logic. Maybe not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see one (and this is cheap).
base/strings/string.jl
Outdated
@@ -104,6 +104,27 @@ function prevind(s::String, i::Integer) | |||
@inbounds while j > 0 && is_valid_continuation(codeunit(s,j)) | |||
j -= 1 | |||
end | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated change. (Below you also have an empty line which isn't there in other functions.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@nalimilan Now CI would pass without one argument methods as I have made the contracts The problem is that with This (the use of undocumented properties of |
base/strings/basic.jl
Outdated
|
||
Get the previous valid string index before `i`. | ||
Returns a value less than `1` at the beginning of the string. | ||
If `nchar` argument is given the function goes back `nchar` charsacters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the nchar
" and "characters".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
base/strings/string.jl
Outdated
j -= 1 | ||
end | ||
end | ||
j <= 0 && return j |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but I wondered whether there would be a way to avoid it by adapting the code logic. Maybe not.
test/strings/basic.jl
Outdated
@test_throws ArgumentError nextind(str, 20, 0) | ||
end | ||
|
||
let str = GenericString("∀α>β:α+1>β") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe have a common part and a separate part, so that it's easier to spot the differences?
@@ -269,11 +285,32 @@ function prevind(s::AbstractString, i::Integer) | |||
return 0 # out of range | |||
end | |||
|
|||
function prevind(s::AbstractString, i::Integer, nchar::Integer) | |||
nchar > 0 || throw(ArgumentError("nchar must be greater than 0")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not accept 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handling of 0
:
- requires separate logic (additional
if
) - is unclear what should be the result if
i
is not a proper byte index (if one callsprevind
one expects to get a proper byte index in return - this is an important invariant of those functions in my opinion)
base/strings/basic.jl
Outdated
for j = j+1:e | ||
isvalid(s,j) && break | ||
end | ||
isvalid(s,j) || next(s,e)[2] # out of range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why you need to call isvalid
again here, since it has been called in the last iteration already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Line 361 was not needed at all as if j==e
after the loop we are sure that s[e]
is valid (as this is the contract of endof
)
In the tests I have collected the common part. |
base/strings/basic.jl
Outdated
@@ -355,10 +355,14 @@ function nextind(s::AbstractString, i::Integer, nchar::Integer) | |||
else | |||
j > e && return j+1 | |||
j == e && return next(s,e)[2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And isn't this one redundant, since the loop will be a no-op anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you've removed the call to next
below I guess this one needs to stay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly
base/strings/basic.jl
Outdated
@@ -255,7 +255,7 @@ end | |||
|
|||
Get the previous valid string index before `i`. | |||
Returns a value less than `1` at the beginning of the string. | |||
If `nchar` argument is given the function goes back `nchar` charsacters. | |||
If `nchar` argument is given the function goes back the `nchar` characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my suggestion about "the" was about "If the nchar
...", not about the second occurrence (but I'm not a native speaker).
The the location is corrected in the comment. |
base/strings/basic.jl
Outdated
j -= 1 | ||
end | ||
end | ||
j < 1 && return j |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the one-argument function return 0
in that case?
BTW, the CI failures are due to whitespace in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It gets triggered only if nchar>1
so it did not apply (for nchar=1
everything was OK) and previously I have made it consistent with String
implementation. But you are right that it is more consistent to return 0
. Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that means this line isn't tested? It should definitely be, especially since that code is quite tricky (are there other cases like this?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I have changed the implementation to satisfy a contract that iterating nextind
/prevind
k
times should give the same as nextind
/prevind
with nchar=k
. It is also fully tested.
That seems like a sensible contract to me... what else would it mean? |
To me the interesting corner cases are when you start out on a byte that's in the middle of a character – IIRC, there used to be a |
@StefanKarpinski but then:
I would prefer a separate function for this. Why |
I have rebased the PR to fix merge conflicts. |
We could always return 0 when the index would be out of bounds on the left, and Regarding |
CI failures seem legitimate. |
I would focus this PR on non-breaking changes (as it is now).
Regarding
shoud (CI should go through now) |
test/strings/basic.jl
Outdated
|
||
# prevind and nextind tests | ||
|
||
let |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the repeated change requests, but this file has just been ported to using testsets. Could you change the new tests to use @testset
(and remove the comment above)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thank you for the commitment. Fixing is not a problem, but resolving conflicts when rebasing is a pain 😃.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've experienced this recently with my own PRs. There's too much activity in Julia at the moment -- which is actually a good sign!
Use case: I have a very large file and I want to do parallel line-oriented processing of it. The. A natural approach is to pick evenly spread out indices and then synchronize to the nearest character and the search for the preceding new line and start processing there. |
Let me propose to move the discussion on |
Any comments before it could be merged? (I would want to move forward with #23765 using this PR) |
First part of implementation of #23765. Extends
prevind
andnextind
withnchar
parameter.Additionally ensures
prevind
andnextind
have consistent return value across different types of strings. In particular:prevind
always returns a result betweenstart(str)-1
andendof(str)
nextind
always returns a result betweenstart(str)
andendof(str)+1