Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iteration fails for strings with unicode characters #33157

Closed
vasile-c opened this issue Sep 4, 2019 · 13 comments
Closed

Iteration fails for strings with unicode characters #33157

vasile-c opened this issue Sep 4, 2019 · 13 comments

Comments

@vasile-c
Copy link

vasile-c commented Sep 4, 2019

It looks like unicode characters take two widths when iterating, with a ERROR: StringIndexError() for the "second" one. But accounting only for one width with textwidth():

julia> "aβc"[1:2]
""

julia> "aβc"[1:3]
ERROR: StringIndexError("aβc", 3)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
 [2] getindex(::String, ::UnitRange{Int64}) at ./strings/string.jl:247
 [3] top-level scope at none:0

julia> "aβc"[1:4]
"aβc"

julia> textwidth("aβc")
3

Same when reverse iterating:

julia> "aβc"[1:end]
"aβc"

julia> "aβc"[1:end-1]
ERROR: StringIndexError("aβc", 3)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
 [2] getindex(::String, ::UnitRange{Int64}) at ./strings/string.jl:247
 [3] top-level scope at none:0

julia> "aβc"[1:end-2]
""

Maybe related too #5712, #31780 and #31075 ?

julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

Is this a (known) bug(?), or we should just avoid iterating over strings with unicode characters ?

@KristofferC
Copy link
Member

textwidth is not what you want:

 help?> textwidth
 search: textwidth

  textwidth(c)

  Give the number of columns needed to print a character.

You might want ncodeunits. Questions like this is better suited for https://discourse.julialang.org

@vasile-c
Copy link
Author

vasile-c commented Sep 4, 2019

Primary issue here was iteration, not textwidth().
I included textwidth() output just for information, but I retrospectively agree that it may not be the most useful.

I know about https://discourse.julialang.org and look there often, but this looked more like a bug to me.
I expected iteration over a simple string to cover these unicode peculiarities behind the scene instead of throwing a StringIndexError() error.

@KristofferC
Copy link
Member

KristofferC commented Sep 4, 2019

You are not iterating. You are indexing.

See https://docs.julialang.org/en/v1/manual/strings/#Unicode-and-UTF-8-1.

This means that not every index into a String is necessarily a valid index for a character. If you index into a string at such an invalid byte index, an error is thrown:

@JeffBezanson
Copy link
Member

This is iteration:

julia> for c in "aβc"
         println(c)
       end
a
β
c

@vasile-c
Copy link
Author

vasile-c commented Sep 4, 2019

Indeed, I was not aware of this officially documented behavior (thank you for linking to it).
Otherwise I definitely would have posted this on discourse.

I now understand the reasons behind and would be aware of this unicode strings behavior when indexing next time.

It just seemed "broken" from a (data science) user point of vue when indexing doesn't follow the same "logic" as iteration and I didn't even think that it may come from the encoding.

I came across this error when I wanted to exclude the file extension from a string with an unicode character as the last one before the file extension itself:

julia> "[...]_α.csv"[1:end-4]
ERROR: StringIndexError("[...]_α.csv", 8)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
 [2] getindex(::String, ::UnitRange{Int64}) at ./strings/string.jl:247
 [3] top-level scope at none:0

Where I expected to get: [...]_α
So here I wasn't even thinking/aware that I'm indexing over the unicode character.
At least, a note pointing to the documentation would be very helpful inside the StringIndexError() (if possible).

But overall, I think that the "natural" indexing over strings with [:] should primary concern valid characters instead of the underlying encoding bytes and let otherwise the power users to use some other appropriate function/operator when requiring to index over the raw code units.

After reading the documentation, in order to obtain what I want I should write:

julia> s = "[...]_α.csv"
"[...]_α.csv"

julia> s[collect(eachindex(s))[1:end-4]]
"[...]_α"

Which seems like an overkill to me.
But this is only my partially aware opinion and expectation.
There is maybe a solid justification behind this design.

@KristofferC
Copy link
Member

After reading the documentation, in order to obtain what I want I should write:

No, you should use splitext:

julia> splitext("fdsfdsα.csv")
("fdsfdsα", ".csv")

@JeffBezanson
Copy link
Member

If string indices referred to characters instead of code units, then S[n] would take O(n) time for very common encodings like UTF-8 and UTF-16.

Also check out prevind and nextind. One common way to do this would be

julia> str = "file.csv"
"file.csv"

julia> str[1:prevind(str,findlast('.', str),1)]
"file"

@StefanKarpinski
Copy link
Member

Or chop(str, tail=4):

julia> str = "file_α.csv"
"file_α.csv"

julia> chop(str, tail=4)
"file_α"

But since the splitext function is specifically for separating the name and extension, use that.

@vasile-c
Copy link
Author

vasile-c commented Sep 6, 2019

Thank you all for the different workarounds to my problem at hand. That allowed me to learn some new techniques.

However, I deliberately didn't mention my problem at hand in the initial post because I was looking for the general way of (character) indexing over a unicode string.
So it still remains s[collect(eachindex(s))[a:b]] as far as I can tell.

I perfectly understand the O(n) cost, but the question here is why s[:] defaults to indexing over code units instead of characters ?
Are there much more people who rely on s[:] for performance instead of usability/consistency ?
Why not to use a dedicated function/operator for performance instead of all workarounds for character indexing ?

To my knowledge, [:] refers to the raw bytes only for strings and to the underlying meaningful objects for any thing else that accept [:].
Also, other languages like Python handle unicode string indexing "correctly" (= by character).

All this to say that the current Julia behavior (leaky abstraction?) surprised me and some others apparently too. And as long as Julia and/or unicode strings will become more and more popular, more and more people will get trapped by the current behavior.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Sep 6, 2019

I perfectly understand the O(n) cost, but the question here is why s[:] defaults to indexing over code units instead of characters ?

... because of the O(n) cost. If we defaulted to O(n) indexing, then the obvious working code to do simple things with strings like search and parsing would be O(n^2) in the length of the input string.

Also, other languages like Python handle unicode string indexing "correctly" (= by character).

At very severe cost:

  1. Python cannot handle invalid Unicode input—it just throws an error. It's impossible to write robust string processing programs with Python's string type. You have to use byte arrays if you want to write robust code that doesn't crash on some data.

  2. Python has to examine and transcode every single string it ever touches. Twice: once on the way in and once on the way out. That's "only" O(n) in the length of the input, but it's a really awful performance penalty. It's only mitigated by the fact that in Python if you want something to be fast you have to write it in C anyway, so high-performance "Python" string libraries use C to do the heavy lifting (e.g. searching large bodies of string data) and then try to limit the transcoding tax to being paid on small strings which must be passed into Python.

All this to say that the current Julia behavior (leaky abstraction?) surprised me and some others apparently too.

It's not a leaky abstraction: it works for any kind of string encoding with any kind of code units that have O(1) indexing. It is a more complex abstraction than O(1) character indexing, but there's nothing leaky about it. This is a natural abstraction for variable length encodings.

The bottom line is this: don't do arithmetic on string indices, use string index functions like find*, nextind and prevind. It's slightly annoying but pretty simple and fast. When people do it wrong they get an error that leads them to learn the correct way to do it.

@vasile-c
Copy link
Author

vasile-c commented Sep 7, 2019

... because of the O(n) cost. If we defaulted to O(n) indexing, then the obvious working code to do simple things with strings like search and parsing would be O(n^2) in the length of the input string.

I never meant to lose the O(1) capability, just to attribute it to an another function/operator than the s[_:_] notation.

Schematically, what I'm suggesting:

  • reserve the s[_:_] notation for character O(n) indexing
  • something else for performant O(1) indexing over code units

Thus you have both, unsurprising "natural" character indexing over strings with the s[_:_] notation
while being able to write performance critical code with a dedicated O(1) function/operator.
This way the adaptability "burden" will be on the developers/power users instead of the "standard" users.

But to make it clear, I certainly do not realize the deep implications this kind of change in the behavior of the s[_:_] notation could have over the already existing code base and even if it will be easy/feasible/desirable to refactor it in order to keep the O(1) performance. So I will stop to bother you with this, I hope it's now clear enough what I was meaning.

It's not a leaky abstraction: it works for any kind of string encoding with any kind of code units that have O(1) indexing. It is a more complex abstraction than O(1) character indexing, but there's nothing leaky about it. This is a natural abstraction for variable length encodings.

I was pointing out a potential leaky abstraction with respect to the [_:_] notation, not the indexing over strings.
When you have an object (e.g. vector, matrix, dataframe, etc.) that accept the [_:_] notation, you will be indexing over its "meaningful" parts, not the raw bytes encoding those.
Moreover, when you feed [_:_] with otherwise valid indices, you certainly never expect to potentially get an error. That is where I see a leaky abstraction.

The bottom line is this: don't do arithmetic on string indices, use string index functions like find*, nextind and prevind. It's slightly annoying but pretty simple and fast. When people do it wrong they get an error that leads them to learn the correct way to do it.

Here is the issue, because we have always been taught that "you can index over the characters of a string just like you do with a simple array". This held true either because of the one byte ASCII-like encoding, either at a higher performance cost.
Thus, this is isolating Julia from others languages and even inside Julia this represent an inconsistency over the [_:_]-like indexing from my point of vue (see above).

To conclude, this unicode string indexing peculiarity will certainly not force me to stop from using Julia. Like you said, it will annoy me until a moment when I will perhaps end up by seeing the benefits (I hope). Above all, I just wanted to report my little feedback "from the field".

@StefanKarpinski
Copy link
Member

People will always reach for the convenient notation first, and if that's slow, it's an unacceptable performance trap. This is a matter of perspective: Python is slow and doesn't really care about performance traps so much. Julia is fast and we care about performance traps. I prefer a design that is fast by default and gives a sensible error if you use it wrong—which is precisely what string indexing does. In any case, this is how strings work in Julia until 2.0 at least. With more compiler tech, we might be able to create a StringIndex type where incrementing by an integer increases the index by the right number of bytes, but that's not currently possible to make efficient.

When you have an object (e.g. vector, matrix, dataframe, etc.) that accept the [:] notation, you will be indexing over its "meaningful" parts, not the raw bytes encoding those.

Characters aren't really even the right level of grouping then. Why not graphemes or even grapheme clusters? Should string indexing do Unicode normalization behind your back? These are not ridiculous questions, this is what Swift does for string indexing.

Here is the issue, because we have always been taught that "you can index over the characters of a string just like you do with a simple array".

Thus, this is isolating Julia from others languages and even inside Julia this represent an inconsistency over the [_:_]-like indexing from my point of vue (see above).

You've learned Python, which does things that way. Other languages don't. In C, for example, if you want to write Unicode-correct code, you cannot assume that every integer index starts a character—you have to write code much like you do in Julia, finding the valid indices, but without all the handy built-in functions for navigating code units and without the helpful errors when you do it wrong. Rust doesn't allow string indexing at all. In Go, indexing into a string gives you raw bytes. Swift does some very complex stuff and I believe indexes by byte offset and iterates and compares strings by normalized grapheme clusters. There's a lot of diversity in string indexing in the modern world. This is a hard problem with no broadly accepted solution. Even in the Python community, there seems to be significant regret about how they chose to do strings in Python 3.

To conclude, this unicode string indexing peculiarity will certainly not force me to stop from using Julia. Like you said, it will annoy me until a moment when I will perhaps end up by seeing the benefits (I hope). Above all, I just wanted to report my little feedback "from the field".

I'm glad it won't put you off the language and thank you for the report.

@vasile-c
Copy link
Author

With more compiler tech, we might be able to create a StringIndex type where incrementing by an integer increases the index by the right number of bytes, but that's not currently possible to make efficient.

This would be certainly great if it could be part of Julia 2 🤞

Also thanks to your pointers, I now better realize that this is not actually just a Julia issue and other languages are facing the same problem with not yet an accepted "default" behavior for unicode string indexing.
With the additional constraint for Julia aiming to solve the "two languages" problem, making a design decision (satisfying low and high level language used users) is thus even harder.

People will always reach for the convenient notation first, and if that's slow, it's an unacceptable performance trap. This is a matter of perspective: Python is slow and doesn't really care about performance traps so much. Julia is fast and we care about performance traps. I prefer a design that is fast by default and gives a sensible error if you use it wrong—which is precisely what string indexing does.

Now I better understand your vision for Julia and thus can better embrace it.
Thank you for taking your time responding this kind of questions.
It was very appreciated and (hopefully) useful for others asking themselves the same question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants