-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iteration fails for strings with unicode characters #33157
Comments
You might want |
Primary issue here was iteration, not I know about https://discourse.julialang.org and look there often, but this looked more like a bug to me. |
You are not iterating. You are indexing. See https://docs.julialang.org/en/v1/manual/strings/#Unicode-and-UTF-8-1.
|
This is iteration:
|
Indeed, I was not aware of this officially documented behavior (thank you for linking to it). I now understand the reasons behind and would be aware of this unicode strings behavior when indexing next time. It just seemed "broken" from a (data science) user point of vue when indexing doesn't follow the same "logic" as iteration and I didn't even think that it may come from the encoding. I came across this error when I wanted to exclude the file extension from a string with an unicode character as the last one before the file extension itself: julia> "[...]_α.csv"[1:end-4]
ERROR: StringIndexError("[...]_α.csv", 8)
Stacktrace:
[1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
[2] getindex(::String, ::UnitRange{Int64}) at ./strings/string.jl:247
[3] top-level scope at none:0 Where I expected to get: But overall, I think that the "natural" indexing over strings with After reading the documentation, in order to obtain what I want I should write: julia> s = "[...]_α.csv"
"[...]_α.csv"
julia> s[collect(eachindex(s))[1:end-4]]
"[...]_α" Which seems like an overkill to me. |
No, you should use
|
If string indices referred to characters instead of code units, then Also check out
|
Or julia> str = "file_α.csv"
"file_α.csv"
julia> chop(str, tail=4)
"file_α" But since the |
Thank you all for the different workarounds to my problem at hand. That allowed me to learn some new techniques. However, I deliberately didn't mention my problem at hand in the initial post because I was looking for the general way of (character) indexing over a unicode string. I perfectly understand the O(n) cost, but the question here is why To my knowledge, All this to say that the current Julia behavior (leaky abstraction?) surprised me and some others apparently too. And as long as Julia and/or unicode strings will become more and more popular, more and more people will get trapped by the current behavior. |
... because of the O(n) cost. If we defaulted to O(n) indexing, then the obvious working code to do simple things with strings like search and parsing would be O(n^2) in the length of the input string.
At very severe cost:
It's not a leaky abstraction: it works for any kind of string encoding with any kind of code units that have O(1) indexing. It is a more complex abstraction than O(1) character indexing, but there's nothing leaky about it. This is a natural abstraction for variable length encodings. The bottom line is this: don't do arithmetic on string indices, use string index functions like |
I never meant to lose the O(1) capability, just to attribute it to an another function/operator than the Schematically, what I'm suggesting:
Thus you have both, unsurprising "natural" character indexing over strings with the But to make it clear, I certainly do not realize the deep implications this kind of change in the behavior of the
I was pointing out a potential leaky abstraction with respect to the
Here is the issue, because we have always been taught that "you can index over the characters of a string just like you do with a simple array". This held true either because of the one byte ASCII-like encoding, either at a higher performance cost. To conclude, this unicode string indexing peculiarity will certainly not force me to stop from using Julia. Like you said, it will annoy me until a moment when I will perhaps end up by seeing the benefits (I hope). Above all, I just wanted to report my little feedback "from the field". |
People will always reach for the convenient notation first, and if that's slow, it's an unacceptable performance trap. This is a matter of perspective: Python is slow and doesn't really care about performance traps so much. Julia is fast and we care about performance traps. I prefer a design that is fast by default and gives a sensible error if you use it wrong—which is precisely what string indexing does. In any case, this is how strings work in Julia until 2.0 at least. With more compiler tech, we might be able to create a
Characters aren't really even the right level of grouping then. Why not graphemes or even grapheme clusters? Should string indexing do Unicode normalization behind your back? These are not ridiculous questions, this is what Swift does for string indexing.
You've learned Python, which does things that way. Other languages don't. In C, for example, if you want to write Unicode-correct code, you cannot assume that every integer index starts a character—you have to write code much like you do in Julia, finding the valid indices, but without all the handy built-in functions for navigating code units and without the helpful errors when you do it wrong. Rust doesn't allow string indexing at all. In Go, indexing into a string gives you raw bytes. Swift does some very complex stuff and I believe indexes by byte offset and iterates and compares strings by normalized grapheme clusters. There's a lot of diversity in string indexing in the modern world. This is a hard problem with no broadly accepted solution. Even in the Python community, there seems to be significant regret about how they chose to do strings in Python 3.
I'm glad it won't put you off the language and thank you for the report. |
This would be certainly great if it could be part of Julia 2 🤞 Also thanks to your pointers, I now better realize that this is not actually just a Julia issue and other languages are facing the same problem with not yet an accepted "default" behavior for unicode string indexing.
Now I better understand your vision for Julia and thus can better embrace it. |
It looks like unicode characters take two widths when iterating, with a
ERROR: StringIndexError()
for the "second" one. But accounting only for one width withtextwidth()
:Same when reverse iterating:
Maybe related too #5712, #31780 and #31075 ?
Is this a (known) bug(?), or we should just avoid iterating over strings with unicode characters ?
The text was updated successfully, but these errors were encountered: