-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure non-ASCII decimal digits are also isdigit
#54447
Conversation
The current (ASCII only) definition of |
That may be true, but on the other hand, these uses relying on ASCII-only are all undocumented assumptions :/ Worse, The failures in the Unicode stdlib are real - I only tested with julia> for c in ['٣', '٥', '٨', '¹', 'ⅳ']
show(stdout, MIME"text/plain"(), c)
println()
end
'٣': Unicode U+0663 (category Nd: Number, decimal digit)
'٥': Unicode U+0665 (category Nd: Number, decimal digit)
'٨': Unicode U+0668 (category Nd: Number, decimal digit)
'¹': Unicode U+00B9 (category No: Number, other)
'ⅳ': Unicode U+2173 (category Nl: Number, letter) Doubly worse, the test failing here is in a Would adding a predicate |
Looking through all those old PRs/issue, maybe @stevengj has some additional context/guidance here :) |
The docs for
julia> c = '4'
'4': Unicode U+FF14 (category Nd: Number, decimal digit)
julia> c in '0':'9'
false So I don't see an issue with the current behavior. |
There are more than just the ASCII decimal digits in Unicode, hence the example above with a 4 from a different script ;) Both that 4 and the ASCII 4 are decimal digits, and would thus fit the docstring. See e.g. here for a list. |
isdigit
isdigit
Yes, the docstring should be updated to be explicit about only including ASCII decimal digit then, but I have code that would break if this function returns true for other chars out of the range |
If I remember, this was following the C |
I would favor @Seelengrab's proposed behaviour, for the following reasons:
|
The TOML parser is bugged with this: julia> TOML.parse("4= \"value\"")
ERROR: TOML Parser error:
none:1:1 error: invalid bare key character: '4'
4= "value"
^
Stacktrace:
...
julia> @eval Base.Unicode begin
isdigit(c::AbstractChar) = category_code(c) == UTF8PROC_CATEGORY_ND
end;
julia> TOML.parse("4= \"value\"")
ERROR: StringIndexError: invalid index [3], valid nearby indices [1]=>'4', [4]=>'='
Stacktrace:
[1] string_index_err(s::String, i::Int64)
@ Base ./strings/string.jl:12
[2] SubString{String}(s::String, i::Int64, j::Int64)
@ Base ./strings/substring.jl:35
[3] SubString
@ ./strings/substring.jl:49 [inlined]
[4] SubString
@ ./strings/substring.jl:52 [inlined]
[5] take_substring
@ ./toml_parser.jl:440 [inlined]
[6] _parse_key(l::Base.TOML.Parser)
@ Base.TOML ./toml_parser.jl:623
Checking for digits feels quite common inside hot parsing loops. Also, if existing uses assumed that Instead of slowing down all uses of this in the ecosystem for the reason "this implementation is a tiny bit more correct" could a new function be added (and referred to from |
IMO that is more an issue with assuming
which isn't the case in Unicode. IMO this should use There's this bit about Lines 589 to 592 in 5006312
but that seems inconsistent with the spec:
I'm happy to add tests & fixes to Base here, but the fact that only the Unicode-stdlib tests failed suggest to me that this is either not well tested, or not that big of a change after all.
Having written some of those hot parsing loops myself, it's quite common to replace the category code-based functions from Base with custom ones specialized for ASCII already - what's one more? IMO, utility functionality provided by Base, working on types defined in Base, should handle the canonical interpretation of those types, and not a special case. That is, the utility functions in this case should follow what Unicode considers a decimal digit.
That's also an option, but then we have the odd situation that only To be clear, the usual arithmetic people sometimes do (e.g. |
The TOML spec is here https://toml.io/en/v1.0.0#keys
As I wrote and demonstrated, there is code that assumes that |
Thanks for pointing that out! Seems like the version I linked is the in-development version, so a future version of TOML is likely going to support Unicode there.
I understand that, but this is a largely undocumented assumption. @jakobnissen also found an issue with the current implementation, which doesn't handle malformed data properly: julia> isdigit(reinterpret(Char, 0x30000001))
true which this PR does handle, returning |
Clearly, changing The question is whether we should export a new function that returns It seems more useful, and more general, to simply export some variant of |
It seems we already have it:
julia/stdlib/Unicode/test/runtests.jl Lines 132 to 142 in 2877cbc
|
No, that returns
This PR is about |
This would just be all kinds of breaking. Just skim across all the usages of |
The `isdigit` function only checks for ASCII digits — this PR clarifies the docs to make that explicit. See #54447 (comment). Closes #54447.
The `isdigit` function only checks for ASCII digits — this PR clarifies the docs to make that explicit. See JuliaLang#54447 (comment). Closes JuliaLang#54447.
I noticed that
even though
The docstring doesn't say that this only checks ASCII, and the unicode category (see here) is an exact match for the definition (ten glyphs making up the numbers zero through nine, in various scripts). So while some things will still slip through (e.g.
四
, which is considered a letter by Unicode), this implementation is a tiny bit more correct.However, it's not just roses and sunshine; there is a downside, in that this won't vectorize anymore due to the
ccall
behindcategory_code
. We could bring the double table lookup into pure Julia, but I'm not sure whether that would restore performance here. Maybe it could? As is, this PR comes with a 100x regression in a microbenchmark.v1.11-alpha2:
This PR:
In a single-invocation benchmark, we "merely" have a 2-3x/10ns regression:
v1.11-alpha2:
This PR:
That being said, we already check
category_code
pretty much everywhere else, and this microbenchmark is extremely unlikely to be representative of an actual workload.