is_valid_char does not correctly follow the Unicode standard #11171

ScottPJones · 2015-05-06T17:13:49Z

is_valid_char returns false for values which are valid Unicode codepoints.
This is due to a misunderstanding of the way the 66 Unicode "non character" codepoints are supposed to be handled. See: "FAQ - Private-Use Characters, Noncharacters, and Sentinels"

Here are the relevant sections:

Q: Are noncharacters invalid in Unicode strings and UTFs?

A: Absolutely not. Noncharacters do not cause a Unicode string to be ill-formed in any UTF. This can be seen explicitly in the table above, where every noncharacter code point has a well-formed representation in UTF-32, in UTF-16, and in UTF-8. An implementation which converts noncharacter code points between one UTF representation and another must preserve these values correctly. The fact that they are called "noncharacters" and are not intended for open interchange does not mean that they are somehow illegal or invalid code points which make strings containing them invalid.

Q: So how should libraries and tools handle noncharacters?

A: Library APIs, components, and tool applications (such as low-level text editors) which handle all Unicode strings should also handle noncharacters. Often this means simple pass-through, the same way such an API or tool would handle a reserved unassigned code point. Such APIs and tools would not normally be expected to interpret the semantics of noncharacters, precisely because the intended use of a noncharacter is internal. But an API or tool should also not arbitrarily filter out, convert, or otherwise discard the value of noncharacters, any more than they would do for private-use characters or reserved unassigned code points.

[@jiahao - edited formatting of hyperlink]

mschauer · 2015-05-06T17:19:43Z

I agree, I understand that this means that functions should handle noncharacters gracefully and not bail out when encountering one.

ScottPJones · 2015-05-06T17:31:56Z

Here is my proposed replacement, I'll submit a PR very shortly...
function is_valid_char(ch::Unsigned) ; !Bool((ch-0xd800<0x800)|(ch>0x10ffff)) ; end

jiahao · 2015-05-06T17:41:50Z

What is really meant here is that is_valid_char does not correctly identify valid Unicode scalar values, as opposed to valid characters or valid Unicode code points (the surrogates U+0D800 - U+0DFFF are valid code points). Only valid Unicode scalar values can have a code unit sequence that can appear in a valid Unicode string.

See Unicode 7.0.0, p119 (pdf):

D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate
code points.
• As a result of this definition, the set of Unicode scalar values consists of the
ranges 0 to D7FF₁₆ and E000₁₆ to 10FFFF₁₆ , inclusive.

The documentation of is_valid_char should also be changed to

Returns true if the given char or integer is a valid Unicode code
point scalar value.

perhaps even including a reference to definition in the Unicode standard.

use simple rejection sampling over valid codepoint range

jakebolewski · 2015-05-06T18:06:29Z

The relevant function is_valid_char is calling in utf8proc is called utf8proc_codepoint_valid

https://github.com/JuliaLang/utf8proc/blob/7c14ef5f8371e463a01e0f1de971caa600384390/utf8proc.c#L151

jiahao · 2015-05-06T18:12:35Z

Ref #11033

jiahao · 2015-05-06T18:19:17Z

utf8proc_codepoint_valid is not documented, so its meaning could be changed to be in sync with what we have here.

ScottPJones · 2015-05-06T18:23:36Z

@jiahao Good point about specifying Unicode scale values, and that would be good to fix utf8proc I have to submit several issues in utf8proc, where it doesn't conform to the Unicode standard correctly.
@jakebolewski I won't use utf8proc, Julia is faster than C anyway! ;-)

nalimilan · 2015-05-06T20:03:06Z

@ScottPJones Now, contrary to what what asked in other PRs, it might be better to fix the problem in utf8proc if that's indeed a bug there. :-) As long as we depend on utf8proc at all, better make it work right.

StefanKarpinski · 2015-05-06T20:05:14Z

utf8proc is justified as C code since it's used by other outside of Julia.

ScottPJones · 2015-05-06T20:14:24Z

@nalimilan I didn't say that I wouldn't get around to fixing it in utf8proc as well, which I hope to do... but I also have other things to do than fixing Julia bugs ;-)
@StefanKarpinski Yes, I understand that... as soon as I get a "round tuit", I'll fix it, but right now, I wanted to get Julia (which I'm using) fixed.

StefanKarpinski · 2015-05-06T20:30:13Z

Thanks, @ScottPJones! Very much appreciated.

ScottPJones · 2015-05-06T20:36:57Z

@StefanKarpinski I'm positively 😊ing from the kind words today! 😉 I do owe all of you a beer (or cider) or two (or three) at the Muddy Charles during JuliaCon, for putting up with me being such a long-winded PITA!

StefanKarpinski · 2015-05-06T20:47:25Z

No worries, @ScottPJones. Glad you've persevered.

Fix #11171 is_valid_char

This is per JuliaLang#11171

Add reference to issue JuliaLang#11171

This is per #11171

Add reference to issue #11171

This is per JuliaLang#11171

Add reference to issue JuliaLang#11171

This is per JuliaLang#11171

Add reference to issue JuliaLang#11171

ihnorton added the unicode Related to unicode characters and encodings label May 6, 2015

jiahao referenced this issue May 6, 2015

Add missing rand(::AbstractRNG, ::Type{Char}) method

5986e58

use simple rejection sampling over valid codepoint range

jakebolewski closed this as completed in 5b00772 May 7, 2015

jakebolewski added a commit that referenced this issue May 7, 2015

Merge pull request #11175 from ScottPJones/spj/validchar

0aa5cd3

Fix #11171 is_valid_char

ScottPJones added a commit to ScottPJones/julia that referenced this issue May 7, 2015

Fix JuliaLang#11171 is_valid_char

88d5a37

ScottPJones added a commit to ScottPJones/julia that referenced this issue May 9, 2015

Add update on is_valid_char

1f86922

This is per JuliaLang#11171

ScottPJones mentioned this issue May 9, 2015

Add update on is_valid_char #11213

Merged

ScottPJones added a commit to ScottPJones/julia that referenced this issue May 9, 2015

Update NEWS.md

7a76ea2

Add reference to issue JuliaLang#11171

mbauman pushed a commit that referenced this issue May 11, 2015

Add update on is_valid_char

5dbefdd

This is per #11171

mbauman pushed a commit that referenced this issue May 11, 2015

Update NEWS.md

1e14810

Add reference to issue #11171

mbauman pushed a commit to mbauman/julia that referenced this issue Jun 6, 2015

Fix JuliaLang#11171 is_valid_char

ef38bcc

mbauman pushed a commit to mbauman/julia that referenced this issue Jun 6, 2015

Add update on is_valid_char

c879cd4

This is per JuliaLang#11171

mbauman pushed a commit to mbauman/julia that referenced this issue Jun 6, 2015

Update NEWS.md

bfbf175

Add reference to issue JuliaLang#11171

tkelman pushed a commit to tkelman/julia that referenced this issue Jun 6, 2015

Fix JuliaLang#11171 is_valid_char

12a75a1

tkelman pushed a commit to tkelman/julia that referenced this issue Jun 6, 2015

Add update on is_valid_char

8d4f6e6

This is per JuliaLang#11171

tkelman pushed a commit to tkelman/julia that referenced this issue Jun 6, 2015

Update NEWS.md

0856285

Add reference to issue JuliaLang#11171

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is_valid_char does not correctly follow the Unicode standard #11171

is_valid_char does not correctly follow the Unicode standard #11171

ScottPJones commented May 6, 2015

mschauer commented May 6, 2015

ScottPJones commented May 6, 2015

jiahao commented May 6, 2015

jakebolewski commented May 6, 2015

jiahao commented May 6, 2015

jiahao commented May 6, 2015

ScottPJones commented May 6, 2015

nalimilan commented May 6, 2015

StefanKarpinski commented May 6, 2015

ScottPJones commented May 6, 2015

StefanKarpinski commented May 6, 2015

ScottPJones commented May 6, 2015

StefanKarpinski commented May 6, 2015

is_valid_char does not correctly follow the Unicode standard #11171

is_valid_char does not correctly follow the Unicode standard #11171

Comments

ScottPJones commented May 6, 2015

mschauer commented May 6, 2015

ScottPJones commented May 6, 2015

jiahao commented May 6, 2015

jakebolewski commented May 6, 2015

jiahao commented May 6, 2015

jiahao commented May 6, 2015

ScottPJones commented May 6, 2015

nalimilan commented May 6, 2015

StefanKarpinski commented May 6, 2015

ScottPJones commented May 6, 2015

StefanKarpinski commented May 6, 2015

ScottPJones commented May 6, 2015

StefanKarpinski commented May 6, 2015