RFC: export utf8proc Unicode transformation functionality in Julia #5576

stevengj · 2014-01-27T18:06:09Z

Following up to #5462 and #5434, this patch exposes a bunch of functionality from the bundled utf8proc library in Julia:

is_valid_char(s) is replaced by the implementation in utf8proc, which detects a few more invalid codepoints
is_assigned_char(c): new function to return whether a code point is assigned
charcategory(c): new function to return a Base.UnicodeCategory describing the Unicode general category of the character. I defined a type for this to make it easy to get the general category or the subcategory parts, and for pretty printing, but I expect that most people will just compare these to strings (e.g. "Lu" for uppercase letter).
normalize_string(s::String, normalform::Symbol): new function to perform standard Unicode normalization, e.g. normalize_string(s, :NFKC) for normal form KC.
normalize_string(s::String; keywords...): new function to perform various normalization transformations, including case folding, diacritical stripping, newline conversion, etcetera.

cc: @jiahao

stevengj · 2014-01-29T17:06:51Z

On second thought, I think it's better to delay implementation of something like charcategory until it is clearer what the needs are for character metadata.

jiahao · 2014-01-31T00:12:42Z

This all looks fantastic. Thanks for putting it all together.

Let's think of what tests we should put in. On top of the very simplest of tests (checking the three new functions on some sample of Unicode strings), one potential concern is if u8_isvalid and utf8proc disagree on whether or not a given string is validly UTF8-encoded. Perhaps the most expedient thing to do is to generate a bunch of random bytestrings and check that is_valid_char and u8_isvalid agree.

jiahao · 2014-01-31T00:16:18Z

I can see charcategory being fruitfully used to implement Unicode-aware methods for string parsing functions like parseint and isupper, for example.

stevengj · 2014-01-31T00:33:28Z

I suspect that functions like isupper would want to call UTF8proc.category_code directly for performance reasons, rather than doing a mapping to some higher-level type. But isupper calls libc's iswupper and so it is already Unicode-aware, right? (Except that Unicode has upper, lower, and title case.)

jiahao · 2014-01-31T04:06:51Z

Yes, isupper does call libc's iswupper. Now I'm just being paranoid and wondering if we should worry about mutual consistency between all the Unicode-aware components.

JeffBezanson · 2014-01-31T04:08:57Z

We know some things are certainly missing: #3721

jiahao · 2014-01-31T04:42:18Z

Unfortunately there are quite a few code points for which u8_isvalid and utf8proc disagree upon their validity:

julia> inconsistencies = [is_valid_utf8(string(char(a))) != is_valid_char((char(a))) for a in 0:0x10ffff]; 
julia> uint32(find(identity, inconsistencies))-1
2114-element Array{Uint32,1}:
 0x0000d800
 0x0000d801
 0x0000d802
 0x0000d803
 0x0000d804
          ⋮
 0x000effff
 0x000ffffe
 0x000fffff
 0x0010fffe
 0x0010ffff

In all cases, is_valid_char is more conservative in declaring them invalid.

jiahao · 2014-01-31T04:53:46Z

The normalize_string method with all the Bool flags could also use a less surprising default:

julia> a=string(char(0x0041),char(0x030a))
"Å"
julia> b=normalize_string(a)
"A"
julia> length(b)
1
julia> int(b[1])
65

It seems unlikely that this is what most users would want as a default; perhaps a better choice is to specify the flags which would make normalize_string do NFC or some other standard normalization scheme by default.

jiahao · 2014-01-31T04:59:02Z

There is also the question of how we want to deal with future versions of Unicode standards. The current stable release of utf8proc is v1.1.16, which explicitly states support for Unicode 5.0.0. This will probably change in the future to a more recent Unicode standard, so the issue of mutually compatible Unicode-aware components is most likely something we will have to face sooner or later.

jiahao · 2014-01-31T05:10:52Z

Further inconsistencies:

julia> #Upper case identification
julia> inconsistencies = [isupper(char(a)) != (Base.UTF8proc.category_code(char(a))==Base.UTF8proc.category_code('A')) for a in 0:0x10ffff];
julia> uint32(find(identity, inconsistencies))-1
134-element Array{Uint32,1}:
 0x0000023a
 0x0000023b
 0x0000023d
 0x0000023e
 0x00000241
          ⋮
 0x00002ce0
 0x00002ce2
 0x00010426
 0x00010427
 0x0001d7ca
julia> #Lower case identification
julia> inconsistencies = [islower(char(a)) != (Base.UTF8proc.category_code(char(a))==Base.UTF8proc.category_code('a')) for a in 0:0x10ffff];
julia> uint32(find(identity, inconsistencies))-1
285-element Array{Uint32,1}:
 0x00000221
 0x00000234
 0x00000235
 0x00000236
 0x00000237
          ⋮
 0x0001044f
 0x0001d4c1
 0x0001d6a4
 0x0001d6a5
 0x0001d7cb
julia> #Digit identification
julia> inconsistencies = [isdigit(char(a)) != (Base.UTF8proc.category_code(char(a))==Base.UTF8proc.category_code('1')) for a in 0:0x10ffff];
julia> uint32(find(identity, inconsistencies))-1
280-element Array{Uint32,1}:
 0x00000660
 0x00000661
 0x00000662
 0x00000663
 0x00000664
          ⋮
 0x0001d7fb
 0x0001d7fc
 0x0001d7fd
 0x0001d7fe
 0x0001d7ff

jiahao · 2014-01-31T05:13:12Z

I could keep going with this, but I'll stop with perhaps the most nefarious of the lot:

julia> #Whitespace identification
julia> inconsistencies = [isblank(char(a)) != (Base.UTF8proc.category_code(char(a))==Base.UTF8proc.category_code(' ')) for a in 0:0x10ffff];
julia> uint32(find(identity, inconsistencies))-1
18-element Array{Uint32,1}:
 0x00000009
 0x000000a0
 0x00001680
 0x0000180e
 0x00002000
 0x00002001
 0x00002002
 0x00002003
 0x00002004
 0x00002005
 0x00002006
 0x00002007
 0x00002008
 0x00002009
 0x0000200a
 0x0000202f
 0x0000205f
 0x00003000

jiahao · 2014-01-31T05:18:30Z

(for a more interesting inspection, try map(x->(x, char(x)), uint32(find(identity, inconsistencies))-1); Github won't let me paste Unicode characters above 0xffff in here.)

JeffBezanson · 2014-01-31T06:07:06Z

u8_isvalid is not really relevant here; it is only concerned with encoding. It just checks whether a byte stream is well-structured UTF-8, and doesn't know anything about characters.

jiahao · 2014-01-31T15:07:50Z

I'm not sure if the question of whether a string is validly encoded UTF-8 can be entirely decoupled from knowledge of its characters.

Maybe I'm just misunderstanding what is_valid_utf8 does and/or how strings are supposed to work, but I tried a constructing few strings containing invalid byte sequences from the Wikipedia article and this test file and all these supposedly invalid byte sequences pass:

julia> is_valid_utf8('\U140000') #Ok
ERROR: syntax: invalid escape sequence

julia> convert(UTF8String, string(char(0x14), char(0x00), char(0x00))) #result is a length-1 UTF8String
"\x14\0\0"

julia> is_valid_utf8(ans)
true

julia> is_valid_utf8(string(char(0x10FFFF+1))) #One past the last defined codepoint
true

julia> is_valid_utf8("\U0010FFFF") #2.3.4 boundary condition
true

julia> is_valid_utf8(string(char(0x80))) #3.1.1 Lone continuation character 
true

julia> string(char(0xed), char(0xa0), char(0x80)) #5.1.1
"í \u80"

julia> is_valid_utf8(ans) #ans is a UTF8String
true

julia> is_valid_utf8("\Udfff") #5.1.7
true

julia> is_valid_utf8("\Ud800\Udc00") #5.2.2
true

JeffBezanson · 2014-01-31T15:18:13Z

Any UTF-8 string constructed from Chars is going to be validly encoded. You are not testing byte sequences. For example:

julia> string(char(0x80)).data
2-element Array{Uint8,1}:
 0xc2
 0x80

Chars are not bytes.

I believe it is true that this routine does not respect the 0x10ffff limit. That could be added easily.

Encoding is orthogonal to code points. Imagine you called is_valid_utf8 and is_valid_utf16 on the same data. They would obviously disagree most of the time.

jiahao · 2014-01-31T15:26:25Z

Thanks for the clarification. Would directly constructing UTF8String([0xed, 0xa0, 0x80]) correspond to the byte sequence ed a0 80 then?

JeffBezanson · 2014-01-31T15:33:25Z

Yes. Just be aware that despite what they may say, the Unicode Consortium does not have the authority to ban the integer 0xd800.

jiahao · 2014-01-31T15:53:19Z

Ok, so I guess the question now is how to resolve the inconstencies in utf8proc and libc's isw* functions.

JeffBezanson · 2014-01-31T16:47:34Z

glibc seems to be quite out of date in this regard. They seem to have a couple years of backlog in updating their unicode tables, for example https://sourceware.org/bugzilla/show_bug.cgi?id=14010.
I hope we don't have to start shipping our own libc.

stevengj · 2014-02-01T19:18:20Z

@jiahao, the default of normalize_string(s) without options was supposed to be to do nothing. I was accidentally ignoring the stripmark flag and always stripping diacriticals; will fix this shortly.

jiahao · 2014-02-01T19:26:43Z

A no-op default sounds reasonable.

stevengj · 2014-02-01T19:39:31Z

@jiahao, I did a few spot checks on the isupper and isdigit inconsistencies, and in all the cases I checked utf8proc was giving the correct answer (albeit for fairly obscure characters).

In the case of isblank, the comparison is more subtle. e.g. utf8proc correctly classifies '\t' as category Cc (Other, control) as opposed to ' ' which is classified as Zs (Separator, space). On the other hand, utf8proc (correctly) classified the non-breaking space char(0xa0) as Zs, whereas isblank(char(0xa0)) surprisingly returns false on my machine.

stevengj · 2014-02-01T19:48:32Z

The good news is that Base.UTF8proc.category_code(a)==23 seems to be about as fast as isblank(a) in a quick benchmark.

jiahao · 2014-02-01T20:04:03Z

In addition to the issue referenced by @JeffBezanson above, glibc#14094 also alludes to an incomplete implementation of Unicode character typing.

How much of an issue would it be to replace the libc character class predicates with their utf8proc equivalents? The affected functions in string.jl would be isalnum, isalpha, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper; and possibly isblank also.

stevengj · 2014-02-01T20:30:01Z

It seems fine to me to use Unicode character classes for this sort of thing, as long as it is documented (with some extensions to count certain control characters as "spaces"/"blanks"); more sensible than maintaining backward compatibility with pre-Unicode conventions from K&R.

We could define c_isfoo functions if people really want the libc definitions.

stevengj · 2014-02-01T21:16:33Z

However, I think that changing the behavior of isblank, isalpha, etcetera to use utf8proc should be a separate RFC and pull request.

…JuliaLang#5434)

jiahao · 2014-02-01T22:06:44Z

You could also add some of the tests I transcribed from UAX15 here

stevengj · 2014-02-01T22:22:57Z

Upon reflection, it makes more sense to me to make compose=true the default. Otherwise people may forget to do it when e.g. they casefold, and it's hard to imagine someone calling normalize_string and not at least wanting a canonical composition or decomposition.

stevengj · 2014-02-01T22:44:50Z

Regarding the libc functions, there is also the problem that wchar_t is 16 bits on Windows.

stevengj · 2014-02-01T22:49:44Z

@JeffBezanson, so is_valid_utf8 is really just checking whether the UTF8 string is well-formed (i.e. contains no ~~unpaired surrogates~~ wrongly encoded sequences)? Maybe the function should be renamed?

stevengj · 2014-02-01T22:51:15Z

Not sure why Travis is suddenly failing with ccall: could not find function utf8proc_map ...

JeffBezanson · 2014-02-01T23:03:03Z

It only checks UTF-8 byte stream syntax: whether it is possible to reconstruct a sequence of 32-bit integers from the bytes, with no over-long sequences. Surrogates are only used in UTF-16. The function only deals with issues unique to the UTF-8 encoding. This kind of validation is needed before one can even talk about which code points are valid, since if the byte stream is not well-formed you don't even know which alleged code points are there.

jiahao · 2014-02-03T16:45:05Z

I've tested this branch separately on my Macbook and on julia.mit and the tests pass. I think this is ready to merge.

RFC: export utf8proc Unicode transformation functionality in Julia

nolta · 2014-02-03T19:02:43Z

I don't think we should document the normalize_string keywords. They're non-standard and utf8proc specific.

stevengj · 2014-02-03T20:43:31Z

We should only provide Unicode processing functionality that is specified by an international standard?

nolta · 2014-02-03T21:33:10Z

Not sure if joking, or serious...

timholy · 2014-02-03T21:42:32Z

On my machine, make testall now gives

exception on 5: ERROR: test error during :((normalize_string("ñ",:NFC)=="ñ"))
ccall: could not find function utf8proc_map
 in utf8proc_map at utf8proc.jl:32
 in normalize_string at utf8proc.jl:69
 in anonymous at test.jl:53
 in do_test at test.jl:37
 in runtests at /home/tim/src/julia/test/testdefs.jl:5
 in anonymous at multi.jl:834
 in run_work_thunk at multi.jl:575
 in anonymous at task.jl:834
while loading strings.jl, in expression starting on line 862

stevengj · 2014-02-03T22:01:14Z

@timholy, what is your machine? That function should be linked into libjulia via libutf8proc.

jiahao · 2014-02-03T22:07:53Z

@timholy This produced an error on Travis also, but I was unable to reproduce this on my machines. Would be great to track this one down.

stevengj · 2014-02-03T22:11:47Z

@nolta, being more serious, I would like to see a better argument for removing (or hiding) some functionality, on a case-by-case basis, than "it's nonstandard".

For example, removing diacriticals from unicode strings (ñ → n, etc) is a common need (google "unicode remove diacritical") and many libraries (e.g. ICU) provide this functionality as well. (Moreover, what utf8proc does can be formally defined fairly easily: perform the canonical decomposition and delete characters in classes Mn, Mc, or Me).

timholy · 2014-02-03T22:13:55Z

tim@diva:~$ uname -a
Linux diva 3.2.0-58-generic #88-Ubuntu SMP Tue Dec 3 17:37:58 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

tim@diva:~$ cat /etc/*-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=12.04
DISTRIB_CODENAME=precise
DISTRIB_DESCRIPTION="Ubuntu 12.04.4 LTS"
NAME="Ubuntu"
VERSION="12.04.4 LTS, Precise Pangolin"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu precise (12.04.4 LTS)"
VERSION_ID="12.04"

tim@diva:~$ locate libutf8proc
/home/tim/src/julia/deps/utf8proc-v1.1.6/libutf8proc.a
/home/tim/src/julia/usr/lib/libutf8proc.a

stevengj · 2014-02-03T22:15:02Z

@timholy, can you do nm on libjulia and grep for "utf8proc"? libutf8proc.a should have been linked into libjulia...

timholy · 2014-02-03T22:16:45Z

You beat me to it:

tim@diva:~$ readelf -Ws /home/tim/src/julia/deps/utf8proc-v1.1.6/libutf8proc.a | grep utf8proc_map
    34: 0000000000000ed0   233 FUNC    GLOBAL DEFAULT    1 utf8proc_map

tim@diva:~$ readelf -Ws /home/tim/src/julia/usr/lib/libjulia.so | grep utf8proc_map
 10276: 0000000000150940   233 FUNC    LOCAL  DEFAULT   11 utf8proc_map

timholy · 2014-02-03T22:24:32Z

Maybe someone specified the ccall in terms of \U+F666, which in unicode means utf8proc_map 😄. That's why you can't see the problem on your screen.

nolta · 2014-02-03T22:59:37Z

@stevengj Fair enough. If each option has a well-defined meaning independent of the utf8proc library, then i'm ok with exposing them. I'm pretty sure, however, that the lump option fails that test.

stevengj · 2014-02-04T03:16:21Z

@nolta, lump documentation removed in commit 00760bd

export utf8proc functionality in Julia (followup to JuliaLang#5462 and …

6039a46

…JuliaLang#5434)

stevengj added 2 commits February 1, 2014 17:07

added normalize_string test cases

0f36f0f

added jiahao's tests from unicode SA15

76c99f3

make compose=true the default in normalize_string

a500b83

stevengj added a commit that referenced this pull request Feb 3, 2014

Merge pull request #5576 from stevengj/utf8proc

7e5a31d

RFC: export utf8proc Unicode transformation functionality in Julia

stevengj merged commit 7e5a31d into JuliaLang:master Feb 3, 2014

timholy mentioned this pull request Feb 3, 2014

Concretely specify the type of dims in SubArray #5662

Closed

kmsquire mentioned this pull request Feb 3, 2014

Export utf8proc_* symbols in julia.expmap #5663

Merged

stevengj added a commit that referenced this pull request Feb 4, 2014

don't document utf8proc's 'lump' transformation, as discussed in #5576

00760bd

jiahao added the unicode label Feb 22, 2014

jiahao mentioned this pull request Feb 22, 2014

Parse a minimal set of fullwidth punctuation as synonyms #5903

Open

stevengj mentioned this pull request Feb 24, 2014

improve character category predicates #5939

Closed

RFC: export utf8proc Unicode transformation functionality in Julia #5576

RFC: export utf8proc Unicode transformation functionality in Julia #5576

Conversation

stevengj commented Jan 27, 2014

stevengj commented Jan 29, 2014

jiahao commented Jan 31, 2014

jiahao commented Jan 31, 2014

stevengj commented Jan 31, 2014

jiahao commented Jan 31, 2014

JeffBezanson commented Jan 31, 2014

jiahao commented Jan 31, 2014

jiahao commented Jan 31, 2014

jiahao commented Jan 31, 2014

jiahao commented Jan 31, 2014

jiahao commented Jan 31, 2014

jiahao commented Jan 31, 2014

JeffBezanson commented Jan 31, 2014

jiahao commented Jan 31, 2014

JeffBezanson commented Jan 31, 2014

jiahao commented Jan 31, 2014

JeffBezanson commented Jan 31, 2014

jiahao commented Jan 31, 2014

JeffBezanson commented Jan 31, 2014

stevengj commented Feb 1, 2014

jiahao commented Feb 1, 2014

stevengj commented Feb 1, 2014

stevengj commented Feb 1, 2014

jiahao commented Feb 1, 2014

stevengj commented Feb 1, 2014

stevengj commented Feb 1, 2014

jiahao commented Feb 1, 2014

stevengj commented Feb 1, 2014

stevengj commented Feb 1, 2014

stevengj commented Feb 1, 2014

stevengj commented Feb 1, 2014

JeffBezanson commented Feb 1, 2014

jiahao commented Feb 3, 2014

nolta commented Feb 3, 2014

stevengj commented Feb 3, 2014

nolta commented Feb 3, 2014

timholy commented Feb 3, 2014

stevengj commented Feb 3, 2014

jiahao commented Feb 3, 2014

stevengj commented Feb 3, 2014

timholy commented Feb 3, 2014

stevengj commented Feb 3, 2014

timholy commented Feb 3, 2014

timholy commented Feb 3, 2014

nolta commented Feb 3, 2014

stevengj commented Feb 4, 2014