Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: export utf8proc Unicode transformation functionality in Julia #5576

Merged
merged 4 commits into from
Feb 3, 2014

Conversation

stevengj
Copy link
Member

Following up to #5462 and #5434, this patch exposes a bunch of functionality from the bundled utf8proc library in Julia:

  • is_valid_char(s) is replaced by the implementation in utf8proc, which detects a few more invalid codepoints
  • is_assigned_char(c): new function to return whether a code point is assigned
  • charcategory(c): new function to return a Base.UnicodeCategory describing the Unicode general category of the character. I defined a type for this to make it easy to get the general category or the subcategory parts, and for pretty printing, but I expect that most people will just compare these to strings (e.g. "Lu" for uppercase letter).
  • normalize_string(s::String, normalform::Symbol): new function to perform standard Unicode normalization, e.g. normalize_string(s, :NFKC) for normal form KC.
  • normalize_string(s::String; keywords...): new function to perform various normalization transformations, including case folding, diacritical stripping, newline conversion, etcetera.

cc: @jiahao

@stevengj
Copy link
Member Author

On second thought, I think it's better to delay implementation of something like charcategory until it is clearer what the needs are for character metadata.

@jiahao
Copy link
Member

jiahao commented Jan 31, 2014

This all looks fantastic. Thanks for putting it all together.

Let's think of what tests we should put in. On top of the very simplest of tests (checking the three new functions on some sample of Unicode strings), one potential concern is if u8_isvalid and utf8proc disagree on whether or not a given string is validly UTF8-encoded. Perhaps the most expedient thing to do is to generate a bunch of random bytestrings and check that is_valid_char and u8_isvalid agree.

@jiahao
Copy link
Member

jiahao commented Jan 31, 2014

I can see charcategory being fruitfully used to implement Unicode-aware methods for string parsing functions like parseint and isupper, for example.

@stevengj
Copy link
Member Author

I suspect that functions like isupper would want to call UTF8proc.category_code directly for performance reasons, rather than doing a mapping to some higher-level type. But isupper calls libc's iswupper and so it is already Unicode-aware, right? (Except that Unicode has upper, lower, and title case.)

@jiahao
Copy link
Member

jiahao commented Jan 31, 2014

Yes, isupper does call libc's iswupper. Now I'm just being paranoid and wondering if we should worry about mutual consistency between all the Unicode-aware components.

@JeffBezanson
Copy link
Sponsor Member

We know some things are certainly missing: #3721

@jiahao
Copy link
Member

jiahao commented Jan 31, 2014

Unfortunately there are quite a few code points for which u8_isvalid and utf8proc disagree upon their validity:

julia> inconsistencies = [is_valid_utf8(string(char(a))) != is_valid_char((char(a))) for a in 0:0x10ffff]; 
julia> uint32(find(identity, inconsistencies))-1
2114-element Array{Uint32,1}:
 0x0000d800
 0x0000d801
 0x0000d802
 0x0000d803
 0x0000d804
          
 0x000effff
 0x000ffffe
 0x000fffff
 0x0010fffe
 0x0010ffff

In all cases, is_valid_char is more conservative in declaring them invalid.

@jiahao
Copy link
Member

jiahao commented Jan 31, 2014

The normalize_string method with all the Bool flags could also use a less surprising default:

julia> a=string(char(0x0041),char(0x030a))
""
julia> b=normalize_string(a)
"A"
julia> length(b)
1
julia> int(b[1])
65

It seems unlikely that this is what most users would want as a default; perhaps a better choice is to specify the flags which would make normalize_string do NFC or some other standard normalization scheme by default.

@jiahao
Copy link
Member

jiahao commented Jan 31, 2014

There is also the question of how we want to deal with future versions of Unicode standards. The current stable release of utf8proc is v1.1.16, which explicitly states support for Unicode 5.0.0. This will probably change in the future to a more recent Unicode standard, so the issue of mutually compatible Unicode-aware components is most likely something we will have to face sooner or later.

@jiahao
Copy link
Member

jiahao commented Jan 31, 2014

Further inconsistencies:

julia> #Upper case identification
julia> inconsistencies = [isupper(char(a)) != (Base.UTF8proc.category_code(char(a))==Base.UTF8proc.category_code('A')) for a in 0:0x10ffff];
julia> uint32(find(identity, inconsistencies))-1
134-element Array{Uint32,1}:
 0x0000023a
 0x0000023b
 0x0000023d
 0x0000023e
 0x00000241
          
 0x00002ce0
 0x00002ce2
 0x00010426
 0x00010427
 0x0001d7ca
julia> #Lower case identification
julia> inconsistencies = [islower(char(a)) != (Base.UTF8proc.category_code(char(a))==Base.UTF8proc.category_code('a')) for a in 0:0x10ffff];
julia> uint32(find(identity, inconsistencies))-1
285-element Array{Uint32,1}:
 0x00000221
 0x00000234
 0x00000235
 0x00000236
 0x00000237
          
 0x0001044f
 0x0001d4c1
 0x0001d6a4
 0x0001d6a5
 0x0001d7cb
julia> #Digit identification
julia> inconsistencies = [isdigit(char(a)) != (Base.UTF8proc.category_code(char(a))==Base.UTF8proc.category_code('1')) for a in 0:0x10ffff];
julia> uint32(find(identity, inconsistencies))-1
280-element Array{Uint32,1}:
 0x00000660
 0x00000661
 0x00000662
 0x00000663
 0x00000664
          
 0x0001d7fb
 0x0001d7fc
 0x0001d7fd
 0x0001d7fe
 0x0001d7ff

@jiahao
Copy link
Member

jiahao commented Jan 31, 2014

I could keep going with this, but I'll stop with perhaps the most nefarious of the lot:

julia> #Whitespace identification
julia> inconsistencies = [isblank(char(a)) != (Base.UTF8proc.category_code(char(a))==Base.UTF8proc.category_code(' ')) for a in 0:0x10ffff];
julia> uint32(find(identity, inconsistencies))-1
18-element Array{Uint32,1}:
 0x00000009
 0x000000a0
 0x00001680
 0x0000180e
 0x00002000
 0x00002001
 0x00002002
 0x00002003
 0x00002004
 0x00002005
 0x00002006
 0x00002007
 0x00002008
 0x00002009
 0x0000200a
 0x0000202f
 0x0000205f
 0x00003000

@jiahao
Copy link
Member

jiahao commented Jan 31, 2014

(for a more interesting inspection, try map(x->(x, char(x)), uint32(find(identity, inconsistencies))-1); Github won't let me paste Unicode characters above 0xffff in here.)

@JeffBezanson
Copy link
Sponsor Member

u8_isvalid is not really relevant here; it is only concerned with encoding. It just checks whether a byte stream is well-structured UTF-8, and doesn't know anything about characters.

@jiahao
Copy link
Member

jiahao commented Jan 31, 2014

I'm not sure if the question of whether a string is validly encoded UTF-8 can be entirely decoupled from knowledge of its characters.

Maybe I'm just misunderstanding what is_valid_utf8 does and/or how strings are supposed to work, but I tried a constructing few strings containing invalid byte sequences from the Wikipedia article and this test file and all these supposedly invalid byte sequences pass:

julia> is_valid_utf8('\U140000') #Ok
ERROR: syntax: invalid escape sequence

julia> convert(UTF8String, string(char(0x14), char(0x00), char(0x00))) #result is a length-1 UTF8String
"\x14\0\0"

julia> is_valid_utf8(ans)
true

julia> is_valid_utf8(string(char(0x10FFFF+1))) #One past the last defined codepoint
true

julia> is_valid_utf8("\U0010FFFF") #2.3.4 boundary condition
true

julia> is_valid_utf8(string(char(0x80))) #3.1.1 Lone continuation character 
true

julia> string(char(0xed), char(0xa0), char(0x80)) #5.1.1
"í \u80"

julia> is_valid_utf8(ans) #ans is a UTF8String
true

julia> is_valid_utf8("\Udfff") #5.1.7
true

julia> is_valid_utf8("\Ud800\Udc00") #5.2.2
true

@JeffBezanson
Copy link
Sponsor Member

Any UTF-8 string constructed from Chars is going to be validly encoded. You are not testing byte sequences. For example:

julia> string(char(0x80)).data
2-element Array{Uint8,1}:
 0xc2
 0x80

Chars are not bytes.

I believe it is true that this routine does not respect the 0x10ffff limit. That could be added easily.

Encoding is orthogonal to code points. Imagine you called is_valid_utf8 and is_valid_utf16 on the same data. They would obviously disagree most of the time.

@jiahao
Copy link
Member

jiahao commented Jan 31, 2014

Thanks for the clarification. Would directly constructing UTF8String([0xed, 0xa0, 0x80]) correspond to the byte sequence ed a0 80 then?

@JeffBezanson
Copy link
Sponsor Member

Yes. Just be aware that despite what they may say, the Unicode Consortium does not have the authority to ban the integer 0xd800.

@jiahao
Copy link
Member

jiahao commented Jan 31, 2014

Ok, so I guess the question now is how to resolve the inconstencies in utf8proc and libc's isw* functions.

@JeffBezanson
Copy link
Sponsor Member

glibc seems to be quite out of date in this regard. They seem to have a couple years of backlog in updating their unicode tables, for example https://sourceware.org/bugzilla/show_bug.cgi?id=14010.
I hope we don't have to start shipping our own libc.

@stevengj
Copy link
Member Author

stevengj commented Feb 1, 2014

@jiahao, the default of normalize_string(s) without options was supposed to be to do nothing. I was accidentally ignoring the stripmark flag and always stripping diacriticals; will fix this shortly.

@jiahao
Copy link
Member

jiahao commented Feb 1, 2014

A no-op default sounds reasonable.

@stevengj
Copy link
Member Author

stevengj commented Feb 1, 2014

@jiahao, I did a few spot checks on the isupper and isdigit inconsistencies, and in all the cases I checked utf8proc was giving the correct answer (albeit for fairly obscure characters).

In the case of isblank, the comparison is more subtle. e.g. utf8proc correctly classifies '\t' as category Cc (Other, control) as opposed to ' ' which is classified as Zs (Separator, space). On the other hand, utf8proc (correctly) classified the non-breaking space char(0xa0) as Zs, whereas isblank(char(0xa0)) surprisingly returns false on my machine.

@stevengj
Copy link
Member Author

stevengj commented Feb 1, 2014

The good news is that Base.UTF8proc.category_code(a)==23 seems to be about as fast as isblank(a) in a quick benchmark.

@jiahao
Copy link
Member

jiahao commented Feb 1, 2014

In addition to the issue referenced by @JeffBezanson above, glibc#14094 also alludes to an incomplete implementation of Unicode character typing.

How much of an issue would it be to replace the libc character class predicates with their utf8proc equivalents? The affected functions in string.jl would be isalnum, isalpha, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper; and possibly isblank also.

@stevengj
Copy link
Member Author

stevengj commented Feb 1, 2014

It seems fine to me to use Unicode character classes for this sort of thing, as long as it is documented (with some extensions to count certain control characters as "spaces"/"blanks"); more sensible than maintaining backward compatibility with pre-Unicode conventions from K&R.

We could define c_isfoo functions if people really want the libc definitions.

@stevengj
Copy link
Member Author

stevengj commented Feb 1, 2014

However, I think that changing the behavior of isblank, isalpha, etcetera to use utf8proc should be a separate RFC and pull request.

@jiahao
Copy link
Member

jiahao commented Feb 1, 2014

You could also add some of the tests I transcribed from UAX15 here

@stevengj
Copy link
Member Author

stevengj commented Feb 1, 2014

Upon reflection, it makes more sense to me to make compose=true the default. Otherwise people may forget to do it when e.g. they casefold, and it's hard to imagine someone calling normalize_string and not at least wanting a canonical composition or decomposition.

@stevengj
Copy link
Member Author

stevengj commented Feb 1, 2014

Regarding the libc functions, there is also the problem that wchar_t is 16 bits on Windows.

@stevengj
Copy link
Member Author

stevengj commented Feb 1, 2014

@JeffBezanson, so is_valid_utf8 is really just checking whether the UTF8 string is well-formed (i.e. contains no unpaired surrogates wrongly encoded sequences)? Maybe the function should be renamed?

@stevengj
Copy link
Member Author

stevengj commented Feb 1, 2014

Not sure why Travis is suddenly failing with ccall: could not find function utf8proc_map ...

@JeffBezanson
Copy link
Sponsor Member

It only checks UTF-8 byte stream syntax: whether it is possible to reconstruct a sequence of 32-bit integers from the bytes, with no over-long sequences. Surrogates are only used in UTF-16. The function only deals with issues unique to the UTF-8 encoding. This kind of validation is needed before one can even talk about which code points are valid, since if the byte stream is not well-formed you don't even know which alleged code points are there.

@jiahao
Copy link
Member

jiahao commented Feb 3, 2014

I've tested this branch separately on my Macbook and on julia.mit and the tests pass. I think this is ready to merge.

stevengj added a commit that referenced this pull request Feb 3, 2014
RFC: export utf8proc Unicode transformation functionality in Julia
@stevengj stevengj merged commit 7e5a31d into JuliaLang:master Feb 3, 2014
@nolta
Copy link
Member

nolta commented Feb 3, 2014

I don't think we should document the normalize_string keywords. They're non-standard and utf8proc specific.

@stevengj
Copy link
Member Author

stevengj commented Feb 3, 2014

We should only provide Unicode processing functionality that is specified by an international standard?

@nolta
Copy link
Member

nolta commented Feb 3, 2014

Not sure if joking, or serious...

@timholy
Copy link
Sponsor Member

timholy commented Feb 3, 2014

On my machine, make testall now gives

exception on 5: ERROR: test error during :((normalize_string("ñ",:NFC)=="ñ"))
ccall: could not find function utf8proc_map
 in utf8proc_map at utf8proc.jl:32
 in normalize_string at utf8proc.jl:69
 in anonymous at test.jl:53
 in do_test at test.jl:37
 in runtests at /home/tim/src/julia/test/testdefs.jl:5
 in anonymous at multi.jl:834
 in run_work_thunk at multi.jl:575
 in anonymous at task.jl:834
while loading strings.jl, in expression starting on line 862

@stevengj
Copy link
Member Author

stevengj commented Feb 3, 2014

@timholy, what is your machine? That function should be linked into libjulia via libutf8proc.

@jiahao
Copy link
Member

jiahao commented Feb 3, 2014

@timholy This produced an error on Travis also, but I was unable to reproduce this on my machines. Would be great to track this one down.

@stevengj
Copy link
Member Author

stevengj commented Feb 3, 2014

@nolta, being more serious, I would like to see a better argument for removing (or hiding) some functionality, on a case-by-case basis, than "it's nonstandard".

For example, removing diacriticals from unicode strings (ñ → n, etc) is a common need (google "unicode remove diacritical") and many libraries (e.g. ICU) provide this functionality as well. (Moreover, what utf8proc does can be formally defined fairly easily: perform the canonical decomposition and delete characters in classes Mn, Mc, or Me).

@timholy
Copy link
Sponsor Member

timholy commented Feb 3, 2014

tim@diva:~$ uname -a
Linux diva 3.2.0-58-generic #88-Ubuntu SMP Tue Dec 3 17:37:58 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

tim@diva:~$ cat /etc/*-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=12.04
DISTRIB_CODENAME=precise
DISTRIB_DESCRIPTION="Ubuntu 12.04.4 LTS"
NAME="Ubuntu"
VERSION="12.04.4 LTS, Precise Pangolin"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu precise (12.04.4 LTS)"
VERSION_ID="12.04"

tim@diva:~$ locate libutf8proc
/home/tim/src/julia/deps/utf8proc-v1.1.6/libutf8proc.a
/home/tim/src/julia/usr/lib/libutf8proc.a

@stevengj
Copy link
Member Author

stevengj commented Feb 3, 2014

@timholy, can you do nm on libjulia and grep for "utf8proc"? libutf8proc.a should have been linked into libjulia...

@timholy
Copy link
Sponsor Member

timholy commented Feb 3, 2014

You beat me to it:

tim@diva:~$ readelf -Ws /home/tim/src/julia/deps/utf8proc-v1.1.6/libutf8proc.a | grep utf8proc_map
    34: 0000000000000ed0   233 FUNC    GLOBAL DEFAULT    1 utf8proc_map

tim@diva:~$ readelf -Ws /home/tim/src/julia/usr/lib/libjulia.so | grep utf8proc_map
 10276: 0000000000150940   233 FUNC    LOCAL  DEFAULT   11 utf8proc_map

@timholy
Copy link
Sponsor Member

timholy commented Feb 3, 2014

Maybe someone specified the ccall in terms of \U+F666, which in unicode means utf8proc_map 😄. That's why you can't see the problem on your screen.

@nolta
Copy link
Member

nolta commented Feb 3, 2014

@stevengj Fair enough. If each option has a well-defined meaning independent of the utf8proc library, then i'm ok with exposing them. I'm pretty sure, however, that the lump option fails that test.

@stevengj
Copy link
Member Author

stevengj commented Feb 4, 2014

@nolta, lump documentation removed in commit 00760bd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants