Unicode character properties #14387

ScottPJones · 2015-12-13T05:43:14Z

This is still a WIP, based on the comments in #14347, to be subjected to some bikeshedding and performance testing, before deprecating some (not all) of the old very abbreviated character classification functions, such as isalnum, isnumber, iscntrl, etc.

Suggestions on how to improve it are quite welcome, of course!

Edit: I think (after tests complete) that this could be merged, and other changes such as deprecating some of the unused is* methods should be done in a follow-on PR.

nalimilan · 2015-12-13T16:56:48Z

base/unicode/UnicodeError.jl


 type UnicodeError <: Exception
-    errmsg::AbstractString      ##< A UTF_ERR_ message
+    errmsg::AbstractString      ##< An Unicode.ERR_ message


"A Unicode" if I'm not mistaken.

nalimilan · 2015-12-13T17:20:36Z

Thanks, this looks interesting!

ScottPJones · 2015-12-13T18:32:06Z

@nalimilan Thanks very much for your input, it really does help improve my code!
Let me know how you like the version I just pushed, I think it addresses most of your review comments.

nalimilan · 2015-12-13T19:34:01Z

base/unicode/properties.jl

+"""Unicode character categories"""
+abstract CharCategory   <: UnicodeProperty
+
+"""Unicode letter character category"""


Missing uppercase L. It would also be clearer if you added some kind of quote around the names of the categories.

Absolutely right, I'll do that as soon as I get home.

nalimilan · 2015-12-13T19:37:00Z

What do you think of my proposal at #14387 (comment)?

ScottPJones · 2015-12-13T21:37:00Z

I'll have to see how the performance compares, using in with a tuple, instead of <: and types.
I think it would work already with what I have here, this latest version allows you to just say Unicode.Zs etc. (and I've added doc strings for all of them).

ScottPJones · 2015-12-13T23:19:46Z

OK, using in like that is about 225 x slower for checking isnumber.
I hope you like this last version.

nalimilan · 2015-12-14T09:22:53Z

base/unicode/properties.jl

+             BegQuotePunct, EndQuotePunct, OtherPunct,
+             MathSymbol, CurrencySymbol, ModifierSymbol, OtherSymbol,
+             SpaceSeparator, LineSeparator, ParagraphSeparator,
+	     ControlChar, FormatChar, SurrogateChar, PrivateUseChar]


Wrong indentation here.

Darn Emacs had put a \t in, so the indentation looked correct for me! Fixed.

nalimilan · 2015-12-14T09:49:30Z

OK, using in like that is about 225 x slower for checking isnumber.
I hope you like this last version.

@ScottPJones Too bad. I'm still not a fan a the double API enum vs. types. Using types for something that can only be known at run time doesn't sound great. Also, you note yourself in the code that they are slower than manually checking the integer code range using <=. This is an interesting design issue which could be relevant for other situations too, so it's worth discussing.

I can see two solutions:

In C, one would use bitwise operations like cat & UpperCase. This offers a relatively nice API which doesn't require mastering the underlying implementation details.
We could also add a custom type for aggregate categories for which we would override in.

ScottPJones · 2015-12-14T13:54:57Z

The name CharType (or CategoryType) is meant to show that it is a julia type (trait), not the numeric code (i.e. CategoryCode), it doesn't mean anything about Unicode.
I'd actually just wanted to have a type (trait) API, because I think it gives you a lot of flexibility to then define fast functions that take a type, and give back the code or a bit (1<<code), or return other information about that category. Another reason I wanted to use types instead of just codes is that I think it would be much easier to handle some of the other Unicode properties that way (charprop being intended to handle all sorts of Unicode properties, such as script(s), writing direction, etc., or the (up-to) 4-level categories).
I left in the (pseudo) enum style (enums just don't work at that stage unfortunately), because of the current performance issues with the <: operator, but I think the performance could be fixed in the future.
I think this is flexible enough, that you could easily add CategoryMask, which, instead of a code 0:29,
would return the bits in a UInt32, to allow for the C masking style with bit operations.

What do you think about the latest updates?

ScottPJones · 2015-12-14T14:00:26Z

Oh, a big reason I much prefer the type (trait) API over the enum(-like) style, is that the code using the enum style depends on the assigned numeric code points being ordered in a certain way, instead of following a hierarchy which you can examine with super(t) (hopefully soon to be supertype(t)!), and subtypes(t), and it is easily extensible by adding things like:
typealias AlphaNumeric Union{Letter, Number}

nalimilan · 2015-12-14T14:17:52Z

The fact that you need to put Type inside the name of a type is an indication that there's a problem IMHO. I have nothing against passing types to charprop to request other properties, but I don't like returning types for something that is essentially a run time property. The essential character of types is that they are used for dispatch, and useful to optimize code when know at compile time. Using them for something else isn't a good idea.

A frequent convention when you pass a type as the first argument to a Julia function is that you get in return an instance of that type (and not a subtype of that type). Here, you would pass the (pseudo-)enum type, and you would get an enum value. In my mind, "Unicode category" is one of the kinds of information you may want to get, just like "upper case", "spacing", "block", etc. Unicode categories should not be types, but values.

If you don't like the idea of comparing integer values, we could as well create another enum consisting of aggregate categories, and another type like MajorCategory to pass to charprop to get this information. Anyway, this shouldn't be a very common use, since these major categories are very broad and specific properties are more often needed.

ScottPJones · 2015-12-14T14:39:29Z

OK, I'd thought you'd like using types (traits), per this response:
#14347 (comment)
but that's OK. I would still like to keep the type system I came up with (for some of the reasons I stated above), but I can get rid of the charprop that returns the type.

nalimilan · 2015-12-14T14:47:22Z

Yeah, I wrote that before I realized returning and enum made more sense.

ScottPJones · 2015-12-15T12:40:24Z

I think that this latest version is flexible, easy to use, and is demonstrably much faster for tests like
isnumber, isalpha, isupper, isalnum, isprint, isgraph (where they have to match multiple categories).

With this PR:

julia> @benchmark isalnum('ℵ')
================ Benchmark Results ========================
     Time per evaluation: 3.99 ns [3.98 ns, 4.01 ns]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
        Memory allocated: 0.00 bytes
   Number of allocations: 0 allocations
       Number of samples: 11501
   Number of evaluations: 115133501
         R² of OLS model: 0.994
 Time spent benchmarking: 0.50 s

From master:

julia> @benchmark isalnum('ℵ')
================ Benchmark Results ========================
     Time per evaluation: 5.34 ns [5.32 ns, 5.35 ns]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
        Memory allocated: 0.00 bytes
   Number of allocations: 0 allocations
       Number of samples: 11201
   Number of evaluations: 86502301
         R² of OLS model: 0.998
 Time spent benchmarking: 0.50 s

So the old methods are about 34% slower. (Thanks for the mask idea, @nalimilan!)

tkelman · 2015-12-15T12:51:20Z

Using types for this just seems like an abuse of the type system. -1 on this scheme.

ScottPJones · 2015-12-15T13:10:53Z

If you consider this an abuse of the type system, then there are many examples of abusing the type system in Julia, and I'd say that the whole traits scheme also "abuses" the type system.
This is no different from the following:

abstract Algorithm

immutable InsertionSortAlg <: Algorithm end
immutable QuickSortAlg     <: Algorithm end
immutable MergeSortAlg     <: Algorithm end

or all of the types defined for functors:

abstract Func{N}

immutable IdFun <: Func{1} end
...

or things like ElementwiseMaxFun and ElementwiseMinFun, etc.

ScottPJones · 2015-12-15T19:07:08Z

@tkelman @nalimilan Please check out this latest version - it no longer uses types for the categories, and sets up masks for all of the general categories, as well as the aggregate categories such as AlphaNumeric, Print or Graph. The names all have documentation added, so you can do
?Category.LowerCase or ?Category.Ll and get documentation for the categories, with either long readable names or short two character names (from the Unicode standard). Very little is exported, pretty much just Unicode, Category, and the function charprop.

tkelman · 2015-12-15T19:25:49Z

Category should not be exported.

StefanKarpinski · 2015-12-15T19:57:14Z

Those uses of types are an excellent list of things we'd like to get rid of.

ScottPJones · 2015-12-15T20:07:32Z

Running a full test suite now, with Category not exported from Base.

ScottPJones · 2015-12-15T23:16:57Z

OK, just got home, tests passed so it now has Category removed from the exports.

ScottPJones · 2015-12-15T23:19:52Z

@StefanKarpinski Is there anything in the documentation that states that types should not be used that way, or any PRs or issues about removing the places in Base where you say you'd like to get rid of them? What other methods do you suggest, that would have the same performance (i.e. allow the compiler to determine things at compile-time instead of run-time), to replace those techniques?

mason-bially · 2015-12-16T00:23:02Z

Well, @JeffBezanson 's proposal (expanded from) in #6975 might provide features for a better convention for such problems (although I personally think just a protocol system is the better answer to issue #6975 specifically, I'd be curious if there are other reasons for the more general answer, like problems being encountered here).

Add newline Fix indentation (Emacs and tabs)

Updated tests to use Unicode.Category

nalimilan · 2015-12-16T08:03:37Z

@StefanKarpinski Is there anything in the documentation that states that types should not be used that way, or any PRs or issues about removing the places in Base where you say you'd like to get rid of them? What other methods do you suggest, that would have the same performance (i.e. allow the compiler to determine things at compile-time instead of run-time), to replace those techniques?

Wait, are we taking about returning a type to represent the category (which I don't like either), or about passing a type as an argument to specify which information you want (which I support)? Only the latter allows the compiler to reason about types.

MithrandirMiles · 2015-12-16T15:01:37Z

@nalimilan No - the only types left in this are the ones for specifying which property charprop returns.
Currently, I've only implemented Unicode.Category.Code, but there are others, as per your original suggestion, that can be added in subsequent PRs.
Based on @tkelman's feedback, only the Unicode module and the charprop function are exported from Base.
With this last version using masks (thanks again for that suggestion!) it also gives a nice performance improvement, as well as IMO being more readable. (Note: since these are in the Unicode module,
they can just say Category.Code instead of Unicode.Category.Code)

isupper(c::Char)  = charprop(Category.Code, c) in Category.Upper
isalpha(c::Char)  = charprop(Category.Code, c) in Category.Letter
isnumber(c::Char) = charprop(Category.Code, c) in Category.Number
isalnum(c::Char)  = charprop(Category.Code, c) in Category.AlphaNumeric
ispunct(c::Char)  = charprop(Category.Code, c) in Category.Punctuation
isprint(c::Char)  = charprop(Category.Code, c) in Category.Print
# true in principle if a printer would use ink
isgraph(c::Char)  = charprop(Category.Code, c) in Category.Graph

as opposed to the old

# true for Unicode upper and mixed case
function isupper(c::Char)
    ccode = category_code(c)
    return ccode == UTF8PROC_CATEGORY_LU || ccode == UTF8PROC_CATEGORY_LT
end
isalpha(c::Char)  = (UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_LO)
isnumber(c::Char) = (UTF8PROC_CATEGORY_ND <= category_code(c) <= UTF8PROC_CATEGORY_NO)
function isalnum(c::Char)
    ccode = category_code(c)
    return (UTF8PROC_CATEGORY_LU <= ccode <= UTF8PROC_CATEGORY_LO) ||
           (UTF8PROC_CATEGORY_ND <= ccode <= UTF8PROC_CATEGORY_NO)
end
ispunct(c::Char) = (UTF8PROC_CATEGORY_PC <= category_code(c) <= UTF8PROC_CATEGORY_PO)
isprint(c::Char) = (UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_ZS)
# true in principal if a printer would use ink
isgraph(c::Char) = (UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_SO)

nalimilan · 2015-12-16T15:03:38Z

@StefanKarpinski @tkelman Can you develop your stance about the proposed charprop API (cf. my previous comment)? Do you think it's OK to pass (not return) types to choose the property to retrieve?

nalimilan · 2015-12-16T15:09:44Z

@MithrandirMiles Using another account to escape your banning isn't likely to lead to a productive discussion. If you want to continue contributing to Julia, I think you should show that you have understood what the community reproaches you with, and show concrete signs that you will change your behavior. Else, I don't think anybody will merge this PR (nor future PRs from you), which I would find quite unfortunate.

tkelman · 2015-12-16T15:10:54Z

Passing types in as flags strikes me as a bit less ugly than returning types, though seems like symbols would be a more conventional API (depending on whether that can be made type stable).

However I'd rather not continue interacting on this PR, considering the events of #14397 (comment) which have been blatantly ignored here. If someone else wants to wait a few weeks then take a crack at cleaning up these functions I think everyone's blood pressure would be better off for it.

StefanKarpinski · 2015-12-16T18:02:22Z

I agree with @tkelman. Closing this PR. The subject can be revisited later.

tkelman added the unicode Related to unicode characters and encodings label Dec 13, 2015

nalimilan reviewed Dec 13, 2015
View reviewed changes

ScottPJones force-pushed the spj/charcategory branch from e6488e3 to 96fcf6f Compare December 13, 2015 18:27

ScottPJones changed the title ~~WIP: Unicode character properties~~ Unicode character properties Dec 13, 2015

nalimilan reviewed Dec 13, 2015
View reviewed changes

ScottPJones force-pushed the spj/charcategory branch from 96fcf6f to fbffc85 Compare December 13, 2015 23:09

nalimilan reviewed Dec 14, 2015
View reviewed changes

ScottPJones force-pushed the spj/charcategory branch 3 times, most recently from 1010f09 to f24826e Compare December 15, 2015 03:52

ScottPJones force-pushed the spj/charcategory branch from e4a759b to 4e02490 Compare December 16, 2015 03:41

ScottPJones added 7 commits December 15, 2015 23:28

Unicode character properties

8365c84

Update to use submodule

2c53e52

Clean up naming, don't want any Cat fights!

8b23e47

Add category masks

c05d463

Update per comments on use of ?:

f4ae3ec

Add newline Fix indentation (Emacs and tabs)

Remove types for general categories

0166695

Remove Category from exports

2dd14d8

Updated tests to use Unicode.Category

ScottPJones force-pushed the spj/charcategory branch from 4e02490 to 2dd14d8 Compare December 16, 2015 04:28

StefanKarpinski closed this Dec 16, 2015

nalimilan mentioned this pull request Jan 30, 2017

isnumber("") return true #14156

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode character properties #14387

Unicode character properties #14387

ScottPJones commented Dec 13, 2015

nalimilan Dec 13, 2015

nalimilan commented Dec 13, 2015

ScottPJones commented Dec 13, 2015

nalimilan Dec 13, 2015

ScottPJones Dec 13, 2015

nalimilan commented Dec 13, 2015

ScottPJones commented Dec 13, 2015

ScottPJones commented Dec 13, 2015

nalimilan Dec 14, 2015

ScottPJones Dec 14, 2015

nalimilan commented Dec 14, 2015

ScottPJones commented Dec 14, 2015

ScottPJones commented Dec 14, 2015

nalimilan commented Dec 14, 2015

ScottPJones commented Dec 14, 2015

nalimilan commented Dec 14, 2015

ScottPJones commented Dec 15, 2015

tkelman commented Dec 15, 2015

ScottPJones commented Dec 15, 2015

ScottPJones commented Dec 15, 2015

tkelman commented Dec 15, 2015

StefanKarpinski commented Dec 15, 2015

ScottPJones commented Dec 15, 2015

ScottPJones commented Dec 15, 2015

ScottPJones commented Dec 15, 2015

mason-bially commented Dec 16, 2015

nalimilan commented Dec 16, 2015

MithrandirMiles commented Dec 16, 2015

nalimilan commented Dec 16, 2015

nalimilan commented Dec 16, 2015

tkelman commented Dec 16, 2015

StefanKarpinski commented Dec 16, 2015

Unicode character properties #14387

Unicode character properties #14387

Conversation

ScottPJones commented Dec 13, 2015

Suggestions on how to improve it are quite welcome, of course!

nalimilan Dec 13, 2015

Choose a reason for hiding this comment

nalimilan commented Dec 13, 2015

ScottPJones commented Dec 13, 2015

nalimilan Dec 13, 2015

Choose a reason for hiding this comment

ScottPJones Dec 13, 2015

Choose a reason for hiding this comment

nalimilan commented Dec 13, 2015

ScottPJones commented Dec 13, 2015

ScottPJones commented Dec 13, 2015

nalimilan Dec 14, 2015

Choose a reason for hiding this comment

ScottPJones Dec 14, 2015

Choose a reason for hiding this comment

nalimilan commented Dec 14, 2015

ScottPJones commented Dec 14, 2015

ScottPJones commented Dec 14, 2015

nalimilan commented Dec 14, 2015

ScottPJones commented Dec 14, 2015

nalimilan commented Dec 14, 2015

ScottPJones commented Dec 15, 2015

tkelman commented Dec 15, 2015

ScottPJones commented Dec 15, 2015

ScottPJones commented Dec 15, 2015

tkelman commented Dec 15, 2015

StefanKarpinski commented Dec 15, 2015

ScottPJones commented Dec 15, 2015

ScottPJones commented Dec 15, 2015

ScottPJones commented Dec 15, 2015

mason-bially commented Dec 16, 2015

nalimilan commented Dec 16, 2015

MithrandirMiles commented Dec 16, 2015

nalimilan commented Dec 16, 2015

nalimilan commented Dec 16, 2015

tkelman commented Dec 16, 2015

StefanKarpinski commented Dec 16, 2015