Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode character properties #14387

Closed
wants to merge 7 commits into from

Conversation

ScottPJones
Copy link
Contributor

This is still a WIP, based on the comments in #14347, to be subjected to some bikeshedding and performance testing, before deprecating some (not all) of the old very abbreviated character classification functions, such as isalnum, isnumber, iscntrl, etc.

Suggestions on how to improve it are quite welcome, of course!

Edit: I think (after tests complete) that this could be merged, and other changes such as deprecating some of the unused is* methods should be done in a follow-on PR.

@tkelman tkelman added the unicode Related to unicode characters and encodings label Dec 13, 2015

type UnicodeError <: Exception
errmsg::AbstractString ##< A UTF_ERR_ message
errmsg::AbstractString ##< An Unicode.ERR_ message
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"A Unicode" if I'm not mistaken.

@nalimilan
Copy link
Member

Thanks, this looks interesting!

@ScottPJones
Copy link
Contributor Author

@nalimilan Thanks very much for your input, it really does help improve my code!
Let me know how you like the version I just pushed, I think it addresses most of your review comments.

@ScottPJones ScottPJones changed the title WIP: Unicode character properties Unicode character properties Dec 13, 2015
"""Unicode character categories"""
abstract CharCategory <: UnicodeProperty

"""Unicode letter character category"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing uppercase L. It would also be clearer if you added some kind of quote around the names of the categories.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely right, I'll do that as soon as I get home.

@nalimilan
Copy link
Member

What do you think of my proposal at #14387 (comment)?

@ScottPJones
Copy link
Contributor Author

I'll have to see how the performance compares, using in with a tuple, instead of <: and types.
I think it would work already with what I have here, this latest version allows you to just say Unicode.Zs etc. (and I've added doc strings for all of them).

@ScottPJones
Copy link
Contributor Author

OK, using in like that is about 225 x slower for checking isnumber.
I hope you like this last version.

BegQuotePunct, EndQuotePunct, OtherPunct,
MathSymbol, CurrencySymbol, ModifierSymbol, OtherSymbol,
SpaceSeparator, LineSeparator, ParagraphSeparator,
ControlChar, FormatChar, SurrogateChar, PrivateUseChar]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong indentation here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Darn Emacs had put a \t in, so the indentation looked correct for me! Fixed.

@nalimilan
Copy link
Member

OK, using in like that is about 225 x slower for checking isnumber.
I hope you like this last version.

@ScottPJones Too bad. I'm still not a fan a the double API enum vs. types. Using types for something that can only be known at run time doesn't sound great. Also, you note yourself in the code that they are slower than manually checking the integer code range using <=. This is an interesting design issue which could be relevant for other situations too, so it's worth discussing.

I can see two solutions:

  1. In C, one would use bitwise operations like cat & UpperCase. This offers a relatively nice API which doesn't require mastering the underlying implementation details.
  2. We could also add a custom type for aggregate categories for which we would override in.

@ScottPJones
Copy link
Contributor Author

The name CharType (or CategoryType) is meant to show that it is a julia type (trait), not the numeric code (i.e. CategoryCode), it doesn't mean anything about Unicode.
I'd actually just wanted to have a type (trait) API, because I think it gives you a lot of flexibility to then define fast functions that take a type, and give back the code or a bit (1<<code), or return other information about that category. Another reason I wanted to use types instead of just codes is that I think it would be much easier to handle some of the other Unicode properties that way (charprop being intended to handle all sorts of Unicode properties, such as script(s), writing direction, etc., or the (up-to) 4-level categories).
I left in the (pseudo) enum style (enums just don't work at that stage unfortunately), because of the current performance issues with the <: operator, but I think the performance could be fixed in the future.
I think this is flexible enough, that you could easily add CategoryMask, which, instead of a code 0:29,
would return the bits in a UInt32, to allow for the C masking style with bit operations.

What do you think about the latest updates?

@ScottPJones
Copy link
Contributor Author

Oh, a big reason I much prefer the type (trait) API over the enum(-like) style, is that the code using the enum style depends on the assigned numeric code points being ordered in a certain way, instead of following a hierarchy which you can examine with super(t) (hopefully soon to be supertype(t)!), and subtypes(t), and it is easily extensible by adding things like:
typealias AlphaNumeric Union{Letter, Number}

@nalimilan
Copy link
Member

The fact that you need to put Type inside the name of a type is an indication that there's a problem IMHO. I have nothing against passing types to charprop to request other properties, but I don't like returning types for something that is essentially a run time property. The essential character of types is that they are used for dispatch, and useful to optimize code when know at compile time. Using them for something else isn't a good idea.

A frequent convention when you pass a type as the first argument to a Julia function is that you get in return an instance of that type (and not a subtype of that type). Here, you would pass the (pseudo-)enum type, and you would get an enum value. In my mind, "Unicode category" is one of the kinds of information you may want to get, just like "upper case", "spacing", "block", etc. Unicode categories should not be types, but values.

If you don't like the idea of comparing integer values, we could as well create another enum consisting of aggregate categories, and another type like MajorCategory to pass to charprop to get this information. Anyway, this shouldn't be a very common use, since these major categories are very broad and specific properties are more often needed.

@ScottPJones
Copy link
Contributor Author

OK, I'd thought you'd like using types (traits), per this response:
#14347 (comment)
but that's OK. I would still like to keep the type system I came up with (for some of the reasons I stated above), but I can get rid of the charprop that returns the type.

@nalimilan
Copy link
Member

Yeah, I wrote that before I realized returning and enum made more sense.

@ScottPJones ScottPJones force-pushed the spj/charcategory branch 3 times, most recently from 1010f09 to f24826e Compare December 15, 2015 03:52
@ScottPJones
Copy link
Contributor Author

I think that this latest version is flexible, easy to use, and is demonstrably much faster for tests like
isnumber, isalpha, isupper, isalnum, isprint, isgraph (where they have to match multiple categories).

With this PR:

julia> @benchmark isalnum('')
================ Benchmark Results ========================
     Time per evaluation: 3.99 ns [3.98 ns, 4.01 ns]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
        Memory allocated: 0.00 bytes
   Number of allocations: 0 allocations
       Number of samples: 11501
   Number of evaluations: 115133501
         R² of OLS model: 0.994
 Time spent benchmarking: 0.50 s

From master:

julia> @benchmark isalnum('')
================ Benchmark Results ========================
     Time per evaluation: 5.34 ns [5.32 ns, 5.35 ns]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
        Memory allocated: 0.00 bytes
   Number of allocations: 0 allocations
       Number of samples: 11201
   Number of evaluations: 86502301
         R² of OLS model: 0.998
 Time spent benchmarking: 0.50 s

So the old methods are about 34% slower. (Thanks for the mask idea, @nalimilan!)

@tkelman
Copy link
Contributor

tkelman commented Dec 15, 2015

Using types for this just seems like an abuse of the type system. -1 on this scheme.

@ScottPJones
Copy link
Contributor Author

If you consider this an abuse of the type system, then there are many examples of abusing the type system in Julia, and I'd say that the whole traits scheme also "abuses" the type system.
This is no different from the following:

abstract Algorithm

immutable InsertionSortAlg <: Algorithm end
immutable QuickSortAlg     <: Algorithm end
immutable MergeSortAlg     <: Algorithm end

or all of the types defined for functors:

abstract Func{N}

immutable IdFun <: Func{1} end
...

or things like ElementwiseMaxFun and ElementwiseMinFun, etc.

@ScottPJones
Copy link
Contributor Author

@tkelman @nalimilan Please check out this latest version - it no longer uses types for the categories, and sets up masks for all of the general categories, as well as the aggregate categories such as AlphaNumeric, Print or Graph. The names all have documentation added, so you can do
?Category.LowerCase or ?Category.Ll and get documentation for the categories, with either long readable names or short two character names (from the Unicode standard). Very little is exported, pretty much just Unicode, Category, and the function charprop.

@tkelman
Copy link
Contributor

tkelman commented Dec 15, 2015

Category should not be exported.

@StefanKarpinski
Copy link
Member

Those uses of types are an excellent list of things we'd like to get rid of.

@ScottPJones
Copy link
Contributor Author

Running a full test suite now, with Category not exported from Base.

@ScottPJones
Copy link
Contributor Author

OK, just got home, tests passed so it now has Category removed from the exports.

@ScottPJones
Copy link
Contributor Author

@StefanKarpinski Is there anything in the documentation that states that types should not be used that way, or any PRs or issues about removing the places in Base where you say you'd like to get rid of them? What other methods do you suggest, that would have the same performance (i.e. allow the compiler to determine things at compile-time instead of run-time), to replace those techniques?

@mason-bially
Copy link
Contributor

Well, @JeffBezanson 's proposal (expanded from) in #6975 might provide features for a better convention for such problems (although I personally think just a protocol system is the better answer to issue #6975 specifically, I'd be curious if there are other reasons for the more general answer, like problems being encountered here).

@nalimilan
Copy link
Member

@StefanKarpinski Is there anything in the documentation that states that types should not be used that way, or any PRs or issues about removing the places in Base where you say you'd like to get rid of them? What other methods do you suggest, that would have the same performance (i.e. allow the compiler to determine things at compile-time instead of run-time), to replace those techniques?

Wait, are we taking about returning a type to represent the category (which I don't like either), or about passing a type as an argument to specify which information you want (which I support)? Only the latter allows the compiler to reason about types.

@MithrandirMiles
Copy link

@nalimilan No - the only types left in this are the ones for specifying which property charprop returns.
Currently, I've only implemented Unicode.Category.Code, but there are others, as per your original suggestion, that can be added in subsequent PRs.
Based on @tkelman's feedback, only the Unicode module and the charprop function are exported from Base.
With this last version using masks (thanks again for that suggestion!) it also gives a nice performance improvement, as well as IMO being more readable. (Note: since these are in the Unicode module,
they can just say Category.Code instead of Unicode.Category.Code)

isupper(c::Char)  = charprop(Category.Code, c) in Category.Upper
isalpha(c::Char)  = charprop(Category.Code, c) in Category.Letter
isnumber(c::Char) = charprop(Category.Code, c) in Category.Number
isalnum(c::Char)  = charprop(Category.Code, c) in Category.AlphaNumeric
ispunct(c::Char)  = charprop(Category.Code, c) in Category.Punctuation
isprint(c::Char)  = charprop(Category.Code, c) in Category.Print
# true in principle if a printer would use ink
isgraph(c::Char)  = charprop(Category.Code, c) in Category.Graph

as opposed to the old

# true for Unicode upper and mixed case
function isupper(c::Char)
    ccode = category_code(c)
    return ccode == UTF8PROC_CATEGORY_LU || ccode == UTF8PROC_CATEGORY_LT
end
isalpha(c::Char)  = (UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_LO)
isnumber(c::Char) = (UTF8PROC_CATEGORY_ND <= category_code(c) <= UTF8PROC_CATEGORY_NO)
function isalnum(c::Char)
    ccode = category_code(c)
    return (UTF8PROC_CATEGORY_LU <= ccode <= UTF8PROC_CATEGORY_LO) ||
           (UTF8PROC_CATEGORY_ND <= ccode <= UTF8PROC_CATEGORY_NO)
end
ispunct(c::Char) = (UTF8PROC_CATEGORY_PC <= category_code(c) <= UTF8PROC_CATEGORY_PO)
isprint(c::Char) = (UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_ZS)
# true in principal if a printer would use ink
isgraph(c::Char) = (UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_SO)

@nalimilan
Copy link
Member

@StefanKarpinski @tkelman Can you develop your stance about the proposed charprop API (cf. my previous comment)? Do you think it's OK to pass (not return) types to choose the property to retrieve?

@nalimilan
Copy link
Member

@MithrandirMiles Using another account to escape your banning isn't likely to lead to a productive discussion. If you want to continue contributing to Julia, I think you should show that you have understood what the community reproaches you with, and show concrete signs that you will change your behavior. Else, I don't think anybody will merge this PR (nor future PRs from you), which I would find quite unfortunate.

@tkelman
Copy link
Contributor

tkelman commented Dec 16, 2015

Passing types in as flags strikes me as a bit less ugly than returning types, though seems like symbols would be a more conventional API (depending on whether that can be made type stable).

However I'd rather not continue interacting on this PR, considering the events of #14397 (comment) which have been blatantly ignored here. If someone else wants to wait a few weeks then take a crack at cleaning up these functions I think everyone's blood pressure would be better off for it.

@StefanKarpinski
Copy link
Member

I agree with @tkelman. Closing this PR. The subject can be revisited later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants