-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode character properties #14387
Unicode character properties #14387
Conversation
|
||
type UnicodeError <: Exception | ||
errmsg::AbstractString ##< A UTF_ERR_ message | ||
errmsg::AbstractString ##< An Unicode.ERR_ message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"A Unicode" if I'm not mistaken.
Thanks, this looks interesting! |
e6488e3
to
96fcf6f
Compare
@nalimilan Thanks very much for your input, it really does help improve my code! |
"""Unicode character categories""" | ||
abstract CharCategory <: UnicodeProperty | ||
|
||
"""Unicode letter character category""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing uppercase L. It would also be clearer if you added some kind of quote around the names of the categories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely right, I'll do that as soon as I get home.
What do you think of my proposal at #14387 (comment)? |
I'll have to see how the performance compares, using |
96fcf6f
to
fbffc85
Compare
OK, using |
BegQuotePunct, EndQuotePunct, OtherPunct, | ||
MathSymbol, CurrencySymbol, ModifierSymbol, OtherSymbol, | ||
SpaceSeparator, LineSeparator, ParagraphSeparator, | ||
ControlChar, FormatChar, SurrogateChar, PrivateUseChar] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong indentation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Darn Emacs had put a \t
in, so the indentation looked correct for me! Fixed.
@ScottPJones Too bad. I'm still not a fan a the double API enum vs. types. Using types for something that can only be known at run time doesn't sound great. Also, you note yourself in the code that they are slower than manually checking the integer code range using I can see two solutions:
|
The name What do you think about the latest updates? |
Oh, a big reason I much prefer the type (trait) API over the enum(-like) style, is that the code using the enum style depends on the assigned numeric code points being ordered in a certain way, instead of following a hierarchy which you can examine with |
The fact that you need to put A frequent convention when you pass a type as the first argument to a Julia function is that you get in return an instance of that type (and not a subtype of that type). Here, you would pass the (pseudo-)enum type, and you would get an enum value. In my mind, "Unicode category" is one of the kinds of information you may want to get, just like "upper case", "spacing", "block", etc. Unicode categories should not be types, but values. If you don't like the idea of comparing integer values, we could as well create another enum consisting of aggregate categories, and another type like |
OK, I'd thought you'd like using types (traits), per this response: |
Yeah, I wrote that before I realized returning and enum made more sense. |
1010f09
to
f24826e
Compare
I think that this latest version is flexible, easy to use, and is demonstrably much faster for tests like With this PR: julia> @benchmark isalnum('ℵ')
================ Benchmark Results ========================
Time per evaluation: 3.99 ns [3.98 ns, 4.01 ns]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
Memory allocated: 0.00 bytes
Number of allocations: 0 allocations
Number of samples: 11501
Number of evaluations: 115133501
R² of OLS model: 0.994
Time spent benchmarking: 0.50 s From master: julia> @benchmark isalnum('ℵ')
================ Benchmark Results ========================
Time per evaluation: 5.34 ns [5.32 ns, 5.35 ns]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
Memory allocated: 0.00 bytes
Number of allocations: 0 allocations
Number of samples: 11201
Number of evaluations: 86502301
R² of OLS model: 0.998
Time spent benchmarking: 0.50 s So the old methods are about 34% slower. (Thanks for the mask idea, @nalimilan!) |
Using types for this just seems like an abuse of the type system. -1 on this scheme. |
If you consider this an abuse of the type system, then there are many examples of abusing the type system in Julia, and I'd say that the whole traits scheme also "abuses" the type system. abstract Algorithm
immutable InsertionSortAlg <: Algorithm end
immutable QuickSortAlg <: Algorithm end
immutable MergeSortAlg <: Algorithm end or all of the types defined for functors: abstract Func{N}
immutable IdFun <: Func{1} end
... or things like |
@tkelman @nalimilan Please check out this latest version - it no longer uses types for the categories, and sets up masks for all of the general categories, as well as the aggregate categories such as |
|
Those uses of types are an excellent list of things we'd like to get rid of. |
Running a full test suite now, with |
OK, just got home, tests passed so it now has |
@StefanKarpinski Is there anything in the documentation that states that types should not be used that way, or any PRs or issues about removing the places in Base where you say you'd like to get rid of them? What other methods do you suggest, that would have the same performance (i.e. allow the compiler to determine things at compile-time instead of run-time), to replace those techniques? |
Well, @JeffBezanson 's proposal (expanded from) in #6975 might provide features for a better convention for such problems (although I personally think just a protocol system is the better answer to issue #6975 specifically, I'd be curious if there are other reasons for the more general answer, like problems being encountered here). |
e4a759b
to
4e02490
Compare
Add newline Fix indentation (Emacs and tabs)
Updated tests to use Unicode.Category
4e02490
to
2dd14d8
Compare
Wait, are we taking about returning a type to represent the category (which I don't like either), or about passing a type as an argument to specify which information you want (which I support)? Only the latter allows the compiler to reason about types. |
@nalimilan No - the only types left in this are the ones for specifying which property isupper(c::Char) = charprop(Category.Code, c) in Category.Upper
isalpha(c::Char) = charprop(Category.Code, c) in Category.Letter
isnumber(c::Char) = charprop(Category.Code, c) in Category.Number
isalnum(c::Char) = charprop(Category.Code, c) in Category.AlphaNumeric
ispunct(c::Char) = charprop(Category.Code, c) in Category.Punctuation
isprint(c::Char) = charprop(Category.Code, c) in Category.Print
# true in principle if a printer would use ink
isgraph(c::Char) = charprop(Category.Code, c) in Category.Graph as opposed to the old # true for Unicode upper and mixed case
function isupper(c::Char)
ccode = category_code(c)
return ccode == UTF8PROC_CATEGORY_LU || ccode == UTF8PROC_CATEGORY_LT
end
isalpha(c::Char) = (UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_LO)
isnumber(c::Char) = (UTF8PROC_CATEGORY_ND <= category_code(c) <= UTF8PROC_CATEGORY_NO)
function isalnum(c::Char)
ccode = category_code(c)
return (UTF8PROC_CATEGORY_LU <= ccode <= UTF8PROC_CATEGORY_LO) ||
(UTF8PROC_CATEGORY_ND <= ccode <= UTF8PROC_CATEGORY_NO)
end
ispunct(c::Char) = (UTF8PROC_CATEGORY_PC <= category_code(c) <= UTF8PROC_CATEGORY_PO)
isprint(c::Char) = (UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_ZS)
# true in principal if a printer would use ink
isgraph(c::Char) = (UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_SO) |
@StefanKarpinski @tkelman Can you develop your stance about the proposed |
@MithrandirMiles Using another account to escape your banning isn't likely to lead to a productive discussion. If you want to continue contributing to Julia, I think you should show that you have understood what the community reproaches you with, and show concrete signs that you will change your behavior. Else, I don't think anybody will merge this PR (nor future PRs from you), which I would find quite unfortunate. |
Passing types in as flags strikes me as a bit less ugly than returning types, though seems like symbols would be a more conventional API (depending on whether that can be made type stable). However I'd rather not continue interacting on this PR, considering the events of #14397 (comment) which have been blatantly ignored here. If someone else wants to wait a few weeks then take a crack at cleaning up these functions I think everyone's blood pressure would be better off for it. |
I agree with @tkelman. Closing this PR. The subject can be revisited later. |
This is still a WIP, based on the comments in #14347, to be subjected to some bikeshedding and performance testing, before deprecating some (not all) of the old very abbreviated character classification functions, such as
isalnum
,isnumber
,iscntrl
, etc.Suggestions on how to improve it are quite welcome, of course!
Edit: I think (after tests complete) that this could be merged, and other changes such as deprecating some of the unused
is*
methods should be done in a follow-on PR.