-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to use PCRE2_UCP in regular expressions ("u regex flag) #27084
Comments
Good catch. The current behavior doesn't sound very intuitive given that our strings are supposed to be Unicode: julia> match(r"\w+", "café")
RegexMatch("caf") So maybe we should make UCP the default, with a flag to disable it when performance is a concern and the caller only cares about ASCII . |
Though that is not backward compatible? |
Yes, that's why it would have to happen before 1.0. |
Does anybody know what other languages do? |
(Just some tentative info from some searching around - keep grains of salt handy, and please point out any errors you find in this.) Perl automatically matches as if PCRE2_UCP was set, and so implicitly does Unicode Character Property-based matching by default. Python barely has any Unicode regex support in the default PHP, as far as I can tell, doesn't use PCRE2_UCP at all, things like .NET documentation says it automatically uses UCP based matching for strings with Unicode encoding, which seems to be the default encoding. Ruby above version 1.9 uses the Onigmo library, which is halfway between explicit and implicit UCP support: for POSIX character classes like
Java 7 and above work similar to PHP, in that UCP support has to be explicitly enabled (with |
Additional data:
Overall it really looks like we should set UCP by default. |
It does seem like we should use UCP by default. Possible transition path:
It's kind of annoying for everyone to need to add a flag to all their regexes and then delete it again, especially when they may well have wanted UCP behavior in the first place. We could potentially be smart about it and only give the warning for regular expressions whose meaning has changed. |
Agreed. We could check whether the regex contains common patterns like |
There is some dissenting opinion: https://discourse.julialang.org/t/regex-pcre2-and-the-pcre2-ucp-ucp-flag/10930. |
I think we should just change the default and add a flag to enable the old behavior. |
I agree. If you really only want only the ASCII ones or whatever just use |
Fixes #27084. Regexes now match based on unicode character properties, rather than just ASCII character properties, e.g. `match(r"\w+", "café")` will now match the entire word (and not just `caf`). This behavior can be disabled with the `a` flag to the regex string macro (e.g. `r"\w+"a`).
Fixes #27084. Regexes now match based on unicode character properties, rather than just ASCII character properties, e.g. `match(r"\w+", "café")` will now match the entire word (and not just `caf`). This behavior can be disabled with the `a` flag to the regex string macro (e.g. `r"\w+"a`).
Fixes JuliaLang#27084. Regexes now match based on unicode character properties, rather than just ASCII character properties, e.g. `match(r"\w+", "café")` will now match the entire word (and not just `caf`). This behavior can be disabled with the `a` flag to the regex string macro (e.g. `r"\w+"a`).
Per the PCRE2 documentation:
(This is only a subset of the changes, other parts of the documentation mention changes to POSIX character class interpretations among other things.)
There's a
const UCP = UInt32(0x00020000)
defined inbase/pcre_h.jl
, but it doesn't seem to be used for anything. It would be very useful to have a regex flag that allowed us to specify to the regex library when we wanted UCP set.Perl seems to automatically set its
/u
flag (http://perldoc.perl.org/perlre.html under "Character set modifiers") when in 'unicode_strings' mode, which is the default mode in recent Perl versions. Julia could do the same (since Julia is also in 'unicode strings' mode by default) and make PCRE2_UCP the default mode, but the PCRE2 documentation says "Matching these sequences is noticeably slower when PCRE2_UCP is set" - if that's still the case in practice and there's performance impact, then UCP can be left just as a non-default flag that the user can enable by saying, for eg.,match(r"\w+"u, "~~கசடதபற_2!! ")
. Currently this returns:The text was updated successfully, but these errors were encountered: