Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to use PCRE2_UCP in regular expressions ("u regex flag) #27084

Closed
digital-carver opened this issue May 12, 2018 · 11 comments
Closed

Option to use PCRE2_UCP in regular expressions ("u regex flag) #27084

digital-carver opened this issue May 12, 2018 · 11 comments
Assignees
Labels
unicode Related to unicode characters and encodings
Milestone

Comments

@digital-carver
Copy link
Contributor

Per the PCRE2 documentation:

If the PCRE2_UCP option is set, the behaviour is changed so that Unicode properties are used to determine character types, as follows:

\d any character that matches \p{Nd} (decimal digit)
\s any character that matches \p{Z} or \h or \v
\w any character that matches \p{L} or \p{N}, plus underscore

(This is only a subset of the changes, other parts of the documentation mention changes to POSIX character class interpretations among other things.)

There's a const UCP = UInt32(0x00020000) defined in base/pcre_h.jl, but it doesn't seem to be used for anything. It would be very useful to have a regex flag that allowed us to specify to the regex library when we wanted UCP set.

Perl seems to automatically set its /u flag (http://perldoc.perl.org/perlre.html under "Character set modifiers") when in 'unicode_strings' mode, which is the default mode in recent Perl versions. Julia could do the same (since Julia is also in 'unicode strings' mode by default) and make PCRE2_UCP the default mode, but the PCRE2 documentation says "Matching these sequences is noticeably slower when PCRE2_UCP is set" - if that's still the case in practice and there's performance impact, then UCP can be left just as a non-default flag that the user can enable by saying, for eg., match(r"\w+"u, "~~கசடதபற_2!! "). Currently this returns:

ERROR: LoadError: ArgumentError: unknown regex flag: u
Stacktrace:
 [1] Regex(::String, ::String) at .\regex.jl:43
 [2] @r_str(::LineNumberNode, ::Module, ::Any, ::Vararg{Any,N} where N) at .\regex.jl:83
in expression starting at REPL[76]:1
@nalimilan
Copy link
Member

Good catch. The current behavior doesn't sound very intuitive given that our strings are supposed to be Unicode:

julia> match(r"\w+", "café")
RegexMatch("caf")

So maybe we should make UCP the default, with a flag to disable it when performance is a concern and the caller only cares about ASCII .

@nalimilan nalimilan added the unicode Related to unicode characters and encodings label May 12, 2018
@elextr
Copy link

elextr commented May 13, 2018

So maybe we should make UCP the default

Though that is not backward compatible?

@nalimilan
Copy link
Member

Yes, that's why it would have to happen before 1.0.

@nalimilan nalimilan added the triage This should be discussed on a triage call label May 13, 2018
@JeffBezanson
Copy link
Sponsor Member

Does anybody know what other languages do?

@digital-carver
Copy link
Contributor Author

(Just some tentative info from some searching around - keep grains of salt handy, and please point out any errors you find in this.)

Perl automatically matches as if PCRE2_UCP was set, and so implicitly does Unicode Character Property-based matching by default.

Python barely has any Unicode regex support in the default re module (Python 2 has practically no support). The regex module is commonly recommended for doing any Unicode matching in Python, and that uses UCP matching automatically, unless it's a bytestring or a string marked with the ASCII flag.

PHP, as far as I can tell, doesn't use PCRE2_UCP at all, things like \w and [:alpha:] always match only within the ASCII range. It's possible to use Unicode categories and script names with a /u suffix, which enables things like \pL and \p{Greek}. Based on those, people bake their own versions of the Unicode equivalents to \w, [:alnum:], etc. (I believe Julia's current behaviour matches PHP's, except Julia doesn't need a suffix to enable \p patterns).

.NET documentation says it automatically uses UCP based matching for strings with Unicode encoding, which seems to be the default encoding.

Ruby above version 1.9 uses the Onigmo library, which is halfway between explicit and implicit UCP support: for POSIX character classes like [:alpha:], it uses Unicode properties automatically, as if PCRE2_UCP was set. For the shorthand character classes like \w, the Unicode property mode has to be enabled with (?u):

$ ruby -e 'print(/[[:alpha:]]+/.match("$café."))'
café
$ ruby -e 'print(/\w+/.match("$café."))'
caf
$ ruby -e 'print(/(?u)\w+/.match("$café."))'
café

Java 7 and above work similar to PHP, in that UCP support has to be explicitly enabled (with Pattern.UNICODE_CHARACTER_CLASS or (?U)).

@nalimilan
Copy link
Member

Additional data:

  • Rust has a custom implementation which behaves like PCRE with UCP.
  • Swift doesn't have built-in regexes apparently (though it's planned). Currently it includes NSRegularExpression, which is based on ICU, which in turns behaves as PCRE with UCP.
  • Go uses the re2 engine which treats things like \w and [:alpha:] as sets of ASCII characters (and provides a different syntax to matching Unicode categories).

Overall it really looks like we should set UCP by default.

@StefanKarpinski
Copy link
Sponsor Member

It does seem like we should use UCP by default. Possible transition path:

  1. 0.7: add flags for turning UCP mode on and off
  2. 0.7: deprecate not using either one — use the off flag to keep old non-UCP behavior
  3. 1.0: remove warning for no-flag, flip its behavior from non-UCP to UCP

It's kind of annoying for everyone to need to add a flag to all their regexes and then delete it again, especially when they may well have wanted UCP behavior in the first place. We could potentially be smart about it and only give the warning for regular expressions whose meaning has changed.

@nalimilan
Copy link
Member

Agreed. We could check whether the regex contains common patterns like \w, [:alnum:], etc. IIUC the number of patterns affected by UCP is limited.

@StefanKarpinski
Copy link
Sponsor Member

@JeffBezanson
Copy link
Sponsor Member

I think we should just change the default and add a flag to enable the old behavior.

@Keno
Copy link
Member

Keno commented May 17, 2018

I agree. If you really only want only the ASCII ones or whatever just use [a-zA-Z0-9_] or enable the ASCII flag.

@Keno Keno removed the triage This should be discussed on a triage call label May 17, 2018
@JeffBezanson JeffBezanson added this to the 1.0 milestone May 17, 2018
@Keno Keno self-assigned this May 21, 2018
Keno added a commit that referenced this issue May 21, 2018
Fixes #27084. Regexes now match based on unicode character properties,
rather than just ASCII character properties, e.g. `match(r"\w+", "café")`
will now match the entire word (and not just `caf`). This behavior can
be disabled with the `a` flag to the regex string macro (e.g. `r"\w+"a`).
Keno added a commit that referenced this issue May 21, 2018
Fixes #27084. Regexes now match based on unicode character properties,
rather than just ASCII character properties, e.g. `match(r"\w+", "café")`
will now match the entire word (and not just `caf`). This behavior can
be disabled with the `a` flag to the regex string macro (e.g. `r"\w+"a`).
Liozou pushed a commit to Liozou/julia that referenced this issue May 24, 2018
Fixes JuliaLang#27084. Regexes now match based on unicode character properties,
rather than just ASCII character properties, e.g. `match(r"\w+", "café")`
will now match the entire word (and not just `caf`). This behavior can
be disabled with the `a` flag to the regex string macro (e.g. `r"\w+"a`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

6 participants