Option to use PCRE2_UCP in regular expressions ("u regex flag) #27084

digital-carver · 2018-05-12T13:54:11Z

If the PCRE2_UCP option is set, the behaviour is changed so that Unicode properties are used to determine character types, as follows:

\d any character that matches \p{Nd} (decimal digit)
\s any character that matches \p{Z} or \h or \v
\w any character that matches \p{L} or \p{N}, plus underscore

(This is only a subset of the changes, other parts of the documentation mention changes to POSIX character class interpretations among other things.)

There's a const UCP = UInt32(0x00020000) defined in base/pcre_h.jl, but it doesn't seem to be used for anything. It would be very useful to have a regex flag that allowed us to specify to the regex library when we wanted UCP set.

Perl seems to automatically set its /u flag (http://perldoc.perl.org/perlre.html under "Character set modifiers") when in 'unicode_strings' mode, which is the default mode in recent Perl versions. Julia could do the same (since Julia is also in 'unicode strings' mode by default) and make PCRE2_UCP the default mode, but the PCRE2 documentation says "Matching these sequences is noticeably slower when PCRE2_UCP is set" - if that's still the case in practice and there's performance impact, then UCP can be left just as a non-default flag that the user can enable by saying, for eg., match(r"\w+"u, "~~கசடதபற_2!! "). Currently this returns:

ERROR: LoadError: ArgumentError: unknown regex flag: u
Stacktrace:
 [1] Regex(::String, ::String) at .\regex.jl:43
 [2] @r_str(::LineNumberNode, ::Module, ::Any, ::Vararg{Any,N} where N) at .\regex.jl:83
in expression starting at REPL[76]:1

The text was updated successfully, but these errors were encountered:

nalimilan · 2018-05-12T18:09:36Z

Good catch. The current behavior doesn't sound very intuitive given that our strings are supposed to be Unicode:

julia> match(r"\w+", "café")
RegexMatch("caf")

So maybe we should make UCP the default, with a flag to disable it when performance is a concern and the caller only cares about ASCII .

elextr · 2018-05-13T03:22:47Z

So maybe we should make UCP the default

Though that is not backward compatible?

nalimilan · 2018-05-13T09:28:52Z

Yes, that's why it would have to happen before 1.0.

JeffBezanson · 2018-05-15T15:39:29Z

Does anybody know what other languages do?

digital-carver · 2018-05-15T19:52:39Z

(Just some tentative info from some searching around - keep grains of salt handy, and please point out any errors you find in this.)

Perl automatically matches as if PCRE2_UCP was set, and so implicitly does Unicode Character Property-based matching by default.

Python barely has any Unicode regex support in the default re module (Python 2 has practically no support). The regex module is commonly recommended for doing any Unicode matching in Python, and that uses UCP matching automatically, unless it's a bytestring or a string marked with the ASCII flag.

PHP, as far as I can tell, doesn't use PCRE2_UCP at all, things like \w and [:alpha:] always match only within the ASCII range. It's possible to use Unicode categories and script names with a /u suffix, which enables things like \pL and \p{Greek}. Based on those, people bake their own versions of the Unicode equivalents to \w, [:alnum:], etc. (I believe Julia's current behaviour matches PHP's, except Julia doesn't need a suffix to enable \p patterns).

.NET documentation says it automatically uses UCP based matching for strings with Unicode encoding, which seems to be the default encoding.

Ruby above version 1.9 uses the Onigmo library, which is halfway between explicit and implicit UCP support: for POSIX character classes like [:alpha:], it uses Unicode properties automatically, as if PCRE2_UCP was set. For the shorthand character classes like \w, the Unicode property mode has to be enabled with (?u):

$ ruby -e 'print(/[[:alpha:]]+/.match("$café."))'
café
$ ruby -e 'print(/\w+/.match("$café."))'
caf
$ ruby -e 'print(/(?u)\w+/.match("$café."))'
café

Java 7 and above work similar to PHP, in that UCP support has to be explicitly enabled (with Pattern.UNICODE_CHARACTER_CLASS or (?U)).

nalimilan · 2018-05-15T20:36:33Z

Additional data:

Rust has a custom implementation which behaves like PCRE with UCP.
Swift doesn't have built-in regexes apparently (though it's planned). Currently it includes NSRegularExpression, which is based on ICU, which in turns behaves as PCRE with UCP.
Go uses the re2 engine which treats things like \w and [:alpha:] as sets of ASCII characters (and provides a different syntax to matching Unicode categories).

Overall it really looks like we should set UCP by default.

StefanKarpinski · 2018-05-16T12:49:01Z

It does seem like we should use UCP by default. Possible transition path:

0.7: add flags for turning UCP mode on and off
0.7: deprecate not using either one — use the off flag to keep old non-UCP behavior
1.0: remove warning for no-flag, flip its behavior from non-UCP to UCP

It's kind of annoying for everyone to need to add a flag to all their regexes and then delete it again, especially when they may well have wanted UCP behavior in the first place. We could potentially be smart about it and only give the warning for regular expressions whose meaning has changed.

nalimilan · 2018-05-16T15:05:33Z

Agreed. We could check whether the regex contains common patterns like \w, [:alnum:], etc. IIUC the number of patterns affected by UCP is limited.

StefanKarpinski · 2018-05-17T03:14:28Z

There is some dissenting opinion: https://discourse.julialang.org/t/regex-pcre2-and-the-pcre2-ucp-ucp-flag/10930.

JeffBezanson · 2018-05-17T18:57:34Z

I think we should just change the default and add a flag to enable the old behavior.

Keno · 2018-05-17T18:59:33Z

I agree. If you really only want only the ASCII ones or whatever just use [a-zA-Z0-9_] or enable the ASCII flag.

Fixes #27084. Regexes now match based on unicode character properties, rather than just ASCII character properties, e.g. `match(r"\w+", "café")` will now match the entire word (and not just `caf`). This behavior can be disabled with the `a` flag to the regex string macro (e.g. `r"\w+"a`).

Fixes JuliaLang#27084. Regexes now match based on unicode character properties, rather than just ASCII character properties, e.g. `match(r"\w+", "café")` will now match the entire word (and not just `caf`). This behavior can be disabled with the `a` flag to the regex string macro (e.g. `r"\w+"a`).

nalimilan added the unicode Related to unicode characters and encodings label May 12, 2018

nalimilan added the triage This should be discussed on a triage call label May 13, 2018

Keno removed the triage This should be discussed on a triage call label May 17, 2018

JeffBezanson added this to the 1.0 milestone May 17, 2018

Keno self-assigned this May 21, 2018

Keno mentioned this issue May 21, 2018

Make UCP option the default for regex matching #27189

Merged

JeffBezanson closed this as completed in #27189 May 22, 2018

wordish mentioned this issue Aug 5, 2021

Fix H1 1102211 concretecms/concretecms#9634

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to use PCRE2_UCP in regular expressions ("u regex flag) #27084

Option to use PCRE2_UCP in regular expressions ("u regex flag) #27084

digital-carver commented May 12, 2018

nalimilan commented May 12, 2018

elextr commented May 13, 2018

nalimilan commented May 13, 2018

JeffBezanson commented May 15, 2018

digital-carver commented May 15, 2018

nalimilan commented May 15, 2018

StefanKarpinski commented May 16, 2018

nalimilan commented May 16, 2018

StefanKarpinski commented May 17, 2018

JeffBezanson commented May 17, 2018

Keno commented May 17, 2018

Option to use PCRE2_UCP in regular expressions ("u regex flag) #27084

Option to use PCRE2_UCP in regular expressions ("u regex flag) #27084

Comments

digital-carver commented May 12, 2018

nalimilan commented May 12, 2018

elextr commented May 13, 2018

nalimilan commented May 13, 2018

JeffBezanson commented May 15, 2018

digital-carver commented May 15, 2018

nalimilan commented May 15, 2018

StefanKarpinski commented May 16, 2018

nalimilan commented May 16, 2018

StefanKarpinski commented May 17, 2018

JeffBezanson commented May 17, 2018

Keno commented May 17, 2018