- Markus Scherer
- Mathias Bynens
This proposal reached stage 4 of the TC39 process during the 2023-may-16 meeting. It was merged into the ECMAScript spec on 2023-jun-15, on track for inclusion in the ECMAScript 2024 snapshot.
As of the 2021-may-25 TC39 meeting, this proposal officially subsumes the properties of strings proposal.
In ECMAScript regex character classes, we propose to add syntax & semantics for the following set operations:
- difference/subtraction (in A but not in B)
- intersection (in both A and B)
- nested character classes (needed to enable the above)
In addition, by merging with the properties of strings proposal, we also propose to add certain Unicode properties of strings, and string literals in character classes.
For a JavaScript developer-facing explanation of this proposal, see our feature article on v8.dev.
Many regular expression engines support named character properties, mostly reflecting Unicode character properties, to avoid hardcoding character classes that may require hundreds of ranges and that may change with new versions of Unicode.
However, a character property is often just a starting point. It is common to need additions (union), exceptions (subtraction), and “both this and that” (intersection). See the recommendation to support set operations in UTS #18: Unicode Regular Expressions.
ECMAScript regular expression patterns already support one set operation in limited form: one can create a union of characters, ranges, and classes, as long as those classes are CharacterClassEscape
s like \s
or \p{Decimal_Number}
.
A web search for questions about regular expressions with such set operations reveals workarounds such as hardcoding the ranges resulting from set operations (losing the benefits of named properties) and lookahead assertions (which are unintuitive for this purpose and perform less well).
We propose adding syntax & semantics for difference and intersection, as well as nested character classes.
We propose to extend the syntax for character classes to add support for set difference/subtraction, set intersection, and nested character classes.
Within regular expression patterns, we propose enabling the following functionality.
// difference/subtraction
[A--B]
// intersection
[A&&B]
// nested character class
[A--[0-9]]
Throughout these high-level examples, A
and B
can be thought of as placeholders for a character class (e.g. [a-z]
) or a property escape (e.g. \p{ASCII}
) and maybe (subject to discussion of specifics) single characters and/or character ranges. See the illustrative examples section for concrete real-world use cases.
Real-world usage examples from code using ICU’s UnicodeSet
which implements a pattern syntax similar to regex character classes (modified here to use \p{Perl syntax for properties}
rather than [:POSIX syntax for properties:]
— UnicodeSet
supports both):
-
Code that looks for non-ASCII digits, to convert them to ASCII digits:
[\p{Decimal_Number}--[0-9]]
-
Looking for spans of "word/identifier letters" of specific scripts:
[\p{Script=Khmer}&&[\p{Letter}\p{Mark}\p{Number}]]
-
Looking for “breaking spaces”:
[\p{White_Space}--\p{Line_Break=Glue}]
Note that ECMAScript currently doesn’t support
\p{Line_Break=…}
— this is an illustrative example regardless. -
Looking for emoji characters except for the ASCII ones:
[\p{Emoji}--[#*0-9]] // …or… [\p{Emoji}--\p{ASCII}]
-
Looking for non-script-specific combining marks:
[\p{Nonspacing_Mark}&&[\p{Script=Inherited}\p{Script=Common}]]
-
Looking for “invisible characters” except for ASCII space:
[[\p{Other}\p{Separator}\p{White_Space}\p{Default_Ignorable_Code_Point}]--\x20]
-
Looking for “first letter in each script” starting from:
[\P{NFC_Quick_Check=No}--\p{Script=Common}--\p{Script=Inherited}--\p{Script=Unknown}]
Note that ECMAScript currently doesn’t support
\p{NFC_Quick_Check=…}
— this is an illustrative example regardless. -
All Greek code points that are either a letter, a mark (e.g. diacritic), or a decimal number:
[\p{Script_Extensions=Greek}&&[\p{Letter}\p{Mark}\p{Decimal_Number}]]
-
All code points, except for those in the “Other”
General_Category
, but add back control characters:[[\p{Any}--\p{Other}]\p{Control}]
-
All assigned code points, except for separators:
[\p{Assigned}--\p{Separator}]
-
All right-to-left and Arabic Letter code points, but remove unassigned code points:
[[\p{Bidi_Class=R}\p{Bidi_Class=AL}]--\p{Unassigned}]
Note that ECMAScript currently doesn’t support
\p{Bidi_Class=…}
— this is an illustrative example regardless. -
All right-to-left and Arabic Letter code points with
General_Category
“Letter”:[\p{Letter}&&[\p{Bidi_Class=R}\p{Bidi_Class=AL}]]
Note that ECMAScript currently doesn’t support
\p{Bidi_Class=…}
— this is an illustrative example regardless. -
All characters in the “Other”
General_Category
EXCEPT for format and control characters (or, equivalently, all surrogate, private use, and unassigned code points):[\p{Other}--\p{Format}--\p{Control}]
It is an explicit goal of this proposal to not break backwards compatibility. Concretely, we don’t want to change behavior of any regular expression pattern that currently does not throw an exception. There needs to be some way to indicate that the new syntax is in use.
We considered 4 options:
- A new flag outside the expression itself.
- A modifier inside the expression, of the form
(?L)
whereL
is one ASCII letter. (Several regex engines support various modifiers like this.) - A prefix like
\U…
that is not valid under the currentu
flag (Unicode mode) – but note that\U
without theu
flag is just the same asU
itself.- (Banning the use of unknown escape sequences in
u
RegExps was a conscious choice, made to enable this kind of extension.)
- (Banning the use of unknown escape sequences in
- A prefix like
(?[
that is not valid in existing patterns regardless of flags.
The idea to use a prefix was suggested in an early TC39 meeting, so we were working with variations of that, for example:
UnicodeCharacterClass = '\UniSet{' ClassContents '}'
However, we found that this is not very developer-friendly.
In particular, one would have to write the prefix and use the u
flag. Waldemar pointed out that the prefix looks like it should be enough, and therefore a developer may well accidentally omit adding the u
flag. Although this aspect could be addressed by using a more complicated prefix that is currently invalid with and without the u
flag (like (?[
), doing so would come at the cost of readability.
Also, the use of a backslash-letter prefix would want to enclose the new syntax in {curly braces}
because other such syntax (\p{property}
, \u{12345}
, …) uses curly braces – but not using [square brackets]
for the outermost level of a character class looks strange.
Finally, when an expression has several new-syntax character classes, the prefix would have to be used on each one, which is clunky.
An in-expression modifier is an attractive alternative, but ECMAScript does not yet use any such modifiers.
Therefore, a new flag is the simplest, most user-friendly, and syntactically and semantically cleanest way to indicate the new character class syntax. It should imply and build on the u
flag.
We suggest using flag v
for the next letter after u
.
We also suggest that the proposed properties of strings require use of this same new flag.
In other words, the new flag would indicate several connected changes related to properties and character classes:
- properties of strings
- character classes may contain multi-character-string elements, via string literals or certain properties
- nested classes
- set operators
- simpler parsing of dashes and square brackets
- fixed/improved IgnoreCase matching
For more discussion see issue 2.
The answer to this question can be useful when “upgrading” existing u
RegExps to use v
. Here’s an overview of the differences:
-
(This is the obvious part.) Previously invalid patterns making use of the new syntax (see above) now become valid, e.g.
[\p{ASCII_Hex_Digit}--[Ff]] \p{RGI_Emoji} [_\q{a|bc|def}]
-
Some previously valid patterns are now errors, specifically those with a character class including either an unescaped special character
(
)
[
{
}
/
-
|
(note:\
and]
also require escaping inside a character class, but this is already true with theu
flag) or a double punctuator:[(] [)] [[] [{] [}] [/] [-] [|] [&&] [!!] [##] [$$] [%%] [**] [++] [,,] [..] [::] [;;] [<<] [==] [>>] [??] [@@] [``] [~~] [^^^] [_^^]
-
The
u
flag suffers from confusing case-insensitive matching behavior. Thev
flag has different, improved semantics. See the explainer for an overview, or issue #30 for more details.
Several other regex engines support some or all of the proposed extensions in some form:
language/implementation | union | subtraction | intersection | nested classes | symmetric difference |
---|---|---|---|---|---|
ICU regex | ✅ | ✅ | ✅ | ✅ | ❌ |
java.util.regex.Pattern |
✅ | 🤷 * | ✅ | ✅ | ❌ |
Perl (“experimental feature available starting in 5.18”) | ✅ | ✅ | ✅ | ✅ | ✅ |
.Net | ✅ | ✅ | ❌ | ✅ | ❌ |
XML Schema | ✅ | ✅ | ❌ | ✅ | ❌ |
Apache Xerces2 XPath regex | ✅ | ✅ | ✅ | ✅ | ❌ |
Python regex module (not built-in "re") | ✅ | ✅ | ✅ | ✅ | ✅ |
Ruby Regexp | ✅ | ❌ | ✅ | ❌ | ❌ |
ECMAScript prior to this proposal | ✅ | ❌ | ❌ | ❌ | ❌ |
ECMAScript with this proposal | ✅ | ✅ | ✅ | ✅ | ❌ |
* Subtraction is documented as intersection with negation. With only support for negation + nested classes, you already have the functional equivalent of intersection & subtraction: [^[^ab][^cd]] === [[ab]&&[cd]]
and [^[^ab][cd]] === [[ab]--[cd]]
. This is just not very readable. For this reason, our proposal includes dedicated syntax for intersection and subtraction as well.
These all differ somewhat in syntax and semantics (e.g. operator precedence). References:
- regular expression flavors that support character class subtraction
- regular expression flavors that support character class intersection
Some Stack Overflow discussions:
How does this interact with properties of strings a.k.a. the sequence properties proposal?
We described the exact interactions between the two proposals on the path to stage 2. (See issue #3 for background.)
We propose to require the new flag in order to enable properties-of-strings as well as allowing new-syntax character classes to contain multi-character-string elements (from string literals or properties-of-strings used inside a class).
Short answer: no.
Long answer: We brought this up with the Unicode Technical Committee (UTC) in May 2019 (see L2/19-168 + meeting notes), and later (in April 2021) proposed a concrete new stability policy (see L2/21-091 + meeting notes). The UTC reached consensus to approve our proposal. The domain of a normative or informative Unicode property must never change. In particular, a property of characters must never be changed into a property of strings, and vice versa.
Short answer: no.
This proposal, just like the original properties of strings proposal, adds support for certain properties of strings, each of which expands to a finite, well-defined set of strings (Basic_Emoji
also applies to many single characters); and this proposal adds syntax for character classes with explicitly enumerated strings, which also creates a finite set. This is a natural extension from finite properties of characters and finite character classes/sets of characters.
For example, in UTS #51 there is a very clear distinction between
- an emoji zwj sequence, defined via a regular expression that matches an infinite set of strings
- the RGI emoji ZWJ sequence set (= the RGI_Emoji_ZWJ_Sequence property) which is a finite set of strings listed in a data file
It is theoretically possible to support named matchers for infinite sets of strings, that is, a kind of named sub-regular-expression. That is decidedly not part of this proposal, nor is any speculation about possible syntax and semantics of such hypothetical expressions part of this proposal.
There is enough reserved syntax (e.g., curly braces) to enable wide-ranging extensions in the future, but we don’t plan to build something specific into the proposed spec changes.
This proposal ensures longest strings are matched first, so that a prefix like 'xy'
does not hide a longer string like 'xyz'
. For example, the pattern [a-c\q{W|xy|xyz}]
applies to the strings 'a'
, 'b'
, 'c'
, 'W'
, 'xy'
, and 'xyz'
. This pattern behaves like xyz|xy|a|b|c|W
or xyz|xy|[a-cW]
.
Matching the longest strings first is key to the integration with properties of strings like \p{RGI_Emoji}
. A Unicode property defines a set of characters/strings in the mathematical sense; in particular, no order. Thus, there is no order of the strings in e.g. [\p{RGI_Emoji}--\q{🇧🇪}]
that we could preserve.
For more details on the rationale for matching longest strings first, see issue #25.
A character class may contain multiple strings of the same length: e.g. [xyz]
contains three strings consisting of a single character, and [\q{xx|yy|zz}]
(using the new string literal syntax) contains three strings consisting of two characters. There is no inherent or observable match order for those same-length strings. The committee discussed and decided that character classes are mathematical sets with no inherent order. Similar to how there is no observable match order difference between [xyz]
and [zyx]
, there is no match order difference between [\q{xx|yy|zz}]
and [\q{zz|yy|xx}]
. This nuance enables implementers to use sets (i.e. implementations of mathematical sets) and tries (retrieval trees) for runtime optimizations.
No. As shown in the previous FAQ entry, \p{PropertyOfStrings}
desugars into a plain disjunction, rather than an atomic group containing a disjunction. We believe this behavior is the most future-proof, for the following reasons.
If, as part of a separate proposal, atomic groups are added to ECMAScript following the syntactic precedent in other languages, users can make their own choices, i.e.
- use
(?>\p{PropertyOfStrings})
if atomic behavior is desired - use
\p{PropertyOfStrings}
if non-atomic behavior is desired
If, on the other hand, we forced properties of strings to be atomic, there’d be no way for users to opt-out of the atomic behavior without inventing a new “non-atomic” regular expression operator for which no precedent exists in other regular expression flavors.
See issue #50 for details.
As mentioned in the answer to the previous question, according to both the current ECMAScript specification and other regular expression implementations, character classes are mathematical sets. As such, the removal of strings that are not present in the original set is not an error, but rather a no-op. Example (note that RGI_Emoji
includes the string 🇧🇪
, but RGI_Emoji_ZWJ_Sequence
does not):
# Proper subset.
[\p{RGI_Emoji}--\q{🇧🇪}]
# Not a proper subset.
[\p{RGI_Emoji_ZWJ_Sequence}--\q{🇧🇪}]
It would be confusing and counterproductive if one of these patterns threw an exception.
Several of the real-world illustrative examples in this explainer rely on this useful A--B
pattern, and it is crucial that we support it. See issue #32 for more background.
We considered also proposing an operator for symmetric difference (see issue #5), but we did not find a good use case and wanted to keep the proposal simple.
Instead, we are proposing to reserve doubled ASCII punctuation and symbols for future use. That will allow for future proposals to add ~~
for example, as suggested in UTS #18, for symmetric difference.
No. It’s an explicit goal of our proposal that a correct ECMAScript lexer before this proposal remains a correct ECMAScript lexer after this proposal.
- November 2020 (slides)
- January 2021 (slides)
- March 2021 (slides)
- April 2021 Incubator Call (slides)
- April 2021 (slides)
- May 2021 (slides)
- August 2021 (slides)
- December 2021 (slides)
- March 2022 (slides)
- May 2023 (slides)
(We initially developed the draft spec changes in a Google Doc, but later moved all of the changes from there to the pull request.)
Integration with other standards:
- Integration with the
pattern
attribute in the HTML Standard: whatwg/html#7908 - Integration with the
URLPattern
API: whatwg/urlpattern#178
- SpiderMonkey/Firefox, shipping in Firefox 116
- V8/Chrome, enabled by default in V8 v11.2 / Chrome 112 (behind the
--harmony-regexp-unicode-sets
flag in earlier versions) - JavaScriptCore/Safari, enabled by default in Safari Technology Preview 166 & Safari 17
- Babel via regexpu-core
- ICU class UnicodeSet can be built from a string with syntax like a regular expression character class. UnicodeSet has long supported set operations and multi-character strings, and recently (in ICU 70) added support for emoji properties of strings.
- C++ SRELL (
std::regex
-like library)
Support for the v
flag in the HTML pattern
attribute is available in: