IgnoreCase vs. complement vs. nested class #30

markusicu · 2021-05-27T21:43:25Z

Proposal

The current proposal is to use deep case closure: #30 (comment) --> #30 (comment) (limited to only simple case folding)

More specifically:

Define an abstract operation SimpleCaseClosure(A) where A is a CharSet.
(Note: scf = Unicode Simple_Case_Folding: the simple mappings in CaseFolding.txt, as in the Canonicalize(ch) operation)

For each single code point c in A, add every other code point d where scf(d)==c or where scf(c)==d

When building a CharSet from a character class or from a CharacterClassEscape, if IgnoreCase==true and /v is specified:

literal character c: create a CharSet A with just c and return SimpleCaseClosure(A)
range a-b: create a CharSet A with the one contiguous range of code points from a to b and return SimpleCaseClosure(A)
- Note: The result will often consist of two or more ranges. SimpleCaseClosure([a-z]) will include the separate range [A-Z] and several other non-adjacent code points.
\p{X}: resolve the property expression into a CharSet A and return SimpleCaseClosure(A)
\P{X}: resolve the property expression into a CharSet A, compute CharSet B = SimpleCaseClosure(A), return the code point complement of B
same with \w \W \s \S etc.: look up the property CharSet, compute the case closure, return the code point complement for backslash-uppercase escapes
set operations work as usual
[^...]: compute the inner character class expression into CharSet A and return the code point complement of A; if this is a top-level CharacterClass, then return with invert=false

Problem description

The current draft spec text includes a TODO to discuss whether to do something special about IgnoreCase.
This becomes interesting when looking at IgnoreCase + complement + nested classes.

Notation: In Unicode=true mode, ES regular expressions apply Unicode Simple_Case_Folding, which has the short name scf.

In our little working group, we had been chewing on this question on and off without coming to a conclusion.
We had been trying to rationalize and match the existing behavior, and discussed doing an early "case closure" when IgnoreCase=true (for each c in the CharSet, add any c2 if scf(c2)=c), at least when a nested class is complemented (and maybe for any nested class regardless), with the goal of being consistent with existing matching behavior.

Then we realized that the existing matching behavior is inconsistent with itself (or at least unintuitive).

Looking at the existing spec:

The spec already allows one level of nested classes in the form of CharacterClassEscapes inside CharacterClasses.
The complement of a CharacterClassEscape (\ plus uppercase letter) is resolved immediately, computing the code point complement of its CharSet.
The complement of a CharacterClass ([^ ClassRanges ]) is deferred via the "invert" boolean until the CharacterSetMatcher operates on the CharSet, rather than complementing the CharSet itself.
When IgnoreCase=true, then Canonicalize (which the CharacterSetMatcher invokes) applies the Unicode Simple_Case_Folding (scf) before applying the "invert" complement.

In other words, with IgnoreCase=true, the matching behavior for a CharacterClass is very different from that for a CharacterClassEscape, and using a complemented CharacterClassEscape inside a CharacterClass is different from a normal CharacterClassEscape inside a complemented CharacterClass.

Example:

re1=/\p{Ll}/giu
re2=/[^\P{Ll}]/giu

Naïvely, I expected these to behave the same. Actual results:

"aAbBcC4#".replaceAll(re1, 'X') outputs "XXXXXX4#"
- The CharSet contains all lowercase letters.
- invert=false
- For any letter m in the text, there is a letter n in the set such that scf(m)==scf(n).
- No match for digits, punctuation, etc.
"aAbBcC4#".replaceAll(re2, 'X') outputs "aAbBcC4#"
- The CharSet contains all Unicode code points other than lowercase letters: Uppercase/titlecase/other letters, digits, symbols, punctuation, unassigned, etc.
- invert=true
- For any letter m in the text, there is a letter n in the set such that scf(m)==scf(n), but "invert" negates "found".
- No match for any letters.
- No match for digits/punctuation/etc. either because they are in the CharSet directly and "invert" negates "found".
- In other words, under IgnoreCase \P{Ll} matches everything (with possible exceptions if there are unusual character properties) and so its CharacterClass complement matches nothing.

In our proposed spec text, we currently do nothing special. Just like in the current spec text, a CharacterClassEscape simply always evaluates to a CharSet, with a code point complement as appropriate. And a nested class with brackets (which does not use the CharacterClass production to avoid having to return a (CharSet, invert) pair) also simply evaluates to a CharSet, with a code point complement for [^ ClassRanges ].

We could consider doing early case closure of nested classes and properties, and/or doing a code point complement of the CharacterClass CharSet and removing the "invert" boolean, and/or something else. We need to weigh "improving" behavior vs. making it different from existing behavior for the same or similar patterns.

The text was updated successfully, but these errors were encountered:

macchiati · 2021-06-03T05:22:28Z

I propose that we adopt a uniform approach to case insensitively, where:

if C is a character class, and "a" represents a character that has case variants, and "1" represents a character that has no case variants, then:

if /C/ matches "a", then /C/i matches "a" and all case variants of "a"
if /C/ doesn't match "a" or any case variant of "a", then /C/i doesn't match "a" or any case variant of "a"
if /C/ matches "1", then /C/i matches "1"
if /C/ doesn't match "1", then /C/i doesn't match "1"

And a uniform approach to complement, where [^[^X]] == X, and [^\p{X}] == [\P{X}]

aheninger · 2021-06-03T20:46:20Z

In the earlier email thread on this question I had looked into the behavior of other regex implementations. Copying those results here, for the record ...

RE                /[^ab]/i    /[[a-m]&&[F-T]]/i   /[[a-m]--[F-T]]/i
-------------------------------------------------------------------
Java              deep        deep                na
ICU / Android     deep        deep                deep
Python RE         deep        na (future)         na
Python regex      deep        deep                deep
.NET              deep        na                  deep
Rust regex        deep        deep                deep
Perl              deep        deep                deep 
Oniguruma / Ruby  deep        shallow             na

ECMAScript        deep        na                  na
PCRE / PHP        deep        na                  na
RE2 / Go          deep        na                  na
grep ERE          deep        na                  nz

Here are the expected results for the patterns shown and the test strings "abcdefghi" "ABCDEFGHI". Matched letters are shown as themselves; non-matches as ".".The logic for what matches in each case is exactly as Markus described in his previous email.

Pattern                Deep Case Close       Shallow Case Close
----------------------------------------------------------------
/[^ab]/i               ..cdefghi ..CDEFGHI   abcdefghi ABCDEFGHI
/[[a-m]&&[F-T]]/i      .....fghi .....FGHI   ......... .........
/[[a-m]--[F-T]]/i      abcde.... ABCDE....   abcdefghi ABCDEFGHI

Notes:

Perl's support for set expression is experimental, and subject to change. I missed it in the previous iteration. See https://perldoc.perl.org/perlrecharclass#Extended-Bracketed-Character-Classes
It uses (?[ ... ])to delimit set expressions with a new, extended syntax.

Symmetric difference (exclusive OR) is supported by Perl, Python and Rust. '^' for Perl, '~' for the others.

PCRE set expression syntax.

.NET character class subtraction.

Rust character class syntax

markusicu · 2021-06-04T18:01:35Z

Thanks, Mark & Andy.

These discussions about desired and observed outcomes are useful, but now that we are at stage 2, what I am most looking for is concrete proposals for how to change the ECMAScript spec, together with whether & how that changes u flag behavior, and what the resulting behavior looks like for v expressions that are the same / similar / use features not available in u.

For example, if a proposal involves changing the current semantics of CharSet+invert for a CharacterClass vs. CharSet-only for a CharacterClassEscape, then we need to spell that out, and need to say whether we want to change semantics for existing code, or else document that the same expression will behave differently depending on the v flag.

macchiati · 2021-06-04T18:08:34Z

I suggest that we only propose a change for v. That maintains complete backwards compatibility. If then the committee wants to extend the changes to non-V it's fairly easy to do. I'd be glad to compare what it looks like for the simple option that I outlined. But that would be after I get back from vacation so starting around the 14th

mathiasbynens · 2021-06-10T16:32:30Z

@schuay @msaboff, any opinions or preferences from an implementation point of view?

schuay · 2021-06-14T09:16:51Z

Currently no opinion since I'm not too familiar with the implementation. cc @hashseed

hashseed · 2021-06-14T09:28:39Z

Intuitively I would also expect adding ignoreCase to match more, not less. That's just my personal opinion though.

In the current implementation in V8, iirc, we apply case folding close over for individual \p, which of course means that with inversion we end up excluding more. The solution would be to wait with closing over for case folding after the character class has been fully defined.

markusicu · 2021-06-17T17:11:14Z

Discussed in today's meeting with @mathiasbynens @macchiati @sffc --

We agreed that the behavior of /[^\P{Ll}]/giu is unintuitive/surprising, and that /[^\P{Ll}]/giv (note the v flag) should behave the same as /\p{Ll}/giu and /\p{Ll}/giv.

Therefore, we propose to make v behavior of IgnoreCase different from u behavior. Specifically, under v a CharacterClass always (in both a top-level class and a nested class) resolves the ^ by complementing the CharSet (just like a CharacterClassEscape with an uppercase letter), and it always keeps invert==false.

With this example:

"aAbBcC4#".replaceAll(/\p{Ll}/giu, 'X') outputs "XXXXXX4#"
"aAbBcC4#".replaceAll(/[^\P{Ll}]/giu, 'X') outputs "aAbBcC4#"
"aAbBcC4#".replaceAll(/\p{Ll}/giv, 'X') outputs "XXXXXX4#"
"aAbBcC4#".replaceAll(/[^\P{Ll}]/giv, 'X') outputs "XXXXXX4#"

More generally, this makes [^\p{X}] == [\P{X}] == \P{X} and [^\P{X}] == [\p{X}] == \p{X}, whether IgnoreCase or not.

Cc stage 3 reviewers @waldemarhorwat @gibson042 @msaboff

waldemarhorwat · 2021-06-29T00:18:00Z

The invert flag is there for a reason. Doing the above change would break fundamental regexps:

"aAbBcC".replaceAll(/[b]/giu, "X") outputs "aAXXcC"
"aAbBcC".replaceAll(/[^b]/giu, "X") outputs "XXbBXX"
"aAbBcC".replaceAll(/[b]/giv, "X") outputs "aAXXcC"
"aAbBcC".replaceAll(/[^b]/giv, "X") currently outputs "XXbBXX". The change would make it output "XXXXXX". Oops!

markusicu · 2021-07-08T17:10:23Z

"aAbBcC".replaceAll(/[^b]/giv, "X") currently outputs "XXbBXX". The change would make it output "XXXXXX". Oops!

[^b] includes B so it makes sense, under IgnoreCase, for this class to match both b and B.
Same as [\p{L}--b]: Includes B so should IgnoreCase-match both b and B.
Or any other class that excludes some but not all IgnoreCase kinds of "b".
[^b] should be exactly the same as [[\0-\u{10FFFF}]--b] and [[^b]] etc.
Different ways of arriving at the same class should behave consistently.

aheninger · 2021-07-11T23:03:56Z

The model I had in my head for ICU case insensitive regex matching is that it should produce the same results that you would get by case-folding both the input text and the pattern, and then doing a regular match with those.

Folding the pattern is tricky, though. Transform it by

folding literals immediately
folding ranges after establishing an unfolded start and end.
folding positive properties \p{X} into an equivalent set expression as soon as they are encountered.
negated properties, \P{X}: fetch the positive property, fold, then negate, yielding a set expression.

Then process the transformed pattern normally (case sensitive, no more folding, no case closing).

ICU follows this behavior, with the the fact that the actual implementation uses case-closed sets and case-insensitive string matching being an optimization.

Taking the example from above,

"aAbBcC".replaceAll(/[^b]/giv, "X")

The transformed pattern and input string become

"aabbcc".replaceAll(/[^b]/gv, "X") which should output XXbbXX

The case-sensitive matching of [^b] would also match a B, but that will never occur because the input string is folded.

One twist in ICU's implementation is that sets use simple case folding/closure, while string literals in the pattern match with full Unicode casing.

markusicu · 2021-07-12T00:04:46Z

Thanks, Andy, that's useful. It would replace the separate invert flag with early case closure of the set. More expensive while parsing the pattern, less while matching. And if we use the same early closure (before complement) for \P{X} outside of character classes, then everything should be consistent. It would be a change to the behavior of those, of course.

It still seems weird that [^b] looks like a set that should match B but under IgnoreCase does not match either b or B. Waldemar seems to like it that way.

So first we should decide what the various expressions mentioned here should match. I hope we can agree on equivalent expressions matching equivalently. I think that means getting rid of the invert flag one way or the other.

@aheninger What exactly do you mean with “full Unicode casing”? Full string case folding, where "ß" becomes "ss"? If so, do you do that on the fly while matching, or do you full-case-fold the whole input string and maintain a mapping from folded-string offsets back to input-string offsets? How do you handle partial matches, like one pattern s against one half of an input string "ß"?

markusicu · 2021-07-12T00:26:19Z

Note: When we case-fold a class/set for IgnoreCase, we could add the full-case-folding strings, like [ß] → [ß(ss)].
However, we said we will forbid the complement of a set with strings, and we will validate early using property metadata, not actual property data.
It would probably surprise people that /[^ß]/iv should throw a SyntaxError, and the validation logic would need access to the full-case-folding mappings (or at least the set of characters that full-case-fold to multi-character strings).
The validation logic would also need property data; for example, \p{Ll} which contains ß.
It could also make an expression that works with one Unicode version fail with the next one -- when a code point that was unassigned is assigned a character with a multi-character case folding.

So best to not do that.

markusicu · 2021-07-15T18:45:29Z

Note: If we wanted to do early case closure to emulate the current behavior of /[^b]/iu without the buggy invert boolean, that would be like the existing behavior of /\w/iu which also does early case closure of \w. I consider it a defect in the current spec that there is no early case closure of other CharacterClassEscapes.

markusicu · 2021-07-15T20:25:17Z

Looking over the discussion, and trying to read the tea leaves behind Waldemar's comments, I think we might be able to arrive at a consensus around the following:

consistent behavior as in our earlier proposal in IgnoreCase vs. complement vs. nested class #30 (comment)
deep, early case closure replacing the invert boolean as Andy described in IgnoreCase vs. complement vs. nested class #30 (comment)

Ok?

There is still an open question about Andy's approach of full case folding for matching of substrings (not code points in CharSets), see #30 (comment)

mathiasbynens · 2021-07-16T08:09:11Z

Looking over the discussion, and trying to read the tea leaves behind Waldemar's comments, I think we might be able to arrive at a consensus around the following:

consistent behavior as in our earlier proposal in IgnoreCase vs. complement vs. nested class #30 (comment)

deep, early case closure replacing the invert boolean as Andy described in IgnoreCase vs. complement vs. nested class #30 (comment)

Ok?

SGTM

mathiasbynens · 2021-08-31T18:21:54Z

During today’s TC39 meeting, @waldemarhorwat was concerned that this change breaks patterns like [^x] when the i flag is set. That is not the intention of our proposal, so either Waldemar discovered a spec bug, or we’re missing something. Waldemar, do you have more details?

markusicu · 2021-08-31T18:58:04Z

During today’s TC39 meeting, @waldemarhorwat was concerned that this change breaks patterns like [^x] when the i flag is set. That is not the intention of our proposal, so either Waldemar discovered a spec bug, or we’re missing something.

Yes, I am baffled.

Current behavior of /[^x]/iu: (see CharacterSetMatcher)

set contains only 'x'
invert = true
matching:
- step h: text 'x' or 'X' get Canonicalize(ch)ed to cc='x'
- step i: there is a member a='x' in the set with a==cc, so found=true
- step k: invert==true && found==true: no match
if the text is 't' then found=false but invert==true: yes match

Proposed behavior of /[^x]/iv using deep case closure:

set contains 'x'
case closure adds 'X' because Canonicalize('X')=='x'
complement: set contains all characters except 'x' and 'X'
invert = false (always)
matching:
- step h: text 'x' or 'X' get Canonicalize(ch)ed to cc='x'
- step i: there is not a member a in the set with a==cc, so found=false
- step j: invert==false && found==false: no match
if the text is 't' then found=true but invert==false: yes match

That is, the behavior for any expression /[^...]/i should be equivalent.

What we are fixing is cases like /\p{Ll}/giu (all letters) ≠ /[^\P{Ll}]/giu (no letters)

waldemarhorwat · 2021-09-01T15:57:16Z

Yes, I am baffled.

I was referring to #30 (comment) and #30 (comment) above.

markusicu · 2021-09-01T16:58:50Z

Yes, I am baffled.

I was referring to #30 (comment) and #30 (comment) above.

We addressed those. The current proposal is to use deep case closure: #30 (comment) --> #30 (comment) (limited to only simple case folding)

markusicu · 2021-09-02T17:34:46Z

I spelled out the proposed algorithm in this issue's description (text at the top).
PTAL
@macchiati @aheninger @mathiasbynens @gibson042 @bmeck @waldemarhorwat @sffc @schuay @msaboff @hashseed

sffc · 2021-09-08T00:28:41Z

The algorithm sounds good and well-defined; can you provide a few example regexes and what the behavior of those regexes would be with your proposed algorithm?

markusicu · 2021-09-09T16:49:04Z

Met today: Waldemar, Richard, Mark, Markus.
Waldemar and Richard agree that this proposal is good.
(After a small amendment to make the case closure symmetric. Case closure needs to add Kelvin for k, and k for Kelvin.)

markusicu · 2021-09-24T01:56:04Z

I updated the draft spec changes with the agreed algorithm, modulo inadvertent bugs. Review would be good. (If you have edit access to the doc, see the version history.)

markusicu mentioned this issue Jun 17, 2021

Flag implication: v implies u? #23

Closed

markusicu closed this as completed Sep 24, 2021

slevithan mentioned this issue Jul 29, 2024

Breaking: Throw for [^\p{…}] with flag i in environments without native flag v slevithan/regex#10

Closed

NWilson mentioned this issue Sep 26, 2024

Pattern /[^\D\P{Nd}]/utf matches to \x{1d7cf} PCRE2Project/pcre2#497

Open

JLHwung mentioned this issue Oct 3, 2024

Apply case expansion to unicode property escapes within i modifier mathiasbynens/regexpu-core#97

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IgnoreCase vs. complement vs. nested class #30

IgnoreCase vs. complement vs. nested class #30

markusicu commented May 27, 2021 •

edited

Loading

macchiati commented Jun 3, 2021

aheninger commented Jun 3, 2021

markusicu commented Jun 4, 2021

macchiati commented Jun 4, 2021 via email •

edited by markusicu

Loading

mathiasbynens commented Jun 10, 2021

schuay commented Jun 14, 2021

hashseed commented Jun 14, 2021

markusicu commented Jun 17, 2021 •

edited

Loading

waldemarhorwat commented Jun 29, 2021

markusicu commented Jul 8, 2021 •

edited by mathiasbynens

Loading

aheninger commented Jul 11, 2021

markusicu commented Jul 12, 2021

markusicu commented Jul 12, 2021

markusicu commented Jul 15, 2021

markusicu commented Jul 15, 2021

mathiasbynens commented Jul 16, 2021

mathiasbynens commented Aug 31, 2021

markusicu commented Aug 31, 2021

waldemarhorwat commented Sep 1, 2021

markusicu commented Sep 1, 2021

markusicu commented Sep 2, 2021

sffc commented Sep 8, 2021

markusicu commented Sep 9, 2021

markusicu commented Sep 24, 2021

IgnoreCase vs. complement vs. nested class #30

IgnoreCase vs. complement vs. nested class #30

Comments

markusicu commented May 27, 2021 • edited Loading

Proposal

Problem description

macchiati commented Jun 3, 2021

aheninger commented Jun 3, 2021

markusicu commented Jun 4, 2021

macchiati commented Jun 4, 2021 via email • edited by markusicu Loading

mathiasbynens commented Jun 10, 2021

schuay commented Jun 14, 2021

hashseed commented Jun 14, 2021

markusicu commented Jun 17, 2021 • edited Loading

waldemarhorwat commented Jun 29, 2021

markusicu commented Jul 8, 2021 • edited by mathiasbynens Loading

aheninger commented Jul 11, 2021

markusicu commented Jul 12, 2021

markusicu commented Jul 12, 2021

markusicu commented Jul 15, 2021

markusicu commented Jul 15, 2021

mathiasbynens commented Jul 16, 2021

mathiasbynens commented Aug 31, 2021

markusicu commented Aug 31, 2021

waldemarhorwat commented Sep 1, 2021

markusicu commented Sep 1, 2021

markusicu commented Sep 2, 2021

sffc commented Sep 8, 2021

markusicu commented Sep 9, 2021

markusicu commented Sep 24, 2021

markusicu commented May 27, 2021 •

edited

Loading

macchiati commented Jun 4, 2021 via email •

edited by markusicu

Loading

markusicu commented Jun 17, 2021 •

edited

Loading

markusicu commented Jul 8, 2021 •

edited by mathiasbynens

Loading