Skip to content
This repository has been archived by the owner on Feb 16, 2024. It is now read-only.

IgnoreCase vs. complement vs. nested class #30

Closed
markusicu opened this issue May 27, 2021 · 25 comments
Closed

IgnoreCase vs. complement vs. nested class #30

markusicu opened this issue May 27, 2021 · 25 comments

Comments

@markusicu
Copy link
Collaborator

markusicu commented May 27, 2021

Proposal

The current proposal is to use deep case closure: #30 (comment) --> #30 (comment) (limited to only simple case folding)

More specifically:

Define an abstract operation SimpleCaseClosure(A) where A is a CharSet.
(Note: scf = Unicode Simple_Case_Folding: the simple mappings in CaseFolding.txt, as in the Canonicalize(ch) operation)

  • For each single code point c in A, add every other code point d where scf(d)==c or where scf(c)==d

When building a CharSet from a character class or from a CharacterClassEscape, if IgnoreCase==true and /v is specified:

  • literal character c: create a CharSet A with just c and return SimpleCaseClosure(A)
  • range a-b: create a CharSet A with the one contiguous range of code points from a to b and return SimpleCaseClosure(A)
    • Note: The result will often consist of two or more ranges. SimpleCaseClosure([a-z]) will include the separate range [A-Z] and several other non-adjacent code points.
  • \p{X}: resolve the property expression into a CharSet A and return SimpleCaseClosure(A)
  • \P{X}: resolve the property expression into a CharSet A, compute CharSet B = SimpleCaseClosure(A), return the code point complement of B
  • same with \w \W \s \S etc.: look up the property CharSet, compute the case closure, return the code point complement for backslash-uppercase escapes
  • set operations work as usual
  • [^...]: compute the inner character class expression into CharSet A and return the code point complement of A; if this is a top-level CharacterClass, then return with invert=false

Problem description

The current draft spec text includes a TODO to discuss whether to do something special about IgnoreCase.
This becomes interesting when looking at IgnoreCase + complement + nested classes.

Notation: In Unicode=true mode, ES regular expressions apply Unicode Simple_Case_Folding, which has the short name scf.

In our little working group, we had been chewing on this question on and off without coming to a conclusion.
We had been trying to rationalize and match the existing behavior, and discussed doing an early "case closure" when IgnoreCase=true (for each c in the CharSet, add any c2 if scf(c2)=c), at least when a nested class is complemented (and maybe for any nested class regardless), with the goal of being consistent with existing matching behavior.

Then we realized that the existing matching behavior is inconsistent with itself (or at least unintuitive).

Looking at the existing spec:

  • The spec already allows one level of nested classes in the form of CharacterClassEscapes inside CharacterClasses.
  • The complement of a CharacterClassEscape (\ plus uppercase letter) is resolved immediately, computing the code point complement of its CharSet.
  • The complement of a CharacterClass ([^ ClassRanges ]) is deferred via the "invert" boolean until the CharacterSetMatcher operates on the CharSet, rather than complementing the CharSet itself.
  • When IgnoreCase=true, then Canonicalize (which the CharacterSetMatcher invokes) applies the Unicode Simple_Case_Folding (scf) before applying the "invert" complement.

In other words, with IgnoreCase=true, the matching behavior for a CharacterClass is very different from that for a CharacterClassEscape, and using a complemented CharacterClassEscape inside a CharacterClass is different from a normal CharacterClassEscape inside a complemented CharacterClass.

Example:

re1=/\p{Ll}/giu
re2=/[^\P{Ll}]/giu

Naïvely, I expected these to behave the same. Actual results:

  1. "aAbBcC4#".replaceAll(re1, 'X') outputs "XXXXXX4#"
    • The CharSet contains all lowercase letters.
    • invert=false
    • For any letter m in the text, there is a letter n in the set such that scf(m)==scf(n).
    • No match for digits, punctuation, etc.
  2. "aAbBcC4#".replaceAll(re2, 'X') outputs "aAbBcC4#"
    • The CharSet contains all Unicode code points other than lowercase letters: Uppercase/titlecase/other letters, digits, symbols, punctuation, unassigned, etc.
    • invert=true
    • For any letter m in the text, there is a letter n in the set such that scf(m)==scf(n), but "invert" negates "found".
    • No match for any letters.
    • No match for digits/punctuation/etc. either because they are in the CharSet directly and "invert" negates "found".
    • In other words, under IgnoreCase \P{Ll} matches everything (with possible exceptions if there are unusual character properties) and so its CharacterClass complement matches nothing.

In our proposed spec text, we currently do nothing special. Just like in the current spec text, a CharacterClassEscape simply always evaluates to a CharSet, with a code point complement as appropriate. And a nested class with brackets (which does not use the CharacterClass production to avoid having to return a (CharSet, invert) pair) also simply evaluates to a CharSet, with a code point complement for [^ ClassRanges ].

We could consider doing early case closure of nested classes and properties, and/or doing a code point complement of the CharacterClass CharSet and removing the "invert" boolean, and/or something else. We need to weigh "improving" behavior vs. making it different from existing behavior for the same or similar patterns.

@macchiati
Copy link
Collaborator

I propose that we adopt a uniform approach to case insensitively, where:

if C is a character class, and "a" represents a character that has case variants, and "1" represents a character that has no case variants, then:

  1. if /C/ matches "a", then /C/i matches "a" and all case variants of "a"

  2. if /C/ doesn't match "a" or any case variant of "a", then /C/i doesn't match "a" or any case variant of "a"

  3. if /C/ matches "1", then /C/i matches "1"

  4. if /C/ doesn't match "1", then /C/i doesn't match "1"


And a uniform approach to complement, where [^[^X]] == X, and [^\p{X}] == [\P{X}]

@aheninger
Copy link

In the earlier email thread on this question I had looked into the behavior of other regex implementations. Copying those results here, for the record ...

RE                /[^ab]/i    /[[a-m]&&[F-T]]/i   /[[a-m]--[F-T]]/i
-------------------------------------------------------------------
Java              deep        deep                na
ICU / Android     deep        deep                deep
Python RE         deep        na (future)         na
Python regex      deep        deep                deep
.NET              deep        na                  deep
Rust regex        deep        deep                deep
Perl              deep        deep                deep 
Oniguruma / Ruby  deep        shallow             na

ECMAScript        deep        na                  na
PCRE / PHP        deep        na                  na
RE2 / Go          deep        na                  na
grep ERE          deep        na                  nz

Here are the expected results for the patterns shown and the test strings "abcdefghi" "ABCDEFGHI". Matched letters are shown as themselves; non-matches as ".".The logic for what matches in each case is exactly as Markus described in his previous email.

Pattern                Deep Case Close       Shallow Case Close
----------------------------------------------------------------
/[^ab]/i               ..cdefghi ..CDEFGHI   abcdefghi ABCDEFGHI
/[[a-m]&&[F-T]]/i      .....fghi .....FGHI   ......... .........
/[[a-m]--[F-T]]/i      abcde.... ABCDE....   abcdefghi ABCDEFGHI

Notes:

Perl's support for set expression is experimental, and subject to change. I missed it in the previous iteration. See https://perldoc.perl.org/perlrecharclass#Extended-Bracketed-Character-Classes
It uses (?[ ... ])to delimit set expressions with a new, extended syntax.

Symmetric difference (exclusive OR) is supported by Perl, Python and Rust. '^' for Perl, '~' for the others.

PCRE set expression syntax.

.NET character class subtraction.

Rust character class syntax

@markusicu
Copy link
Collaborator Author

Thanks, Mark & Andy.

These discussions about desired and observed outcomes are useful, but now that we are at stage 2, what I am most looking for is concrete proposals for how to change the ECMAScript spec, together with whether & how that changes u flag behavior, and what the resulting behavior looks like for v expressions that are the same / similar / use features not available in u.

For example, if a proposal involves changing the current semantics of CharSet+invert for a CharacterClass vs. CharSet-only for a CharacterClassEscape, then we need to spell that out, and need to say whether we want to change semantics for existing code, or else document that the same expression will behave differently depending on the v flag.

@macchiati
Copy link
Collaborator

macchiati commented Jun 4, 2021 via email

@mathiasbynens
Copy link
Member

@schuay @msaboff, any opinions or preferences from an implementation point of view?

@schuay
Copy link

schuay commented Jun 14, 2021

Currently no opinion since I'm not too familiar with the implementation. cc @hashseed

@hashseed
Copy link

Intuitively I would also expect adding ignoreCase to match more, not less. That's just my personal opinion though.

In the current implementation in V8, iirc, we apply case folding close over for individual \p, which of course means that with inversion we end up excluding more. The solution would be to wait with closing over for case folding after the character class has been fully defined.

@markusicu
Copy link
Collaborator Author

markusicu commented Jun 17, 2021

Discussed in today's meeting with @mathiasbynens @macchiati @sffc --

We agreed that the behavior of /[^\P{Ll}]/giu is unintuitive/surprising, and that /[^\P{Ll}]/giv (note the v flag) should behave the same as /\p{Ll}/giu and /\p{Ll}/giv.

Therefore, we propose to make v behavior of IgnoreCase different from u behavior. Specifically, under v a CharacterClass always (in both a top-level class and a nested class) resolves the ^ by complementing the CharSet (just like a CharacterClassEscape with an uppercase letter), and it always keeps invert==false.

With this example:

  • "aAbBcC4#".replaceAll(/\p{Ll}/giu, 'X') outputs "XXXXXX4#"
  • "aAbBcC4#".replaceAll(/[^\P{Ll}]/giu, 'X') outputs "aAbBcC4#"
  • "aAbBcC4#".replaceAll(/\p{Ll}/giv, 'X') outputs "XXXXXX4#"
  • "aAbBcC4#".replaceAll(/[^\P{Ll}]/giv, 'X') outputs "XXXXXX4#"

More generally, this makes [^\p{X}] == [\P{X}] == \P{X} and [^\P{X}] == [\p{X}] == \p{X}, whether IgnoreCase or not.

Cc stage 3 reviewers @waldemarhorwat @gibson042 @msaboff

@waldemarhorwat
Copy link

The invert flag is there for a reason. Doing the above change would break fundamental regexps:

  • "aAbBcC".replaceAll(/[b]/giu, "X") outputs "aAXXcC"
  • "aAbBcC".replaceAll(/[^b]/giu, "X") outputs "XXbBXX"
  • "aAbBcC".replaceAll(/[b]/giv, "X") outputs "aAXXcC"
  • "aAbBcC".replaceAll(/[^b]/giv, "X") currently outputs "XXbBXX". The change would make it output "XXXXXX". Oops!

@markusicu
Copy link
Collaborator Author

markusicu commented Jul 8, 2021

  • "aAbBcC".replaceAll(/[^b]/giv, "X") currently outputs "XXbBXX". The change would make it output "XXXXXX". Oops!

[^b] includes B so it makes sense, under IgnoreCase, for this class to match both b and B.
Same as [\p{L}--b]: Includes B so should IgnoreCase-match both b and B.
Or any other class that excludes some but not all IgnoreCase kinds of "b".
[^b] should be exactly the same as [[\0-\u{10FFFF}]--b] and [[^b]] etc.
Different ways of arriving at the same class should behave consistently.

@aheninger
Copy link

The model I had in my head for ICU case insensitive regex matching is that it should produce the same results that you would get by case-folding both the input text and the pattern, and then doing a regular match with those.

Folding the pattern is tricky, though. Transform it by

  • folding literals immediately
  • folding ranges after establishing an unfolded start and end.
  • folding positive properties \p{X} into an equivalent set expression as soon as they are encountered.
  • negated properties, \P{X}: fetch the positive property, fold, then negate, yielding a set expression.

Then process the transformed pattern normally (case sensitive, no more folding, no case closing).

ICU follows this behavior, with the the fact that the actual implementation uses case-closed sets and case-insensitive string matching being an optimization.

Taking the example from above,

"aAbBcC".replaceAll(/[^b]/giv, "X")

The transformed pattern and input string become

"aabbcc".replaceAll(/[^b]/gv, "X") which should output XXbbXX

The case-sensitive matching of [^b] would also match a B, but that will never occur because the input string is folded.

One twist in ICU's implementation is that sets use simple case folding/closure, while string literals in the pattern match with full Unicode casing.

@markusicu
Copy link
Collaborator Author

Thanks, Andy, that's useful. It would replace the separate invert flag with early case closure of the set. More expensive while parsing the pattern, less while matching. And if we use the same early closure (before complement) for \P{X} outside of character classes, then everything should be consistent. It would be a change to the behavior of those, of course.

It still seems weird that [^b] looks like a set that should match B but under IgnoreCase does not match either b or B. Waldemar seems to like it that way.

So first we should decide what the various expressions mentioned here should match. I hope we can agree on equivalent expressions matching equivalently. I think that means getting rid of the invert flag one way or the other.

@aheninger What exactly do you mean with “full Unicode casing”? Full string case folding, where "ß" becomes "ss"? If so, do you do that on the fly while matching, or do you full-case-fold the whole input string and maintain a mapping from folded-string offsets back to input-string offsets? How do you handle partial matches, like one pattern s against one half of an input string "ß"?

@markusicu
Copy link
Collaborator Author

Note: When we case-fold a class/set for IgnoreCase, we could add the full-case-folding strings, like [ß][ß(ss)].
However, we said we will forbid the complement of a set with strings, and we will validate early using property metadata, not actual property data.
It would probably surprise people that /[^ß]/iv should throw a SyntaxError, and the validation logic would need access to the full-case-folding mappings (or at least the set of characters that full-case-fold to multi-character strings).
The validation logic would also need property data; for example, \p{Ll} which contains ß.
It could also make an expression that works with one Unicode version fail with the next one -- when a code point that was unassigned is assigned a character with a multi-character case folding.

So best to not do that.

@markusicu
Copy link
Collaborator Author

Note: If we wanted to do early case closure to emulate the current behavior of /[^b]/iu without the buggy invert boolean, that would be like the existing behavior of /\w/iu which also does early case closure of \w. I consider it a defect in the current spec that there is no early case closure of other CharacterClassEscapes.

@markusicu
Copy link
Collaborator Author

Looking over the discussion, and trying to read the tea leaves behind Waldemar's comments, I think we might be able to arrive at a consensus around the following:

Ok?

There is still an open question about Andy's approach of full case folding for matching of substrings (not code points in CharSets), see #30 (comment)

@mathiasbynens
Copy link
Member

Looking over the discussion, and trying to read the tea leaves behind Waldemar's comments, I think we might be able to arrive at a consensus around the following:

Ok?

SGTM

@mathiasbynens
Copy link
Member

During today’s TC39 meeting, @waldemarhorwat was concerned that this change breaks patterns like [^x] when the i flag is set. That is not the intention of our proposal, so either Waldemar discovered a spec bug, or we’re missing something. Waldemar, do you have more details?

@markusicu
Copy link
Collaborator Author

During today’s TC39 meeting, @waldemarhorwat was concerned that this change breaks patterns like [^x] when the i flag is set. That is not the intention of our proposal, so either Waldemar discovered a spec bug, or we’re missing something.

Yes, I am baffled.

Current behavior of /[^x]/iu: (see CharacterSetMatcher)

  • set contains only 'x'
  • invert = true
  • matching:
    • step h: text 'x' or 'X' get Canonicalize(ch)ed to cc='x'
    • step i: there is a member a='x' in the set with a==cc, so found=true
    • step k: invert==true && found==true: no match
  • if the text is 't' then found=false but invert==true: yes match

Proposed behavior of /[^x]/iv using deep case closure:

  • set contains 'x'
  • case closure adds 'X' because Canonicalize('X')=='x'
  • complement: set contains all characters except 'x' and 'X'
  • invert = false (always)
  • matching:
    • step h: text 'x' or 'X' get Canonicalize(ch)ed to cc='x'
    • step i: there is not a member a in the set with a==cc, so found=false
    • step j: invert==false && found==false: no match
  • if the text is 't' then found=true but invert==false: yes match

That is, the behavior for any expression /[^...]/i should be equivalent.

What we are fixing is cases like /\p{Ll}/giu (all letters) ≠ /[^\P{Ll}]/giu (no letters)

@waldemarhorwat
Copy link

Yes, I am baffled.

I was referring to #30 (comment) and #30 (comment) above.

@markusicu
Copy link
Collaborator Author

Yes, I am baffled.

I was referring to #30 (comment) and #30 (comment) above.

We addressed those. The current proposal is to use deep case closure: #30 (comment) --> #30 (comment) (limited to only simple case folding)

@markusicu
Copy link
Collaborator Author

I spelled out the proposed algorithm in this issue's description (text at the top).
PTAL
@macchiati @aheninger @mathiasbynens @gibson042 @bmeck @waldemarhorwat @sffc @schuay @msaboff @hashseed

@sffc
Copy link
Collaborator

sffc commented Sep 8, 2021

The algorithm sounds good and well-defined; can you provide a few example regexes and what the behavior of those regexes would be with your proposed algorithm?

@markusicu
Copy link
Collaborator Author

Met today: Waldemar, Richard, Mark, Markus.
Waldemar and Richard agree that this proposal is good.
(After a small amendment to make the case closure symmetric. Case closure needs to add Kelvin for k, and k for Kelvin.)

@markusicu
Copy link
Collaborator Author

I updated the draft spec changes with the agreed algorithm, modulo inadvertent bugs. Review would be good. (If you have edit access to the doc, see the version history.)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants