-
Notifications
You must be signed in to change notification settings - Fork 12
IgnoreCase vs. complement vs. nested class #30
Comments
I propose that we adopt a uniform approach to case insensitively, where: if C is a character class, and "a" represents a character that has case variants, and "1" represents a character that has no case variants, then:
And a uniform approach to complement, where [^[^X]] == X, and [^\p{X}] == [\P{X}] |
In the earlier email thread on this question I had looked into the behavior of other regex implementations. Copying those results here, for the record ...
Here are the expected results for the patterns shown and the test strings
Notes: Perl's support for set expression is experimental, and subject to change. I missed it in the previous iteration. See https://perldoc.perl.org/perlrecharclass#Extended-Bracketed-Character-Classes Symmetric difference (exclusive OR) is supported by Perl, Python and Rust. '^' for Perl, '~' for the others. |
Thanks, Mark & Andy. These discussions about desired and observed outcomes are useful, but now that we are at stage 2, what I am most looking for is concrete proposals for how to change the ECMAScript spec, together with whether & how that changes For example, if a proposal involves changing the current semantics of CharSet+invert for a CharacterClass vs. CharSet-only for a CharacterClassEscape, then we need to spell that out, and need to say whether we want to change semantics for existing code, or else document that the same expression will behave differently depending on the |
I suggest that we only propose a change for v. That maintains complete
backwards compatibility. If then the committee wants to extend the changes
to non-V it's fairly easy to do.
I'd be glad to compare what it looks like for the simple option that I
outlined. But that would be after I get back from vacation so starting
around the 14th
|
Currently no opinion since I'm not too familiar with the implementation. cc @hashseed |
Intuitively I would also expect adding In the current implementation in V8, iirc, we apply case folding close over for individual |
Discussed in today's meeting with @mathiasbynens @macchiati @sffc -- We agreed that the behavior of Therefore, we propose to make With this example:
More generally, this makes Cc stage 3 reviewers @waldemarhorwat @gibson042 @msaboff |
The invert flag is there for a reason. Doing the above change would break fundamental regexps:
|
|
The model I had in my head for ICU case insensitive regex matching is that it should produce the same results that you would get by case-folding both the input text and the pattern, and then doing a regular match with those. Folding the pattern is tricky, though. Transform it by
Then process the transformed pattern normally (case sensitive, no more folding, no case closing). ICU follows this behavior, with the the fact that the actual implementation uses case-closed sets and case-insensitive string matching being an optimization. Taking the example from above,
The transformed pattern and input string become
The case-sensitive matching of One twist in ICU's implementation is that sets use simple case folding/closure, while string literals in the pattern match with full Unicode casing. |
Thanks, Andy, that's useful. It would replace the separate invert flag with early case closure of the set. More expensive while parsing the pattern, less while matching. And if we use the same early closure (before complement) for It still seems weird that So first we should decide what the various expressions mentioned here should match. I hope we can agree on equivalent expressions matching equivalently. I think that means getting rid of the invert flag one way or the other. @aheninger What exactly do you mean with “full Unicode casing”? Full string case folding, where "ß" becomes "ss"? If so, do you do that on the fly while matching, or do you full-case-fold the whole input string and maintain a mapping from folded-string offsets back to input-string offsets? How do you handle partial matches, like one pattern |
Note: When we case-fold a class/set for IgnoreCase, we could add the full-case-folding strings, like So best to not do that. |
Note: If we wanted to do early case closure to emulate the current behavior of |
Looking over the discussion, and trying to read the tea leaves behind Waldemar's comments, I think we might be able to arrive at a consensus around the following:
Ok? There is still an open question about Andy's approach of full case folding for matching of substrings (not code points in CharSets), see #30 (comment) |
SGTM |
During today’s TC39 meeting, @waldemarhorwat was concerned that this change breaks patterns like |
Yes, I am baffled. Current behavior of
Proposed behavior of
That is, the behavior for any expression What we are fixing is cases like |
I was referring to #30 (comment) and #30 (comment) above. |
We addressed those. The current proposal is to use deep case closure: #30 (comment) --> #30 (comment) (limited to only simple case folding) |
I spelled out the proposed algorithm in this issue's description (text at the top). |
The algorithm sounds good and well-defined; can you provide a few example regexes and what the behavior of those regexes would be with your proposed algorithm? |
Met today: Waldemar, Richard, Mark, Markus. |
I updated the draft spec changes with the agreed algorithm, modulo inadvertent bugs. Review would be good. (If you have edit access to the doc, see the version history.) |
Proposal
The current proposal is to use deep case closure: #30 (comment) --> #30 (comment) (limited to only simple case folding)
More specifically:
Define an abstract operation SimpleCaseClosure(A) where A is a CharSet.
(Note: scf = Unicode Simple_Case_Folding: the simple mappings in CaseFolding.txt, as in the Canonicalize(ch) operation)
When building a CharSet from a character class or from a CharacterClassEscape, if IgnoreCase==true and
/v
is specified:c
: create a CharSet A with justc
and return SimpleCaseClosure(A)a-b
: create a CharSet A with the one contiguous range of code points froma
tob
and return SimpleCaseClosure(A)SimpleCaseClosure([a-z])
will include the separate range[A-Z]
and several other non-adjacent code points.\p{X}
: resolve the property expression into a CharSet A and return SimpleCaseClosure(A)\P{X}
: resolve the property expression into a CharSet A, compute CharSet B = SimpleCaseClosure(A), return the code point complement of B\w \W \s \S
etc.: look up the property CharSet, compute the case closure, return the code point complement for backslash-uppercase escapes[^...]
: compute the inner character class expression into CharSet A and return the code point complement of A; if this is a top-level CharacterClass, then return with invert=falseProblem description
The current draft spec text includes a TODO to discuss whether to do something special about IgnoreCase.
This becomes interesting when looking at IgnoreCase + complement + nested classes.
Notation: In Unicode=true mode, ES regular expressions apply Unicode Simple_Case_Folding, which has the short name scf.
In our little working group, we had been chewing on this question on and off without coming to a conclusion.
We had been trying to rationalize and match the existing behavior, and discussed doing an early "case closure" when IgnoreCase=true (for each c in the CharSet, add any c2 if scf(c2)=c), at least when a nested class is complemented (and maybe for any nested class regardless), with the goal of being consistent with existing matching behavior.
Then we realized that the existing matching behavior is inconsistent with itself (or at least unintuitive).
Looking at the existing spec:
\
plus uppercase letter) is resolved immediately, computing the code point complement of its CharSet.[^
ClassRanges]
) is deferred via the "invert" boolean until the CharacterSetMatcher operates on the CharSet, rather than complementing the CharSet itself.In other words, with IgnoreCase=true, the matching behavior for a CharacterClass is very different from that for a CharacterClassEscape, and using a complemented CharacterClassEscape inside a CharacterClass is different from a normal CharacterClassEscape inside a complemented CharacterClass.
Example:
Naïvely, I expected these to behave the same. Actual results:
"aAbBcC4#".replaceAll(re1, 'X')
outputs "XXXXXX4#""aAbBcC4#".replaceAll(re2, 'X')
outputs "aAbBcC4#"\P{Ll}
matches everything (with possible exceptions if there are unusual character properties) and so its CharacterClass complement matches nothing.In our proposed spec text, we currently do nothing special. Just like in the current spec text, a CharacterClassEscape simply always evaluates to a CharSet, with a code point complement as appropriate. And a nested class with brackets (which does not use the CharacterClass production to avoid having to return a (CharSet, invert) pair) also simply evaluates to a CharSet, with a code point complement for
[^
ClassRanges]
.We could consider doing early case closure of nested classes and properties, and/or doing a code point complement of the CharacterClass CharSet and removing the "invert" boolean, and/or something else. We need to weigh "improving" behavior vs. making it different from existing behavior for the same or similar patterns.
The text was updated successfully, but these errors were encountered: