-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify handling of RegExp CharacterClassEscapes \w and \W and Word Asserts \b and \B #525
Conversation
</tbody> | ||
</table> | ||
</figure> | ||
1. If _Unicode_ is *true*, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to restrict to "Unicode is true", since the following steps will add no character when Unicode is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about eliminating the Unicode check, however it shows that the change is there for Unicode Word character matching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this should be expressed by the "Assert" I'm proposing below, instead of by adding an explicit check. The relevant distinction between "Unicode" and "non-Unicode" is already done in the Canonicalize() abstract operation; following the DRY principle, it is better not to repeat that distinction.
(You should have said: "Fixes #512" in your comment, in order to autolink that Pull Request with Issue 512 (the original Issue).) |
I'm fine with eliminating the If Unicode check and adding the Assert. |
This is going to cause a difference between what Rather than expanding We can then redefine |
@goyakin There was a proposal to make |
@mathiasbynens I wasn't aware of that. Seems like the reason was backward compatibility. Since this PR is a breaking change in that regard, if we're going down this route, I think we should go ahead and accept all alphanumeric characters instead. If we want to keep backward compatibility though, it might be better to disallow canonicalizing characters |
While I'm not entirely happy with this proposal, I don't see a backward compatibility issue if this is gated by /u, given that /u has not been around for very long yet anyways. |
To be clear, there was already an issue before this change, i.e.:
Which gotcha is the worst one, I don’t know. It is possible to also modify |
I am feeling like modifying \b would be the right choice here - it's a vastly smaller change than allowing all Unicode characters right? |
I'm fine with also modifying |
We have consensus on this PR with the addition aligning \b and \B. @msaboff can you update this PR with the appropriate changes to \b? |
@mathiasbynens thanks for raising this issue! |
I gave the matching of \b and \B a read and it seems to me that they are currently not actually affected by case-insensitive canonicalization at all:
And
So contrary to what was believed in yesterday's meeting, \b and \B actually work fine right now, even though /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui (if we actually had lookbehind assertions). The Kelvin sign and sharp-S are not considered word characters wrt \b and \B. At the least the existing definition of \b and \B would have to be rewritten to be coupled to \w and \W in the first place. Or do you simply want to add sharp-S and Kelvin sign to the list for IsWordChar for /ui regexps? |
@hashseed imo the goal is to make the following true, in all modes:
Rephrased, if Kelvin and sharp-S are matched by (someone please correct me if I've misstated the relationship between |
@ljharb \b and \B are zero-length assertions, they don’t match characters. |
I’m confused, because I thought the the issue was precisely that /\b/ui is not equivalent to /(?:(?<=\w)(?=\W))|(?:(?<=\w)(?=\W))/ui. If not, what the issue was supposed to be? |
@claudepache lol thank you, i was sure i was going to get that wrong. hopefully someone can come in and state it better than i. Suffice to say, within the same mode, |
From our discussion yesterday, for /ui patterns, we agreed that small sharp-S and kelvin ARE ward characters. For all other flags, they are not word characters. |
@claudepache that's what I meant. \w and \b are not consistent currently for /ui. |
I updated the spec to reflect what we agreed on at the May meeting. I created a common abstract operation, WordCharacters and changed so that both the word asserts and \w and \W CharacterClassEscapes use the new abstract operation. |
ping @littledan, @efaust. I'll take a look tomorrow! |
@msaboff can you also rebase? |
Created a new abstract operation "WordCharacters()" that is used by both IsWordChar() for word assertions and \w/\W CharacterClassEscapes.
Rebased. |
@bterlson @littledan @efaust Can I get closure on this change? |
I'll try to review this today (still out on a wedding break but festivities are now over :)) |
Sorry for the delay. Looks good to me (other than some minor ecmarkup/spec convention things which I'll just fix up in a subsequent commit). |
Per ES6, `/\W/iu` matched U+017F, U+212A, and, surprisingly, `K` and `S`. This is no longer the case now that tc39/ecma262#525 is merged. Ref. mathiasbynens/regexpu-core#8.
Per ES6, `/\W/iu` matched U+017F, U+212A, and, surprisingly, `K` and `S`. This is no longer the case now that tc39/ecma262#525 is merged. Ref. #8. Ref. mathiasbynens/regexpu-fixtures@81eeb14.
Bug tickets:
(Also listed here: https://mathiasbynens.be/notes/es6-unicode-regex#support) |
The current specification of CharacterClassEscape's in Regular Expressions introduces surprising behavior of \W when both the unicode and ignoreCase flags are provided.
There are 6 CharacterClassEscapes defined:
\d is digit character and \D is not digit character
\s is space character and \S is not space character
\w is word character and \W is not word character
Furthermore, !\d matches the same as \D, !\s matches the same as \S, and !\w matches the same as \W with the exception of /\W/.ui and the characters 'k', 'K', 's' and 'S'.
The proposal here is increase the characters in the set returned by \w CharacterClassEscape when unicode is specified. Given the current CaseFolding.txt contents, this will add \u017f (lower case long s) and \u212a (Kelvin symbol) to the returned set. It has the correspond effect that those two characters will not be in the set returned by the \W CharacterClassEscape rule.
This change will make \u017f (lower case long s) and \u212a (Kelvin symbol) word characters for unicode regular expressions. It also means that \w == !\W and !\w == \W regardless of the flags provided to the regular expression.
This is an improvement to pull request 516. It resolves the inconsistency without creating an issue with explicit characters classes ([] syntax) that include CharacterClassEscapes that exist in that pull request.
Note that much of the spec diff is due to indentation changes of the character table.