Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests for regular expression modifiers #3756

Closed
30 tasks done
cjtenny opened this issue Dec 20, 2022 · 6 comments
Closed
30 tasks done

Tests for regular expression modifiers #3756

cjtenny opened this issue Dec 20, 2022 · 6 comments
Assignees

Comments

@cjtenny
Copy link
Contributor

cjtenny commented Dec 20, 2022

Explainer: https://github.com/tc39/proposal-regexp-modifiers
Spec text: https://tc39.es/proposal-regexp-modifiers/

regexp-modifiers testing plan

  • Syntax errors: Throw both in parsing and when constructed with new Regexp("...")
    • Basic regular expression flags (n.b. source text refers to matched text for "regular expression flags" production of grammar)
      • Source text contains other code points than i, m, s
      • Source text contains combining codepoints alongside i, m, s
      • Source text contains other non-display codepoints alongside i, m, s
      • Source text contains ZWNJ, ZWJ, ZWNBSP alongside i, m, s (I think this is right? https://tc39.es/ecma262/#sec-unicode-format-control-characters)
      • Source text contains i, m, and/or s more than once
      • Source text in a case-ignoring context contains code points that case fold to i, m, s e.g. I, M, S
      • Source text contains code points outside the basic latin range that, were they canonicalized by a unicode-mapping regex, would map to e.g. i, m, or s (e.g. ſ (U+017F) would map to s, U+0130 to i) (ref. https://www.unicode.org/Public/12.1.0/ucd/CaseFolding.txt)
        • (e.g. /foo(?\u{017F}:bar)/u is a syntax error, /foo(?s:bar)/u is not)
    • Arithmetic regular expression flags
      • First or second source text exhibits any of the 'basic regular expression flags' errors
      • Both source texts are empty
      • Code point matched by first flags is also contained in source text matched by second flags
      • Various forms of (?ims-ims) - no colon - is a syntax error
      • Source text cannot use unicode escape sequences to express code points i, m, s
  • Valid syntax
    • Basic regular expression flags parse correctly
    • Source text with any valid combination of flags or arithmetic flags - reasonable to enumerate
  • Behavior
    • Disabling flag in subexpression behaves correctly when corresponding top-level flag is and isn't already set
    • Enabling flag in subexpression behaves correctly when corresponding top-level flag is and isn't already set
    • Constructing a RegExp from a literal but changing flags by an argument to the RegExp constructor does (or does not) correctly change behavior of a subexpression that enables or removes flags.
    • i
      • Ignore case applies appropriately inside subexpression, but not outside; when turned on, off, and when nested inside a subexpression that has previously modified behavior
      • Behavior as normal when other flags modified but i flag not modified
      • Callers of Canonicalize:
        • Backreferences ignore case in captures
        • Individual characters ignore case
        • Character sets ignore case
        • Character escapes ignore case
        • Character class escapes ignore case
        • \w class, \b, \B all ignore case
    • m
      • ^ and $ apply appropriately inside subexpression, but not outside; when turned on, off, and when nested inside a subexpression that has previously modified behavior
    • s
      • . applies appropriately inside subexpression, but not outside; when turned on, off, and when nested inside a subexpression that has previously modified behavior
    • Subexpressions with flags set do not cause RegExp()....flags or /.../.flags to have the flags set, e.g. (new RegExp("(?i:a)")).flags does not include i.
    • ^ for RegExp.prototype.dotAll, .multiline, ignoreCase
@rbuckton
Copy link
Contributor

Can you clarify what this means?

Source text uses unicode escape sequences to express code points i, m, s

Are you describing a RegExp like /(?\u0069:a)/? Since it isn't clear from the test plan, my expectation is that this would be a syntax error.

@rbuckton
Copy link
Contributor

rbuckton commented Feb 14, 2024

Are there any examples of cases where the result of \b or \B change based on whether the i flag is present?

@rbuckton
Copy link
Contributor

I've found only two instances where \b does not agree in both modes:

  • U+017f - ſ LATIN SMALL LETTER LONG S
  • U+212a - K KELVIN SIGN

I find it strange that such an asymmetry exists. A quick test of the same pattern in both modes in C# shows no asymmetry, so it's not clear to me if this is expected or is a bug in \b.

@ioannad
Copy link
Contributor

ioannad commented Feb 29, 2024

Edited test plan to change "Source text uses unicode escape sequences to express code points i, m, s" to "Source text cannot use unicode escape sequences to express code points i, m, s" as suggested by @rbuckton . Added tests for this in #4016, along with all the other syntax error tests from @cjtenny and @guijemont 's PRs.

@ptomato
Copy link
Contributor

ptomato commented Mar 7, 2024

I find it strange that such an asymmetry exists. A quick test of the same pattern in both modes in C# shows no asymmetry, so it's not clear to me if this is expected or is a bug in \b.

This seems like it is expected. It comes from WordCharacters.

From what I can tell, most other languages' regexps' unicode flag changes the definition of word characters to [\p{L}\p{D}_] as far as \b, \B, \w, \W are concerned. In JS, this is not the case, but instead the definition of word characters is changed to "the ASCII word characters plus any Unicode code points whose canonicalization is a single ASCII word character". That seems to be only the two that you found. (FWIW, I find this odd too.)

@ptomato
Copy link
Contributor

ptomato commented Mar 8, 2024

The merged PRs have covered the entire testing plan, so this is done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants