Tests for regular expression modifiers #3756

cjtenny · 2022-12-20T20:17:56Z

Explainer: https://github.com/tc39/proposal-regexp-modifiers
Spec text: https://tc39.es/proposal-regexp-modifiers/

regexp-modifiers testing plan

Syntax errors: Throw both in parsing and when constructed with new Regexp("...")
- Basic regular expression flags (n.b. source text refers to matched text for "regular expression flags" production of grammar)
  - Source text contains other code points than i, m, s
  - Source text contains combining codepoints alongside i, m, s
  - Source text contains other non-display codepoints alongside i, m, s
  - Source text contains ZWNJ, ZWJ, ZWNBSP alongside i, m, s (I think this is right? https://tc39.es/ecma262/#sec-unicode-format-control-characters)
  - Source text contains i, m, and/or s more than once
  - Source text in a case-ignoring context contains code points that case fold to i, m, s e.g. I, M, S
  - Source text contains code points outside the basic latin range that, were they canonicalized by a unicode-mapping regex, would map to e.g. i, m, or s (e.g. ſ (U+017F) would map to s, U+0130 to i) (ref. https://www.unicode.org/Public/12.1.0/ucd/CaseFolding.txt)
    - (e.g. /foo(?\u{017F}:bar)/u is a syntax error, /foo(?s:bar)/u is not)
- Arithmetic regular expression flags
  - First or second source text exhibits any of the 'basic regular expression flags' errors
  - Both source texts are empty
  - Code point matched by first flags is also contained in source text matched by second flags
  - Various forms of (?ims-ims) - no colon - is a syntax error
  - Source text cannot use unicode escape sequences to express code points i, m, s
Valid syntax
- Basic regular expression flags parse correctly
- Source text with any valid combination of flags or arithmetic flags - reasonable to enumerate
Behavior
- Disabling flag in subexpression behaves correctly when corresponding top-level flag is and isn't already set
- Enabling flag in subexpression behaves correctly when corresponding top-level flag is and isn't already set
- Constructing a RegExp from a literal but changing flags by an argument to the RegExp constructor does (or does not) correctly change behavior of a subexpression that enables or removes flags.
- i
  - Ignore case applies appropriately inside subexpression, but not outside; when turned on, off, and when nested inside a subexpression that has previously modified behavior
  - Behavior as normal when other flags modified but i flag not modified
  - Callers of Canonicalize:
    - Backreferences ignore case in captures
    - Individual characters ignore case
    - Character sets ignore case
    - Character escapes ignore case
    - Character class escapes ignore case
    - \w class, \b, \B all ignore case
- m
  - ^ and $ apply appropriately inside subexpression, but not outside; when turned on, off, and when nested inside a subexpression that has previously modified behavior
- s
  - . applies appropriately inside subexpression, but not outside; when turned on, off, and when nested inside a subexpression that has previously modified behavior
- Subexpressions with flags set do not cause RegExp()....flags or /.../.flags to have the flags set, e.g. (new RegExp("(?i:a)")).flags does not include i.
- ^ for RegExp.prototype.dotAll, .multiline, ignoreCase

The text was updated successfully, but these errors were encountered:

rbuckton · 2024-02-13T22:47:14Z

Can you clarify what this means?

Source text uses unicode escape sequences to express code points i, m, s

Are you describing a RegExp like /(?\u0069:a)/? Since it isn't clear from the test plan, my expectation is that this would be a syntax error.

rbuckton · 2024-02-14T01:56:19Z

Are there any examples of cases where the result of \b or \B change based on whether the i flag is present?

rbuckton · 2024-02-14T02:18:45Z

I've found only two instances where \b does not agree in both modes:

U+017f - ſ LATIN SMALL LETTER LONG S
U+212a - K KELVIN SIGN

I find it strange that such an asymmetry exists. A quick test of the same pattern in both modes in C# shows no asymmetry, so it's not clear to me if this is expected or is a bug in \b.

ioannad · 2024-02-29T22:33:25Z

Edited test plan to change "Source text uses unicode escape sequences to express code points i, m, s" to "Source text cannot use unicode escape sequences to express code points i, m, s" as suggested by @rbuckton . Added tests for this in #4016, along with all the other syntax error tests from @cjtenny and @guijemont 's PRs.

ptomato · 2024-03-07T21:02:08Z

I find it strange that such an asymmetry exists. A quick test of the same pattern in both modes in C# shows no asymmetry, so it's not clear to me if this is expected or is a bug in \b.

This seems like it is expected. It comes from WordCharacters.

From what I can tell, most other languages' regexps' unicode flag changes the definition of word characters to [\p{L}\p{D}_] as far as \b, \B, \w, \W are concerned. In JS, this is not the case, but instead the definition of word characters is changed to "the ASCII word characters plus any Unicode code points whose canonicalization is a single ASCII word character". That seems to be only the two that you found. (FWIW, I find this odd too.)

ptomato · 2024-03-08T17:05:12Z

The merged PRs have covered the entire testing plan, so this is done!

cjtenny self-assigned this Dec 26, 2022

cjtenny mentioned this issue Dec 27, 2022

proposal-regexp-modifiers: Tests for syntax errors #3758

Closed

guijemont mentioned this issue Mar 31, 2023

proposal-regexp-modifiers: Tests for syntax errors #3807

Closed

ptomato mentioned this issue Aug 9, 2023

add tests for Promise.withResolvers proposal #3867

Merged

ptomato mentioned this issue Jan 18, 2024

Add tests for the RegExp Modifiers proposal #3960

Merged

ioannad mentioned this issue Feb 29, 2024

Regexp-modifiers syntax error tests #4016

Merged

ptomato closed this as completed Mar 8, 2024

ptomato assigned ioannad and rbuckton Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests for regular expression modifiers #3756

Tests for regular expression modifiers #3756

cjtenny commented Dec 20, 2022 •

edited by ptomato

Loading

rbuckton commented Feb 13, 2024

rbuckton commented Feb 14, 2024 •

edited

Loading

rbuckton commented Feb 14, 2024

ioannad commented Feb 29, 2024 •

edited

Loading

ptomato commented Mar 7, 2024

ptomato commented Mar 8, 2024

Tests for regular expression modifiers #3756

Tests for regular expression modifiers #3756

Comments

cjtenny commented Dec 20, 2022 • edited by ptomato Loading

regexp-modifiers testing plan

rbuckton commented Feb 13, 2024

rbuckton commented Feb 14, 2024 • edited Loading

rbuckton commented Feb 14, 2024

ioannad commented Feb 29, 2024 • edited Loading

ptomato commented Mar 7, 2024

ptomato commented Mar 8, 2024

cjtenny commented Dec 20, 2022 •

edited by ptomato

Loading

rbuckton commented Feb 14, 2024 •

edited

Loading

ioannad commented Feb 29, 2024 •

edited

Loading