Skip to content

Conversation

@jakebailey
Copy link
Member

@jakebailey jakebailey commented Nov 5, 2025

This PR implements regexp literal syntax checking. The main difference is that it is no longer a scanning step, as that would make it a parser error, which would mean we would have to go back to having separate ASTs depending on the language version being used (something we worked so hard to stop doing).

Instead, the regex checks are grammar checks that happen during checking. Additionally, the code doesn't live in the scanner package at all, as it is entirely self-contained (much to the original code's credit).

This is a really complicated piece of code largely due to the sheer scale of these checks, but also the annoying edge cases due to the old compiler's reliance of TS being in JS already and therefore getting identical string behavior when decoding. This required just a load of fiddling to get working.

The bulk of this is copilot ported, with a bunch of cleanups and fixes. The result is that every regexp related test matches, except for regularExpressionCharacterClassRangeOrder.errors.txt.diff and regularExpressionWithNonBMPFlags.errors.txt.diff, which I believe are improving because of better handling of positioning in weird characters, e.g.:

     // See https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols
     const regexes: RegExp[] = [
     	/[𝘈-𝘡][𝘡-𝘈]/,
-    	   ~~~
+    	  ~~~
 !!! error TS1517: Range out of order in character class.
-    	          ~~~
+    	       ~~~
 !!! error TS1517: Range out of order in character class.
     	/[𝘈-𝘡][𝘡-𝘈]/u,
-    	         ~~~~~
+    	       ~~~
 !!! error TS1517: Range out of order in character class.
     	/[𝘈-𝘡][𝘡-𝘈]/v,
-    	         ~~~~~
+    	       ~~~
 !!! error TS1517: Range out of order in character class.
     
     	/[\u{1D608}-\u{1D621}][\u{1D621}-\u{1D608}]/,

But probably aren't really, it's just that the offsets are UTF-8 now and these are weird chars. tsserver in the editor shows similar ranges so this was probably just a bug in the old baseline squiggler.

…elines for out-of-range escapes

- Check parsed hex code point in regexp escape sequences and emit error if > 0x10FFFF
- Add test baselines for unicodeExtendedEscapesInRegularExpressions07 and 12 (ES5/ES6) showing TS1198 for out-of-range values
- Remove obsolete .diff baseline artifacts
When reporting "Range out of order in character class", extend the diagnostic length
to include a pending low surrogate (using surrogateState.utf8Size). This ensures
the error highlight covers the full surrogate pair in non-Unicode mode. Update
baselines to reflect the corrected highlight spans.
Copilot AI review requested due to automatic review settings November 5, 2025 21:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates baseline/reference files for regular expression test cases in the compiler. The changes involve removing .diff files (which previously showed differences between expected and actual outputs) and populating corresponding .errors.txt files with the expected error output. This indicates that the Go port's regular expression error handling has converged with the TypeScript reference implementation.

Key Changes

  • Regular expression validation errors are now being correctly reported, matching the TypeScript reference behavior
  • Unicode escape sequence validation in regex patterns
  • Character class range ordering validation
  • Regex flag validation (including unicode flags, duplicate flags, and unknown flags)

Reviewed Changes

Copilot reviewed 57 out of 57 changed files in this pull request and generated no comments.

File Description
Multiple .errors.txt.diff files Removed diff files showing previous discrepancies in error reporting
Multiple .errors.txt files Added expected error outputs that now match TypeScript's behavior
Files cover various regex features Unicode escapes, character classes, flags, backreferences, quantifiers, and more

@jakebailey
Copy link
Member Author

jakebailey commented Nov 5, 2025

Based on the little differences I've noticed, the testing of this feature is not entirely complete...

@jakebailey
Copy link
Member Author

@graphemecluster I'm not sure if you are around and willing to look at this, but since you are looking at regex stuff again maybe you're willing to look at this again.

Not 100% confident in it, and I don't like the duplication due to the fact that we need some weird scanning stuff to deal with things the new scanner does not much care about...


// Table 67: Binary Unicode property aliases and their canonical property names
// https://tc39.es/ecma262/#table-binary-unicode-properties
var binaryUnicodeProperties = collections.NewSetFromItems(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's annoying that we can't just use unicode.Categories for this because of differences like Bidi_Control and Bidi_C existing.

@jakebailey jakebailey marked this pull request as draft November 7, 2025 15:59
@graphemecluster
Copy link

Let me see if I can get some time this weekend :D

Copy link

@graphemecluster graphemecluster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I should overwhelm this page with nitpicking reviews, so I've decided to put the remaining suggested changes in a separate PR #2107 for your convenience.

Comment on lines +968 to +984
minCodePoint := decodeCodePoint(atom)
maxCodePoint := decodeCodePoint(rangeEnd)

// Get the expected sizes (in UTF-16 code units)
minExpectedSize := charSize(minCodePoint)
maxExpectedSize := charSize(maxCodePoint)

// Check if both are "complete" characters in JavaScript's UTF-16 representation.
// A character is complete if its UTF-16 length matches the expected size.
// In JavaScript, string.length returns the UTF-16 code unit count.
minUTF16Length := utf16Length(atom)
maxUTF16Length := utf16Length(rangeEnd)

minIsComplete := minUTF16Length == minExpectedSize
maxIsComplete := maxUTF16Length == maxExpectedSize

if minIsComplete && maxIsComplete && minCodePoint > maxCodePoint {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original intention is merely to determine whether an atom has only a single character (codepoint) for a valid order comparison. I believe utf8.RuneCountInString(atom) == 1 is already sufficient (and is actually an equivalent regardless of Unicode mode, since it is impossible to get a non-BMP atom in non-Unicode mode), but I might be wrong. (In the original TypeScript code, I could have written [...atom].length === 1 – comparing by charSize is purely for avoiding array allocation.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is always equivalent to r, n := utf8.DecodeRuneInString(s); len(s) == n, so you can just do what you did in scanClassSetExpression.

// >= U+10000 are treated as surrogate pairs and consumed across two sequential calls.
func (v *regExpValidator) scanSourceCharacter() string {
// Check if we have a pending low surrogate from the previous call
if v.surrogateState != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for safety, I would suggest checking the expected position too (by adding a property to the state)

Comment on lines +67 to +70
if message.Category() == diagnostics.CategoryMessage && lastError != nil &&
start == lastError.Pos() && length == lastError.Len() {
err := ast.NewDiagnostic(nil, core.NewTextRange(start, start+length), message, args...)
lastError.AddRelatedInfo(err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was kind of workaround since I couldn't expand scanner effectively in Strada. But now that regExpValidator is a standalone struct, I would suggest adding a dedicated method for related info (spelling suggestions).

Comment on lines +117 to +134
// regExpChar represents a single "character" in a regex pattern.
// In Unicode mode, this is a single code point.
// In non-Unicode mode, this matches JavaScript's UTF-16 representation,
// where supplementary characters are represented as surrogate pairs.
type regExpChar struct {
// The code point value. For surrogates, this is the surrogate value itself (0xD800-0xDFFF).
codePoint rune
// The UTF-16 length (1 for most characters, 2 for supplementary characters in Unicode mode)
utf16Length int
}

// makeRegExpChar creates a regExpChar from a code point
func makeRegExpChar(codePoint rune) regExpChar {
return regExpChar{
codePoint: codePoint,
utf16Length: charSize(codePoint),
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think they are referenced anywhere, do you intend to use them at a later time?

candidates = slices.AppendSeq(candidates, maps.Keys(generalCategoryValues.M))
candidates = slices.AppendSeq(candidates, maps.Keys(binaryUnicodeProperties.M))
candidates = slices.AppendSeq(candidates, maps.Keys(binaryUnicodePropertiesOfStrings.M))
suggestion := core.GetSpellingSuggestion(propertyNameOrValue, candidates, core.Identity)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider making GetSpellingSuggestion accept an iterator to avoid allocation? (I am not particularly familiar with Go; is the succession of hidden for loops unavoidable after all?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants