Implement regexp literal syntax checking #2026

jakebailey · 2025-11-05T21:10:26Z

This PR implements regexp literal syntax checking. The main difference is that it is no longer a scanning step, as that would make it a parser error, which would mean we would have to go back to having separate ASTs depending on the language version being used (something we worked so hard to stop doing).

Instead, the regex checks are grammar checks that happen during checking. Additionally, the code doesn't live in the scanner package at all, as it is entirely self-contained (much to the original code's credit).

This is a really complicated piece of code largely due to the sheer scale of these checks, but also the annoying edge cases due to the old compiler's reliance of TS being in JS already and therefore getting identical string behavior when decoding. This required just a load of fiddling to get working.

The bulk of this is copilot ported, with a bunch of cleanups and fixes. The result is that every regexp related test matches, except for regularExpressionCharacterClassRangeOrder.errors.txt.diff and regularExpressionWithNonBMPFlags.errors.txt.diff, which I believe are improving because of better handling of positioning in weird characters, e.g.:

     // See https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols
     const regexes: RegExp[] = [
     	/[𝘈-𝘡][𝘡-𝘈]/,
-    	   ~~~
+    	  ~~~
 !!! error TS1517: Range out of order in character class.
-    	          ~~~
+    	       ~~~
 !!! error TS1517: Range out of order in character class.
     	/[𝘈-𝘡][𝘡-𝘈]/u,
-    	         ~~~~~
+    	       ~~~
 !!! error TS1517: Range out of order in character class.
     	/[𝘈-𝘡][𝘡-𝘈]/v,
-    	         ~~~~~
+    	       ~~~
 !!! error TS1517: Range out of order in character class.
     
     	/[\u{1D608}-\u{1D621}][\u{1D621}-\u{1D608}]/,

But probably aren't really, it's just that the offsets are UTF-8 now and these are weird chars. tsserver in the editor shows similar ranges so this was probably just a bug in the old baseline squiggler.

…elines for out-of-range escapes - Check parsed hex code point in regexp escape sequences and emit error if > 0x10FFFF - Add test baselines for unicodeExtendedEscapesInRegularExpressions07 and 12 (ES5/ES6) showing TS1198 for out-of-range values - Remove obsolete .diff baseline artifacts

When reporting "Range out of order in character class", extend the diagnostic length to include a pending low surrogate (using surrogateState.utf8Size). This ensures the error highlight covers the full surrogate pair in non-Unicode mode. Update baselines to reflect the corrected highlight spans.

Copilot

Pull Request Overview

This PR updates baseline/reference files for regular expression test cases in the compiler. The changes involve removing .diff files (which previously showed differences between expected and actual outputs) and populating corresponding .errors.txt files with the expected error output. This indicates that the Go port's regular expression error handling has converged with the TypeScript reference implementation.

Key Changes

Regular expression validation errors are now being correctly reported, matching the TypeScript reference behavior
Unicode escape sequence validation in regex patterns
Character class range ordering validation
Regex flag validation (including unicode flags, duplicate flags, and unknown flags)

Reviewed Changes

Copilot reviewed 57 out of 57 changed files in this pull request and generated no comments.

File	Description
Multiple `.errors.txt.diff` files	Removed diff files showing previous discrepancies in error reporting
Multiple `.errors.txt` files	Added expected error outputs that now match TypeScript's behavior
Files cover various regex features	Unicode escapes, character classes, flags, backreferences, quantifiers, and more

jakebailey · 2025-11-05T22:58:42Z

Based on the little differences I've noticed, the testing of this feature is not entirely complete...

jakebailey · 2025-11-06T22:27:36Z

@graphemecluster I'm not sure if you are around and willing to look at this, but since you are looking at regex stuff again maybe you're willing to look at this again.

Not 100% confident in it, and I don't like the duplication due to the fact that we need some weird scanning stuff to deal with things the new scanner does not much care about...

DanielRosenwasser · 2025-11-06T23:09:19Z

internal/regexpchecker/tables.go

+
+// Table 67: Binary Unicode property aliases and their canonical property names
+// https://tc39.es/ecma262/#table-binary-unicode-properties
+var binaryUnicodeProperties = collections.NewSetFromItems(


It's annoying that we can't just use unicode.Categories for this because of differences like Bidi_Control and Bidi_C existing.

graphemecluster · 2025-11-10T09:24:00Z

Let me see if I can get some time this weekend :D

graphemecluster

I don't think I should overwhelm this page with nitpicking reviews, so I've decided to put the remaining suggested changes in a separate PR #2107 for your convenience.

graphemecluster · 2025-11-17T03:50:59Z

internal/regexpchecker/regexpchecker.go

+				minCodePoint := decodeCodePoint(atom)
+				maxCodePoint := decodeCodePoint(rangeEnd)
+
+				// Get the expected sizes (in UTF-16 code units)
+				minExpectedSize := charSize(minCodePoint)
+				maxExpectedSize := charSize(maxCodePoint)
+
+				// Check if both are "complete" characters in JavaScript's UTF-16 representation.
+				// A character is complete if its UTF-16 length matches the expected size.
+				// In JavaScript, string.length returns the UTF-16 code unit count.
+				minUTF16Length := utf16Length(atom)
+				maxUTF16Length := utf16Length(rangeEnd)
+
+				minIsComplete := minUTF16Length == minExpectedSize
+				maxIsComplete := maxUTF16Length == maxExpectedSize
+
+				if minIsComplete && maxIsComplete && minCodePoint > maxCodePoint {


The original intention is merely to determine whether an atom has only a single character (codepoint) for a valid order comparison. I believe utf8.RuneCountInString(atom) == 1 is already sufficient (and is actually an equivalent regardless of Unicode mode, since it is impossible to get a non-BMP atom in non-Unicode mode), but I might be wrong. (In the original TypeScript code, I could have written [...atom].length === 1 – comparing by charSize is purely for avoiding array allocation.)

And this is always equivalent to r, n := utf8.DecodeRuneInString(s); len(s) == n, so you can just do what you did in scanClassSetExpression.

graphemecluster · 2025-11-17T04:49:37Z

internal/regexpchecker/regexpchecker.go

+// >= U+10000 are treated as surrogate pairs and consumed across two sequential calls.
+func (v *regExpValidator) scanSourceCharacter() string {
+	// Check if we have a pending low surrogate from the previous call
+	if v.surrogateState != nil {


Just for safety, I would suggest checking the expected position too (by adding a property to the state)

graphemecluster · 2025-11-17T05:57:17Z

internal/checker/grammarchecks.go

+			if message.Category() == diagnostics.CategoryMessage && lastError != nil &&
+				start == lastError.Pos() && length == lastError.Len() {
+				err := ast.NewDiagnostic(nil, core.NewTextRange(start, start+length), message, args...)
+				lastError.AddRelatedInfo(err)


This was kind of workaround since I couldn't expand scanner effectively in Strada. But now that regExpValidator is a standalone struct, I would suggest adding a dedicated method for related info (spelling suggestions).

graphemecluster · 2025-11-17T05:58:48Z

internal/regexpchecker/utf16.go

+// regExpChar represents a single "character" in a regex pattern.
+// In Unicode mode, this is a single code point.
+// In non-Unicode mode, this matches JavaScript's UTF-16 representation,
+// where supplementary characters are represented as surrogate pairs.
+type regExpChar struct {
+	// The code point value. For surrogates, this is the surrogate value itself (0xD800-0xDFFF).
+	codePoint rune
+	// The UTF-16 length (1 for most characters, 2 for supplementary characters in Unicode mode)
+	utf16Length int
+}
+
+// makeRegExpChar creates a regExpChar from a code point
+func makeRegExpChar(codePoint rune) regExpChar {
+	return regExpChar{
+		codePoint:   codePoint,
+		utf16Length: charSize(codePoint),
+	}
+}


I don't think they are referenced anywhere, do you intend to use them at a later time?

graphemecluster · 2025-11-17T06:07:27Z

internal/regexpchecker/regexpchecker.go

+			candidates = slices.AppendSeq(candidates, maps.Keys(generalCategoryValues.M))
+			candidates = slices.AppendSeq(candidates, maps.Keys(binaryUnicodeProperties.M))
+			candidates = slices.AppendSeq(candidates, maps.Keys(binaryUnicodePropertiesOfStrings.M))
+			suggestion := core.GetSpellingSuggestion(propertyNameOrValue, candidates, core.Identity)


Did you consider making GetSpellingSuggestion accept an iterator to avoid allocation? (I am not particularly familiar with Go; is the succession of hidden for loops unavoidable after all?)

jakebailey added 30 commits November 4, 2025 19:10

wip

40ee30b

wip

1af4d7c

wip

1cef046

wip

3110ac6

wip

b648805

wip

0470fb4

wip

899ad1d

wip

a5a1597

move code

9a4c582

wip

918571e

wip

9c50cd8

Simplify

6b788d2

Simplify

e1c3215

Simplify

1831e7b

Delete temp test

e944385

Remove helper

2d4159c

refactor

d6a8a5c

refactor

b36d027

Simplify

d878c54

more refactors

24e7ea1

move code around

6e230fb

use a set

1ab893a

Small simplifications

fd90e1b

Rename func

0c95bf0

cleanups

3450dae

Move utf16 junk out

fdb5b71

Defer to utf16 package where possible

cb08daf

Remove unused method

b63d61b

Copilot AI review requested due to automatic review settings November 5, 2025 21:10

Copilot AI reviewed Nov 5, 2025

View reviewed changes

jakebailey added 3 commits November 5, 2025 14:09

Fix named capture group stuff

0604d23

Fix porting difference

082cbcc

fix escaping

320e435

DanielRosenwasser reviewed Nov 6, 2025

View reviewed changes

jakebailey marked this pull request as draft November 7, 2025 15:59

jakebailey mentioned this pull request Nov 7, 2025

Fix scanning of valid surrogate pairs #2032

Merged

graphemecluster reviewed Nov 17, 2025

View reviewed changes

graphemecluster mentioned this pull request Nov 17, 2025

Suggested changes to #2026 (Implement RegExp literal syntax checking) #2107

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement regexp literal syntax checking #2026

Implement regexp literal syntax checking #2026

jakebailey commented Nov 5, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

jakebailey commented Nov 5, 2025 •

edited

Loading

Uh oh!

jakebailey commented Nov 6, 2025

Uh oh!

DanielRosenwasser Nov 6, 2025

Uh oh!

graphemecluster commented Nov 10, 2025

Uh oh!

graphemecluster left a comment

Uh oh!

graphemecluster Nov 17, 2025

Uh oh!

graphemecluster Nov 17, 2025

Uh oh!

graphemecluster Nov 17, 2025

Uh oh!

graphemecluster Nov 17, 2025

Uh oh!

graphemecluster Nov 17, 2025

Uh oh!

graphemecluster Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Implement regexp literal syntax checking #2026

Are you sure you want to change the base?

Implement regexp literal syntax checking #2026

Conversation

jakebailey commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

jakebailey commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakebailey commented Nov 6, 2025

Uh oh!

DanielRosenwasser Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

graphemecluster commented Nov 10, 2025

Uh oh!

graphemecluster left a comment

Choose a reason for hiding this comment

Uh oh!

graphemecluster Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

graphemecluster Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

graphemecluster Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

graphemecluster Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

graphemecluster Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

graphemecluster Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jakebailey commented Nov 5, 2025 •

edited

Loading

jakebailey commented Nov 5, 2025 •

edited

Loading