fix(parser/html): regex literals in frontmatter by ematipico · Pull Request #9531 · biomejs/biome

ematipico · 2026-03-17T18:06:22Z

Summary

Closes #9187

Asked Claude Code to fix it

Test Plan

Added tests from the report

Docs

N/A

changeset-bot · 2026-03-17T18:06:44Z

🦋 Changeset detected

Latest commit: 838ed8e

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 13 packages

Name	Type
@biomejs/biome	Patch
@biomejs/cli-win32-x64	Patch
@biomejs/cli-win32-arm64	Patch
@biomejs/cli-darwin-x64	Patch
@biomejs/cli-darwin-arm64	Patch
@biomejs/cli-linux-x64	Patch
@biomejs/cli-linux-arm64	Patch
@biomejs/cli-linux-x64-musl	Patch
@biomejs/cli-linux-arm64-musl	Patch
@biomejs/wasm-web	Patch
@biomejs/wasm-bundler	Patch
@biomejs/wasm-nodejs	Patch
@biomejs/backend-jsonrpc	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

codspeed-hq · 2026-03-17T18:15:39Z

Merging this PR will not alter performance

✅ 64 untouched benchmarks
⏩ 152 skipped benchmarks¹

_{Comparing fix/regex-frontmatter (838ed8e) with main (61f53ee)²}

152 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
No successful run was found on main (bc709f6) during the generation of this report, so 61f53ee was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

coderabbitai · 2026-03-17T18:16:59Z

Walkthrough

This PR fixes Astro frontmatter parsing failures when regex literals contain quotes or dashes (e.g., /'/, /""/, /---/). It extends the QuotesSeen state tracker within the HTML lexer to track regex literals and maintain a record of the previous non-whitespace byte. This enables better detection of whether a slash starts a regex, a comment, or serves as a division operator. The implementation includes regex body consumption logic with proper escape handling and corresponding test coverage.

Possibly related PRs

fix(html): fix comment detection in Astro frontmatter #7989 — Modifies the same frontmatter lexer QuotesSeen logic to correctly distinguish slash-related constructs (comments versus regex) inside quoted frontmatter.
fix(html): parse frontmatter inside quotes #7907 — Introduces the QuotesSeen helper in the HTML lexer to improve Astro frontmatter parsing through quote-tracking mechanisms.
fix(parse/html): astro fence #9124 — Evolves the QuotesSeen-based state machine for Astro frontmatter to handle regex literals and associated heuristics.

Suggested reviewers

dyc3

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix(parser/html): regex literals in frontmatter' accurately and concisely describes the main change—fixing parser handling of regex literals in Astro frontmatter.
Description check	✅ Passed	The description clearly relates to the changeset, identifying the linked issue, disclosing AI assistance, and documenting the test plan.
Linked Issues check	✅ Passed	The PR fully addresses the requirements from issue `#9187`: regex literals with quotes or dashes in Astro frontmatter now parse correctly, and tests validating these cases have been added.
Out of Scope Changes check	✅ Passed	All changes are narrowly scoped to fixing the frontmatter regex parsing issue; the changeset, lexer enhancements, and test file are directly aligned with the stated objective.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/regex-frontmatter

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can customize the high-level summary generated by CodeRabbit.

Configure the reviews.high_level_summary_instructions setting to provide custom instructions for generating the high-level summary.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/biome_html_parser/src/lexer/mod.rs`:
- Around line 1773-1793: slash_starts_regex currently relies only on
prev_non_ws_byte and misclassifies cases like "return /re/" and "x + /re/";
change it to consult a lexer-level syntactic flag (e.g., allow_expression_start
or lexical_goal) that tracks whether the grammar currently expects an
expression. Update the lexer to set/clear that flag when emitting tokens that
begin or end expression contexts (for example in the logic that emits tokens for
keywords like return, punctuators like '(', '{', '[', operators like '+', '-',
and after semicolons), replace slash_starts_regex to return true when
allow_expression_start is true (falling back to the existing byte check only if
you must), and add regression tests specifically for "return /re/" and "x +
/re/" to ensure regexes are recognized correctly; reference the existing
function slash_starts_regex and the state prev_non_ws_byte when making the
change.
- Around line 1587-1588: The lexer currently uses the boolean flag in_regex to
decide when a regex literal ends but does not track character-class state, so
unescaped '/' inside [...] prematurely terminates the regex; update the regex
handling to track character classes and escapes (e.g., replace or augment the
in_regex boolean with a small state machine or additional flag like
in_regex_char_class) inside the lexer routine that handles regex closing (the
code that checks for unescaped '/' to exit regex mode). Specifically, when
entering a '[' inside a regex set a char-class state, ignore '/' (and handle
backslash-escaped characters) until a matching unescaped ']' is seen, then
resume normal regex-close logic that treats only unescaped '/' as the
terminator; ensure escape sequences (backslashes) are honored both inside and
outside character classes.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1401cf85-22d2-4e5f-a9b8-8da653b11e43

📥 Commits

Reviewing files that changed from the base of the PR and between bc709f6 and 838ed8e.

⛔ Files ignored due to path filters (1)

crates/biome_html_parser/tests/html_specs/ok/astro/issue_9187.astro.snap is excluded by !**/*.snap and included by **

📒 Files selected for processing (3)

.changeset/fix-astro-regex-frontmatter.md
crates/biome_html_parser/src/lexer/mod.rs
crates/biome_html_parser/tests/html_specs/ok/astro/issue_9187.astro

coderabbitai · 2026-03-17T18:17:02Z

crates/biome_html_parser/src/lexer/mod.rs

+    /// Whether we are currently inside a regex literal (`/…/`).
+    in_regex: bool,


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In ECMAScript regular expression literals, may an unescaped forward slash appear inside a character class [...], for example /[/']/, without terminating the regex literal?

💡 Result:

Yes. In an ECMAScript regular expression literal (/.../), an unescaped / may appear inside a character class [...] without terminating the literal, because the lexical grammar for RegularExpressionClassChar excludes ] and \—but does not exclude /. Therefore /[/']/ is tokenized as a single regex literal and its character class matches either / or '. [1][2]

🏁 Script executed:

# Examine the lexer file around the flagged lines head -n 1650 crates/biome_html_parser/src/lexer/mod.rs | tail -n +1550 | cat -n

Repository: biomejs/biome

Length of output: 5071

🏁 Script executed:

# Search for the in_regex field and slash_starts_regex function rg -n "slash_starts_regex|in_regex" crates/biome_html_parser/src/lexer/mod.rs | head -20

Repository: biomejs/biome

Length of output: 444

🏁 Script executed:

# Check for regex-related tests fd -t f "test" crates/biome_html_parser --exec rg -l "regex|slash" {} \;

Repository: biomejs/biome

Length of output: 100

🏁 Script executed:

# Read the regex closing logic and surrounding context sed -n '1650,1680p' crates/biome_html_parser/src/lexer/mod.rs | cat -n

Repository: biomejs/biome

Length of output: 1428

🏁 Script executed:

# Read the slash_starts_regex function sed -n '1777,1810p' crates/biome_html_parser/src/lexer/mod.rs | cat -n

Repository: biomejs/biome

Length of output: 1314

🏁 Script executed:

# Search for character class tests or regex tests rg -A 5 -B 5 "in_regex|character.*class|slash" crates/biome_html_parser/src/lexer/tests.rs | head -60

Repository: biomejs/biome

Length of output: 908

Character classes inside regex literals are not tracked, causing early termination.

The regex closing logic (lines 1656–1661) terminates on any unescaped /, but ECMAScript regex literals permit / inside character classes without closing the pattern—/[/']/ is valid. This causes the lexer to exit regex mode too early, leaving subsequent quotes or dashes to confuse frontmatter fence detection. Tracking character classes (detecting [, ], and escape sequences within them) would prevent this false termination.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/biome_html_parser/src/lexer/mod.rs` around lines 1587 - 1588, The lexer currently uses the boolean flag in_regex to decide when a regex literal ends but does not track character-class state, so unescaped '/' inside [...] prematurely terminates the regex; update the regex handling to track character classes and escapes (e.g., replace or augment the in_regex boolean with a small state machine or additional flag like in_regex_char_class) inside the lexer routine that handles regex closing (the code that checks for unescaped '/' to exit regex mode). Specifically, when entering a '[' inside a regex set a char-class state, ignore '/' (and handle backslash-escaped characters) until a matching unescaped ']' is seen, then resume normal regex-close logic that treats only unescaped '/' as the terminator; ensure escape sequences (backslashes) are honored both inside and outside character classes.

This seems important

coderabbitai · 2026-03-17T18:17:03Z

crates/biome_html_parser/src/lexer/mod.rs

+    /// Returns whether a deferred `/` starts a regex literal based on
+    /// `prev_non_ws_byte`. After an identifier character, closing
+    /// paren/bracket, number, or `++`/`--` suffix, `/` is division. In all
+    /// other positions `/` starts a regex.
+    fn slash_starts_regex(&self) -> bool {
+        match self.prev_non_ws_byte {
+            None => true,
+            Some(b) => !matches!(
+                b,
+                b'a'..=b'z'
+                    | b'A'..=b'Z'
+                    | b'0'..=b'9'
+                    | b'_'
+                    | b'$'
+                    | b')'
+                    | b']'
+                    | b'+'
+                    | b'-'
+            ),
+        }
+    }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In ECMAScript, does a /begin a regular expression literal in contexts likereturn /re/andx + /re/, or are those parsed as division?

💡 Result:

Yes—both return /re/ and x + /re/ are parsed with /re/ as a RegularExpressionLiteral, not as division.

ECMAScript avoids ambiguity by using different lexical goal symbols: in syntactic positions where a regex literal is permitted, the scanner uses InputElementRegExp; otherwise it uses InputElementDiv. The spec notes there is no syntactic context where both a leading division operator and a leading RegularExpressionLiteral are permitted, so the choice is determined by grammar context. In particular, after return (when the expression is on the same line) and after +, the grammar expects an expression, so / starts a regex literal. [1]

By contrast, when the grammar expects an operator continuing an expression (e.g., after an identifier or literal), / is taken as division (the spec gives a = b newline /hi/... as a case that becomes division, not regex, because the context allows division). [1][2]

One-byte look-behind cannot distinguish valid regex starters in expression contexts.

slash_starts_regex() fails for return /re/ and x + /re/ because the previous non-whitespace byte alone cannot capture ECMAScript's requirement that / begins a regex when the grammar expects an expression. The spec uses syntactic context (lexical goal symbols) to make this choice, not character-level heuristics. A more coarse-grained state tracking which positions allow expression starters is needed, along with regression tests for these cases.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/biome_html_parser/src/lexer/mod.rs` around lines 1773 - 1793, slash_starts_regex currently relies only on prev_non_ws_byte and misclassifies cases like "return /re/" and "x + /re/"; change it to consult a lexer-level syntactic flag (e.g., allow_expression_start or lexical_goal) that tracks whether the grammar currently expects an expression. Update the lexer to set/clear that flag when emitting tokens that begin or end expression contexts (for example in the logic that emits tokens for keywords like return, punctuators like '(', '{', '[', operators like '+', '-', and after semicolons), replace slash_starts_regex to return true when allow_expression_start is true (falling back to the existing byte check only if you must), and add regression tests specifically for "return /re/" and "x + /re/" to ensure regexes are recognized correctly; reference the existing function slash_starts_regex and the state prev_non_ws_byte when making the change.

dyc3

I actually think the better fix is to check if the --- had a new line right before it.

ematipico · 2026-03-17T21:54:51Z

I actually think the better fix is to check if the --- had a new line right before it.

No it doesn't, it was one of the first bugs reported that I fixed.

---
let a = "
---
"
---

This was the bug reported.

dyc3

Oh, I see. Nevermind then.

fix(parser/html): regex literals in frontmatter

838ed8e

ematipico requested review from a team March 17, 2026 18:06

github-actions bot added A-Parser Area: parser L-HTML Language: HTML and super languages labels Mar 17, 2026

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

dyc3 requested changes Mar 17, 2026

View reviewed changes

dyc3 approved these changes Mar 17, 2026

View reviewed changes

ematipico merged commit 1302740 into main Mar 18, 2026
17 checks passed

ematipico deleted the fix/regex-frontmatter branch March 18, 2026 06:01

github-actions bot mentioned this pull request Mar 18, 2026

ci: release #9517

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(parser/html): regex literals in frontmatter#9531

fix(parser/html): regex literals in frontmatter#9531
ematipico merged 1 commit intomainfrom
fix/regex-frontmatter

ematipico commented Mar 17, 2026

Uh oh!

changeset-bot bot commented Mar 17, 2026

Uh oh!

codspeed-hq bot commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 17, 2026

Uh oh!

dyc3 Mar 17, 2026

Uh oh!

coderabbitai bot Mar 17, 2026

Uh oh!

dyc3 left a comment

Uh oh!

ematipico commented Mar 17, 2026

Uh oh!

dyc3 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		/// Whether we are currently inside a regex literal (`/…/`).
		in_regex: bool,

Uh oh!

Conversation

ematipico commented Mar 17, 2026

Summary

Test Plan

Docs

Uh oh!

changeset-bot bot commented Mar 17, 2026

🦋 Changeset detected

Uh oh!

codspeed-hq bot commented Mar 17, 2026

Merging this PR will not alter performance

Footnotes

Uh oh!

coderabbitai bot commented Mar 17, 2026

Walkthrough

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

dyc3 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

dyc3 left a comment

Choose a reason for hiding this comment

Uh oh!

ematipico commented Mar 17, 2026

Uh oh!

dyc3 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants