Skip to content

fix(parser/html): regex literals in frontmatter#9531

Merged
ematipico merged 1 commit intomainfrom
fix/regex-frontmatter
Mar 18, 2026
Merged

fix(parser/html): regex literals in frontmatter#9531
ematipico merged 1 commit intomainfrom
fix/regex-frontmatter

Conversation

@ematipico
Copy link
Member

Summary

Closes #9187

Asked Claude Code to fix it

Test Plan

Added tests from the report

Docs

N/A

@ematipico ematipico requested review from a team March 17, 2026 18:06
@changeset-bot
Copy link

changeset-bot bot commented Mar 17, 2026

🦋 Changeset detected

Latest commit: 838ed8e

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 13 packages
Name Type
@biomejs/biome Patch
@biomejs/cli-win32-x64 Patch
@biomejs/cli-win32-arm64 Patch
@biomejs/cli-darwin-x64 Patch
@biomejs/cli-darwin-arm64 Patch
@biomejs/cli-linux-x64 Patch
@biomejs/cli-linux-arm64 Patch
@biomejs/cli-linux-x64-musl Patch
@biomejs/cli-linux-arm64-musl Patch
@biomejs/wasm-web Patch
@biomejs/wasm-bundler Patch
@biomejs/wasm-nodejs Patch
@biomejs/backend-jsonrpc Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@github-actions github-actions bot added A-Parser Area: parser L-HTML Language: HTML and super languages labels Mar 17, 2026
@codspeed-hq
Copy link

codspeed-hq bot commented Mar 17, 2026

Merging this PR will not alter performance

✅ 64 untouched benchmarks
⏩ 152 skipped benchmarks1


Comparing fix/regex-frontmatter (838ed8e) with main (61f53ee)2

Open in CodSpeed

Footnotes

  1. 152 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on main (bc709f6) during the generation of this report, so 61f53ee was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

Walkthrough

This PR fixes Astro frontmatter parsing failures when regex literals contain quotes or dashes (e.g., /'/, /""/, /---/). It extends the QuotesSeen state tracker within the HTML lexer to track regex literals and maintain a record of the previous non-whitespace byte. This enables better detection of whether a slash starts a regex, a comment, or serves as a division operator. The implementation includes regex body consumption logic with proper escape handling and corresponding test coverage.

Possibly related PRs

Suggested reviewers

  • dyc3
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(parser/html): regex literals in frontmatter' accurately and concisely describes the main change—fixing parser handling of regex literals in Astro frontmatter.
Description check ✅ Passed The description clearly relates to the changeset, identifying the linked issue, disclosing AI assistance, and documenting the test plan.
Linked Issues check ✅ Passed The PR fully addresses the requirements from issue #9187: regex literals with quotes or dashes in Astro frontmatter now parse correctly, and tests validating these cases have been added.
Out of Scope Changes check ✅ Passed All changes are narrowly scoped to fixing the frontmatter regex parsing issue; the changeset, lexer enhancements, and test file are directly aligned with the stated objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/regex-frontmatter
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can customize the high-level summary generated by CodeRabbit.

Configure the reviews.high_level_summary_instructions setting to provide custom instructions for generating the high-level summary.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/biome_html_parser/src/lexer/mod.rs`:
- Around line 1773-1793: slash_starts_regex currently relies only on
prev_non_ws_byte and misclassifies cases like "return /re/" and "x + /re/";
change it to consult a lexer-level syntactic flag (e.g., allow_expression_start
or lexical_goal) that tracks whether the grammar currently expects an
expression. Update the lexer to set/clear that flag when emitting tokens that
begin or end expression contexts (for example in the logic that emits tokens for
keywords like return, punctuators like '(', '{', '[', operators like '+', '-',
and after semicolons), replace slash_starts_regex to return true when
allow_expression_start is true (falling back to the existing byte check only if
you must), and add regression tests specifically for "return /re/" and "x +
/re/" to ensure regexes are recognized correctly; reference the existing
function slash_starts_regex and the state prev_non_ws_byte when making the
change.
- Around line 1587-1588: The lexer currently uses the boolean flag in_regex to
decide when a regex literal ends but does not track character-class state, so
unescaped '/' inside [...] prematurely terminates the regex; update the regex
handling to track character classes and escapes (e.g., replace or augment the
in_regex boolean with a small state machine or additional flag like
in_regex_char_class) inside the lexer routine that handles regex closing (the
code that checks for unescaped '/' to exit regex mode). Specifically, when
entering a '[' inside a regex set a char-class state, ignore '/' (and handle
backslash-escaped characters) until a matching unescaped ']' is seen, then
resume normal regex-close logic that treats only unescaped '/' as the
terminator; ensure escape sequences (backslashes) are honored both inside and
outside character classes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1401cf85-22d2-4e5f-a9b8-8da653b11e43

📥 Commits

Reviewing files that changed from the base of the PR and between bc709f6 and 838ed8e.

⛔ Files ignored due to path filters (1)
  • crates/biome_html_parser/tests/html_specs/ok/astro/issue_9187.astro.snap is excluded by !**/*.snap and included by **
📒 Files selected for processing (3)
  • .changeset/fix-astro-regex-frontmatter.md
  • crates/biome_html_parser/src/lexer/mod.rs
  • crates/biome_html_parser/tests/html_specs/ok/astro/issue_9187.astro

Comment on lines +1587 to +1588
/// Whether we are currently inside a regex literal (`/…/`).
in_regex: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In ECMAScript regular expression literals, may an unescaped forward slash appear inside a character class [...], for example /[/']/, without terminating the regex literal?

💡 Result:

Yes. In an ECMAScript regular expression literal (/.../), an unescaped / may appear inside a character class [...] without terminating the literal, because the lexical grammar for RegularExpressionClassChar excludes ] and \—but does not exclude /. Therefore /[/']/ is tokenized as a single regex literal and its character class matches either / or '. [1][2]


🏁 Script executed:

# Examine the lexer file around the flagged lines
head -n 1650 crates/biome_html_parser/src/lexer/mod.rs | tail -n +1550 | cat -n

Repository: biomejs/biome

Length of output: 5071


🏁 Script executed:

# Search for the in_regex field and slash_starts_regex function
rg -n "slash_starts_regex|in_regex" crates/biome_html_parser/src/lexer/mod.rs | head -20

Repository: biomejs/biome

Length of output: 444


🏁 Script executed:

# Check for regex-related tests
fd -t f "test" crates/biome_html_parser --exec rg -l "regex|slash" {} \;

Repository: biomejs/biome

Length of output: 100


🏁 Script executed:

# Read the regex closing logic and surrounding context
sed -n '1650,1680p' crates/biome_html_parser/src/lexer/mod.rs | cat -n

Repository: biomejs/biome

Length of output: 1428


🏁 Script executed:

# Read the slash_starts_regex function
sed -n '1777,1810p' crates/biome_html_parser/src/lexer/mod.rs | cat -n

Repository: biomejs/biome

Length of output: 1314


🏁 Script executed:

# Search for character class tests or regex tests
rg -A 5 -B 5 "in_regex|character.*class|slash" crates/biome_html_parser/src/lexer/tests.rs | head -60

Repository: biomejs/biome

Length of output: 908


Character classes inside regex literals are not tracked, causing early termination.

The regex closing logic (lines 1656–1661) terminates on any unescaped /, but ECMAScript regex literals permit / inside character classes without closing the pattern—/[/']/ is valid. This causes the lexer to exit regex mode too early, leaving subsequent quotes or dashes to confuse frontmatter fence detection. Tracking character classes (detecting [, ], and escape sequences within them) would prevent this false termination.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/biome_html_parser/src/lexer/mod.rs` around lines 1587 - 1588, The
lexer currently uses the boolean flag in_regex to decide when a regex literal
ends but does not track character-class state, so unescaped '/' inside [...]
prematurely terminates the regex; update the regex handling to track character
classes and escapes (e.g., replace or augment the in_regex boolean with a small
state machine or additional flag like in_regex_char_class) inside the lexer
routine that handles regex closing (the code that checks for unescaped '/' to
exit regex mode). Specifically, when entering a '[' inside a regex set a
char-class state, ignore '/' (and handle backslash-escaped characters) until a
matching unescaped ']' is seen, then resume normal regex-close logic that treats
only unescaped '/' as the terminator; ensure escape sequences (backslashes) are
honored both inside and outside character classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems important

Comment on lines +1773 to +1793
/// Returns whether a deferred `/` starts a regex literal based on
/// `prev_non_ws_byte`. After an identifier character, closing
/// paren/bracket, number, or `++`/`--` suffix, `/` is division. In all
/// other positions `/` starts a regex.
fn slash_starts_regex(&self) -> bool {
match self.prev_non_ws_byte {
None => true,
Some(b) => !matches!(
b,
b'a'..=b'z'
| b'A'..=b'Z'
| b'0'..=b'9'
| b'_'
| b'$'
| b')'
| b']'
| b'+'
| b'-'
),
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In ECMAScript, does a /begin a regular expression literal in contexts likereturn /re/andx + /re/, or are those parsed as division?

💡 Result:

Yes—both return /re/ and x + /re/ are parsed with /re/ as a RegularExpressionLiteral, not as division.

ECMAScript avoids ambiguity by using different lexical goal symbols: in syntactic positions where a regex literal is permitted, the scanner uses InputElementRegExp; otherwise it uses InputElementDiv. The spec notes there is no syntactic context where both a leading division operator and a leading RegularExpressionLiteral are permitted, so the choice is determined by grammar context. In particular, after return (when the expression is on the same line) and after +, the grammar expects an expression, so / starts a regex literal. [1]

By contrast, when the grammar expects an operator continuing an expression (e.g., after an identifier or literal), / is taken as division (the spec gives a = b newline /hi/... as a case that becomes division, not regex, because the context allows division). [1][2]


One-byte look-behind cannot distinguish valid regex starters in expression contexts.

slash_starts_regex() fails for return /re/ and x + /re/ because the previous non-whitespace byte alone cannot capture ECMAScript's requirement that / begins a regex when the grammar expects an expression. The spec uses syntactic context (lexical goal symbols) to make this choice, not character-level heuristics. A more coarse-grained state tracking which positions allow expression starters is needed, along with regression tests for these cases.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/biome_html_parser/src/lexer/mod.rs` around lines 1773 - 1793,
slash_starts_regex currently relies only on prev_non_ws_byte and misclassifies
cases like "return /re/" and "x + /re/"; change it to consult a lexer-level
syntactic flag (e.g., allow_expression_start or lexical_goal) that tracks
whether the grammar currently expects an expression. Update the lexer to
set/clear that flag when emitting tokens that begin or end expression contexts
(for example in the logic that emits tokens for keywords like return,
punctuators like '(', '{', '[', operators like '+', '-', and after semicolons),
replace slash_starts_regex to return true when allow_expression_start is true
(falling back to the existing byte check only if you must), and add regression
tests specifically for "return /re/" and "x + /re/" to ensure regexes are
recognized correctly; reference the existing function slash_starts_regex and the
state prev_non_ws_byte when making the change.

Copy link
Contributor

@dyc3 dyc3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think the better fix is to check if the --- had a new line right before it.

@ematipico
Copy link
Member Author

I actually think the better fix is to check if the --- had a new line right before it.

No it doesn't, it was one of the first bugs reported that I fixed.

---
let a = "
---
"
---

This was the bug reported.

Copy link
Contributor

@dyc3 dyc3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Nevermind then.

@ematipico ematipico merged commit 1302740 into main Mar 18, 2026
17 checks passed
@ematipico ematipico deleted the fix/regex-frontmatter branch March 18, 2026 06:01
@github-actions github-actions bot mentioned this pull request Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Parser Area: parser L-HTML Language: HTML and super languages

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🐛 Astro: various things result in frontmatter parse: Unterminated regex literal

2 participants