fix(parser/html): regex literals in frontmatter#9531
Conversation
🦋 Changeset detectedLatest commit: 838ed8e The changes in this PR will be included in the next version bump. This PR includes changesets to release 13 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Merging this PR will not alter performance
Comparing Footnotes
|
WalkthroughThis PR fixes Astro frontmatter parsing failures when regex literals contain quotes or dashes (e.g., Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip You can customize the high-level summary generated by CodeRabbit.Configure the |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/biome_html_parser/src/lexer/mod.rs`:
- Around line 1773-1793: slash_starts_regex currently relies only on
prev_non_ws_byte and misclassifies cases like "return /re/" and "x + /re/";
change it to consult a lexer-level syntactic flag (e.g., allow_expression_start
or lexical_goal) that tracks whether the grammar currently expects an
expression. Update the lexer to set/clear that flag when emitting tokens that
begin or end expression contexts (for example in the logic that emits tokens for
keywords like return, punctuators like '(', '{', '[', operators like '+', '-',
and after semicolons), replace slash_starts_regex to return true when
allow_expression_start is true (falling back to the existing byte check only if
you must), and add regression tests specifically for "return /re/" and "x +
/re/" to ensure regexes are recognized correctly; reference the existing
function slash_starts_regex and the state prev_non_ws_byte when making the
change.
- Around line 1587-1588: The lexer currently uses the boolean flag in_regex to
decide when a regex literal ends but does not track character-class state, so
unescaped '/' inside [...] prematurely terminates the regex; update the regex
handling to track character classes and escapes (e.g., replace or augment the
in_regex boolean with a small state machine or additional flag like
in_regex_char_class) inside the lexer routine that handles regex closing (the
code that checks for unescaped '/' to exit regex mode). Specifically, when
entering a '[' inside a regex set a char-class state, ignore '/' (and handle
backslash-escaped characters) until a matching unescaped ']' is seen, then
resume normal regex-close logic that treats only unescaped '/' as the
terminator; ensure escape sequences (backslashes) are honored both inside and
outside character classes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1401cf85-22d2-4e5f-a9b8-8da653b11e43
⛔ Files ignored due to path filters (1)
crates/biome_html_parser/tests/html_specs/ok/astro/issue_9187.astro.snapis excluded by!**/*.snapand included by**
📒 Files selected for processing (3)
.changeset/fix-astro-regex-frontmatter.mdcrates/biome_html_parser/src/lexer/mod.rscrates/biome_html_parser/tests/html_specs/ok/astro/issue_9187.astro
| /// Whether we are currently inside a regex literal (`/…/`). | ||
| in_regex: bool, |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In ECMAScript regular expression literals, may an unescaped forward slash appear inside a character class [...], for example /[/']/, without terminating the regex literal?
💡 Result:
Yes. In an ECMAScript regular expression literal (/.../), an unescaped / may appear inside a character class [...] without terminating the literal, because the lexical grammar for RegularExpressionClassChar excludes ] and \—but does not exclude /. Therefore /[/']/ is tokenized as a single regex literal and its character class matches either / or '. [1][2]
🏁 Script executed:
# Examine the lexer file around the flagged lines
head -n 1650 crates/biome_html_parser/src/lexer/mod.rs | tail -n +1550 | cat -nRepository: biomejs/biome
Length of output: 5071
🏁 Script executed:
# Search for the in_regex field and slash_starts_regex function
rg -n "slash_starts_regex|in_regex" crates/biome_html_parser/src/lexer/mod.rs | head -20Repository: biomejs/biome
Length of output: 444
🏁 Script executed:
# Check for regex-related tests
fd -t f "test" crates/biome_html_parser --exec rg -l "regex|slash" {} \;Repository: biomejs/biome
Length of output: 100
🏁 Script executed:
# Read the regex closing logic and surrounding context
sed -n '1650,1680p' crates/biome_html_parser/src/lexer/mod.rs | cat -nRepository: biomejs/biome
Length of output: 1428
🏁 Script executed:
# Read the slash_starts_regex function
sed -n '1777,1810p' crates/biome_html_parser/src/lexer/mod.rs | cat -nRepository: biomejs/biome
Length of output: 1314
🏁 Script executed:
# Search for character class tests or regex tests
rg -A 5 -B 5 "in_regex|character.*class|slash" crates/biome_html_parser/src/lexer/tests.rs | head -60Repository: biomejs/biome
Length of output: 908
Character classes inside regex literals are not tracked, causing early termination.
The regex closing logic (lines 1656–1661) terminates on any unescaped /, but ECMAScript regex literals permit / inside character classes without closing the pattern—/[/']/ is valid. This causes the lexer to exit regex mode too early, leaving subsequent quotes or dashes to confuse frontmatter fence detection. Tracking character classes (detecting [, ], and escape sequences within them) would prevent this false termination.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/biome_html_parser/src/lexer/mod.rs` around lines 1587 - 1588, The
lexer currently uses the boolean flag in_regex to decide when a regex literal
ends but does not track character-class state, so unescaped '/' inside [...]
prematurely terminates the regex; update the regex handling to track character
classes and escapes (e.g., replace or augment the in_regex boolean with a small
state machine or additional flag like in_regex_char_class) inside the lexer
routine that handles regex closing (the code that checks for unescaped '/' to
exit regex mode). Specifically, when entering a '[' inside a regex set a
char-class state, ignore '/' (and handle backslash-escaped characters) until a
matching unescaped ']' is seen, then resume normal regex-close logic that treats
only unescaped '/' as the terminator; ensure escape sequences (backslashes) are
honored both inside and outside character classes.
| /// Returns whether a deferred `/` starts a regex literal based on | ||
| /// `prev_non_ws_byte`. After an identifier character, closing | ||
| /// paren/bracket, number, or `++`/`--` suffix, `/` is division. In all | ||
| /// other positions `/` starts a regex. | ||
| fn slash_starts_regex(&self) -> bool { | ||
| match self.prev_non_ws_byte { | ||
| None => true, | ||
| Some(b) => !matches!( | ||
| b, | ||
| b'a'..=b'z' | ||
| | b'A'..=b'Z' | ||
| | b'0'..=b'9' | ||
| | b'_' | ||
| | b'$' | ||
| | b')' | ||
| | b']' | ||
| | b'+' | ||
| | b'-' | ||
| ), | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In ECMAScript, does a /begin a regular expression literal in contexts likereturn /re/andx + /re/, or are those parsed as division?
💡 Result:
Yes—both return /re/ and x + /re/ are parsed with /re/ as a RegularExpressionLiteral, not as division.
ECMAScript avoids ambiguity by using different lexical goal symbols: in syntactic positions where a regex literal is permitted, the scanner uses InputElementRegExp; otherwise it uses InputElementDiv. The spec notes there is no syntactic context where both a leading division operator and a leading RegularExpressionLiteral are permitted, so the choice is determined by grammar context. In particular, after return (when the expression is on the same line) and after +, the grammar expects an expression, so / starts a regex literal. [1]
By contrast, when the grammar expects an operator continuing an expression (e.g., after an identifier or literal), / is taken as division (the spec gives a = b newline /hi/... as a case that becomes division, not regex, because the context allows division). [1][2]
One-byte look-behind cannot distinguish valid regex starters in expression contexts.
slash_starts_regex() fails for return /re/ and x + /re/ because the previous non-whitespace byte alone cannot capture ECMAScript's requirement that / begins a regex when the grammar expects an expression. The spec uses syntactic context (lexical goal symbols) to make this choice, not character-level heuristics. A more coarse-grained state tracking which positions allow expression starters is needed, along with regression tests for these cases.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/biome_html_parser/src/lexer/mod.rs` around lines 1773 - 1793,
slash_starts_regex currently relies only on prev_non_ws_byte and misclassifies
cases like "return /re/" and "x + /re/"; change it to consult a lexer-level
syntactic flag (e.g., allow_expression_start or lexical_goal) that tracks
whether the grammar currently expects an expression. Update the lexer to
set/clear that flag when emitting tokens that begin or end expression contexts
(for example in the logic that emits tokens for keywords like return,
punctuators like '(', '{', '[', operators like '+', '-', and after semicolons),
replace slash_starts_regex to return true when allow_expression_start is true
(falling back to the existing byte check only if you must), and add regression
tests specifically for "return /re/" and "x + /re/" to ensure regexes are
recognized correctly; reference the existing function slash_starts_regex and the
state prev_non_ws_byte when making the change.
dyc3
left a comment
There was a problem hiding this comment.
I actually think the better fix is to check if the --- had a new line right before it.
No it doesn't, it was one of the first bugs reported that I fixed. This was the bug reported. |
Summary
Closes #9187
Asked Claude Code to fix it
Test Plan
Added tests from the report
Docs
N/A