fix(markdown_parser): recognize setext heading inside blockquote#9782
Conversation
|
Merging this PR will not alter performance
Comparing Footnotes
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis change adds a "re-lex as if at line start" capability used by the markdown lexer and parser. It implements Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/biome_markdown_parser/src/syntax/mod.rs`:
- Around line 871-875: The quote-entry path in parse_quote_block_list (after
emit_quote_prefix_node / after consume_quote_prefix) does not call
p.force_relex_at_line_start(), so a quoted line like "> ---" is tokenized as
MINUS tokens instead of MD_THEMATIC_BREAK_LITERAL; add a call to
p.force_relex_at_line_start() immediately after
emit_quote_prefix_node()/consume_quote_prefix in parse_quote_block_list (or in
the first-block dispatch that handles the entry path) so the
paragraph/thematic-break lexer runs at line start, and add a unit test asserting
that a standalone blockquote thematic break (e.g., "> ---") is parsed as a
thematic break inside the blockquote to prevent regressions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: aff0ab95-7158-4eab-a64f-ca5c0b4a4555
⛔ Files ignored due to path filters (2)
crates/biome_markdown_parser/tests/md_test_suite/ok/setext_heading_edge_cases.md.snapis excluded by!**/*.snapand included by**crates/biome_markdown_parser/tests/md_test_suite/ok/setext_heading_in_blockquote.md.snapis excluded by!**/*.snapand included by**
📒 Files selected for processing (7)
crates/biome_markdown_parser/src/lexer/tests.rscrates/biome_markdown_parser/src/parser.rscrates/biome_markdown_parser/src/syntax/mod.rscrates/biome_markdown_parser/src/token_source.rscrates/biome_markdown_parser/tests/md_test_suite/ok/setext_heading_in_blockquote.mdcrates/biome_markdown_parser/tests/spec_test.rscrates/biome_parser/src/lexer.rs
| || (p.at(MD_TEXTUAL_LITERAL) | ||
| && p.cur_text() | ||
| .chars() | ||
| .all(|c| c == ' ' || c == '\t' || c == '-' || c == '*' || c == '_')); |
There was a problem hiding this comment.
Let's use the lookup table for faster access to bytes.
Also, I believe this logic is incorrect: we're checking a union of characters, which means text like _*- matches the all() function , which I believe it's not correct
| /// After consuming a quote prefix, selectively re-lex the current token as if | ||
| /// it were at line start when the remaining line could form a thematic break. | ||
| /// | ||
| /// Re-lexing unconditionally perturbs ordinary quoted text tokenization by | ||
| /// splitting leading spaces into separate tokens. We only need line-start | ||
| /// semantics here for thematic-break candidates like `> ---`. |
There was a problem hiding this comment.
While I understand the good and technical comment, it doesn't actually explain the criteria of what we check for the thematic line break.
I suggest rewording the docstring with a more concrete approach, or having some inline comments in the weird parts of the code. For example the all() function usage is weird to me, and probably wrong (I might be wrong, but alas that's why I ask for a more down to earth comment)
|
@ematipico feedback addressed in c83efe6. After morning coffee, all() was buggy and has been replaced. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/biome_markdown_parser/src/syntax/quote.rs`:
- Line 100: The re-lex for thematic breaks after consuming a quote prefix
(currently in force_relex_thematic_break_after_quote_prefix(p)) misses cases
where parse_code_block_newline() consumes the quote prefix and returns parked at
the next line, so create a shared helper (e.g.,
mark_quote_prefix_consumed_and_relex(p)) and call it whenever any code path
consumes a quote prefix — replace direct calls to
force_relex_thematic_break_after_quote_prefix(p) and add a call from
parse_code_block_newline() (and the other spot referenced around 328) so the
next token is re-lexed and indented-code hand-off (e.g., `> code\n> ---`)
runs through the same re-lex hook.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e9c205ae-4952-4b5e-b5dc-3d577f7a37d0
📒 Files selected for processing (1)
crates/biome_markdown_parser/src/syntax/quote.rs
After consuming a blockquote prefix (`> `), the lexer's `after_newline` flag is false, so `---` is lexed as MINUS tokens instead of MD_THEMATIC_BREAK_LITERAL. This prevented setext heading detection inside blockquotes. Add `force_relex_at_line_start` to the buffered lexer which re-lexes the current token with `after_line_break = true`. Use it in `classify_quote_break_after_newline` (lookahead) and `break_for_quote_prefix_after_inline_newline` (parse path) so the lexer produces the correct block-level tokens after a quote prefix.
…andidate, add tests
Address review feedback: use `biome_unicode_table` dispatch variants (MIN, MUL, IDT) instead of raw byte literals for thematic break character matching in `is_thematic_break_candidate_text`.
831dca1 to
e9eca12
Compare
Note
This PR was created with AI assistance (Claude Code).
Summary
Fixes setext heading detection inside blockquotes.
After consuming a blockquote prefix (
>), the lexer no longer considered the following token to be at line start, so---was lexed asMINUSinstead ofMD_THEMATIC_BREAK_LITERAL. As a result, input like:was parsed as a paragraph instead of a setext heading. Per CommonMark §5.1 and §4.3, blockquote content should still participate in setext heading parsing after the quote prefix is removed.
This adds
force_relex_at_line_startto re-lex the current token as if it were at line start, and uses it in the blockquote/setext detection path.Test Plan
force_relex_at_line_start_produces_thematic_breaksetext_heading_in_blockquote.mdjust test-crate biome_markdown_parserjust test-markdown-conformancejust fjust lDocs
N/A