Skip to content

fix(markdown_parser): recognize setext heading inside blockquote#9782

Merged
ematipico merged 7 commits intobiomejs:mainfrom
jfmcdowell:fix/md-setext-heading-in-blockquote
Apr 6, 2026
Merged

fix(markdown_parser): recognize setext heading inside blockquote#9782
ematipico merged 7 commits intobiomejs:mainfrom
jfmcdowell:fix/md-setext-heading-in-blockquote

Conversation

@jfmcdowell
Copy link
Copy Markdown
Contributor

Note

This PR was created with AI assistance (Claude Code).

Summary

Fixes setext heading detection inside blockquotes.

After consuming a blockquote prefix (> ), the lexer no longer considered the following token to be at line start, so --- was lexed as MINUS instead of MD_THEMATIC_BREAK_LITERAL. As a result, input like:

> Foo
> ---

was parsed as a paragraph instead of a setext heading. Per CommonMark §5.1 and §4.3, blockquote content should still participate in setext heading parsing after the quote prefix is removed.

This adds force_relex_at_line_start to re-lex the current token as if it were at line start, and uses it in the blockquote/setext detection path.

Test Plan

  • New lexer unit test: force_relex_at_line_start_produces_thematic_break
  • New fixture: setext_heading_in_blockquote.md
  • just test-crate biome_markdown_parser
  • just test-markdown-conformance
  • just f
  • just l

Docs

N/A

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Apr 3, 2026

⚠️ No Changeset found

Latest commit: e9eca12

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@github-actions github-actions bot added A-Parser Area: parser L-Markdown Language: Markdown labels Apr 3, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 3, 2026

Merging this PR will not alter performance

✅ 58 untouched benchmarks
⏩ 196 skipped benchmarks1


Comparing jfmcdowell:fix/md-setext-heading-in-blockquote (e9eca12) with main (1d09f0f)

Open in CodSpeed

Footnotes

  1. 196 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@jfmcdowell jfmcdowell marked this pull request as ready for review April 3, 2026 00:24
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This change adds a "re-lex as if at line start" capability used by the markdown lexer and parser. It implements BufferedLexer::force_relex_at_line_start, exposes MarkdownTokenSource::force_relex_at_line_start and MarkdownParser::force_relex_at_line_start, and invokes that re-lexing at specific points after consuming > quote prefixes so line-start‑gated tokens (for example MD_THEMATIC_BREAK_LITERAL / setext underlines) are recognised inside blockquotes. Tests and fixtures for setext headings and thematic breaks in blockquotes were added.

Possibly related PRs

Suggested reviewers

  • dyc3
  • ematipico
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: fixing setext heading detection inside blockquotes, which is the core problem addressed by the PR.
Description check ✅ Passed The description clearly explains the problem (setext headings not recognised in blockquotes due to lexer re-lexing), the solution (force_relex_at_line_start), and provides test coverage details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/biome_markdown_parser/src/syntax/mod.rs`:
- Around line 871-875: The quote-entry path in parse_quote_block_list (after
emit_quote_prefix_node / after consume_quote_prefix) does not call
p.force_relex_at_line_start(), so a quoted line like "> ---" is tokenized as
MINUS tokens instead of MD_THEMATIC_BREAK_LITERAL; add a call to
p.force_relex_at_line_start() immediately after
emit_quote_prefix_node()/consume_quote_prefix in parse_quote_block_list (or in
the first-block dispatch that handles the entry path) so the
paragraph/thematic-break lexer runs at line start, and add a unit test asserting
that a standalone blockquote thematic break (e.g., "> ---") is parsed as a
thematic break inside the blockquote to prevent regressions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aff0ab95-7158-4eab-a64f-ca5c0b4a4555

📥 Commits

Reviewing files that changed from the base of the PR and between b22f31a and ad20351.

⛔ Files ignored due to path filters (2)
  • crates/biome_markdown_parser/tests/md_test_suite/ok/setext_heading_edge_cases.md.snap is excluded by !**/*.snap and included by **
  • crates/biome_markdown_parser/tests/md_test_suite/ok/setext_heading_in_blockquote.md.snap is excluded by !**/*.snap and included by **
📒 Files selected for processing (7)
  • crates/biome_markdown_parser/src/lexer/tests.rs
  • crates/biome_markdown_parser/src/parser.rs
  • crates/biome_markdown_parser/src/syntax/mod.rs
  • crates/biome_markdown_parser/src/token_source.rs
  • crates/biome_markdown_parser/tests/md_test_suite/ok/setext_heading_in_blockquote.md
  • crates/biome_markdown_parser/tests/spec_test.rs
  • crates/biome_parser/src/lexer.rs

Comment thread crates/biome_markdown_parser/src/syntax/mod.rs
|| (p.at(MD_TEXTUAL_LITERAL)
&& p.cur_text()
.chars()
.all(|c| c == ' ' || c == '\t' || c == '-' || c == '*' || c == '_'));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the lookup table for faster access to bytes.

Also, I believe this logic is incorrect: we're checking a union of characters, which means text like _*- matches the all() function , which I believe it's not correct

Comment on lines +129 to +134
/// After consuming a quote prefix, selectively re-lex the current token as if
/// it were at line start when the remaining line could form a thematic break.
///
/// Re-lexing unconditionally perturbs ordinary quoted text tokenization by
/// splitting leading spaces into separate tokens. We only need line-start
/// semantics here for thematic-break candidates like `> ---`.
Copy link
Copy Markdown
Member

@ematipico ematipico Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I understand the good and technical comment, it doesn't actually explain the criteria of what we check for the thematic line break.

I suggest rewording the docstring with a more concrete approach, or having some inline comments in the weird parts of the code. For example the all() function usage is weird to me, and probably wrong (I might be wrong, but alas that's why I ask for a more down to earth comment)

@jfmcdowell
Copy link
Copy Markdown
Contributor Author

@ematipico feedback addressed in c83efe6. After morning coffee, all() was buggy and has been replaced.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/biome_markdown_parser/src/syntax/quote.rs`:
- Line 100: The re-lex for thematic breaks after consuming a quote prefix
(currently in force_relex_thematic_break_after_quote_prefix(p)) misses cases
where parse_code_block_newline() consumes the quote prefix and returns parked at
the next line, so create a shared helper (e.g.,
mark_quote_prefix_consumed_and_relex(p)) and call it whenever any code path
consumes a quote prefix — replace direct calls to
force_relex_thematic_break_after_quote_prefix(p) and add a call from
parse_code_block_newline() (and the other spot referenced around 328) so the
next token is re-lexed and indented-code hand-off (e.g., `>     code\n> ---`)
runs through the same re-lex hook.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e9c205ae-4952-4b5e-b5dc-3d577f7a37d0

📥 Commits

Reviewing files that changed from the base of the PR and between c83efe6 and ce5b6a4.

📒 Files selected for processing (1)
  • crates/biome_markdown_parser/src/syntax/quote.rs

Comment thread crates/biome_markdown_parser/src/syntax/quote.rs Outdated
Comment thread crates/biome_markdown_parser/src/syntax/quote.rs Outdated
jfmcdowell and others added 7 commits April 5, 2026 19:42
After consuming a blockquote prefix (`> `), the lexer's `after_newline`
flag is false, so `---` is lexed as MINUS tokens instead of
MD_THEMATIC_BREAK_LITERAL. This prevented setext heading detection
inside blockquotes.

Add `force_relex_at_line_start` to the buffered lexer which re-lexes
the current token with `after_line_break = true`. Use it in
`classify_quote_break_after_newline` (lookahead) and
`break_for_quote_prefix_after_inline_newline` (parse path) so the
lexer produces the correct block-level tokens after a quote prefix.
Address review feedback: use `biome_unicode_table` dispatch variants
(MIN, MUL, IDT) instead of raw byte literals for thematic break
character matching in `is_thematic_break_candidate_text`.
@jfmcdowell jfmcdowell force-pushed the fix/md-setext-heading-in-blockquote branch from 831dca1 to e9eca12 Compare April 5, 2026 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Parser Area: parser L-Markdown Language: Markdown

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants