Skip to content

refactor(markdown-parser): align newline/prescan paragraph-break checks#9197

Merged
ematipico merged 5 commits intobiomejs:mainfrom
jfmcdowell:refactor/newline-prescan-parity
Mar 3, 2026
Merged

refactor(markdown-parser): align newline/prescan paragraph-break checks#9197
ematipico merged 5 commits intobiomejs:mainfrom
jfmcdowell:refactor/newline-prescan-parity

Conversation

@jfmcdowell
Copy link
Contributor

Note

AI Assistance Disclosure: This PR was developed with assistance from Claude Code.

Summary

  • Extract shared at_paragraph_break predicate to consolidate duplicate setext, thematic break, fence, block interrupt, and textual list marker checks between handle_inline_newline and inline_list_source_len.
  • Add a setext/thematic check after list-indent stripping in inline_list_source_len, tightening parse/prescan parity for list-indented continuation lines.
  • Align classify_quote_break_after_newline so both callers consistently consider textual list markers, then remove the now-redundant include_textual_markers parameter.
  • Add a focused regression fixture and snapshot (quote_textual_marker_parity.md) covering quote-newline handling for textual - / 1. markers and normal quote continuation.

No user-facing behavior change is intended; this aligns parse/prescan decision points and adds regression coverage to lock parity.
Scope is limited to crates/biome_markdown_parser/src/syntax/mod.rs plus crates/biome_markdown_parser/tests/md_test_suite/ok/quote_textual_marker_parity.md and its snapshot.

Test Plan

  • cargo test -p biome_markdown_parser
  • cargo insta test -p biome_markdown_parser
  • just test-markdown-conformance
  • just f && just l

Docs

N/A

Consolidate duplicate setext underline, thematic break, fence, block
interrupt, and textual list marker checks from handle_inline_newline
and inline_list_source_len into a shared at_paragraph_break predicate.

This also adds the previously missing textual_looks_like_list_marker
check to the prescan (inline_list_source_len), aligning it with the
parse path.
The prescan (inline_list_source_len) was missing a setext underline
check after stripping the list item's required indent. This parity
gap with handle_inline_newline was harmless — the prescan caught it
on the next iteration — but made the code harder to reason about.

Add a direct check after indent stripping and document the remaining
block-interrupt parity gap (handled on the next iteration).
Both callers of classify_quote_break_after_newline now pass true for
include_textual_markers (the prescan was previously passing false, but
at_paragraph_break already checks textual_looks_like_list_marker in
the subsequent step). Remove the parameter entirely and always check
textual markers, simplifying the API.
@changeset-bot
Copy link

changeset-bot bot commented Feb 23, 2026

⚠️ No Changeset found

Latest commit: 1971de8

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@github-actions github-actions bot added the A-Parser Area: parser label Feb 23, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 23, 2026

Walkthrough

Removed the include_textual_markers parameter from classify_quote_break_after_newline and updated its call sites. Added a new crate-visible predicate at_paragraph_break(p, has_content) to centralise paragraph-interruption checks (setext underline, thematic break, fence starts, block interrupts, textual list markers). Updated handle_inline_newline and inline_list_source_len to use the new predicate and revised quote-break logic; re-checks for setext/thematic markers after stripping list indentation were added. Adjusted fence handling for textual fences and added a new Markdown test covering quoted blocks with nested lists and paragraphs.

Suggested labels: A-Parser

Suggested reviewers

  • ematipico
  • dyc3
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarises the main refactoring objective: aligning paragraph-break checks between newline and prescan logic paths in the markdown parser.
Description check ✅ Passed The description clearly relates to the changeset, detailing the extracted at_paragraph_break predicate, parameter removal, setext/thematic checks, and the regression test fixture.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
crates/biome_markdown_parser/src/syntax/mod.rs (2)

851-867: Doc comment is missing textual list markers.

Line 866 returns true for textual_looks_like_list_marker, but the doc only lists setext underline, thematic break, fence, and block interrupt.

📝 Proposed doc fix
-/// Check if the current position is a paragraph break (setext underline,
-/// thematic break, fence, or block interrupt).
+/// Check if the current position is a paragraph break (setext underline,
+/// thematic break, fence, block interrupt, or textual list marker).

As per coding guidelines: "Use inline rustdoc documentation for rules, assists, and their options."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/biome_markdown_parser/src/syntax/mod.rs` around lines 851 - 867, The
doc comment for function at_paragraph_break omits "textual list markers" even
though the function checks textual_looks_like_list_marker; update the Rustdoc to
list all break conditions including setext underline, thematic break, fence,
block interrupt, and textual list markers (or "textual list marker" detection)
so the documentation matches the implementation in at_paragraph_break and
references the textual_looks_like_list_marker check.

1235-1235: Prescan always treats has_content as true — minor parity gap with parse path.

Two related spots:

  1. Line 1235 — at_paragraph_break(p, true): setext/thematic-break detection is always active, whereas handle_inline_newline passes the actual has_content value (and the parse path at lines 965, 1068–1082 gates on has_content).
  2. Lines 1273–1278 — the post-indent setext/thematic re-check also has no has_content guard (cf. parse path at line 965: if is_setext && has_content).

Net effect: the prescan may terminate earlier than the parse path for continuation lines that lead with a setext/thematic marker without preceding paragraph content, producing a shorter emphasis-context span. Nothing catastrophic, but it's worth tracking as a known parity gap.

Also applies to: 1266-1278

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/biome_markdown_parser/src/syntax/mod.rs` at line 1235, Prescan
currently calls at_paragraph_break(p, true) and re-checks setext/thematic
markers unconditionally, causing prescan to always treat has_content as true and
potentially stop earlier than the parse path; modify the prescan logic so it
uses the actual has_content flag (propagate the same has_content value used by
handle_inline_newline) when calling at_paragraph_break and add the same guard
around the post-indent setext/thematic re-check (mirror the parse-path
condition: only check is_setext/thematic if has_content is true) so prescan
parity with the parse path (symbols: at_paragraph_break, handle_inline_newline,
is_setext, has_content) is restored.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/biome_markdown_parser/src/syntax/mod.rs`:
- Around line 851-867: The doc comment for function at_paragraph_break omits
"textual list markers" even though the function checks
textual_looks_like_list_marker; update the Rustdoc to list all break conditions
including setext underline, thematic break, fence, block interrupt, and textual
list markers (or "textual list marker" detection) so the documentation matches
the implementation in at_paragraph_break and references the
textual_looks_like_list_marker check.
- Line 1235: Prescan currently calls at_paragraph_break(p, true) and re-checks
setext/thematic markers unconditionally, causing prescan to always treat
has_content as true and potentially stop earlier than the parse path; modify the
prescan logic so it uses the actual has_content flag (propagate the same
has_content value used by handle_inline_newline) when calling at_paragraph_break
and add the same guard around the post-indent setext/thematic re-check (mirror
the parse-path condition: only check is_setext/thematic if has_content is true)
so prescan parity with the parse path (symbols: at_paragraph_break,
handle_inline_newline, is_setext, has_content) is restored.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
crates/biome_markdown_parser/src/syntax/mod.rs (2)

1275-1275: Nit: prefer is_dash_only_thematic_break(p) for consistency.

Every other call site in this file uses the thin wrapper is_dash_only_thematic_break(p); only line 1275 reaches into is_dash_only_thematic_break_text(p.cur_text()) directly.

🔧 Trivial fix
-                        && is_dash_only_thematic_break_text(p.cur_text()))
+                        && is_dash_only_thematic_break(p))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/biome_markdown_parser/src/syntax/mod.rs` at line 1275, Replace the
direct call to is_dash_only_thematic_break_text(p.cur_text()) with the existing
wrapper is_dash_only_thematic_break(p) for consistency with other call sites;
update the expression that currently references p.cur_text() so it instead
passes the parser/token context object p into is_dash_only_thematic_break to
match usages elsewhere in this module.

856-867: at_paragraph_break omits the < 4 indent guard seen in the inline loop, but this is safe.

The inline-loop checks at lines 1068–1084 guard with real_line_indent_from_source(p) < INDENT_CODE_BLOCK_SPACES; at_paragraph_break does not. In practice this isn't reachable with 4+ indent because (a) for required_indent == 0 the earlier setext check at line 942 fires first and when it falls through the parser is still positioned at the whitespace token, not the underline; (b) for required_indent > 0, allow_setext_heading returns false when indent < required_indent. Worth a brief comment to explain why the guard is deliberately absent here, so the next reader doesn't add it unnecessarily.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/biome_markdown_parser/src/syntax/mod.rs` around lines 856 - 867, The
function at_paragraph_break omits the real_line_indent_from_source(p) <
INDENT_CODE_BLOCK_SPACES guard present in the inline loop, which could confuse
future readers; add a brief comment inside or just above at_paragraph_break
explaining that the <4-indent check is intentionally omitted because (1) when
required_indent == 0 the earlier setext check handles the case and the parser is
positioned at the whitespace token, and (2) when required_indent > 0
allow_setext_heading already returns false for indent < required_indent, so the
guard is unnecessary—mention at_paragraph_break and allow_setext_heading by name
so maintainers know why not to reintroduce the <4 check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/biome_markdown_parser/src/syntax/mod.rs`:
- Line 1235: The prescan currently hardcodes has_content=true in calls like
at_paragraph_break(p, true), causing setext/thematic guards to be
unconditionally enabled; add a local mutable has_content flag (initialized
false) in the prescan loop (mirror parse_inline_item_list) and set it true
whenever you encounter a non-newline content token during the prescan, then
replace the hardcoded true in at_paragraph_break and the later re-check (the
post-list-indent strip) with this has_content variable and ensure the same
is_setext && has_content gating used in the parse path (e.g.,
handle_inline_newline / inline_list_source_len checks) is applied so
setext/thematic logic only runs when real content was seen.

---

Nitpick comments:
In `@crates/biome_markdown_parser/src/syntax/mod.rs`:
- Line 1275: Replace the direct call to
is_dash_only_thematic_break_text(p.cur_text()) with the existing wrapper
is_dash_only_thematic_break(p) for consistency with other call sites; update the
expression that currently references p.cur_text() so it instead passes the
parser/token context object p into is_dash_only_thematic_break to match usages
elsewhere in this module.
- Around line 856-867: The function at_paragraph_break omits the
real_line_indent_from_source(p) < INDENT_CODE_BLOCK_SPACES guard present in the
inline loop, which could confuse future readers; add a brief comment inside or
just above at_paragraph_break explaining that the <4-indent check is
intentionally omitted because (1) when required_indent == 0 the earlier setext
check handles the case and the parser is positioned at the whitespace token, and
(2) when required_indent > 0 allow_setext_heading already returns false for
indent < required_indent, so the guard is unnecessary—mention at_paragraph_break
and allow_setext_heading by name so maintainers know why not to reintroduce the
<4 check.

}

if at_block_interrupt(p) {
if at_paragraph_break(p, true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Prescan doesn't track has_content; both setext/thematic guards are unconditionally enabled.

at_paragraph_break(p, true) at line 1235 hardcodes has_content = true. The parse path (handle_inline_newline) gates the setext/thematic break on the actual has_content flag accumulated during parse. The re-check added at lines 1273–1278 (post-list-indent stripping) has the same gap — its parse-path mirror at lines 965–969 is guarded by if is_setext && has_content.

In practice inline_list_source_len is never called when the inline list starts with a NEWLINE (paragraphs can't start with one), so the risk is limited to emphasis-context sizing in edge cases. But this is a direct parity gap contrary to the PR's stated goal of aligning decision points.

A has_content local flag (mirroring how parse_inline_item_list tracks it) would close the gap:

💡 Sketch — add `has_content` tracking to the prescan
 fn inline_list_source_len(p: &mut MarkdownParser) -> usize {
     p.lookahead(|p| {
         let mut len = 0usize;
+        let mut has_content = false;

         loop {
             // …
             if p.at(NEWLINE) {
                 // …
-                if at_paragraph_break(p, true) {
+                if at_paragraph_break(p, has_content) {
                     break;
                 }
                 // …
                 // After stripping list indent, re-check setext/thematic markers
-                if p.at(MD_SETEXT_UNDERLINE_LITERAL)
-                    || (p.at(MD_THEMATIC_BREAK_LITERAL)
-                        && is_dash_only_thematic_break_text(p.cur_text()))
-                {
-                    break;
-                }
+                if has_content
+                    && (p.at(MD_SETEXT_UNDERLINE_LITERAL)
+                        || (p.at(MD_THEMATIC_BREAK_LITERAL) && is_dash_only_thematic_break(p)))
+                {
+                    break;
+                }
                 continue;
             }
             // …
             len += p.cur_text().len();
             p.bump(p.cur());
+            // mirror has_content tracking from parse_inline_item_list
+            let inline_end: usize = p.cur_range().start().into();
+            // (simplified: any non-NEWLINE token counts as content for prescan purposes)
+            has_content = true;
         }
         len
     })
 }

Also applies to: 1273-1278

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/biome_markdown_parser/src/syntax/mod.rs` at line 1235, The prescan
currently hardcodes has_content=true in calls like at_paragraph_break(p, true),
causing setext/thematic guards to be unconditionally enabled; add a local
mutable has_content flag (initialized false) in the prescan loop (mirror
parse_inline_item_list) and set it true whenever you encounter a non-newline
content token during the prescan, then replace the hardcoded true in
at_paragraph_break and the later re-check (the post-list-indent strip) with this
has_content variable and ensure the same is_setext && has_content gating used in
the parse path (e.g., handle_inline_newline / inline_list_source_len checks) is
applied so setext/thematic logic only runs when real content was seen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this relevant @jfmcdowell ?

Copy link
Contributor Author

@jfmcdowell jfmcdowell Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this issue was real and is fixed in #9313: the prescan mismatch in inline_list_source_len is now gated with has_content for setext/thematic checks (aligned with parse_inline_item_list). Revalidated with parser tests and CommonMark conformance passing. It slipped past the keeper.

@ematipico ematipico merged commit 1e82da1 into biomejs:main Mar 3, 2026
13 checks passed
@jfmcdowell jfmcdowell deleted the refactor/newline-prescan-parity branch March 3, 2026 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Parser Area: parser

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants