Skip to content

refactor(markdown-parser): promote thematic break skipped trivia to explicit CST nodes#9337

Merged
dyc3 merged 3 commits intobiomejs:mainfrom
jfmcdowell:refactor/thematic-break-skipped-trivia
Mar 9, 2026
Merged

refactor(markdown-parser): promote thematic break skipped trivia to explicit CST nodes#9337
dyc3 merged 3 commits intobiomejs:mainfrom
jfmcdowell:refactor/thematic-break-skipped-trivia

Conversation

@jfmcdowell
Copy link
Contributor

@jfmcdowell jfmcdowell commented Mar 5, 2026

Note

AI Assistance Disclosure: This PR was developed with assistance from Claude Code.

Summary

  • Add MdThematicBreakChar, AnyMdThematicBreakPart, and MdThematicBreakPartList to the ungram grammar, replacing the single md_thematic_break_literal token model with a structured parts list.
  • Add ThematicBreakParts lexer context that emits single-char STAR/MINUS/UNDERSCORE tokens and MD_INDENT_CHAR for inter-marker whitespace, instead of aggregating into MD_THEMATIC_BREAK_LITERAL.
  • Implement parse_thematic_break_parts with re-lex happy path (decomposing MD_THEMATIC_BREAK_LITERAL) and fallback path for list-item contexts where tokens are already individual.
  • Add FormatNodeRule stubs for MdThematicBreakChar, AnyMdThematicBreakPart, and MdThematicBreakPartList.
  • Fix infinite loop in list parser when thematic break detection succeeds but parsing returns Absent (pre-existing bug exposed by new test fixture).
  • Add thematic_break_in_list.md test fixture for fallback-path coverage (thematic breaks inside list items).
  • Update all thematic break parser snapshots to reflect the new CST shape.

Follow-up to #9321. All three parse_as_skipped_trivia_tokens call sites in thematic break parsing have been eliminated. Break characters (*, -, _) and inter-marker whitespace are now real CST nodes visible to the formatter harness.

No user-facing behavior change. Parsed semantics are preserved; only the internal CST representation changes.

Test Plan

  • just test-crate biome_markdown_parser
  • cargo insta test -p biome_markdown_parser
  • cargo clippy -p biome_markdown_parser -p biome_markdown_formatter
  • cargo test -p biome_cli
  • just test-markdown-conformance

Docs

N/A — internal structural change, no new user-facing features.

@changeset-bot
Copy link

changeset-bot bot commented Mar 5, 2026

⚠️ No Changeset found

Latest commit: 9d16839

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@github-actions github-actions bot added A-Parser Area: parser A-Formatter Area: formatter A-Tooling Area: internal tools labels Mar 5, 2026
@jfmcdowell jfmcdowell force-pushed the refactor/thematic-break-skipped-trivia branch 4 times, most recently from 78cab77 to 50cddf6 Compare March 5, 2026 12:51
@jfmcdowell jfmcdowell marked this pull request as ready for review March 5, 2026 13:26
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 5, 2026

Walkthrough

This PR converts thematic breaks from a single literal token to a parts-based representation. It adds new grammar kinds (MdThematicBreakChar, AnyMdThematicBreakPart, MdThematicBreakPartList), updates the lexer to emit a ThematicBreakParts context and tokenisation (marker chars and indent tokens), changes parser logic to re-lex and parse thematic-break parts, and wires formatter implementations for the new node types. Control flow primarily delegates to existing formatting and parsing paths with a parts-based parsing/re-lexing path for complex cases.

Possibly related PRs

Suggested reviewers

  • ematipico
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main objective: refactoring thematic break parsing by promoting skipped trivia (whitespace/markers) to explicit CST nodes.
Description check ✅ Passed The description comprehensively documents the changeset, covering grammar changes, lexer contexts, parser implementations, formatting rules, bug fixes, test fixtures, and the rationale for promoting skipped trivia to explicit nodes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/biome_markdown_parser/src/lexer/mod.rs`:
- Around line 233-240: The numbered comment describing whitespace handling is
stale: the code now checks ThematicBreakParts before CodeSpan but the list still
shows the old order; update the comment block that documents the dispatch order
(the multiline comment above the whitespace handling logic) to reflect the
current sequence used by the lexer—mention ThematicBreakParts prior to CodeSpan
and keep the rest of steps consistent with the implementation around the
whitespace dispatch in mod.rs (look for references to ThematicBreakParts and
CodeSpan in the surrounding code).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c522adf5-a34d-4adf-8d15-77ecdc3ef68b

📥 Commits

Reviewing files that changed from the base of the PR and between 20a64c4 and 50cddf6.

⛔ Files ignored due to path filters (10)
  • crates/biome_markdown_factory/src/generated/node_factory.rs is excluded by !**/generated/**, !**/generated/** and included by **
  • crates/biome_markdown_factory/src/generated/syntax_factory.rs is excluded by !**/generated/**, !**/generated/** and included by **
  • crates/biome_markdown_parser/tests/md_test_suite/ok/lazy_continuation.md.snap is excluded by !**/*.snap and included by **
  • crates/biome_markdown_parser/tests/md_test_suite/ok/paragraph_interruption.md.snap is excluded by !**/*.snap and included by **
  • crates/biome_markdown_parser/tests/md_test_suite/ok/thematic_break_block.md.snap is excluded by !**/*.snap and included by **
  • crates/biome_markdown_parser/tests/md_test_suite/ok/thematic_break_in_list.md.snap is excluded by !**/*.snap and included by **
  • crates/biome_markdown_syntax/src/generated/kind.rs is excluded by !**/generated/**, !**/generated/** and included by **
  • crates/biome_markdown_syntax/src/generated/macros.rs is excluded by !**/generated/**, !**/generated/** and included by **
  • crates/biome_markdown_syntax/src/generated/nodes.rs is excluded by !**/generated/**, !**/generated/** and included by **
  • crates/biome_markdown_syntax/src/generated/nodes_mut.rs is excluded by !**/generated/**, !**/generated/** and included by **
📒 Files selected for processing (15)
  • crates/biome_markdown_formatter/src/generated.rs
  • crates/biome_markdown_formatter/src/markdown/any/mod.rs
  • crates/biome_markdown_formatter/src/markdown/any/thematic_break_part.rs
  • crates/biome_markdown_formatter/src/markdown/auxiliary/mod.rs
  • crates/biome_markdown_formatter/src/markdown/auxiliary/thematic_break_char.rs
  • crates/biome_markdown_formatter/src/markdown/lists/mod.rs
  • crates/biome_markdown_formatter/src/markdown/lists/thematic_break_part_list.rs
  • crates/biome_markdown_parser/src/lexer/mod.rs
  • crates/biome_markdown_parser/src/parser.rs
  • crates/biome_markdown_parser/src/syntax/list.rs
  • crates/biome_markdown_parser/src/syntax/thematic_break_block.rs
  • crates/biome_markdown_parser/src/token_source.rs
  • crates/biome_markdown_parser/tests/md_test_suite/ok/thematic_break_in_list.md
  • xtask/codegen/markdown.ungram
  • xtask/codegen/src/markdown_kinds_src.rs

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/biome_markdown_parser/src/lexer/mod.rs (1)

706-715: Consider de-duplicating marker→token mapping.

The same * / _ / - mapping is repeated in a few places; a tiny helper would reduce drift risk.

Possible tidy-up
+    #[inline]
+    fn marker_to_token(marker: u8) -> MarkdownSyntaxKind {
+        match marker {
+            b'*' => STAR,
+            b'_' => UNDERSCORE,
+            b'-' => MINUS,
+            _ => MINUS,
+        }
+    }
+
@@
         if matches!(context, MarkdownLexContext::ThematicBreakParts) {
             self.advance(1);
-            return match start_char {
-                b'*' => STAR,
-                b'_' => UNDERSCORE,
-                _ => MINUS,
-            };
+            return Self::marker_to_token(start_char);
         }
@@
         if matches!(context, MarkdownLexContext::EmphasisInline) {
             self.advance(1);
-            return match start_char {
-                b'*' => STAR,
-                b'_' => UNDERSCORE,
-                b'-' => MINUS,
-                _ => unreachable!(),
-            };
+            return Self::marker_to_token(start_char);
         }
@@
-        match start_char {
-            b'*' => STAR,
-            b'_' => UNDERSCORE,
-            b'-' => MINUS,
-            _ => unreachable!(),
-        }
+        Self::marker_to_token(start_char)

Also applies to: 773-779, 794-799

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/biome_markdown_parser/src/lexer/mod.rs` around lines 706 - 715, Create
a small helper function (e.g., marker_to_token(start_char: u8) -> TokenType)
that maps b'*'→STAR, b'_'→UNDERSCORE, and default→MINUS, and replace the
duplicated match expressions in MarkdownLexContext::ThematicBreakParts and the
other locations (the matches around lines 773–779 and 794–799) with a call to
this helper; ensure the existing advance(1) logic and return semantics are
preserved so you only swap the match block for marker_to_token(start_char).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/biome_markdown_parser/src/lexer/mod.rs`:
- Around line 706-715: Create a small helper function (e.g.,
marker_to_token(start_char: u8) -> TokenType) that maps b'*'→STAR,
b'_'→UNDERSCORE, and default→MINUS, and replace the duplicated match expressions
in MarkdownLexContext::ThematicBreakParts and the other locations (the matches
around lines 773–779 and 794–799) with a call to this helper; ensure the
existing advance(1) logic and return semantics are preserved so you only swap
the match block for marker_to_token(start_char).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8bc6e3e4-3dce-487f-99e8-fd5092727d98

📥 Commits

Reviewing files that changed from the base of the PR and between 50cddf6 and d498012.

📒 Files selected for processing (1)
  • crates/biome_markdown_parser/src/lexer/mod.rs

…xplicit CST nodes

Replace all `parse_as_skipped_trivia_tokens` calls in thematic break
parsing with explicit `MdThematicBreakChar` and `MdIndentToken` CST
nodes. Break characters (*, -, _) and inter-marker whitespace are now
structurally represented via `MdThematicBreakPartList` instead of being
hidden in trivia.

- Add `MdThematicBreakChar`, `AnyMdThematicBreakPart`, and
  `MdThematicBreakPartList` to the ungram grammar
- Add `ThematicBreakParts` lexer context for single-char token emission
- Implement `parse_thematic_break_parts` with re-lex happy path and
  fallback path for list-item contexts
- Fix infinite loop in list parser when thematic break detection
  succeeds but parsing returns Absent
- Add `thematic_break_in_list.md` test fixture for fallback path coverage
Add ThematicBreakParts to the numbered comment documenting the
whitespace handling dispatch order in the lexer.
@jfmcdowell jfmcdowell force-pushed the refactor/thematic-break-skipped-trivia branch from d498012 to 9d16839 Compare March 9, 2026 14:21
@dyc3 dyc3 merged commit 5978824 into biomejs:main Mar 9, 2026
14 checks passed
@jfmcdowell jfmcdowell deleted the refactor/thematic-break-skipped-trivia branch March 10, 2026 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Formatter Area: formatter A-Parser Area: parser A-Tooling Area: internal tools

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants