Skip to content

fix(markdown_parser): prefer list item over thematic break for - ---#9946

Closed
jfmcdowell wants to merge 4 commits intobiomejs:mainfrom
jfmcdowell:fix/md-thematic-break-list-precedence
Closed

fix(markdown_parser): prefer list item over thematic break for - ---#9946
jfmcdowell wants to merge 4 commits intobiomejs:mainfrom
jfmcdowell:fix/md-thematic-break-list-precedence

Conversation

@jfmcdowell
Copy link
Copy Markdown
Contributor

Note

This PR was created with AI assistance (Claude Code).

Summary

When the lexer produces MD_THEMATIC_BREAK_LITERAL for a line like - ---, the thematic break check in the block dispatcher fires before the list item check, so the line is parsed as a top-level <hr /> instead of a list item containing <hr />.

Per CommonMark §5.2/§4.1 (verified against commonmark.js + markdown-it): when stripping a bullet marker + space from the token text leaves a consecutive run of 3+ matching break characters, the list item interpretation wins:

  • - --- → list item containing <hr /> (consecutive --- after marker)
  • * *** → list item containing <hr />
  • + ___ → list item containing <hr />
  • - - - → thematic break (spaced chars after marker — stays a break)
  • * * * → thematic break

The fix adds a parser-side guard (thematic_break_hides_list_item) that inspects the token text. When triggered, the token is re-lexed via ThematicBreakParts context to expose the individual marker tokens, then list item parsing proceeds normally.

Also fixes 2 pre-existing CommonMark conformance failures (examples 53, 54) — conformance is now 652/652 (100%).

Test Plan

  • just test-crate biome_markdown_parser
  • just test-markdown-conformance
  • 9 targeted disambiguation tests added to spec_test.rs
  • 1 CST snapshot updated (thematic_break_in_list.md)

Docs

N/A

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Apr 12, 2026

⚠️ No Changeset found

Latest commit: bd64d69

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@github-actions github-actions bot added A-Parser Area: parser L-Markdown Language: Markdown labels Apr 12, 2026
@jfmcdowell jfmcdowell force-pushed the fix/md-thematic-break-list-precedence branch from 54712fd to efd3e6a Compare April 12, 2026 18:37
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 12, 2026

Merging this PR will not alter performance

✅ 28 untouched benchmarks
⏩ 228 skipped benchmarks1


Comparing jfmcdowell:fix/md-thematic-break-list-precedence (81aaa78) with main (bcd6508)

Open in CodSpeed

Footnotes

  1. 228 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@jfmcdowell jfmcdowell marked this pull request as ready for review April 12, 2026 19:26
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 104187e8-0452-43b3-a6f3-8d0e0b5f59c6

📥 Commits

Reviewing files that changed from the base of the PR and between 13597b9 and bd64d69.

⛔ Files ignored due to path filters (1)
  • crates/biome_markdown_parser/tests/md_test_suite/ok/thematic_break_in_list.md.snap is excluded by !**/*.snap and included by **
📒 Files selected for processing (3)
  • crates/biome_markdown_parser/src/syntax/mod.rs
  • crates/biome_markdown_parser/src/syntax/thematic_break_block.rs
  • crates/biome_markdown_parser/tests/spec_test.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • crates/biome_markdown_parser/tests/spec_test.rs
  • crates/biome_markdown_parser/src/syntax/thematic_break_block.rs

Walkthrough

This PR changes thematic-break recognition and parsing to disambiguate cases where a bullet marker (-, *, +) + space is immediately followed by a run of three or more identical thematic characters. It adds a thematic_break_hides_list_item predicate and byte-oriented checks for break markers, and updates lexer/parse dispatch so such sequences force re-lexing and are parsed as list items containing an <hr /> instead of a top-level thematic break. Tests were updated to reflect the new precedence.

Possibly related PRs

Suggested reviewers

  • ematipico
  • dyc3
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarises the main fix: establishing correct parsing precedence where list items take priority over thematic breaks for patterns like - ---.
Description check ✅ Passed The description is well-related to the changeset, explaining the CommonMark precedence issue, the fix strategy, conformance improvements, and test coverage in detail.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment on lines +65 to +80
if !matches!(bytes[0], b'-' | b'*' | b'+') {
return false;
}
if !matches!(bytes[1], b' ' | b'\t') {
return false;
}

// The payload (after marker + space) must be 3+ consecutive matching
// break characters, optionally followed by trailing whitespace only.
let payload = text[2..].trim_end_matches([' ', '\t']);
let payload_bytes = payload.as_bytes();
if payload_bytes.len() < THEMATIC_BREAK_MIN_CHARS {
return false;
}
let break_char = payload_bytes[0];
if !matches!(break_char, b'-' | b'*' | b'_') {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to use the lookup table for known characters

jfmcdowell and others added 3 commits April 13, 2026 08:32
When the lexer produces `MD_THEMATIC_BREAK_LITERAL` for a line like
`- ---`, the thematic break interpretation won because it was checked
before list items in the block dispatcher.

Per CommonMark §5.2/§4.1 (and verified against commonmark.js +
markdown-it), when stripping a bullet marker + space from the token
leaves content that is itself a valid thematic break (3+ matching
chars), the list item interpretation should win. E.g.:
  - `- ---` → list item containing <hr /> (3 chars remain)
  - `- - -` → thematic break (only 2 chars remain after marker)

The fix adds a parser-side guard (`thematic_break_hides_list_item`)
that inspects the token text. When triggered, the token is re-lexed
via `ThematicBreakParts` context to expose the individual marker
tokens, then list item parsing proceeds normally.
…classification

Route `*`, `-`, and `_` classification through `biome_unicode_table::lookup_byte`
via a shared `is_break_marker` helper, following the project convention. Whitespace
checks (`' '`/`'\t'`) are kept explicit since `WHS` is semantically broader than
what CommonMark requires here.
@jfmcdowell jfmcdowell force-pushed the fix/md-thematic-break-list-precedence branch from 024a7e4 to bd64d69 Compare April 13, 2026 12:32
@github-actions github-actions bot added the A-Project Area: project label Apr 13, 2026
@jfmcdowell jfmcdowell marked this pull request as draft April 13, 2026 13:08
@jfmcdowell
Copy link
Copy Markdown
Contributor Author

After re-examining the CommonMark spec (§4.1 Thematic breaks) this is the wrong approach.

@jfmcdowell jfmcdowell closed this Apr 13, 2026
@jfmcdowell jfmcdowell deleted the fix/md-thematic-break-list-precedence branch April 13, 2026 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Parser Area: parser A-Project Area: project L-Markdown Language: Markdown

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants