Skip to content

test(markdown_parser): add differential fuzzer against commonmark.js#9784

Merged
ematipico merged 4 commits intobiomejs:mainfrom
jfmcdowell:test/md-differential-fuzzer
Apr 6, 2026
Merged

test(markdown_parser): add differential fuzzer against commonmark.js#9784
ematipico merged 4 commits intobiomejs:mainfrom
jfmcdowell:test/md-differential-fuzzer

Conversation

@jfmcdowell
Copy link
Copy Markdown
Contributor

@jfmcdowell jfmcdowell commented Apr 3, 2026

Note

This PR was created with AI assistance (Claude Code).

Summary

Adds a differential fuzzer that generates random markdown from construct combinators and compares Biome's document_to_html output against commonmark.js reference output.

The generator is biased toward interaction patterns that have produced parser bugs: headers inside lists, setext headings in blockquotes, inline HTML near blockquote markers, mixed list markers, and lazy continuation at various indent levels.

The checked-in seed corpus contains only passing cases. Any failure is either a regression or a newly discovered mismatch worth triaging. Extended corpora can be generated locally for discovery and fixed cases promoted into the seed. Because the checked-in seed corpus contains only passing cases, this differential test is now viable for CI against the seed corpus, while larger generated corpora can remain a local or scheduled discovery workflow.

Test Plan

  • cargo test -p biome_markdown_parser --test fuzz_differential -- --ignored --nocapture
  • just test-crate biome_markdown_parser
  • just f
  • just l

Docs

N/A

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Apr 3, 2026

⚠️ No Changeset found

Latest commit: f5bdd2c

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@github-actions github-actions bot added A-Parser Area: parser L-Markdown Language: Markdown labels Apr 3, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 3, 2026

Merging this PR will not alter performance

✅ 28 untouched benchmarks
⏩ 228 skipped benchmarks1


Comparing jfmcdowell:test/md-differential-fuzzer (f5bdd2c) with main (f3d60a6)2

Open in CodSpeed

Footnotes

  1. 228 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on main (d5ca672) during the generation of this report, so f3d60a6 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@jfmcdowell jfmcdowell force-pushed the test/md-differential-fuzzer branch from adc9177 to e692a76 Compare April 3, 2026 02:30
@jfmcdowell
Copy link
Copy Markdown
Contributor Author

I'd recommend adding commonmark as a root devDependency for this workflow. Biome already has root Node tooling and root devDependencies, so this wouldn't be introducing a new kind of project dependency, and commonmark is unusually justified here because it is the reference implementation the differential test compares against. It would also make the generator workflow simpler and easier to run manually. @ematipico, does that dependency choice seem acceptable here?

For now the PR uses a temp-dir npm install approach to avoid forcing that decision, but it adds ceremony to the justfile recipe that a root devDependency would eliminate.

@jfmcdowell jfmcdowell marked this pull request as ready for review April 3, 2026 02:42
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 854ebbe2-1b92-4476-a586-3c2a7e5155dd

📥 Commits

Reviewing files that changed from the base of the PR and between 97dadc2 and f5bdd2c.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml and included by **
📒 Files selected for processing (5)
  • crates/biome_markdown_parser/tests/fuzz_corpus/seed.jsonl
  • crates/biome_markdown_parser/tests/fuzz_differential.rs
  • crates/biome_markdown_parser/tests/fuzz_generate_corpus.cjs
  • justfile
  • package.json
✅ Files skipped from review due to trivial changes (5)
  • package.json
  • justfile
  • crates/biome_markdown_parser/tests/fuzz_corpus/seed.jsonl
  • crates/biome_markdown_parser/tests/fuzz_generate_corpus.cjs
  • crates/biome_markdown_parser/tests/fuzz_differential.rs

Walkthrough

Adds fuzzing infrastructure for the markdown parser: a checked‑in JSONL seed corpus (crates/biome_markdown_parser/tests/fuzz_corpus/seed.jsonl) containing 102 { markdown, html } pairs; an ignored differential test (fuzz_differential.rs) that parses each markdown, renders HTML via the crate renderer, normalises lines (preserving <pre> blocks), compares against the reference HTML, records failures using a deterministic 64‑bit FNV‑1a hash, and can write per‑failure artifacts. Adds a Node.js corpus generator (fuzz_generate_corpus.cjs) that uses commonmark to produce references, a devDependency on commonmark, and two just recipes to generate the corpus and run the differential test.

Suggested reviewers

  • ematipico
  • dyc3
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarises the main addition: a differential fuzzer for the markdown parser comparing against commonmark.js, which aligns directly with the changeset.
Description check ✅ Passed The description clearly explains the purpose, approach, and testing plan of the differential fuzzer, relating directly to the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/biome_markdown_parser/tests/fuzz_differential.rs`:
- Around line 55-61: run_corpus currently swallows I/O and JSON errors which
lets a corrupted seed.jsonl hide as “All cases passed”; change run_corpus to
fail fast by returning or panicking on unreadable files and malformed JSON
instead of skipping them, and replace ad-hoc serde_json::Value handling with a
typed deserialisable struct (e.g., SeedCase with markdown and html fields) when
parsing each line so missing or wrong fields cause an immediate error; update
the parsing logic (the code that reads lines and the block referenced around
lines 72–82) to deserialize into SeedCase and propagate or surface any
deserialization/I/O errors rather than continuing silently.

In `@crates/biome_markdown_parser/tests/fuzz_generate_corpus.cjs`:
- Around line 24-25: The PRNG can get stuck when seeded with 0 and can produce
rand() === 1.0 causing randInt() to return max+1 and make pick() yield
undefined; to fix, ensure the xorshift32 state is never initialized to zero (if
seed === 0 set it to 1) and change the floating-point normalization in rand() to
divide by 0x100000000 (4294967296) instead of 0xffffffff so rand() is always in
[0,1). Update the implementations of xorshift32 (state init), rand(), randInt(),
and pick() accordingly so randInt() uses the corrected rand() and pick() cannot
index out of bounds.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: febdfe05-717f-4f2e-aa6a-7e2b8384d080

📥 Commits

Reviewing files that changed from the base of the PR and between b22f31a and e692a76.

📒 Files selected for processing (4)
  • crates/biome_markdown_parser/tests/fuzz_corpus/seed.jsonl
  • crates/biome_markdown_parser/tests/fuzz_differential.rs
  • crates/biome_markdown_parser/tests/fuzz_generate_corpus.cjs
  • justfile

Comment thread crates/biome_markdown_parser/tests/fuzz_differential.rs Outdated
Comment thread crates/biome_markdown_parser/tests/fuzz_generate_corpus.cjs
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/biome_markdown_parser/tests/fuzz_differential.rs`:
- Around line 136-140: The loop over all_failures silently swallows fs::write
errors (base.with_extension(...).ok()) and reuses hash-only basenames (base =
dir.join(&failure.hash)) which can hide missing artefacts and cause overwrites;
change the loop to enumerate(all_failures) to generate a unique basename per
failure (e.g., combine failure.hash with the index or timestamp) and replace the
.ok() calls with error-propagating writes (use ? or expect with a clear message,
or collect and panic on errors) for the three writes that use
base.with_extension("md"), base.with_extension("expected.html"), and
base.with_extension("actual.html") so failures surface loudly and filenames are
unique.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d1e887e6-2076-4c70-af2c-95d8555c29c8

📥 Commits

Reviewing files that changed from the base of the PR and between 133c81c and 7d72775.

📒 Files selected for processing (1)
  • crates/biome_markdown_parser/tests/fuzz_differential.rs

Comment thread crates/biome_markdown_parser/tests/fuzz_differential.rs
@ematipico
Copy link
Copy Markdown
Member

Oh yeah that's totally reasonable, go ahead.

Also, I'm not familiar with terminology here: what "seed corpus" and "corpora" mean?

@jfmcdowell
Copy link
Copy Markdown
Contributor Author

Good question.

  • Seed corpus: the checked-in baseline inputs the differential test always runs. Ours is seed.jsonl, and it should contain only passing { markdown, html } pairs.
  • Corpus / corpora: the full collection of inputs. Besides the seed corpus, you can generate a larger disposable corpus locally to look for new mismatches.

So: seed corpus = stable baseline, generated corpus = exploratory fuzz input.

Also pushed a commit adding commonmark as a root devDependency and simplifying the just recipe.

Generates random markdown from construct combinators (lists, blockquotes,
headings, inline HTML, fenced code, link definitions) biased toward
interaction patterns that have produced parser bugs. Compares rendered
HTML from document_to_html against commonmark.js reference output.

- Generator: tests/fuzz_generate_corpus.mjs (seeded, reproducible)
- Seed corpus: tests/fuzz_corpus/seed.jsonl (102 passing cases)
- Differential test: tests/fuzz_differential.rs (#[ignore] by default)
- Justfile: fuzz-markdown-generate, fuzz-markdown-differential
- fuzz_differential.rs: panic on unreadable files and malformed JSON
  instead of silently skipping; use typed SeedCase struct so missing
  markdown/html fields fail at deserialization.
- fuzz_generate_corpus.cjs: guard seed=0 (xorshift32 gets stuck at zero)
  and use 0x100000000 divisor so rand() returns [0,1) not [0,1].
@jfmcdowell jfmcdowell force-pushed the test/md-differential-fuzzer branch from 97dadc2 to f5bdd2c Compare April 6, 2026 13:23
@ematipico ematipico merged commit 626344e into biomejs:main Apr 6, 2026
16 checks passed
@jfmcdowell jfmcdowell deleted the test/md-differential-fuzzer branch April 13, 2026 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Parser Area: parser L-Markdown Language: Markdown

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants