test(markdown_parser): add differential fuzzer against commonmark.js#9784
test(markdown_parser): add differential fuzzer against commonmark.js#9784ematipico merged 4 commits intobiomejs:mainfrom
Conversation
|
Merging this PR will not alter performance
Comparing Footnotes
|
adc9177 to
e692a76
Compare
|
I'd recommend adding For now the PR uses a temp-dir |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (5)
✅ Files skipped from review due to trivial changes (5)
WalkthroughAdds fuzzing infrastructure for the markdown parser: a checked‑in JSONL seed corpus ( Suggested reviewers
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/biome_markdown_parser/tests/fuzz_differential.rs`:
- Around line 55-61: run_corpus currently swallows I/O and JSON errors which
lets a corrupted seed.jsonl hide as “All cases passed”; change run_corpus to
fail fast by returning or panicking on unreadable files and malformed JSON
instead of skipping them, and replace ad-hoc serde_json::Value handling with a
typed deserialisable struct (e.g., SeedCase with markdown and html fields) when
parsing each line so missing or wrong fields cause an immediate error; update
the parsing logic (the code that reads lines and the block referenced around
lines 72–82) to deserialize into SeedCase and propagate or surface any
deserialization/I/O errors rather than continuing silently.
In `@crates/biome_markdown_parser/tests/fuzz_generate_corpus.cjs`:
- Around line 24-25: The PRNG can get stuck when seeded with 0 and can produce
rand() === 1.0 causing randInt() to return max+1 and make pick() yield
undefined; to fix, ensure the xorshift32 state is never initialized to zero (if
seed === 0 set it to 1) and change the floating-point normalization in rand() to
divide by 0x100000000 (4294967296) instead of 0xffffffff so rand() is always in
[0,1). Update the implementations of xorshift32 (state init), rand(), randInt(),
and pick() accordingly so randInt() uses the corrected rand() and pick() cannot
index out of bounds.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: febdfe05-717f-4f2e-aa6a-7e2b8384d080
📒 Files selected for processing (4)
crates/biome_markdown_parser/tests/fuzz_corpus/seed.jsonlcrates/biome_markdown_parser/tests/fuzz_differential.rscrates/biome_markdown_parser/tests/fuzz_generate_corpus.cjsjustfile
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/biome_markdown_parser/tests/fuzz_differential.rs`:
- Around line 136-140: The loop over all_failures silently swallows fs::write
errors (base.with_extension(...).ok()) and reuses hash-only basenames (base =
dir.join(&failure.hash)) which can hide missing artefacts and cause overwrites;
change the loop to enumerate(all_failures) to generate a unique basename per
failure (e.g., combine failure.hash with the index or timestamp) and replace the
.ok() calls with error-propagating writes (use ? or expect with a clear message,
or collect and panic on errors) for the three writes that use
base.with_extension("md"), base.with_extension("expected.html"), and
base.with_extension("actual.html") so failures surface loudly and filenames are
unique.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d1e887e6-2076-4c70-af2c-95d8555c29c8
📒 Files selected for processing (1)
crates/biome_markdown_parser/tests/fuzz_differential.rs
|
Oh yeah that's totally reasonable, go ahead. Also, I'm not familiar with terminology here: what "seed corpus" and "corpora" mean? |
|
Good question.
So: seed corpus = stable baseline, generated corpus = exploratory fuzz input. Also pushed a commit adding |
Generates random markdown from construct combinators (lists, blockquotes, headings, inline HTML, fenced code, link definitions) biased toward interaction patterns that have produced parser bugs. Compares rendered HTML from document_to_html against commonmark.js reference output. - Generator: tests/fuzz_generate_corpus.mjs (seeded, reproducible) - Seed corpus: tests/fuzz_corpus/seed.jsonl (102 passing cases) - Differential test: tests/fuzz_differential.rs (#[ignore] by default) - Justfile: fuzz-markdown-generate, fuzz-markdown-differential
- fuzz_differential.rs: panic on unreadable files and malformed JSON instead of silently skipping; use typed SeedCase struct so missing markdown/html fields fail at deserialization. - fuzz_generate_corpus.cjs: guard seed=0 (xorshift32 gets stuck at zero) and use 0x100000000 divisor so rand() returns [0,1) not [0,1].
Fixes clippy::implicit_clone warnings.
97dadc2 to
f5bdd2c
Compare
Note
This PR was created with AI assistance (Claude Code).
Summary
Adds a differential fuzzer that generates random markdown from construct combinators and compares Biome's
document_to_htmloutput againstcommonmark.jsreference output.The generator is biased toward interaction patterns that have produced parser bugs: headers inside lists, setext headings in blockquotes, inline HTML near blockquote markers, mixed list markers, and lazy continuation at various indent levels.
The checked-in seed corpus contains only passing cases. Any failure is either a regression or a newly discovered mismatch worth triaging. Extended corpora can be generated locally for discovery and fixed cases promoted into the seed. Because the checked-in seed corpus contains only passing cases, this differential test is now viable for CI against the seed corpus, while larger generated corpora can remain a local or scheduled discovery workflow.
Test Plan
cargo test -p biome_markdown_parser --test fuzz_differential -- --ignored --nocapturejust test-crate biome_markdown_parserjust fjust lDocs
N/A