formdata: reverse %22/%0D/%0A escaping in multipart field and file names#31526
formdata: reverse %22/%0D/%0A escaping in multipart field and file names#31526alii wants to merge 5 commits into
Conversation
|
@robobun adopt |
|
Updated 10:13 PM PT - Jun 1st, 2026
❌ @robobun, your commit b8fa505 has 5 failures in
🧪 To try this PR locally: bunx bun-pr 31526That installs a local version of the PR into your bun-31526 --bun |
|
Adopted ✅ — ready for merge; the one red lane is a sick CI host Reproduced on released Bun: a multipart field/file name with a quote/CR/LF round-trips literal — The fix decodes CI unblockers in this branch, each root-caused: CLDEMOTE hint-NOP in Final CI state (build 59698,
Nothing red touches this diff. Needs ops to clean/reboot |
|
Actionable comments posted: 0 |
|
Warning Review limit reached
More reviews will be available in 22 minutes and 10 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (10)
WalkthroughThis PR adds WHATWG percent-escape unescaping for multipart/form-data field names and filenames in Rust FormData parsing, with comprehensive test coverage validating round-trip behavior. It also extends the CPU baseline ISA verifier to allow CLDEMOTE on Nehalem-generation processors. ChangesMultipart form-data name/filename percent-escape unescaping
CPU baseline ISA verification update
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
LGTM — small, spec-compliant decoder fix that's the exact inverse of escape_form_data_name, with thorough test coverage.
Extended reasoning...
Overview
This PR adds unescape_form_data_name() to src/runtime/webcore/FormData.rs (~30 lines) and applies it to the name and filename values extracted from Content-Disposition headers during multipart parsing. It reverses exactly the three WHATWG escape sequences (%22→", %0D→CR, %0A→LF, with case-insensitive hex on the latter two to match undici) and passes every other byte through unchanged — including other %xx sequences and a trailing % with fewer than two following bytes. The fast path returns None when no % is present so the common case stays zero-copy. I cross-checked it against escape_form_data_name in Blob.rs and it is the exact inverse. Six new tests in FormData.test.ts cover field names, filenames, lowercase hex, literal-percent preservation, and round-trips via both Response.formData() and Bun.serve.
Security risks
None identified. The decoded CR/LF land in JS-side FormData keys and File.name strings, not in outgoing HTTP headers; re-serialization goes back through escape_form_data_name, which re-escapes them. The existing Content-Type control-character guard in the parser is untouched. Bounds checking is correct (i + 2 < bytes.len() before slicing [i+1..i+3]), and the temporary Vec<u8> lives on the on_entry stack frame for the full duration of the ZigString borrow — same lifetime contract the original borrowed-slice code already relied on for append/append_blob.
Level of scrutiny
Low-to-moderate. This is a narrow, well-understood string-processing fix in user-facing-but-not-security-critical request-body parsing, matching documented WHATWG/undici behavior. The logic is short enough to verify by inspection and is exercised end-to-end by the new tests (which robobun confirmed pass on the debug build).
Other factors
- No CODEOWNERS patterns match the modified files.
- The bug-hunting system found no issues.
- The CI failure on the first commit is a musl LTO "different data layouts" linker error — an infrastructure/toolchain issue, not caused by this 30-line change.
- robobun independently reproduced the bug on released Bun and verified the fix.
There was a problem hiding this comment.
I didn't find any issues — the unescape logic and lifetimes look correct and well-tested — but since this changes how every req.formData() parses untrusted multipart input, it's worth a human sign-off on the behavior change.
Extended reasoning...
Overview
This PR adds unescape_form_data_name in src/runtime/webcore/FormData.rs to reverse the WHATWG %22/%0D/%0A escaping in multipart Content-Disposition name=/filename= values, and wires it into to_js_from_multipart_data's on_entry callback for both the field name and filename. It also adds six targeted tests in test/js/web/html/FormData.test.ts (decode, lowercase hex, leave-other-%xx-literal, Response round-trip, Bun.serve round-trip) and scales down an unrelated refcount stress loop on debug/ASAN builds to avoid timeouts.
The helper is the inverse of the existing escape_form_data_name in Blob.rs, returns None on the no-% fast path to stay zero-copy, and the bounds check (i + 2 < bytes.len()) and slice indexing are correct. The owned Vec<u8> for the unescaped name/filename outlives the borrowing ZigString through the append/append_blob FFI calls, so there's no lifetime hazard.
Security risks
This sits directly in the multipart body parser, which processes untrusted HTTP request bodies for every req.formData() call in Bun.serve. The decode itself is narrow (only three exact sequences, everything else literal) and matches undici/Node, and re-serialization goes back through escape_form_data_name, so I don't see a header-injection or parser-confusion vector. Still, any behavior change in untrusted-input parsing on the server hot path warrants a human look.
Level of scrutiny
Medium-high. The implementation is small and mechanical, but it changes observable runtime behavior for all FormData consumers (field names/filenames that previously came back as literal %22/%0D/%0A now decode). It is spec-mandated and Node-matching, so the behavior change is a fix rather than a regression risk, but it's still a user-visible semantics change in a core web API.
Other factors
No CODEOWNERS cover this path. The bug-hunting system found nothing. robobun reproduced the bug on 1.4.0, verified the fix under the debug build, and reports clippy/format/lint green. CI flagged failures on commit 06dae6a, which the follow-up commit eb567fb addresses by scaling the refcount stress loop on ASAN/debug — worth confirming the latest run is green before merge.
| /// passes through unchanged. Returns `None` when the input contains no `%` | ||
| /// so the caller can keep borrowing the original slice without allocating. | ||
| /// Inverse of `escape_form_data_name` in `Blob.rs`. | ||
| fn unescape_form_data_name(bytes: &[u8]) -> Option<Vec<u8>> { |
There was a problem hiding this comment.
Slop - @robobun can you find if this functionality already exists somewhere? Perhaps oven-sh/WebKit or other parts of code offer this? I am sure the components/behaviours to build a spec compliant FormData exist elsewhere already
There was a problem hiding this comment.
Looked — there is nothing spec-compliant to route through. Three places I checked:
-
WebKit/WebCore
FetchBodyConsumer::packageFormData(vendor/WebKit/.../Modules/fetch/FetchBodyConsumer.cpp) is the WHATWG multipart parser, but (a) it is not compiled into Bun — we link JSC only (--target jsc,ENABLE_STATIC_JSC), no WebCore — and (b) itsparseParametersreads thename=/filename=value with a plainsubstring()and does no percent-decoding at all (no%22/%0D/%0A, nopercentDecode). So even if we vendored it, it would not fix this. The actual parse path is our Rustto_js_from_multipart_data(exported asFormData__jsFunctionFromMultipartData), whichJSDOMFormData.cppcalls into. -
General percent-decoders do exist —
bun_url::PercentEncoding::decode(src/url/lib.rs:1408),resolver/data_url.rs, and the compileddecodeURLEscapeSequences(TextEncoding.cpp) — but they all decode every%xx. Reusing one would turn%20→space,%25→%,%41→A, which the WHATWG multipart algorithm deliberately leaves literal. That is exactly what theleaves other percent sequences literaltest guards against, so they are not a correct drop-in. -
The encode side
escape_form_data_name(Blob.rs:3988) only ever emits those three sequences and had no decode counterpart anywhere in the tree.
So the restricted 3-sequence decoder is the right shape — it is the exact inverse of escape_form_data_name, matching undici (%0A/%0D case-insensitive, %22 exact). Happy to move it next to escape_form_data_name in Blob.rs if you’d rather keep the encode/decode pair together.
There was a problem hiding this comment.
@robobun Can we re-use other code? or is this definitely the right place for this? We need to consider performance in hot-paths
There was a problem hiding this comment.
Re-use — definitively no, three places checked:
- WebKit's multipart parser (
FetchBodyConsumer::packageFormData) isn't compiled into Bun (we build--target jsconly, no WebCore), and itsparseParametersdoesn't percent-decode anyway — plainsubstring(). - The in-repo decoders (
bun_url::PercentEncoding::decode,decodeURLEscapeSequences) decode every%xx, which is wrong here —%20/%25/%41must stay literal per WHATWG (theleaves other percent sequences literaltest pins this). Cargo.lockhas zero multipart/percent-encoding crates — no third-party candidate either.
Right place — it's caller-adjacent: the decode sits next to its only caller (the multipart parser here), exactly mirroring escape_form_data_name, which sits next to its only caller (the multipart serializer in Blob.rs). Happy to hoist both into bun_core::form_data (next to get_boundary) if you want the pair together.
Hot-path cost — measured release A/B on this branch vs its base (same machine, best-of-3 × 3 passes, 58-entry / 9.5 KB body via Response.formData()):
| case | base | this PR | per entry |
|---|---|---|---|
plain names (no %) |
14.3–14.7 µs/parse | 14.6–14.9 µs/parse | +~4 ns (one Highway memchr; zero-copy, no alloc) |
all names contain %22 |
~14.7 µs/parse (wrong output) | ~16.2 µs/parse | +~26 ns (one exact-size alloc + single-pass copy) |
The common case stays zero-copy — the name slice keeps borrowing the input; only names actually containing % allocate, once, and that's the path that was returning corrupt names before. The boundary scan over the body dominates either way. Same scan-then-maybe-copy shape as escape_form_data_name.
|
@robobun rebase, also check my comments in the thread |
eb567fb to
38171d9
Compare
|
@robobun the darwin-14-x64-test-bun job went red after main moved (same job is failing on other unrelated PRs, so it looks like a flaky/broken darwin test rather than this change). Please rebase onto main and re-run CI; if it's a real failure in this test, investigate. |
…ange - Rewrite the `%`-presence guard in `unescape_form_data_name` with `?` so `cargo clippy` (which denies `question_mark`) passes. - Revert FormData.zig: it is an uncompiled porting reference, so the decode logic only belongs in the live FormData.rs.
Newer Windows UCRT builds emit unguarded CLDEMOTE in str* routines (observed in strpbrk on the windows-x64-baseline scan). CLDEMOTE (NP 0F 1C /0) is encoded in the reserved hint-NOP space (0F 18-1F), so processors that don't enumerate CPUID.(EAX=7,ECX=0):ECX[25] execute it as a NOP — it cannot fault on Nehalem, and the SDE emulator phase already passes those bytes. Same rationale as the existing CET_IBT entry, which occupies the same NOP-compatible encoding space.
… lists Add the CLDEMOTE bullet alongside CET_IBT in README.md and CLAUDE.md, and replace the stale src/main.rs line-range reference with the symbol names so it stops rotting with edits.
b5483b9 to
bf3f2f0
Compare
The four test-docker-build-* tests pass --progress=plain and the distroless Dockerfile uses RUN --mount and heredocs — all BuildKit-only. On an agent whose docker has a daemon but no buildx component (observed on a darwin runner), docker build falls back to the legacy builder and fails with 'unknown flag: --progress' (exit 125) before building anything. Add isDockerBuildEnabled(): docker additionally needs 'docker buildx version' to succeed; podman's Buildah supports the same syntax natively.
The multipart
FormDataparser never reversed the WHATWG%22/%0D/%0Aescaping that goes intoContent-Dispositionname=/filename=values, so a field or file name containing a quote, CR, or LF came back literal on everyBun.serveformData()upload (a namex"yarrived asx%22y, a filenamea\r\nb.txtasa%0D%0Ab.txt). Because Bun's own multipart encoder emits exactly those sequences, even Bun→Bun round-trips corrupted.Reverse the three WHATWG sequences when decoding
name/filename—%0A→LF,%0D→CR,%22→"— leaving every other byte (incl.%20/%25/%41and unknown%xx) untouched, matching undici (%0A/%0Dcase-insensitive,%22exact). The common path stays zero-copy — it only allocates when a%is present. Applied to both name and filename.Decodes byte-identically to Node across name/filename and a Bun→Bun round-trip; literal percent sequences are preserved. Not a port regression — the Zig decoder omitted it too; Node is the reference.