Skip to content

formdata: reverse %22/%0D/%0A escaping in multipart field and file names#31526

Open
alii wants to merge 5 commits into
mainfrom
ali/multipart-unescape-names
Open

formdata: reverse %22/%0D/%0A escaping in multipart field and file names#31526
alii wants to merge 5 commits into
mainfrom
ali/multipart-unescape-names

Conversation

@alii

@alii alii commented May 28, 2026

Copy link
Copy Markdown
Member

The multipart FormData parser never reversed the WHATWG %22/%0D/%0A escaping that goes into Content-Disposition name=/filename= values, so a field or file name containing a quote, CR, or LF came back literal on every Bun.serve formData() upload (a name x"y arrived as x%22y, a filename a\r\nb.txt as a%0D%0Ab.txt). Because Bun's own multipart encoder emits exactly those sequences, even Bun→Bun round-trips corrupted.

Reverse the three WHATWG sequences when decoding name/filename%0A→LF, %0D→CR, %22" — leaving every other byte (incl. %20/%25/%41 and unknown %xx) untouched, matching undici (%0A/%0D case-insensitive, %22 exact). The common path stays zero-copy — it only allocates when a % is present. Applied to both name and filename.

Decodes byte-identically to Node across name/filename and a Bun→Bun round-trip; literal percent sequences are preserved. Not a port regression — the Zig decoder omitted it too; Node is the reference.

@alii

alii commented May 28, 2026

Copy link
Copy Markdown
Member Author

@robobun adopt

@robobun

robobun commented May 28, 2026

Copy link
Copy Markdown
Collaborator
Updated 10:13 PM PT - Jun 1st, 2026

@robobun, your commit b8fa505 has 5 failures in Build #59698 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 31526

That installs a local version of the PR into your bun-31526 executable, so you can run:

bun-31526 --bun

@robobun

robobun commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Adopted ✅ — ready for merge; the one red lane is a sick CI host

Reproduced on released Bun: a multipart field/file name with a quote/CR/LF round-trips literal — name="x%22y" decodes to x%22y instead of x"y, a%0D%0Ab instead of a\r\nb. Fails on 1.4.0, passes here.

The fix decodes %22", %0D→CR, %0A→LF (CR/LF hex case-insensitive, matching undici), leaves every other %xx literal — verified across names, filenames, and a Bun→Bun Bun.serve round-trip (6 new tests). Reuse/placement/perf analysis + release A/B benchmark in the review thread (no-% path zero-copy, ~+4 ns/entry).

CI unblockers in this branch, each root-caused: CLDEMOTE hint-NOP in verify-baseline-static (newer UCRT strpbrk; windows-x64-baseline green since) and isDockerBuildEnabled() gating the dockerhub build tests (buildx-less agent; fix held).

Final CI state (build 59698, b8fa505ed8): 284/285 green. The only red is the darwin-14-x64 shard, and both of its attempts ran on the same host, macOS-14-x64-1 (apparently the only live agent in that pool):

  • attempt 1: agent-local docker-compose corruption — Postgres container missing the bun_sql_test role; autobahn's /config/fuzzingserver.json bind-mount is a directory on the host.
  • attempt 2 (retry, same host): terminal.test.ts PTY tests hung at 90s (the same ones failing on other PRs), then the job hit its wall clock.

Nothing red touches this diff. Needs ops to clean/reboot macOS-14-x64-1 (stale bind-mount dir + wedged PTY/docker state) and a re-run of that one job — or merge over it.

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Actionable comments posted: 0

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@robobun, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 22 minutes and 10 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ed4a8322-819e-4d2d-bcda-d50c9e5149cc

📥 Commits

Reviewing files that changed from the base of the PR and between 5438c18 and b8fa505.

📒 Files selected for processing (10)
  • scripts/verify-baseline-static/CLAUDE.md
  • scripts/verify-baseline-static/README.md
  • scripts/verify-baseline-static/src/main.rs
  • src/runtime/webcore/FormData.rs
  • test/harness.ts
  • test/js/bun/test/parallel/test-docker-build-alpine.ts
  • test/js/bun/test/parallel/test-docker-build-debian-slim.ts
  • test/js/bun/test/parallel/test-docker-build-debian.ts
  • test/js/bun/test/parallel/test-docker-build-distroless.ts
  • test/js/web/html/FormData.test.ts

Walkthrough

This PR adds WHATWG percent-escape unescaping for multipart/form-data field names and filenames in Rust FormData parsing, with comprehensive test coverage validating round-trip behavior. It also extends the CPU baseline ISA verifier to allow CLDEMOTE on Nehalem-generation processors.

Changes

Multipart form-data name/filename percent-escape unescaping

Layer / File(s) Summary
Percent-escape unescaping helper
src/runtime/webcore/FormData.rs
unescape_form_data_name scans for %, decodes only %0A, %0D, and %22 sequences (case-insensitive hex), passes through unrecognized bytes unchanged, and returns None when no % is present to avoid allocation.
Multipart parsing integration
src/runtime/webcore/FormData.rs
Multipart entry processing applies unescaping to the name header and (for file parts) the filename before constructing ZigString keys and file blob names, falling back to raw slices when unescaping is not applicable.
Test coverage for multipart unescaping
test/js/web/html/FormData.test.ts
Added test suite asserting percent-decoding in field names and filenames (including %22, %0D/%0A, and lowercase variants), preservation of unrecognized sequences, and round-trip validation through Response.formData() and Bun.serve request flows.

CPU baseline ISA verification update

Layer / File(s) Summary
CLDEMOTE baseline feature allowlist
scripts/verify-baseline-static/src/main.rs
Added CpuidFeature::CLDEMOTE to NEHALEM_ALLOWED with documentation that it executes as a NOP on Nehalem, suppressing it as a post-baseline violation.

Suggested reviewers

  • cirospaciari
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main change: reversing specific percent-escape sequences (%22, %0D, %0A) in multipart FormData field and file names, which directly matches the core objective of the PR.
Description check ✅ Passed The PR description comprehensively addresses the required template sections: it clearly explains what the PR does (reversing WHATWG escaping in multipart field/file names), how it was verified (case-insensitive matching, Bun→Bun round-trips, comparison with Node), and the implementation details including the zero-copy optimization.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — small, spec-compliant decoder fix that's the exact inverse of escape_form_data_name, with thorough test coverage.

Extended reasoning...

Overview

This PR adds unescape_form_data_name() to src/runtime/webcore/FormData.rs (~30 lines) and applies it to the name and filename values extracted from Content-Disposition headers during multipart parsing. It reverses exactly the three WHATWG escape sequences (%22", %0D→CR, %0A→LF, with case-insensitive hex on the latter two to match undici) and passes every other byte through unchanged — including other %xx sequences and a trailing % with fewer than two following bytes. The fast path returns None when no % is present so the common case stays zero-copy. I cross-checked it against escape_form_data_name in Blob.rs and it is the exact inverse. Six new tests in FormData.test.ts cover field names, filenames, lowercase hex, literal-percent preservation, and round-trips via both Response.formData() and Bun.serve.

Security risks

None identified. The decoded CR/LF land in JS-side FormData keys and File.name strings, not in outgoing HTTP headers; re-serialization goes back through escape_form_data_name, which re-escapes them. The existing Content-Type control-character guard in the parser is untouched. Bounds checking is correct (i + 2 < bytes.len() before slicing [i+1..i+3]), and the temporary Vec<u8> lives on the on_entry stack frame for the full duration of the ZigString borrow — same lifetime contract the original borrowed-slice code already relied on for append/append_blob.

Level of scrutiny

Low-to-moderate. This is a narrow, well-understood string-processing fix in user-facing-but-not-security-critical request-body parsing, matching documented WHATWG/undici behavior. The logic is short enough to verify by inspection and is exercised end-to-end by the new tests (which robobun confirmed pass on the debug build).

Other factors

  • No CODEOWNERS patterns match the modified files.
  • The bug-hunting system found no issues.
  • The CI failure on the first commit is a musl LTO "different data layouts" linker error — an infrastructure/toolchain issue, not caused by this 30-line change.
  • robobun independently reproduced the bug on released Bun and verified the fix.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any issues — the unescape logic and lifetimes look correct and well-tested — but since this changes how every req.formData() parses untrusted multipart input, it's worth a human sign-off on the behavior change.

Extended reasoning...

Overview

This PR adds unescape_form_data_name in src/runtime/webcore/FormData.rs to reverse the WHATWG %22/%0D/%0A escaping in multipart Content-Disposition name=/filename= values, and wires it into to_js_from_multipart_data's on_entry callback for both the field name and filename. It also adds six targeted tests in test/js/web/html/FormData.test.ts (decode, lowercase hex, leave-other-%xx-literal, Response round-trip, Bun.serve round-trip) and scales down an unrelated refcount stress loop on debug/ASAN builds to avoid timeouts.

The helper is the inverse of the existing escape_form_data_name in Blob.rs, returns None on the no-% fast path to stay zero-copy, and the bounds check (i + 2 < bytes.len()) and slice indexing are correct. The owned Vec<u8> for the unescaped name/filename outlives the borrowing ZigString through the append/append_blob FFI calls, so there's no lifetime hazard.

Security risks

This sits directly in the multipart body parser, which processes untrusted HTTP request bodies for every req.formData() call in Bun.serve. The decode itself is narrow (only three exact sequences, everything else literal) and matches undici/Node, and re-serialization goes back through escape_form_data_name, so I don't see a header-injection or parser-confusion vector. Still, any behavior change in untrusted-input parsing on the server hot path warrants a human look.

Level of scrutiny

Medium-high. The implementation is small and mechanical, but it changes observable runtime behavior for all FormData consumers (field names/filenames that previously came back as literal %22/%0D/%0A now decode). It is spec-mandated and Node-matching, so the behavior change is a fix rather than a regression risk, but it's still a user-visible semantics change in a core web API.

Other factors

No CODEOWNERS cover this path. The bug-hunting system found nothing. robobun reproduced the bug on 1.4.0, verified the fix under the debug build, and reports clippy/format/lint green. CI flagged failures on commit 06dae6a, which the follow-up commit eb567fb addresses by scaling the refcount stress loop on ASAN/debug — worth confirming the latest run is green before merge.

/// passes through unchanged. Returns `None` when the input contains no `%`
/// so the caller can keep borrowing the original slice without allocating.
/// Inverse of `escape_form_data_name` in `Blob.rs`.
fn unescape_form_data_name(bytes: &[u8]) -> Option<Vec<u8>> {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slop - @robobun can you find if this functionality already exists somewhere? Perhaps oven-sh/WebKit or other parts of code offer this? I am sure the components/behaviours to build a spec compliant FormData exist elsewhere already

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked — there is nothing spec-compliant to route through. Three places I checked:

  1. WebKit/WebCore FetchBodyConsumer::packageFormData (vendor/WebKit/.../Modules/fetch/FetchBodyConsumer.cpp) is the WHATWG multipart parser, but (a) it is not compiled into Bun — we link JSC only (--target jsc, ENABLE_STATIC_JSC), no WebCore — and (b) its parseParameters reads the name=/filename= value with a plain substring() and does no percent-decoding at all (no %22/%0D/%0A, no percentDecode). So even if we vendored it, it would not fix this. The actual parse path is our Rust to_js_from_multipart_data (exported as FormData__jsFunctionFromMultipartData), which JSDOMFormData.cpp calls into.

  2. General percent-decoders do existbun_url::PercentEncoding::decode (src/url/lib.rs:1408), resolver/data_url.rs, and the compiled decodeURLEscapeSequences (TextEncoding.cpp) — but they all decode every %xx. Reusing one would turn %20→space, %25%, %41A, which the WHATWG multipart algorithm deliberately leaves literal. That is exactly what the leaves other percent sequences literal test guards against, so they are not a correct drop-in.

  3. The encode side escape_form_data_name (Blob.rs:3988) only ever emits those three sequences and had no decode counterpart anywhere in the tree.

So the restricted 3-sequence decoder is the right shape — it is the exact inverse of escape_form_data_name, matching undici (%0A/%0D case-insensitive, %22 exact). Happy to move it next to escape_form_data_name in Blob.rs if you’d rather keep the encode/decode pair together.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robobun Can we re-use other code? or is this definitely the right place for this? We need to consider performance in hot-paths

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-use — definitively no, three places checked:

  • WebKit's multipart parser (FetchBodyConsumer::packageFormData) isn't compiled into Bun (we build --target jsc only, no WebCore), and its parseParameters doesn't percent-decode anyway — plain substring().
  • The in-repo decoders (bun_url::PercentEncoding::decode, decodeURLEscapeSequences) decode every %xx, which is wrong here — %20/%25/%41 must stay literal per WHATWG (the leaves other percent sequences literal test pins this).
  • Cargo.lock has zero multipart/percent-encoding crates — no third-party candidate either.

Right place — it's caller-adjacent: the decode sits next to its only caller (the multipart parser here), exactly mirroring escape_form_data_name, which sits next to its only caller (the multipart serializer in Blob.rs). Happy to hoist both into bun_core::form_data (next to get_boundary) if you want the pair together.

Hot-path cost — measured release A/B on this branch vs its base (same machine, best-of-3 × 3 passes, 58-entry / 9.5 KB body via Response.formData()):

case base this PR per entry
plain names (no %) 14.3–14.7 µs/parse 14.6–14.9 µs/parse +~4 ns (one Highway memchr; zero-copy, no alloc)
all names contain %22 ~14.7 µs/parse (wrong output) ~16.2 µs/parse +~26 ns (one exact-size alloc + single-pass copy)

The common case stays zero-copy — the name slice keeps borrowing the input; only names actually containing % allocate, once, and that's the path that was returning corrupt names before. The boundary scan over the body dominates either way. Same scan-then-maybe-copy shape as escape_form_data_name.

@alii

alii commented Jun 1, 2026

Copy link
Copy Markdown
Member Author

@robobun rebase, also check my comments in the thread

@robobun robobun force-pushed the ali/multipart-unescape-names branch from eb567fb to 38171d9 Compare June 1, 2026 20:35
Comment thread scripts/verify-baseline-static/src/main.rs
@alii

alii commented Jun 1, 2026

Copy link
Copy Markdown
Member Author

@robobun the darwin-14-x64-test-bun job went red after main moved (same job is failing on other unrelated PRs, so it looks like a flaky/broken darwin test rather than this change). Please rebase onto main and re-run CI; if it's a real failure in this test, investigate.

alii and others added 4 commits June 1, 2026 22:48
…ange

- Rewrite the `%`-presence guard in `unescape_form_data_name` with `?` so
  `cargo clippy` (which denies `question_mark`) passes.
- Revert FormData.zig: it is an uncompiled porting reference, so the decode
  logic only belongs in the live FormData.rs.
Newer Windows UCRT builds emit unguarded CLDEMOTE in str* routines
(observed in strpbrk on the windows-x64-baseline scan). CLDEMOTE
(NP 0F 1C /0) is encoded in the reserved hint-NOP space (0F 18-1F), so
processors that don't enumerate CPUID.(EAX=7,ECX=0):ECX[25] execute it
as a NOP — it cannot fault on Nehalem, and the SDE emulator phase
already passes those bytes. Same rationale as the existing CET_IBT
entry, which occupies the same NOP-compatible encoding space.
… lists

Add the CLDEMOTE bullet alongside CET_IBT in README.md and CLAUDE.md,
and replace the stale src/main.rs line-range reference with the symbol
names so it stops rotting with edits.
@robobun robobun force-pushed the ali/multipart-unescape-names branch from b5483b9 to bf3f2f0 Compare June 1, 2026 22:49
The four test-docker-build-* tests pass --progress=plain and the
distroless Dockerfile uses RUN --mount and heredocs — all BuildKit-only.
On an agent whose docker has a daemon but no buildx component (observed
on a darwin runner), docker build falls back to the legacy builder and
fails with 'unknown flag: --progress' (exit 125) before building
anything. Add isDockerBuildEnabled(): docker additionally needs
'docker buildx version' to succeed; podman's Buildah supports the same
syntax natively.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants