Fix cross-request data leakage from base64 image cache collision by sooth · Pull Request #126 · waybarrios/vllm-mlx

sooth · 2026-02-28T20:02:15Z

Summary

save_base64_image() only hashed the first 1000 characters of base64 strings for its temp file cache key (md5(base64_string[:1000])). JPEG images from the same PDF renderer share identical headers (SOI marker, EXIF, quantization tables), causing different images to collide in the cache and return a previous request's temp file. The model then processes the wrong image entirely.
vision_embedding_cache.py only hashed the first 64KB of image file content, which could similarly collide for large images with shared headers.

Fix

Hash the full base64 string with SHA-256 instead of MD5 on a truncated prefix
Hash full file content in the vision embedding cache

Reproduction

Send a vision request with base64 image A (e.g., a PDF invoice page)
Send a second request with base64 image B (different invoice, same PDF renderer/quality)
The second request returns data from image A — the model literally received image A's pixels

Test plan

Existing tests pass (111 passed)
Manual test: send two different invoice images sequentially, verify correct extraction for both

…sion Cherry-pick upstream merged fixes + local improvements: - PR waybarrios#92: Eager eval batch.tokens after mx.concatenate() to release Metal AGXAllocation handles; adaptive cache clear interval scales with concurrency - PR waybarrios#154: Drain self.requests dict in MLLM _cleanup_finished() to prevent linear memory growth; add mx.clear_cache() after cleanup - PR waybarrios#157: Port adaptive periodic mx.clear_cache() from LLM scheduler to MLLM scheduler (interval scales inversely with active sequences) - PR waybarrios#124: Forward tool definitions through MLLM chat/stream_chat paths in SimpleEngine and MLXMultimodalLM (get_chat_template) - PR waybarrios#126: Hash full base64 string with SHA-256 instead of MD5 on first 1000 chars to prevent cross-request image cache collisions Additional fixes: - batched.py: Disable thinking mode for coder models, promote MLLM stats - mllm_batch_generator.py: Downgrade prompt size guard from error to warning - qwen3_parser.py: Treat tagless output as reasoning (max_tokens truncation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Thump604

I've reviewed the diff carefully. This is a legitimate security fix for cross-request data leakage.

The Bug

Original code hashed only the first 1000 characters of base64 strings. For JPEG images from the same PDF renderer/scanner, the first ~730 bytes of decoded data (SOI marker, EXIF, quantization tables) are often identical. After base64 encoding, this created hash collisions—causing the cache to return a previous request's temp file. The model received wrong image pixels entirely, a data leakage vulnerability.

The Fix: Correct

mllm.py: Full base64 string with SHA-256 instead of md5(prefix). Different images now produce different cache keys, eliminating collision.

vision_embedding_cache.py: Full file content instead of first 64KB. Same logic, same result.

Hash choice (SHA-256 vs MD5): Both work here since we only need collision resistance, not cryptographic security. SHA-256 is safer and adds zero runtime cost.

Assessment

Cache key changes from 32 to 64 chars (negligible memory impact)
File-hashing reads full image into memory on cache miss (acceptable one-time cost, then cached)
Cache hits unchanged (existing checks for key presence + file existence still work)
No other leakage vectors: cache invalidation, temp cleanup, cross-model state all correct

Test Coverage

Existing tests pass (111). The manual test plan (send two invoice images, verify correct extraction) is still unchecked. Consider adding a regression test that explicitly triggers the collision scenario—not blocking, but recommended to prevent this from creeping back.

Approving. Ready to merge.

Thump604 · 2026-04-08T00:23:56Z

@waybarrios, @sooth: independent technical review.

Verification of the bug

The bug is real and reproducible. save_base64_image() previously hashed only base64_string[:1000] with MD5 to key the cache, and compute_image_hash() in vision_embedding_cache.py only read the first 64KB of file content. JPEG images from the same PDF renderer share identical SOI markers, EXIF blocks, and quantization tables in the first ~1KB to 64KB of bytes, so different images collide on cache key and return a previous request`s image. This is a real cross-request data leakage in any deployment that handles user-supplied base64 images.

The fix is correct

save_base64_image(): hash the FULL base64 string with SHA-256 instead of first 1000 chars with MD5. SHA-256 is also a stronger hash function with no collision attack surface.
vision_embedding_cache.py:compute_image_hash(): read the full file content instead of first 64KB. Same root cause, same fix shape.

The diff is 8 lines across 2 files. Backward compatibility: old cache entries become invalid (different hash), so the cache effectively resets on first use after the upgrade. That is the correct behavior because old entries are unsafe.

Status

PR currently shows CONFLICTING merge status. Likely a trivial textual conflict on vllm_mlx/models/mllm.py since the file has had other changes since this branch was created. Would need a quick rebase.

This is a real cross-request data leakage bug and should land. Recommend rebase + merge.

Thump604 · 2026-04-13T19:16:36Z

Hey @sooth — this is a good fix for a real data leakage vector. Ready to merge once rebased onto current main. The conflict should be straightforward. Can you rebase when you get a chance?

`save_base64_image()` cached temp file paths using `md5(base64_string[:1000])` — only the first 1000 characters of the base64 string. For JPEG images rendered by the same PDF converter or scanner, the first ~730 bytes of decoded image data (SOI marker, EXIF headers, quantization tables) are often identical. This caused different images to produce the same cache key, returning a previous request's temp file. The result was that the model received the wrong image pixels entirely, generating output based on a prior request's image — a data leakage bug. Fix: hash the full base64 string with SHA-256 instead of MD5 on a truncated prefix. Also fix `vision_embedding_cache.py` to hash full file content instead of the first 64KB.

sooth · 2026-04-14T14:37:41Z

It has been updated

Thump604 approved these changes Mar 31, 2026

View reviewed changes

Thump604 mentioned this pull request Apr 8, 2026

Security audit: authentication bypass, SSRF, and other vulnerabilities #68

Open

sooth force-pushed the fix/base64-image-cache-collision branch from e72af12 to eae87cc Compare April 14, 2026 14:36

Thump604 merged commit 22050b1 into waybarrios:main Apr 14, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cross-request data leakage from base64 image cache collision#126

Fix cross-request data leakage from base64 image cache collision#126
Thump604 merged 1 commit intowaybarrios:mainfrom
sooth:fix/base64-image-cache-collision

sooth commented Feb 28, 2026

Uh oh!

Thump604 left a comment

Uh oh!

Thump604 commented Apr 8, 2026

Uh oh!

Thump604 commented Apr 13, 2026

Uh oh!

sooth commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sooth commented Feb 28, 2026

Summary

Fix

Reproduction

Test plan

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

The Bug

The Fix: Correct

Assessment

Test Coverage

Uh oh!

Thump604 commented Apr 8, 2026

Verification of the bug

The fix is correct

Status

Uh oh!

Thump604 commented Apr 13, 2026

Uh oh!

sooth commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants