Skip to content

Fix cross-request data leakage from base64 image cache collision#126

Merged
Thump604 merged 1 commit intowaybarrios:mainfrom
sooth:fix/base64-image-cache-collision
Apr 14, 2026
Merged

Fix cross-request data leakage from base64 image cache collision#126
Thump604 merged 1 commit intowaybarrios:mainfrom
sooth:fix/base64-image-cache-collision

Conversation

@sooth
Copy link
Copy Markdown
Contributor

@sooth sooth commented Feb 28, 2026

Summary

  • save_base64_image() only hashed the first 1000 characters of base64 strings for its temp file cache key (md5(base64_string[:1000])). JPEG images from the same PDF renderer share identical headers (SOI marker, EXIF, quantization tables), causing different images to collide in the cache and return a previous request's temp file. The model then processes the wrong image entirely.
  • vision_embedding_cache.py only hashed the first 64KB of image file content, which could similarly collide for large images with shared headers.

Fix

  • Hash the full base64 string with SHA-256 instead of MD5 on a truncated prefix
  • Hash full file content in the vision embedding cache

Reproduction

  1. Send a vision request with base64 image A (e.g., a PDF invoice page)
  2. Send a second request with base64 image B (different invoice, same PDF renderer/quality)
  3. The second request returns data from image A — the model literally received image A's pixels

Test plan

  • Existing tests pass (111 passed)
  • Manual test: send two different invoice images sequentially, verify correct extraction for both

janhilgard added a commit to janhilgard/vllm-mlx that referenced this pull request Mar 21, 2026
…sion

Cherry-pick upstream merged fixes + local improvements:

- PR waybarrios#92: Eager eval batch.tokens after mx.concatenate() to release Metal
  AGXAllocation handles; adaptive cache clear interval scales with concurrency
- PR waybarrios#154: Drain self.requests dict in MLLM _cleanup_finished() to prevent
  linear memory growth; add mx.clear_cache() after cleanup
- PR waybarrios#157: Port adaptive periodic mx.clear_cache() from LLM scheduler to
  MLLM scheduler (interval scales inversely with active sequences)
- PR waybarrios#124: Forward tool definitions through MLLM chat/stream_chat paths
  in SimpleEngine and MLXMultimodalLM (get_chat_template)
- PR waybarrios#126: Hash full base64 string with SHA-256 instead of MD5 on first
  1000 chars to prevent cross-request image cache collisions

Additional fixes:
- batched.py: Disable thinking mode for coder models, promote MLLM stats
- mllm_batch_generator.py: Downgrade prompt size guard from error to warning
- qwen3_parser.py: Treat tagless output as reasoning (max_tokens truncation)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the diff carefully. This is a legitimate security fix for cross-request data leakage.

The Bug

Original code hashed only the first 1000 characters of base64 strings. For JPEG images from the same PDF renderer/scanner, the first ~730 bytes of decoded data (SOI marker, EXIF, quantization tables) are often identical. After base64 encoding, this created hash collisions—causing the cache to return a previous request's temp file. The model received wrong image pixels entirely, a data leakage vulnerability.

The Fix: Correct

mllm.py: Full base64 string with SHA-256 instead of md5(prefix). Different images now produce different cache keys, eliminating collision.

vision_embedding_cache.py: Full file content instead of first 64KB. Same logic, same result.

Hash choice (SHA-256 vs MD5): Both work here since we only need collision resistance, not cryptographic security. SHA-256 is safer and adds zero runtime cost.

Assessment

  • Cache key changes from 32 to 64 chars (negligible memory impact)
  • File-hashing reads full image into memory on cache miss (acceptable one-time cost, then cached)
  • Cache hits unchanged (existing checks for key presence + file existence still work)
  • No other leakage vectors: cache invalidation, temp cleanup, cross-model state all correct

Test Coverage

Existing tests pass (111). The manual test plan (send two invoice images, verify correct extraction) is still unchecked. Consider adding a regression test that explicitly triggers the collision scenario—not blocking, but recommended to prevent this from creeping back.

Approving. Ready to merge.

@Thump604
Copy link
Copy Markdown
Collaborator

Thump604 commented Apr 8, 2026

@waybarrios, @sooth: independent technical review.

Verification of the bug

The bug is real and reproducible. save_base64_image() previously hashed only base64_string[:1000] with MD5 to key the cache, and compute_image_hash() in vision_embedding_cache.py only read the first 64KB of file content. JPEG images from the same PDF renderer share identical SOI markers, EXIF blocks, and quantization tables in the first ~1KB to 64KB of bytes, so different images collide on cache key and return a previous request`s image. This is a real cross-request data leakage in any deployment that handles user-supplied base64 images.

The fix is correct

  • save_base64_image(): hash the FULL base64 string with SHA-256 instead of first 1000 chars with MD5. SHA-256 is also a stronger hash function with no collision attack surface.
  • vision_embedding_cache.py:compute_image_hash(): read the full file content instead of first 64KB. Same root cause, same fix shape.

The diff is 8 lines across 2 files. Backward compatibility: old cache entries become invalid (different hash), so the cache effectively resets on first use after the upgrade. That is the correct behavior because old entries are unsafe.

Status

PR currently shows CONFLICTING merge status. Likely a trivial textual conflict on vllm_mlx/models/mllm.py since the file has had other changes since this branch was created. Would need a quick rebase.

This is a real cross-request data leakage bug and should land. Recommend rebase + merge.

@Thump604
Copy link
Copy Markdown
Collaborator

Hey @sooth — this is a good fix for a real data leakage vector. Ready to merge once rebased onto current main. The conflict should be straightforward. Can you rebase when you get a chance?

`save_base64_image()` cached temp file paths using
`md5(base64_string[:1000])` — only the first 1000 characters of the
base64 string. For JPEG images rendered by the same PDF converter or
scanner, the first ~730 bytes of decoded image data (SOI marker, EXIF
headers, quantization tables) are often identical. This caused different
images to produce the same cache key, returning a previous request's
temp file.

The result was that the model received the wrong image pixels entirely,
generating output based on a prior request's image — a data leakage bug.

Fix: hash the full base64 string with SHA-256 instead of MD5 on a
truncated prefix. Also fix `vision_embedding_cache.py` to hash full
file content instead of the first 64KB.
@sooth sooth force-pushed the fix/base64-image-cache-collision branch from e72af12 to eae87cc Compare April 14, 2026 14:36
@sooth
Copy link
Copy Markdown
Contributor Author

sooth commented Apr 14, 2026

It has been updated

@Thump604 Thump604 merged commit 22050b1 into waybarrios:main Apr 14, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants