Fix cross-request data leakage from base64 image cache collision#126
Conversation
…sion Cherry-pick upstream merged fixes + local improvements: - PR waybarrios#92: Eager eval batch.tokens after mx.concatenate() to release Metal AGXAllocation handles; adaptive cache clear interval scales with concurrency - PR waybarrios#154: Drain self.requests dict in MLLM _cleanup_finished() to prevent linear memory growth; add mx.clear_cache() after cleanup - PR waybarrios#157: Port adaptive periodic mx.clear_cache() from LLM scheduler to MLLM scheduler (interval scales inversely with active sequences) - PR waybarrios#124: Forward tool definitions through MLLM chat/stream_chat paths in SimpleEngine and MLXMultimodalLM (get_chat_template) - PR waybarrios#126: Hash full base64 string with SHA-256 instead of MD5 on first 1000 chars to prevent cross-request image cache collisions Additional fixes: - batched.py: Disable thinking mode for coder models, promote MLLM stats - mllm_batch_generator.py: Downgrade prompt size guard from error to warning - qwen3_parser.py: Treat tagless output as reasoning (max_tokens truncation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Thump604
left a comment
There was a problem hiding this comment.
I've reviewed the diff carefully. This is a legitimate security fix for cross-request data leakage.
The Bug
Original code hashed only the first 1000 characters of base64 strings. For JPEG images from the same PDF renderer/scanner, the first ~730 bytes of decoded data (SOI marker, EXIF, quantization tables) are often identical. After base64 encoding, this created hash collisions—causing the cache to return a previous request's temp file. The model received wrong image pixels entirely, a data leakage vulnerability.
The Fix: Correct
mllm.py: Full base64 string with SHA-256 instead of md5(prefix). Different images now produce different cache keys, eliminating collision.
vision_embedding_cache.py: Full file content instead of first 64KB. Same logic, same result.
Hash choice (SHA-256 vs MD5): Both work here since we only need collision resistance, not cryptographic security. SHA-256 is safer and adds zero runtime cost.
Assessment
- Cache key changes from 32 to 64 chars (negligible memory impact)
- File-hashing reads full image into memory on cache miss (acceptable one-time cost, then cached)
- Cache hits unchanged (existing checks for key presence + file existence still work)
- No other leakage vectors: cache invalidation, temp cleanup, cross-model state all correct
Test Coverage
Existing tests pass (111). The manual test plan (send two invoice images, verify correct extraction) is still unchecked. Consider adding a regression test that explicitly triggers the collision scenario—not blocking, but recommended to prevent this from creeping back.
Approving. Ready to merge.
|
@waybarrios, @sooth: independent technical review. Verification of the bugThe bug is real and reproducible. The fix is correct
The diff is 8 lines across 2 files. Backward compatibility: old cache entries become invalid (different hash), so the cache effectively resets on first use after the upgrade. That is the correct behavior because old entries are unsafe. StatusPR currently shows CONFLICTING merge status. Likely a trivial textual conflict on This is a real cross-request data leakage bug and should land. Recommend rebase + merge. |
|
Hey @sooth — this is a good fix for a real data leakage vector. Ready to merge once rebased onto current main. The conflict should be straightforward. Can you rebase when you get a chance? |
`save_base64_image()` cached temp file paths using `md5(base64_string[:1000])` — only the first 1000 characters of the base64 string. For JPEG images rendered by the same PDF converter or scanner, the first ~730 bytes of decoded image data (SOI marker, EXIF headers, quantization tables) are often identical. This caused different images to produce the same cache key, returning a previous request's temp file. The result was that the model received the wrong image pixels entirely, generating output based on a prior request's image — a data leakage bug. Fix: hash the full base64 string with SHA-256 instead of MD5 on a truncated prefix. Also fix `vision_embedding_cache.py` to hash full file content instead of the first 64KB.
e72af12 to
eae87cc
Compare
|
It has been updated |
Summary
save_base64_image()only hashed the first 1000 characters of base64 strings for its temp file cache key (md5(base64_string[:1000])). JPEG images from the same PDF renderer share identical headers (SOI marker, EXIF, quantization tables), causing different images to collide in the cache and return a previous request's temp file. The model then processes the wrong image entirely.vision_embedding_cache.pyonly hashed the first 64KB of image file content, which could similarly collide for large images with shared headers.Fix
Reproduction
Test plan