-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Improve multimodal hasher performance for re-used Image prompts #22825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
When processing batches of questions with multiple images each, MultiModalHasher will re-compute the hash for every image in each prompt. This is quite expensive and it blocks the ingress thread. This change allows to apply an UUID tag to any Image. If detected, Hasher will just use the UUID bytes rather than the actual image. To avoid conflicts, only real UUID objects are supported, though that could theoretically be opened to arbitrary inputs as well. The ImageID ExifTag used here _theoretically_ could be used for this purpose even then - it should be uniquely identifying the image. This PR does not modify any Image loading code to use this, but adds a test case to validate the functionality. Signed-off-by: Staszek Pasko <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization for hashing multimodal image prompts by using a UUID from the image's EXIF data as a cache key, avoiding expensive re-computation of image hashes. The implementation is sound and the included tests validate the new functionality.
My main feedback is a critical issue regarding the handling of images that do not support EXIF data. The current implementation can cause a crash, which is a regression. I've provided a code suggestion to handle this case gracefully by catching the potential AttributeError.
Signed-off-by: Staszek Pasko <[email protected]>
Signed-off-by: Staszek Pasko <[email protected]>
Signed-off-by: Staszek Pasko <[email protected]>
|
Thanks for adding this support! cc @huachenheli Let's take some time to think about whether this would introduce any security vulnerabilites, since our processor cache, embedding cache and prefix cache all use this hash, and your PR would allow users to submit their own hash. |
|
Also cc @russellb regarding this |
|
In the current state these identifiers can only be provided as part of the
Image, but cannot be loaded from storage (this will render the tag as
str/bytes and not UUID).
So it should be quite contained from security perspective.
Staszek Paśko ***@***.***>
Press Any key to continue.
…On Wed, 13 Aug 2025, 20:36 Cyrus Leung, ***@***.***> wrote:
*DarkLight1337* left a comment (vllm-project/vllm#22825)
<#22825 (comment)>
Also cc @russellb <https://github.com/russellb> regarding this
—
Reply to this email directly, view it on GitHub
<#22825 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLAWXGFVKDQS6U5LCNG5PL3NOAS3AVCNFSM6AAAAACDZ7ILMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCOBVGA2TIMBRGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
I guess it is possible to have incorrect outputs if two different users use the same UUID for their image by manually editing the metadata. But it's arguable that this is the user's own fault for messing with the metadata like that, since that field is supposed to be unique when auto-generated. |
|
The code checks whether the exiftags contain an UUID object - not just metadata that looks like one. The only way this works is with an Engine-level integration where the user supplies the Image objects (look how this works in the unit test). I've added this integration to https://github.com/p88h/fake-vqa as well which shows how this behaves with real batch multimodal scenarios (multiple prompts referring to the same set of Images) In the API / interactive chat world, any images need to be fetched using the media connector, and even if you use the same image over time these get re-downloaded, so hashing is not really important -> hence the check for UUID object - it's not possible to just edit metadata on the source file, since that will not produce one. |
DarkLight1337
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think this should be fine then, thanks for contributing!
| if Image.ExifTags.Base.ImageID in exif and isinstance( | ||
| exif[Image.ExifTags.Base.ImageID], uuid.UUID): | ||
| # If the image has exif ImageID tag, use that | ||
| return exif[Image.ExifTags.Base.ImageID].bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Explicit if-else separation for better readability:
if Image.ExifTags.Base.ImageID in exif and isinstance(exif[Image.ExifTags.Base.ImageID], uuid.UUID):
# If the image has exif ImageID tag, use that
return exif[Image.ExifTags.Base.ImageID].bytes
else:
return cls.item_to_bytes("image", np.asarray(convert_image_mode(obj, "RGBA")))
It's possible for different users to re-use the same uuid, and it can be useful in the case of repeated media in different requests, so supporting that is useful. I think your concern on security is a good point! In the case of multi-tenant APIs this can potentially leave vulnerabilities (i.e. one user hacking uuid to read another user's image). Maybe we should by default disable this feature and have it controlled by a flag? |
|
Can you merge from main to fix the CI? |
FYI, we have discussed this offline and determined this not to be an issue because a potential attacker needs to already have access to the image in the first place in order to determine the hash to use. |
…-project#22825) Signed-off-by: Staszek Pasko <[email protected]>
…-project#22825) Signed-off-by: Staszek Pasko <[email protected]>
…-project#22825) Signed-off-by: Staszek Pasko <[email protected]> Signed-off-by: Duncan Moss <[email protected]>
…-project#22825) Signed-off-by: Staszek Pasko <[email protected]>
…-project#22825) Signed-off-by: Staszek Pasko <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…-project#22825) Signed-off-by: Staszek Pasko <[email protected]>
When processing batches of questions with multiple images each, MultiModalHasher will re-compute the hash for every image in each prompt. This is quite expensive and it blocks the ingress thread.
This change allows to apply an UUID tag to any Image. If detected, Hasher will just use the UUID bytes rather than the actual image.
To avoid conflicts, only real UUID objects are supported, though that could theoretically be opened to arbitrary inputs as well. The ImageID ExifTag used here theoretically could be used for this purpose even then - it should be uniquely identifying the image.
This PR does not modify any Image loading code to use this, but adds a test case to validate the functionality.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Speed up image hashing time when re-using same Images across multiple prompts.
Test Plan
Tested with https://github.com/p88h/fake-vqa to check multi-prompt processing performance.
Test Result
In local testing, using 64 images per prompt, the time to just add the requests to the engine was at about 2 seconds per prompt. After the change, this drops to 0.1 second per prompt, about 20x improvement.
(Optional) Documentation Update