Skip to content

Conversation

@p88h
Copy link
Contributor

@p88h p88h commented Aug 13, 2025

When processing batches of questions with multiple images each, MultiModalHasher will re-compute the hash for every image in each prompt. This is quite expensive and it blocks the ingress thread.

This change allows to apply an UUID tag to any Image. If detected, Hasher will just use the UUID bytes rather than the actual image.

To avoid conflicts, only real UUID objects are supported, though that could theoretically be opened to arbitrary inputs as well. The ImageID ExifTag used here theoretically could be used for this purpose even then - it should be uniquely identifying the image.

This PR does not modify any Image loading code to use this, but adds a test case to validate the functionality.

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Speed up image hashing time when re-using same Images across multiple prompts.

Test Plan

Tested with https://github.com/p88h/fake-vqa to check multi-prompt processing performance.

Test Result

In local testing, using 64 images per prompt, the time to just add the requests to the engine was at about 2 seconds per prompt. After the change, this drops to 0.1 second per prompt, about 20x improvement.

(Optional) Documentation Update

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

When processing batches of questions with multiple images each,
MultiModalHasher will re-compute the hash for every image in each
prompt. This is quite expensive and it blocks the ingress thread.

This change allows to apply an UUID tag to any Image. If detected,
Hasher will just use the UUID bytes rather than the actual image.

To avoid conflicts, only real UUID objects are supported, though
that could theoretically be opened to arbitrary inputs as well.
The ImageID ExifTag used here _theoretically_ could be used for
this purpose even then - it should be uniquely identifying the image.

This PR does not modify any Image loading code to use this, but
adds a test case to validate the functionality.

Signed-off-by: Staszek Pasko <[email protected]>
@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Aug 13, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization for hashing multimodal image prompts by using a UUID from the image's EXIF data as a cache key, avoiding expensive re-computation of image hashes. The implementation is sound and the included tests validate the new functionality.

My main feedback is a critical issue regarding the handling of images that do not support EXIF data. The current implementation can cause a crash, which is a regression. I've provided a code suggestion to handle this case gracefully by catching the potential AttributeError.

p88h added 3 commits August 13, 2025 17:32
Signed-off-by: Staszek Pasko <[email protected]>
Signed-off-by: Staszek Pasko <[email protected]>
Signed-off-by: Staszek Pasko <[email protected]>
@DarkLight1337
Copy link
Member

DarkLight1337 commented Aug 13, 2025

Thanks for adding this support! cc @huachenheli

Let's take some time to think about whether this would introduce any security vulnerabilites, since our processor cache, embedding cache and prefix cache all use this hash, and your PR would allow users to submit their own hash.

@DarkLight1337
Copy link
Member

Also cc @russellb regarding this

@p88h
Copy link
Contributor Author

p88h commented Aug 13, 2025 via email

@DarkLight1337
Copy link
Member

DarkLight1337 commented Aug 14, 2025

I guess it is possible to have incorrect outputs if two different users use the same UUID for their image by manually editing the metadata. But it's arguable that this is the user's own fault for messing with the metadata like that, since that field is supposed to be unique when auto-generated.

@p88h
Copy link
Contributor Author

p88h commented Aug 14, 2025

The code checks whether the exiftags contain an UUID object - not just metadata that looks like one. The only way this works is with an Engine-level integration where the user supplies the Image objects (look how this works in the unit test).
And it needs some batching capability upstream of VLLM to make it useful.

I've added this integration to https://github.com/p88h/fake-vqa as well which shows how this behaves with real batch multimodal scenarios (multiple prompts referring to the same set of Images)

In the API / interactive chat world, any images need to be fetched using the media connector, and even if you use the same image over time these get re-downloaded, so hashing is not really important -> hence the check for UUID object - it's not possible to just edit metadata on the source file, since that will not produce one.

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this should be fine then, thanks for contributing!

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) August 14, 2025 15:39
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 14, 2025
if Image.ExifTags.Base.ImageID in exif and isinstance(
exif[Image.ExifTags.Base.ImageID], uuid.UUID):
# If the image has exif ImageID tag, use that
return exif[Image.ExifTags.Base.ImageID].bytes
Copy link
Contributor

@huachenheli huachenheli Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Explicit if-else separation for better readability:

if Image.ExifTags.Base.ImageID in exif and isinstance(exif[Image.ExifTags.Base.ImageID], uuid.UUID):
  # If the image has exif ImageID tag, use that
  return exif[Image.ExifTags.Base.ImageID].bytes
else:
  return cls.item_to_bytes("image", np.asarray(convert_image_mode(obj, "RGBA")))

@huachenheli
Copy link
Contributor

huachenheli commented Aug 14, 2025

I guess it is possible to have incorrect outputs if two different users use the same UUID for their image by manually editing the metadata. But it's arguable that this is the user's own fault for messing with the metadata like that, since that field is supposed to be unique when auto-generated.

It's possible for different users to re-use the same uuid, and it can be useful in the case of repeated media in different requests, so supporting that is useful.

I think your concern on security is a good point! In the case of multi-tenant APIs this can potentially leave vulnerabilities (i.e. one user hacking uuid to read another user's image). Maybe we should by default disable this feature and have it controlled by a flag?

@DarkLight1337
Copy link
Member

Can you merge from main to fix the CI?

@DarkLight1337 DarkLight1337 merged commit 22341b9 into vllm-project:main Aug 15, 2025
38 of 39 checks passed
@DarkLight1337
Copy link
Member

I think your concern on security is a good point! In the case of multi-tenant APIs this can potentially leave vulnerabilities (i.e. one user hacking uuid to read another user's image). Maybe we should by default disable this feature and have it controlled by a flag?

FYI, we have discussed this offline and determined this not to be an issue because a potential attacker needs to already have access to the image in the first place in order to determine the hash to use.

yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025
divakar-amd pushed a commit to divakar-amd/vllm_upstream that referenced this pull request Aug 20, 2025
djmmoss pushed a commit to djmmoss/vllm that referenced this pull request Aug 21, 2025
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants