[MM] Allow skipping memory profiling for multimodal models. #22950

ywang96 · 2025-08-15T01:26:50Z

Purpose

Memory profiling for multimodal models involves processes input dummy data into features, then encode these features into embeddings and store them. Since most LMMs have a relatively small encoder, sometimes users may want to tune down --gpu-memory-utilization themselves and skip this profiling for faster startup time (especially for RL scenario).

This PR adds --skip-mm-profiling option for users to achieve so.

Test Plan

Command: vllm serve Qwen/Qwen2.5-VL-7B-Instruct

INFO 08-15 01:42:53 [core.py:199] init engine (profile, create kv cache, warmup model) took 26.52 seconds

Command: vllm serve Qwen/Qwen2.5-VL-7B-Instruct --skip-mm-profiling

INFO 08-15 01:47:42 [core.py:199] init engine (profile, create kv cache, warmup model) took 20.86 seconds

Test Result

(Optional) Documentation Update

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Roger Wang <[email protected]>

github-actions · 2025-08-15T01:26:58Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces a --skip-mm-profiling flag to accelerate engine startup for multimodal models by bypassing memory profiling. The implementation correctly propagates this new configuration from the command line to the model runners. However, I've identified a critical issue in both gpu_model_runner.py and tpu_model_runner.py where the code unconditionally accesses model_config.multimodal_config. This will lead to an AttributeError and crash the engine when running non-multimodal models, as multimodal_config will be None. I have provided code suggestions to fix this by ensuring the attribute is only accessed for multimodal models.

vllm/v1/worker/gpu_model_runner.py

vllm/v1/worker/tpu_model_runner.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Roger Wang <[email protected]>

Signed-off-by: Roger Wang <[email protected]>

vllm/v1/worker/gpu_model_runner.py

huachenheli · 2025-08-15T03:38:08Z

vllm/v1/worker/tpu_model_runner.py

    ) -> None:
        # Profile with multimodal encoder & encoder cache.
-        if self.supports_mm_inputs:
+        if self.skip_mm_profiling:


I think we can make the code more readable:

if self.skip_mm_profiling: logger.info("Skipping memory profiling for multimodal encoder and " "encoder cache.") else: # self.supports_mm_inputs already checked above. ...

Note that self.skip_mm_profiling is defined to be True if the flag is set and self.supports_mm_inputs=True - this means you will have to check again anyways if self.supports_mm_inputs is True in that else gate.

The reason for this definition is that this particular check here is really for showing the message when users explicitly turned off profiling for a model that takes mm inputs, otherwise we will be showing this message for text-only models that could be a bit confusing imo.

See my change in fd90599 - probably more readable now?

vllm/v1/worker/gpu_model_runner.py

vllm/config/__init__.py

Signed-off-by: Roger Wang <[email protected]>

huachenheli · 2025-08-15T04:55:26Z

vllm/config/__init__.py

    interleave_mm_strings: bool = False
    """Enable fully interleaved support for multimodal prompts, while using
    --chat-template-content-format=string. Defaults to False."""
+    skip_mm_profiling: bool = False


I'm a big fan of "positive flag names". Empirically, enable_xyz=False causes less cognitive overhead compared to disable_xyz=True. Maybe we can set this instead as enable_mm_profiling and default it to True instead?

We tried that with a few things with positive flag actually but ends up having a bit messy situation IMO (we have --enable-prefix-caching and --no-enable-prefix-caching)

Personally, I actually think vllm serve model_name --skip-mm-profiling is more intuitive than vllm serve model_name --enable-mm-profiling=False or vllm serve model_name --no-enable-mm-profiling, and is more consistent with other negative flags in we have (e.g, disable_sliding_window, skip_tokenizer_init, etc) when we want the positive flag to be the default behavior. What do you think?

I think in terms of user perspective it is better to use --skip- rather than --no-enable-. But maybe we can adjust the argument parser to support both ways while keeping a positive name for the Python variable

Ah okay. If the underlying parser is using the "--no-enable-xyz" style instead of "--enable-xyz=false" (similar to how c++ gflags work) then I guess "skip" is indeed cleaner.

--arg and --no-arg is actually built in Python behaviour, so we adopted it to he more standard from a Python perspective.

https://docs.python.org/3/library/argparse.html#argparse.BooleanOptionalAction

--arg and --no-arg is actually built in Python behaviour, so we adopted it to he more standard from a Python perspective.

Yea - I meant more like having default behavior to be an explicit positive flag (instead of just having --disable-prefix-caching) seems a bit weird to me.

Yeah I agree, I was just providing context :)

huachenheli · 2025-08-15T04:59:51Z

vllm/v1/worker/gpu_model_runner.py

-                    **batched_dummy_mm_inputs)
+                    # Run multimodal encoder.
+                    dummy_encoder_outputs = \
+                        self.model.get_multimodal_embeddings(


Do we need a stream synchronize here? The TPU version seems to explicitly doing a sync. Is that not needed for cuda?

I'm not too familiar with why TPU did that but at least on cuda the two models are running on the same stream (this is by design since we don't want encoder to affect decoder implicitly in any possible way)

…ject#22950) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ject#22950) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Duncan Moss <[email protected]>

…ject#22950) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ject#22950) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Xiao Yu <[email protected]>

…ject#22950) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Roger Wang added 2 commits August 14, 2025 18:17

add

f7e5028

Signed-off-by: Roger Wang <[email protected]>

clarify

1f55398

Signed-off-by: Roger Wang <[email protected]>

ywang96 requested review from ProExpertProg, WoosukKwon, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners August 15, 2025 01:26

mergify bot added v1 tpu Related to Google TPUs labels Aug 15, 2025

gemini-code-assist bot reviewed Aug 15, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/worker/tpu_model_runner.py Outdated Show resolved Hide resolved

Roger Wang and others added 6 commits August 14, 2025 18:29

Update vllm/v1/worker/gpu_model_runner.py

4d41d4a

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Roger Wang <[email protected]>

Update vllm/v1/worker/tpu_model_runner.py

861d00d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Roger Wang <[email protected]>

add to model config

86afe6a

Signed-off-by: Roger Wang <[email protected]>

shorten

381c45f

Signed-off-by: Roger Wang <[email protected]>

add

9b1e140

Signed-off-by: Roger Wang <[email protected]>

Merge branch 'main' into disable-mm-profiling

1a15379

ywang96 removed request for ProExpertProg, comaniac, mgoin, njhill and tlrmchlsmth August 15, 2025 01:54

ywang96 requested review from DarkLight1337 and removed request for alexm-redhat, robertgshaw2-redhat and yewentao256 August 15, 2025 01:54

precommit

bdff1c2

Signed-off-by: Roger Wang <[email protected]>

DarkLight1337 reviewed Aug 15, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

huachenheli reviewed Aug 15, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

huachenheli reviewed Aug 15, 2025

View reviewed changes

vllm/config/__init__.py Outdated Show resolved Hide resolved

huachenheli reviewed Aug 15, 2025

View reviewed changes

vllm/config/__init__.py Outdated Show resolved Hide resolved

update

fd90599

Signed-off-by: Roger Wang <[email protected]>

DarkLight1337 approved these changes Aug 15, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) August 15, 2025 04:44

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 15, 2025

update doc

9645d7d

Signed-off-by: Roger Wang <[email protected]>

huachenheli reviewed Aug 15, 2025

View reviewed changes

Merge branch 'main' into disable-mm-profiling

82ce4f1

DarkLight1337 merged commit 49252cf into vllm-project:main Aug 15, 2025
42 checks passed

Uh oh!

[MM] Allow skipping memory profiling for multimodal models. #22950

[MM] Allow skipping memory profiling for multimodal models. #22950

Uh oh!

Conversation

ywang96 commented Aug 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Aug 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

huachenheli Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywang96 Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywang96 Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

huachenheli Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

ywang96 Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huachenheli Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

hmellor Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

ywang96 Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hmellor Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

huachenheli Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

ywang96 Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ywang96 commented Aug 15, 2025 •

edited by github-actions bot

Loading

huachenheli Aug 15, 2025 •

edited

Loading

ywang96 Aug 15, 2025 •

edited

Loading

ywang96 Aug 15, 2025 •

edited

Loading

DarkLight1337 Aug 15, 2025 •

edited

Loading

ywang96 Aug 15, 2025 •

edited

Loading

ywang96 Aug 15, 2025 •

edited

Loading