Skip to content

Conversation

@ywang96
Copy link
Member

@ywang96 ywang96 commented Aug 15, 2025

Purpose

Memory profiling for multimodal models involves processes input dummy data into features, then encode these features into embeddings and store them. Since most LMMs have a relatively small encoder, sometimes users may want to tune down --gpu-memory-utilization themselves and skip this profiling for faster startup time (especially for RL scenario).

This PR adds --skip-mm-profiling option for users to achieve so.

Test Plan

Command: vllm serve Qwen/Qwen2.5-VL-7B-Instruct

INFO 08-15 01:42:53 [core.py:199] init engine (profile, create kv cache, warmup model) took 26.52 seconds

Command: vllm serve Qwen/Qwen2.5-VL-7B-Instruct --skip-mm-profiling

INFO 08-15 01:47:42 [core.py:199] init engine (profile, create kv cache, warmup model) took 20.86 seconds

Test Result

(Optional) Documentation Update


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Roger Wang added 2 commits August 14, 2025 18:17
add
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added v1 tpu Related to Google TPUs labels Aug 15, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a --skip-mm-profiling flag to accelerate engine startup for multimodal models by bypassing memory profiling. The implementation correctly propagates this new configuration from the command line to the model runners. However, I've identified a critical issue in both gpu_model_runner.py and tpu_model_runner.py where the code unconditionally accesses model_config.multimodal_config. This will lead to an AttributeError and crash the engine when running non-multimodal models, as multimodal_config will be None. I have provided code suggestions to fix this by ensuring the attribute is only accessed for multimodal models.

Roger Wang and others added 6 commits August 14, 2025 18:29
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
add
Signed-off-by: Roger Wang <[email protected]>
@ywang96 ywang96 requested review from DarkLight1337 and removed request for alexm-redhat, robertgshaw2-redhat and yewentao256 August 15, 2025 01:54
Signed-off-by: Roger Wang <[email protected]>
) -> None:
# Profile with multimodal encoder & encoder cache.
if self.supports_mm_inputs:
if self.skip_mm_profiling:
Copy link
Contributor

@huachenheli huachenheli Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make the code more readable:

        if self.skip_mm_profiling:
            logger.info("Skipping memory profiling for multimodal encoder and "
                        "encoder cache.")
        else:
            # self.supports_mm_inputs already checked above.
            ...

Copy link
Member Author

@ywang96 ywang96 Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that self.skip_mm_profiling is defined to be True if the flag is set and self.supports_mm_inputs=True - this means you will have to check again anyways if self.supports_mm_inputs is True in that else gate.

The reason for this definition is that this particular check here is really for showing the message when users explicitly turned off profiling for a model that takes mm inputs, otherwise we will be showing this message for text-only models that could be a bit confusing imo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my change in fd90599 - probably more readable now?

Signed-off-by: Roger Wang <[email protected]>
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) August 15, 2025 04:44
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 15, 2025
Signed-off-by: Roger Wang <[email protected]>
interleave_mm_strings: bool = False
"""Enable fully interleaved support for multimodal prompts, while using
--chat-template-content-format=string. Defaults to False."""
skip_mm_profiling: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a big fan of "positive flag names". Empirically, enable_xyz=False causes less cognitive overhead compared to disable_xyz=True. Maybe we can set this instead as enable_mm_profiling and default it to True instead?

Copy link
Member Author

@ywang96 ywang96 Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried that with a few things with positive flag actually but ends up having a bit messy situation IMO (we have --enable-prefix-caching and --no-enable-prefix-caching)

Personally, I actually think vllm serve model_name --skip-mm-profiling is more intuitive than vllm serve model_name --enable-mm-profiling=False or vllm serve model_name --no-enable-mm-profiling, and is more consistent with other negative flags in we have (e.g, disable_sliding_window, skip_tokenizer_init, etc) when we want the positive flag to be the default behavior. What do you think?

Copy link
Member

@DarkLight1337 DarkLight1337 Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in terms of user perspective it is better to use --skip- rather than --no-enable-. But maybe we can adjust the argument parser to support both ways while keeping a positive name for the Python variable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay. If the underlying parser is using the "--no-enable-xyz" style instead of "--enable-xyz=false" (similar to how c++ gflags work) then I guess "skip" is indeed cleaner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--arg and --no-arg is actually built in Python behaviour, so we adopted it to he more standard from a Python perspective.

https://docs.python.org/3/library/argparse.html#argparse.BooleanOptionalAction

Copy link
Member Author

@ywang96 ywang96 Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--arg and --no-arg is actually built in Python behaviour, so we adopted it to he more standard from a Python perspective.

Yea - I meant more like having default behavior to be an explicit positive flag (instead of just having --disable-prefix-caching) seems a bit weird to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree, I was just providing context :)

**batched_dummy_mm_inputs)
# Run multimodal encoder.
dummy_encoder_outputs = \
self.model.get_multimodal_embeddings(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a stream synchronize here? The TPU version seems to explicitly doing a sync. Is that not needed for cuda?

Copy link
Member Author

@ywang96 ywang96 Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too familiar with why TPU did that but at least on cuda the two models are running on the same stream (this is by design since we don't want encoder to affect decoder implicitly in any possible way)

@DarkLight1337 DarkLight1337 merged commit 49252cf into vllm-project:main Aug 15, 2025
42 checks passed
yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025
…ject#22950)

Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
divakar-amd pushed a commit to divakar-amd/vllm_upstream that referenced this pull request Aug 20, 2025
…ject#22950)

Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
djmmoss pushed a commit to djmmoss/vllm that referenced this pull request Aug 21, 2025
…ject#22950)

Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Duncan Moss <[email protected]>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
…ject#22950)

Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
…ject#22950)

Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Xiao Yu <[email protected]>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
…ject#22950)

Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants