-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[MM] Allow skipping memory profiling for multimodal models. #22950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MM] Allow skipping memory profiling for multimodal models. #22950
Conversation
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a --skip-mm-profiling flag to accelerate engine startup for multimodal models by bypassing memory profiling. The implementation correctly propagates this new configuration from the command line to the model runners. However, I've identified a critical issue in both gpu_model_runner.py and tpu_model_runner.py where the code unconditionally accesses model_config.multimodal_config. This will lead to an AttributeError and crash the engine when running non-multimodal models, as multimodal_config will be None. I have provided code suggestions to fix this by ensuring the attribute is only accessed for multimodal models.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
vllm/v1/worker/tpu_model_runner.py
Outdated
| ) -> None: | ||
| # Profile with multimodal encoder & encoder cache. | ||
| if self.supports_mm_inputs: | ||
| if self.skip_mm_profiling: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make the code more readable:
if self.skip_mm_profiling:
logger.info("Skipping memory profiling for multimodal encoder and "
"encoder cache.")
else:
# self.supports_mm_inputs already checked above.
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that self.skip_mm_profiling is defined to be True if the flag is set and self.supports_mm_inputs=True - this means you will have to check again anyways if self.supports_mm_inputs is True in that else gate.
The reason for this definition is that this particular check here is really for showing the message when users explicitly turned off profiling for a model that takes mm inputs, otherwise we will be showing this message for text-only models that could be a bit confusing imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my change in fd90599 - probably more readable now?
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
| interleave_mm_strings: bool = False | ||
| """Enable fully interleaved support for multimodal prompts, while using | ||
| --chat-template-content-format=string. Defaults to False.""" | ||
| skip_mm_profiling: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a big fan of "positive flag names". Empirically, enable_xyz=False causes less cognitive overhead compared to disable_xyz=True. Maybe we can set this instead as enable_mm_profiling and default it to True instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tried that with a few things with positive flag actually but ends up having a bit messy situation IMO (we have --enable-prefix-caching and --no-enable-prefix-caching)
Personally, I actually think vllm serve model_name --skip-mm-profiling is more intuitive than vllm serve model_name --enable-mm-profiling=False or vllm serve model_name --no-enable-mm-profiling, and is more consistent with other negative flags in we have (e.g, disable_sliding_window, skip_tokenizer_init, etc) when we want the positive flag to be the default behavior. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in terms of user perspective it is better to use --skip- rather than --no-enable-. But maybe we can adjust the argument parser to support both ways while keeping a positive name for the Python variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay. If the underlying parser is using the "--no-enable-xyz" style instead of "--enable-xyz=false" (similar to how c++ gflags work) then I guess "skip" is indeed cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--arg and --no-arg is actually built in Python behaviour, so we adopted it to he more standard from a Python perspective.
https://docs.python.org/3/library/argparse.html#argparse.BooleanOptionalAction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--arg and --no-arg is actually built in Python behaviour, so we adopted it to he more standard from a Python perspective.
Yea - I meant more like having default behavior to be an explicit positive flag (instead of just having --disable-prefix-caching) seems a bit weird to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree, I was just providing context :)
| **batched_dummy_mm_inputs) | ||
| # Run multimodal encoder. | ||
| dummy_encoder_outputs = \ | ||
| self.model.get_multimodal_embeddings( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a stream synchronize here? The TPU version seems to explicitly doing a sync. Is that not needed for cuda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too familiar with why TPU did that but at least on cuda the two models are running on the same stream (this is by design since we don't want encoder to affect decoder implicitly in any possible way)
…ject#22950) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ject#22950) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ject#22950) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Duncan Moss <[email protected]>
…ject#22950) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ject#22950) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Xiao Yu <[email protected]>
…ject#22950) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Purpose
Memory profiling for multimodal models involves processes input dummy data into features, then encode these features into embeddings and store them. Since most LMMs have a relatively small encoder, sometimes users may want to tune down
--gpu-memory-utilizationthemselves and skip this profiling for faster startup time (especially for RL scenario).This PR adds
--skip-mm-profilingoption for users to achieve so.Test Plan
Command:
vllm serve Qwen/Qwen2.5-VL-7B-InstructCommand:
vllm serve Qwen/Qwen2.5-VL-7B-Instruct --skip-mm-profilingTest Result
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.