-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[MM] Allow skipping memory profiling for multimodal models. #22950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 10 commits
f7e5028
1f55398
4d41d4a
861d00d
86afe6a
381c45f
9b1e140
1a15379
bdff1c2
fd90599
9645d7d
82ce4f1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2479,50 +2479,56 @@ def _dummy_pooler_run( | |
| def profile_run(self) -> None: | ||
| # Profile with multimodal encoder & encoder cache. | ||
| if self.supports_mm_inputs: | ||
| mm_budget = self.mm_budget | ||
| assert mm_budget is not None | ||
|
|
||
| # TODO: handle encoder-decoder models once we support them. | ||
| if (encoder_budget := mm_budget.get_encoder_budget()) > 0: | ||
| # NOTE: Currently model is profiled with a single non-text | ||
| # modality with the max possible input tokens even when | ||
| # it supports multiple. | ||
| ( | ||
| dummy_modality, | ||
| max_tokens, | ||
| ) = mm_budget.get_modality_with_max_tokens() | ||
| ( | ||
| max_mm_items_per_prompt, | ||
| max_mm_items_per_batch, | ||
| ) = mm_budget.get_max_items(dummy_modality, max_tokens) | ||
|
|
||
| if self.model_config.multimodal_config.skip_mm_profiling: | ||
| logger.info( | ||
| "Encoder cache will be initialized with a budget of " | ||
| "%s tokens, and profiled with %s %s items of the maximum " | ||
| "feature size.", | ||
| encoder_budget, | ||
| max_mm_items_per_batch, | ||
| dummy_modality, | ||
| ) | ||
| "Skipping memory profiling for multimodal encoder and " | ||
| "encoder cache.") | ||
| else: | ||
| mm_budget = self.mm_budget | ||
| assert mm_budget is not None | ||
|
|
||
| # TODO: handle encoder-decoder models once we support them. | ||
| if (encoder_budget := mm_budget.get_encoder_budget()) > 0: | ||
| # NOTE: Currently model is profiled with a single non-text | ||
| # modality with the max possible input tokens even when | ||
| # it supports multiple. | ||
| ( | ||
| dummy_modality, | ||
| max_tokens, | ||
| ) = mm_budget.get_modality_with_max_tokens() | ||
| ( | ||
| max_mm_items_per_prompt, | ||
| max_mm_items_per_batch, | ||
| ) = mm_budget.get_max_items(dummy_modality, max_tokens) | ||
|
|
||
| logger.info( | ||
| "Encoder cache will be initialized with a budget of " | ||
| "%s tokens, and profiled with %s %s items of the " | ||
| "maximum feature size.", | ||
| encoder_budget, | ||
| max_mm_items_per_batch, | ||
| dummy_modality, | ||
| ) | ||
|
|
||
| # Create dummy batch of multimodal inputs. | ||
| batched_dummy_mm_inputs = self._get_mm_dummy_batch( | ||
| dummy_modality, | ||
| max_mm_items_per_batch, | ||
| ) | ||
| # Create dummy batch of multimodal inputs. | ||
| batched_dummy_mm_inputs = self._get_mm_dummy_batch( | ||
| dummy_modality, | ||
| max_mm_items_per_batch, | ||
| ) | ||
|
|
||
| # Run multimodal encoder. | ||
| dummy_encoder_outputs = self.model.get_multimodal_embeddings( | ||
| **batched_dummy_mm_inputs) | ||
| # Run multimodal encoder. | ||
| dummy_encoder_outputs = \ | ||
| self.model.get_multimodal_embeddings( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need a stream synchronize here? The TPU version seems to explicitly doing a sync. Is that not needed for cuda?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not too familiar with why TPU did that but at least on cuda the two models are running on the same stream (this is by design since we don't want encoder to affect decoder implicitly in any possible way) |
||
| **batched_dummy_mm_inputs) | ||
|
|
||
| sanity_check_mm_encoder_outputs( | ||
| dummy_encoder_outputs, | ||
| expected_num_items=max_mm_items_per_batch, | ||
| ) | ||
| sanity_check_mm_encoder_outputs( | ||
| dummy_encoder_outputs, | ||
| expected_num_items=max_mm_items_per_batch, | ||
| ) | ||
|
|
||
| # Cache the dummy encoder outputs. | ||
| self.encoder_cache["tmp"] = dict( | ||
| enumerate(dummy_encoder_outputs)) | ||
| # Cache the dummy encoder outputs. | ||
| self.encoder_cache["tmp"] = dict( | ||
| enumerate(dummy_encoder_outputs)) | ||
|
|
||
| # Add `is_profile` here to pre-allocate communication buffers | ||
| hidden_states, last_hidden_states \ | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a big fan of "positive flag names". Empirically,
enable_xyz=Falsecauses less cognitive overhead compared todisable_xyz=True. Maybe we can set this instead asenable_mm_profilingand default it to True instead?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tried that with a few things with positive flag actually but ends up having a bit messy situation IMO (we have
--enable-prefix-cachingand--no-enable-prefix-caching)Personally, I actually think
vllm serve model_name --skip-mm-profilingis more intuitive thanvllm serve model_name --enable-mm-profiling=Falseorvllm serve model_name --no-enable-mm-profiling, and is more consistent with other negative flags in we have (e.g,disable_sliding_window,skip_tokenizer_init, etc) when we want the positive flag to be the default behavior. What do you think?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in terms of user perspective it is better to use
--skip-rather than--no-enable-. But maybe we can adjust the argument parser to support both ways while keeping a positive name for the Python variableThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay. If the underlying parser is using the "--no-enable-xyz" style instead of "--enable-xyz=false" (similar to how c++ gflags work) then I guess "skip" is indeed cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--argand--no-argis actually built in Python behaviour, so we adopted it to he more standard from a Python perspective.https://docs.python.org/3/library/argparse.html#argparse.BooleanOptionalAction
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea - I meant more like having default behavior to be an explicit positive flag (instead of just having
--disable-prefix-caching) seems a bit weird to me.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree, I was just providing context :)