Conversation
|
/ok to test 3cad2ec |
📝 WalkthroughWalkthroughThis pull request updates the VLM inference system to replace Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/megatron/bridge/inference/vlm/qwenvl_inference_wrapper.py (1)
44-46:⚠️ Potential issue | 🟡 Minor
InferenceParamsassignment inprep_inference_inputis unused and likely unnecessary.Line 46 creates
self.inference_params = InferenceParams(batch_size, seq_length), but this parameter is never passed to the model inforward_pass_without_pipeline_parallel(line 88). The model is called with onlyattention_mask=Noneand theinference_inputdict containinginput_ids,pixel_values, andimage_grid_thw—neitherself.inference_paramsnorself.inference_contextreach the model.If
StaticInferenceContextis passed at construction, the intended KV-cache management should useself.inference_context(from the base class). Since this assignment creates dead code and the model expectsinference_contextas a parameter (permodeling_qwen25_vl.py), remove line 46 unless it serves another purpose in the generation loop.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/inference/vlm/qwenvl_inference_wrapper.py` around lines 44 - 46, The assignment self.inference_params = InferenceParams(batch_size, seq_length) in prep_inference_input is dead code because InferenceParams is never used when calling the model in forward_pass_without_pipeline_parallel; remove that assignment and rely on self.inference_context provided by the base class (or, if KV-cache management was intended, instead pass the appropriate inference_context to the model call in forward_pass_without_pipeline_parallel to match the model signature in modeling_qwen25_vl.py). Locate prep_inference_input and forward_pass_without_pipeline_parallel and either delete the InferenceParams creation or update the model invocation to include the correct inference_context parameter.src/megatron/bridge/inference/vlm/base.py (1)
115-124:⚠️ Potential issue | 🟡 Minor
vocab_sizeexposure is asymmetric — and appears to be unused code.
_expose_decoder_from_language_model(line 124) is only invoked forQwen25VLModelProvider(line 148), whileQwen3VLModelProvider(lines 149–150) skips it entirely. However, the exposedvocab_sizeattribute is never accessed anywhere in the inference pipeline. Additionally, Qwen3 models already carryvocab_sizein theirlanguage_transformer_config(passed during language model initialization at line 150 of modelling_qwen3_vl/model.py), making the asymmetric exposure redundant for Qwen2.5 as well.Either remove the unused
vocab_sizeexposure entirely, or clarify its purpose if required by a downstream dependency.
inference_max_batch_sizeparameter is not exposed through the public API.The
inference_max_batch_sizeparameter (line 133) is hardcoded to default 4 insetup_model_and_tokenizer(lines 104–110), which always callssetup_inference_wrapperwithout passing this argument. Callers cannot customize the KV cache batch size allocation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/inference/vlm/base.py` around lines 115 - 124, The helper _expose_decoder_from_language_model currently assigns current.vocab_size (unused and asymmetrically applied to Qwen25VLModelProvider but not Qwen3VLModelProvider); remove the vocab_size exposure from _expose_decoder_from_language_model and only expose the decoder there (leave language_model.decoder assignment intact) unless a downstream consumer needs vocab_size—if so, document and centralize that behavior for both Qwen25VLModelProvider and Qwen3VLModelProvider. Also make inference_max_batch_size configurable by adding an inference_max_batch_size parameter to setup_model_and_tokenizer and passing it through into setup_inference_wrapper (instead of hardcoding 4), updating the default to preserve existing behavior and ensuring callers can override it; touch the symbols setup_model_and_tokenizer and setup_inference_wrapper to propagate the argument.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/megatron/bridge/inference/vlm/base.py`:
- Line 133: The public function setup_model_and_tokenizer currently does not
accept or forward the new inference_max_batch_size parameter, causing callers to
be unable to set generate(max_batch_size=N) and risking StaticInferenceContext
KV-cache overflows; update setup_model_and_tokenizer to add an
inference_max_batch_size: int = 4 parameter and pass it through to
setup_inference_wrapper (which already accepts inference_max_batch_size) so
downstream calls to generate() can use the configured max_batch_size and avoid
KV-cache overflow.
In `@src/megatron/bridge/inference/vlm/qwenvl_inference_wrapper.py`:
- Around line 34-35: The __init__ method for QWENVLINferenceWrapper lacks a type
annotation for the inference_context parameter and the class docstring Args
block is missing that parameter; update the signature of __init__ to annotate
inference_context with the correct nullable type (use the project's preferred
"SomeType | None" form, e.g. InferenceContext | None) and update the class
docstring Args section to document inference_context and its purpose; reference
the __init__ method and the class docstring in QWENVLINferenceWrapper when
making these edits.
---
Outside diff comments:
In `@src/megatron/bridge/inference/vlm/base.py`:
- Around line 115-124: The helper _expose_decoder_from_language_model currently
assigns current.vocab_size (unused and asymmetrically applied to
Qwen25VLModelProvider but not Qwen3VLModelProvider); remove the vocab_size
exposure from _expose_decoder_from_language_model and only expose the decoder
there (leave language_model.decoder assignment intact) unless a downstream
consumer needs vocab_size—if so, document and centralize that behavior for both
Qwen25VLModelProvider and Qwen3VLModelProvider. Also make
inference_max_batch_size configurable by adding an inference_max_batch_size
parameter to setup_model_and_tokenizer and passing it through into
setup_inference_wrapper (instead of hardcoding 4), updating the default to
preserve existing behavior and ensuring callers can override it; touch the
symbols setup_model_and_tokenizer and setup_inference_wrapper to propagate the
argument.
In `@src/megatron/bridge/inference/vlm/qwenvl_inference_wrapper.py`:
- Around line 44-46: The assignment self.inference_params =
InferenceParams(batch_size, seq_length) in prep_inference_input is dead code
because InferenceParams is never used when calling the model in
forward_pass_without_pipeline_parallel; remove that assignment and rely on
self.inference_context provided by the base class (or, if KV-cache management
was intended, instead pass the appropriate inference_context to the model call
in forward_pass_without_pipeline_parallel to match the model signature in
modeling_qwen25_vl.py). Locate prep_inference_input and
forward_pass_without_pipeline_parallel and either delete the InferenceParams
creation or update the model invocation to include the correct inference_context
parameter.
| params_dtype: torch.dtype = torch.bfloat16, | ||
| inference_batch_times_seqlen_threshold: int = 1000, | ||
| inference_max_seq_length: int = 8192, | ||
| inference_max_batch_size: int = 4, |
There was a problem hiding this comment.
inference_max_batch_size is not surfaced through setup_model_and_tokenizer.
setup_inference_wrapper gains the new inference_max_batch_size parameter (default 4), but setup_model_and_tokenizer — the public entry point — never accepts or forwards it (lines 104–110). Any caller who needs generate(max_batch_size=N) with N > 4 will hit a StaticInferenceContext KV-cache overflow at runtime, with no way to prevent it through the public API.
🔧 Suggested fix: surface the parameter in the public function
def setup_model_and_tokenizer(
megatron_model_path: str,
tp: int = 1,
pp: int = 1,
params_dtype: torch.dtype = torch.bfloat16,
inference_batch_times_seqlen_threshold: int = 1000,
inference_max_seq_length: int = 8192,
+ inference_max_batch_size: int = 4,
):
...
inference_wrapped_model = setup_inference_wrapper(
model[0],
processor.tokenizer,
params_dtype=torch.bfloat16,
inference_batch_times_seqlen_threshold=1000,
inference_max_seq_length=inference_max_seq_length,
+ inference_max_batch_size=inference_max_batch_size,
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/inference/vlm/base.py` at line 133, The public function
setup_model_and_tokenizer currently does not accept or forward the new
inference_max_batch_size parameter, causing callers to be unable to set
generate(max_batch_size=N) and risking StaticInferenceContext KV-cache
overflows; update setup_model_and_tokenizer to add an inference_max_batch_size:
int = 4 parameter and pass it through to setup_inference_wrapper (which already
accepts inference_max_batch_size) so downstream calls to generate() can use the
configured max_batch_size and avoid KV-cache overflow.
There was a problem hiding this comment.
@meatybobby could you please check if this is relevant ?
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
MCore still has backward compatibility for this, but I just removed it to avoid confusing.
|
/ok to test aa1b9c3 |
|
/ok to test 5d0bf0e |
athitten
left a comment
There was a problem hiding this comment.
LGTM, thank you @meatybobby !
What does this PR do ?
Fix VLM inference engine
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
New Features
Updates