Skip to content

Fix VLM inference engine#2418

Merged
meatybobby merged 5 commits intomainfrom
bobchen/vlm_inference
Feb 20, 2026
Merged

Fix VLM inference engine#2418
meatybobby merged 5 commits intomainfrom
bobchen/vlm_inference

Conversation

@meatybobby
Copy link
Copy Markdown
Contributor

@meatybobby meatybobby commented Feb 17, 2026

What does this PR do ?

Fix VLM inference engine

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • New Features

    • Added configurable batch size parameter for inference operations (default: 4).
    • Enhanced inference context with bounded batch and sequence length management.
    • Exposed vocabulary size for improved language model compatibility.
  • Updates

    • Renamed inference parameter configuration for improved consistency.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@meatybobby
Copy link
Copy Markdown
Contributor Author

/ok to test 3cad2ec

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 17, 2026

📝 Walkthrough

Walkthrough

This pull request updates the VLM inference system to replace CommonInferenceParams with SamplingParams, introduces support for bounded inference contexts via StaticInferenceContext, and exposes the language model's vocabulary size. The changes affect example code and inference wrapper initialization across three files.

Changes

Cohort / File(s) Summary
Inference Parameter Configuration
examples/inference/vlm/vlm_inference.py
Replaces CommonInferenceParams with SamplingParams import and updates the corresponding instance creation and keyword argument name in the generate() call from inference_params to sampling_params.
Inference Wrapper Setup
src/megatron/bridge/inference/vlm/base.py
Adds StaticInferenceContext import, introduces inference_max_batch_size parameter to setup_inference_wrapper(), constructs inference context with max batch size and sequence length constraints, and exposes vocab_size from language model in the decoder exposure method.
Wrapper Initialization
src/megatron/bridge/inference/vlm/qwenvl_inference_wrapper.py
Updates QwenVLInferenceWrapper.__init__() to accept and forward optional inference_context parameter to the superclass.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Possibly related PRs

Suggested labels

bug

Suggested reviewers

  • cuichenx
  • huvunvidia
  • yaoyu-33
🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Fix VLM inference engine' is vague and generic, using the non-descriptive term 'Fix' without explaining what specific issue or feature is being addressed. Replace with a more specific title that describes the actual changes, such as 'Replace CommonInferenceParams with SamplingParams in VLM inference' or similar.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Test Results For Major Changes ✅ Passed PR contains only minor parameter handling and configuration changes (~14 lines) across 5 files, including parameter renaming and adding optional parameters. These targeted bug fixes do not constitute major changes requiring test documentation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bobchen/vlm_inference

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/megatron/bridge/inference/vlm/qwenvl_inference_wrapper.py (1)

44-46: ⚠️ Potential issue | 🟡 Minor

InferenceParams assignment in prep_inference_input is unused and likely unnecessary.

Line 46 creates self.inference_params = InferenceParams(batch_size, seq_length), but this parameter is never passed to the model in forward_pass_without_pipeline_parallel (line 88). The model is called with only attention_mask=None and the inference_input dict containing input_ids, pixel_values, and image_grid_thw—neither self.inference_params nor self.inference_context reach the model.

If StaticInferenceContext is passed at construction, the intended KV-cache management should use self.inference_context (from the base class). Since this assignment creates dead code and the model expects inference_context as a parameter (per modeling_qwen25_vl.py), remove line 46 unless it serves another purpose in the generation loop.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/inference/vlm/qwenvl_inference_wrapper.py` around lines
44 - 46, The assignment self.inference_params = InferenceParams(batch_size,
seq_length) in prep_inference_input is dead code because InferenceParams is
never used when calling the model in forward_pass_without_pipeline_parallel;
remove that assignment and rely on self.inference_context provided by the base
class (or, if KV-cache management was intended, instead pass the appropriate
inference_context to the model call in forward_pass_without_pipeline_parallel to
match the model signature in modeling_qwen25_vl.py). Locate prep_inference_input
and forward_pass_without_pipeline_parallel and either delete the InferenceParams
creation or update the model invocation to include the correct inference_context
parameter.
src/megatron/bridge/inference/vlm/base.py (1)

115-124: ⚠️ Potential issue | 🟡 Minor

vocab_size exposure is asymmetric — and appears to be unused code.

_expose_decoder_from_language_model (line 124) is only invoked for Qwen25VLModelProvider (line 148), while Qwen3VLModelProvider (lines 149–150) skips it entirely. However, the exposed vocab_size attribute is never accessed anywhere in the inference pipeline. Additionally, Qwen3 models already carry vocab_size in their language_transformer_config (passed during language model initialization at line 150 of modelling_qwen3_vl/model.py), making the asymmetric exposure redundant for Qwen2.5 as well.

Either remove the unused vocab_size exposure entirely, or clarify its purpose if required by a downstream dependency.

inference_max_batch_size parameter is not exposed through the public API.

The inference_max_batch_size parameter (line 133) is hardcoded to default 4 in setup_model_and_tokenizer (lines 104–110), which always calls setup_inference_wrapper without passing this argument. Callers cannot customize the KV cache batch size allocation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/inference/vlm/base.py` around lines 115 - 124, The helper
_expose_decoder_from_language_model currently assigns current.vocab_size (unused
and asymmetrically applied to Qwen25VLModelProvider but not
Qwen3VLModelProvider); remove the vocab_size exposure from
_expose_decoder_from_language_model and only expose the decoder there (leave
language_model.decoder assignment intact) unless a downstream consumer needs
vocab_size—if so, document and centralize that behavior for both
Qwen25VLModelProvider and Qwen3VLModelProvider. Also make
inference_max_batch_size configurable by adding an inference_max_batch_size
parameter to setup_model_and_tokenizer and passing it through into
setup_inference_wrapper (instead of hardcoding 4), updating the default to
preserve existing behavior and ensuring callers can override it; touch the
symbols setup_model_and_tokenizer and setup_inference_wrapper to propagate the
argument.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/inference/vlm/base.py`:
- Line 133: The public function setup_model_and_tokenizer currently does not
accept or forward the new inference_max_batch_size parameter, causing callers to
be unable to set generate(max_batch_size=N) and risking StaticInferenceContext
KV-cache overflows; update setup_model_and_tokenizer to add an
inference_max_batch_size: int = 4 parameter and pass it through to
setup_inference_wrapper (which already accepts inference_max_batch_size) so
downstream calls to generate() can use the configured max_batch_size and avoid
KV-cache overflow.

In `@src/megatron/bridge/inference/vlm/qwenvl_inference_wrapper.py`:
- Around line 34-35: The __init__ method for QWENVLINferenceWrapper lacks a type
annotation for the inference_context parameter and the class docstring Args
block is missing that parameter; update the signature of __init__ to annotate
inference_context with the correct nullable type (use the project's preferred
"SomeType | None" form, e.g. InferenceContext | None) and update the class
docstring Args section to document inference_context and its purpose; reference
the __init__ method and the class docstring in QWENVLINferenceWrapper when
making these edits.

---

Outside diff comments:
In `@src/megatron/bridge/inference/vlm/base.py`:
- Around line 115-124: The helper _expose_decoder_from_language_model currently
assigns current.vocab_size (unused and asymmetrically applied to
Qwen25VLModelProvider but not Qwen3VLModelProvider); remove the vocab_size
exposure from _expose_decoder_from_language_model and only expose the decoder
there (leave language_model.decoder assignment intact) unless a downstream
consumer needs vocab_size—if so, document and centralize that behavior for both
Qwen25VLModelProvider and Qwen3VLModelProvider. Also make
inference_max_batch_size configurable by adding an inference_max_batch_size
parameter to setup_model_and_tokenizer and passing it through into
setup_inference_wrapper (instead of hardcoding 4), updating the default to
preserve existing behavior and ensuring callers can override it; touch the
symbols setup_model_and_tokenizer and setup_inference_wrapper to propagate the
argument.

In `@src/megatron/bridge/inference/vlm/qwenvl_inference_wrapper.py`:
- Around line 44-46: The assignment self.inference_params =
InferenceParams(batch_size, seq_length) in prep_inference_input is dead code
because InferenceParams is never used when calling the model in
forward_pass_without_pipeline_parallel; remove that assignment and rely on
self.inference_context provided by the base class (or, if KV-cache management
was intended, instead pass the appropriate inference_context to the model call
in forward_pass_without_pipeline_parallel to match the model signature in
modeling_qwen25_vl.py). Locate prep_inference_input and
forward_pass_without_pipeline_parallel and either delete the InferenceParams
creation or update the model invocation to include the correct inference_context
parameter.

params_dtype: torch.dtype = torch.bfloat16,
inference_batch_times_seqlen_threshold: int = 1000,
inference_max_seq_length: int = 8192,
inference_max_batch_size: int = 4,
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

inference_max_batch_size is not surfaced through setup_model_and_tokenizer.

setup_inference_wrapper gains the new inference_max_batch_size parameter (default 4), but setup_model_and_tokenizer — the public entry point — never accepts or forwards it (lines 104–110). Any caller who needs generate(max_batch_size=N) with N > 4 will hit a StaticInferenceContext KV-cache overflow at runtime, with no way to prevent it through the public API.

🔧 Suggested fix: surface the parameter in the public function
 def setup_model_and_tokenizer(
     megatron_model_path: str,
     tp: int = 1,
     pp: int = 1,
     params_dtype: torch.dtype = torch.bfloat16,
     inference_batch_times_seqlen_threshold: int = 1000,
     inference_max_seq_length: int = 8192,
+    inference_max_batch_size: int = 4,
 ):
     ...
     inference_wrapped_model = setup_inference_wrapper(
         model[0],
         processor.tokenizer,
         params_dtype=torch.bfloat16,
         inference_batch_times_seqlen_threshold=1000,
         inference_max_seq_length=inference_max_seq_length,
+        inference_max_batch_size=inference_max_batch_size,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/inference/vlm/base.py` at line 133, The public function
setup_model_and_tokenizer currently does not accept or forward the new
inference_max_batch_size parameter, causing callers to be unable to set
generate(max_batch_size=N) and risking StaticInferenceContext KV-cache
overflows; update setup_model_and_tokenizer to add an inference_max_batch_size:
int = 4 parameter and pass it through to setup_inference_wrapper (which already
accepts inference_max_batch_size) so downstream calls to generate() can use the
configured max_batch_size and avoid KV-cache overflow.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@meatybobby could you please check if this is relevant ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MCore still has backward compatibility for this, but I just removed it to avoid confusing.

@meatybobby
Copy link
Copy Markdown
Contributor Author

/ok to test aa1b9c3

@meatybobby
Copy link
Copy Markdown
Contributor Author

/ok to test 5d0bf0e

Copy link
Copy Markdown
Contributor

@athitten athitten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @meatybobby !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants