Fix VLM inference engine by meatybobby · Pull Request #2418 · NVIDIA-NeMo/Megatron-Bridge

meatybobby · 2026-02-17T21:44:45Z

What does this PR do ?

Fix VLM inference engine

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

New Features
- Added configurable batch size parameter for inference operations (default: 4).
- Enhanced inference context with bounded batch and sequence length management.
- Exposed vocabulary size for improved language model compatibility.
Updates
- Renamed inference parameter configuration for improved consistency.

copy-pr-bot · 2026-02-17T21:44:48Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

meatybobby · 2026-02-17T21:45:54Z

/ok to test 3cad2ec

coderabbitai · 2026-02-17T21:53:19Z

📝 Walkthrough

Walkthrough

This pull request updates the VLM inference system to replace CommonInferenceParams with SamplingParams, introduces support for bounded inference contexts via StaticInferenceContext, and exposes the language model's vocabulary size. The changes affect example code and inference wrapper initialization across three files.

Changes

Cohort / File(s)	Summary
Inference Parameter Configuration `examples/inference/vlm/vlm_inference.py`	Replaces `CommonInferenceParams` with `SamplingParams` import and updates the corresponding instance creation and keyword argument name in the `generate()` call from `inference_params` to `sampling_params`.
Inference Wrapper Setup `src/megatron/bridge/inference/vlm/base.py`	Adds `StaticInferenceContext` import, introduces `inference_max_batch_size` parameter to `setup_inference_wrapper()`, constructs inference context with max batch size and sequence length constraints, and exposes `vocab_size` from language model in the decoder exposure method.
Wrapper Initialization `src/megatron/bridge/inference/vlm/qwenvl_inference_wrapper.py`	Updates `QwenVLInferenceWrapper.__init__()` to accept and forward optional `inference_context` parameter to the superclass.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Possibly related PRs

Fix Qwen2.5-VL huggingface conversion issue (#2107) #2156: Related changes to language-model decoder exposure in the MCore inference layer.

Suggested labels

bug

Suggested reviewers

cuichenx
huvunvidia
yaoyu-33

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Fix VLM inference engine' is vague and generic, using the non-descriptive term 'Fix' without explaining what specific issue or feature is being addressed.	Replace with a more specific title that describes the actual changes, such as 'Replace CommonInferenceParams with SamplingParams in VLM inference' or similar.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Test Results For Major Changes	✅ Passed	PR contains only minor parameter handling and configuration changes (~14 lines) across 5 files, including parameter renaming and adding optional parameters. These targeted bug fixes do not constitute major changes requiring test documentation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch bobchen/vlm_inference

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/megatron/bridge/inference/vlm/qwenvl_inference_wrapper.py (1)
44-46: ⚠️ Potential issue | 🟡 Minor

InferenceParams assignment in prep_inference_input is unused and likely unnecessary.

Line 46 creates self.inference_params = InferenceParams(batch_size, seq_length), but this parameter is never passed to the model in forward_pass_without_pipeline_parallel (line 88). The model is called with only attention_mask=None and the inference_input dict containing input_ids, pixel_values, and image_grid_thw—neither self.inference_params nor self.inference_context reach the model.

If StaticInferenceContext is passed at construction, the intended KV-cache management should use self.inference_context (from the base class). Since this assignment creates dead code and the model expects inference_context as a parameter (per modeling_qwen25_vl.py), remove line 46 unless it serves another purpose in the generation loop.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/inference/vlm/qwenvl_inference_wrapper.py` around lines
44 - 46, The assignment self.inference_params = InferenceParams(batch_size,
seq_length) in prep_inference_input is dead code because InferenceParams is
never used when calling the model in forward_pass_without_pipeline_parallel;
remove that assignment and rely on self.inference_context provided by the base
class (or, if KV-cache management was intended, instead pass the appropriate
inference_context to the model call in forward_pass_without_pipeline_parallel to
match the model signature in modeling_qwen25_vl.py). Locate prep_inference_input
and forward_pass_without_pipeline_parallel and either delete the InferenceParams
creation or update the model invocation to include the correct inference_context
parameter.
src/megatron/bridge/inference/vlm/base.py (1)
115-124: ⚠️ Potential issue | 🟡 Minor

vocab_size exposure is asymmetric — and appears to be unused code.

_expose_decoder_from_language_model (line 124) is only invoked for Qwen25VLModelProvider (line 148), while Qwen3VLModelProvider (lines 149–150) skips it entirely. However, the exposed vocab_size attribute is never accessed anywhere in the inference pipeline. Additionally, Qwen3 models already carry vocab_size in their language_transformer_config (passed during language model initialization at line 150 of modelling_qwen3_vl/model.py), making the asymmetric exposure redundant for Qwen2.5 as well.

Either remove the unused vocab_size exposure entirely, or clarify its purpose if required by a downstream dependency.

inference_max_batch_size parameter is not exposed through the public API.

The inference_max_batch_size parameter (line 133) is hardcoded to default 4 in setup_model_and_tokenizer (lines 104–110), which always calls setup_inference_wrapper without passing this argument. Callers cannot customize the KV cache batch size allocation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/inference/vlm/base.py` around lines 115 - 124, The helper
_expose_decoder_from_language_model currently assigns current.vocab_size (unused
and asymmetrically applied to Qwen25VLModelProvider but not
Qwen3VLModelProvider); remove the vocab_size exposure from
_expose_decoder_from_language_model and only expose the decoder there (leave
language_model.decoder assignment intact) unless a downstream consumer needs
vocab_size—if so, document and centralize that behavior for both
Qwen25VLModelProvider and Qwen3VLModelProvider. Also make
inference_max_batch_size configurable by adding an inference_max_batch_size
parameter to setup_model_and_tokenizer and passing it through into
setup_inference_wrapper (instead of hardcoding 4), updating the default to
preserve existing behavior and ensuring callers can override it; touch the
symbols setup_model_and_tokenizer and setup_inference_wrapper to propagate the
argument.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/inference/vlm/base.py`:
- Line 133: The public function setup_model_and_tokenizer currently does not
accept or forward the new inference_max_batch_size parameter, causing callers to
be unable to set generate(max_batch_size=N) and risking StaticInferenceContext
KV-cache overflows; update setup_model_and_tokenizer to add an
inference_max_batch_size: int = 4 parameter and pass it through to
setup_inference_wrapper (which already accepts inference_max_batch_size) so
downstream calls to generate() can use the configured max_batch_size and avoid
KV-cache overflow.

In `@src/megatron/bridge/inference/vlm/qwenvl_inference_wrapper.py`:
- Around line 34-35: The __init__ method for QWENVLINferenceWrapper lacks a type
annotation for the inference_context parameter and the class docstring Args
block is missing that parameter; update the signature of __init__ to annotate
inference_context with the correct nullable type (use the project's preferred
"SomeType | None" form, e.g. InferenceContext | None) and update the class
docstring Args section to document inference_context and its purpose; reference
the __init__ method and the class docstring in QWENVLINferenceWrapper when
making these edits.

---

Outside diff comments:
In `@src/megatron/bridge/inference/vlm/base.py`:
- Around line 115-124: The helper _expose_decoder_from_language_model currently
assigns current.vocab_size (unused and asymmetrically applied to
Qwen25VLModelProvider but not Qwen3VLModelProvider); remove the vocab_size
exposure from _expose_decoder_from_language_model and only expose the decoder
there (leave language_model.decoder assignment intact) unless a downstream
consumer needs vocab_size—if so, document and centralize that behavior for both
Qwen25VLModelProvider and Qwen3VLModelProvider. Also make
inference_max_batch_size configurable by adding an inference_max_batch_size
parameter to setup_model_and_tokenizer and passing it through into
setup_inference_wrapper (instead of hardcoding 4), updating the default to
preserve existing behavior and ensuring callers can override it; touch the
symbols setup_model_and_tokenizer and setup_inference_wrapper to propagate the
argument.

In `@src/megatron/bridge/inference/vlm/qwenvl_inference_wrapper.py`:
- Around line 44-46: The assignment self.inference_params =
InferenceParams(batch_size, seq_length) in prep_inference_input is dead code
because InferenceParams is never used when calling the model in
forward_pass_without_pipeline_parallel; remove that assignment and rely on
self.inference_context provided by the base class (or, if KV-cache management
was intended, instead pass the appropriate inference_context to the model call
in forward_pass_without_pipeline_parallel to match the model signature in
modeling_qwen25_vl.py). Locate prep_inference_input and
forward_pass_without_pipeline_parallel and either delete the InferenceParams
creation or update the model invocation to include the correct inference_context
parameter.

coderabbitai · 2026-02-17T21:53:22Z

src/megatron/bridge/inference/vlm/base.py

    params_dtype: torch.dtype = torch.bfloat16,
    inference_batch_times_seqlen_threshold: int = 1000,
    inference_max_seq_length: int = 8192,
+    inference_max_batch_size: int = 4,


⚠️ Potential issue | 🟠 Major

inference_max_batch_size is not surfaced through setup_model_and_tokenizer.

setup_inference_wrapper gains the new inference_max_batch_size parameter (default 4), but setup_model_and_tokenizer — the public entry point — never accepts or forwards it (lines 104–110). Any caller who needs generate(max_batch_size=N) with N > 4 will hit a StaticInferenceContext KV-cache overflow at runtime, with no way to prevent it through the public API.

🔧 Suggested fix: surface the parameter in the public function

def setup_model_and_tokenizer( megatron_model_path: str, tp: int = 1, pp: int = 1, params_dtype: torch.dtype = torch.bfloat16, inference_batch_times_seqlen_threshold: int = 1000, inference_max_seq_length: int = 8192, + inference_max_batch_size: int = 4, ): ... inference_wrapped_model = setup_inference_wrapper( model[0], processor.tokenizer, params_dtype=torch.bfloat16, inference_batch_times_seqlen_threshold=1000, inference_max_seq_length=inference_max_seq_length, + inference_max_batch_size=inference_max_batch_size, )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/inference/vlm/base.py` at line 133, The public function setup_model_and_tokenizer currently does not accept or forward the new inference_max_batch_size parameter, causing callers to be unable to set generate(max_batch_size=N) and risking StaticInferenceContext KV-cache overflows; update setup_model_and_tokenizer to add an inference_max_batch_size: int = 4 parameter and pass it through to setup_inference_wrapper (which already accepts inference_max_batch_size) so downstream calls to generate() can use the configured max_batch_size and avoid KV-cache overflow.

@meatybobby could you please check if this is relevant ?

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

MCore still has backward compatibility for this, but I just removed it to avoid confusing.

src/megatron/bridge/inference/vlm/qwenvl_inference_wrapper.py

meatybobby · 2026-02-17T21:56:53Z

/ok to test aa1b9c3

meatybobby · 2026-02-18T01:54:39Z

/ok to test 5d0bf0e

athitten

LGTM, thank you @meatybobby !

Fix VLM inference engine

3cad2ec

copy-pr-bot bot had a problem deploying to nemo-ci February 17, 2026 21:46 Error

coderabbitai bot reviewed Feb 17, 2026

View reviewed changes

format

aa1b9c3

copy-pr-bot bot temporarily deployed to nemo-ci February 17, 2026 21:57 Inactive

copy-pr-bot bot temporarily deployed to test February 17, 2026 21:57 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 17, 2026 22:01 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 17, 2026 22:55 Failure

Fix test

5d0bf0e

copy-pr-bot bot had a problem deploying to nemo-ci February 18, 2026 01:55 Failure

copy-pr-bot bot temporarily deployed to test February 18, 2026 01:55 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 18, 2026 02:10 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 18, 2026 02:48 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 18, 2026 02:58 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 18, 2026 02:58 Failure

copy-pr-bot bot temporarily deployed to nemo-ci February 18, 2026 02:58 Inactive

copy-pr-bot bot temporarily deployed to test February 20, 2026 03:21 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 20, 2026 03:59 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 20, 2026 04:08 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 20, 2026 04:17 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 20, 2026 04:17 Failure

copy-pr-bot bot temporarily deployed to nemo-ci February 20, 2026 04:17 Inactive

athitten approved these changes Feb 20, 2026

View reviewed changes

Conversation

meatybobby commented Feb 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 17, 2026

Uh oh!

meatybobby commented Feb 17, 2026

Uh oh!

coderabbitai bot commented Feb 17, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

athitten Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

meatybobby Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

meatybobby commented Feb 17, 2026

Uh oh!

meatybobby commented Feb 18, 2026

Uh oh!

athitten left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

meatybobby commented Feb 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot Feb 17, 2026 •

edited

Loading