Adding CUDA Graph Support for Vision Encoder by tomlifu · Pull Request #2334 · NVIDIA-NeMo/Megatron-Bridge

tomlifu · 2026-02-11T19:31:03Z

PRs merge order

Please merge the PRs in the following order.
(1) #2370
(2) #2372
(3) #2334

What does this PR do ?

This PR adds CUDA Graph support for vision encoder.
Previous PR: #2274 was deprecated.

Related Mcore PR:NVIDIA/Megatron-LM#3293, NVIDIA/Megatron-LM#3294
Related TE PR:NVIDIA/TransformerEngine#2657

Example run command:

python ${MEGATRON_BRIDGE_PATH}/scripts/performance/run_script.py \
--account my-account \
--partition my-partition \
--container_image my-container-image \
--gpu gb200 \
--domain vlm \
--model_family_name qwen_vl \
--model_recipe_name qwen3_vl_235b_a22b \
--gpus_per_node 4 \
--num_gpus 64 \
--log_dir ${WORKSPACE}/../logs \
dataset.seq_length=4096 \
dataset.image_size=[392,392] \
model.seq_length=4096 \
model.freeze_language_model=False \
model.freeze_vision_model=False \
model.freeze_vision_projection=False \
model.pipeline_model_parallel_size=8 \
model.expert_model_parallel_size=8 \
model.account_for_embedding_in_pipeline_split=False \
model.account_for_loss_in_pipeline_split=False \
model.num_layers_in_first_pipeline_stage=4 \
model.num_layers_in_last_pipeline_stage=12 \
train.micro_batch_size=1 \
train.global_batch_size=1024 \
train.train_iters=20 \
train.eval_iters=0 \
train.manual_gc=true \
train.manual_gc_interval=100 \
logger.log_interval=1 \
model.cuda_graph_impl=transformer_engine \
model.cuda_graph_scope=[attn,moe_router,moe_preprocess] \
model.vision_cuda_graph_impl=transformer_engine \
model.vision_cuda_graph_scope=[attn,mlp] \
model.max_vision_cuda_graph_seq_length=784 \

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

New Features
- Added vision encoder CUDA graph support for improved training performance.
- Added configurable vision CUDA graph settings including implementation scope and maximum sequence length limits.
Improvements
- Enhanced vision model with automatic sequence padding for CUDA graph compatibility.
Updates
- Updated Megatron-LM submodule to latest version.

copy-pr-bot · 2026-02-11T19:31:07Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-11T19:44:50Z

📝 Walkthrough

Walkthrough

This PR adds CUDA graph support for vision encoders in Qwen3VL models. Changes include vision CUDA graph configuration fields, per-layer detection logic in transformer blocks, padding/unpadding workflows in the vision model, training integration for graph initialization and cleanup, and updates to the Megatron-LM submodule pointer.

Changes

Cohort / File(s)	Summary
Submodule Update `3rdparty/Megatron-LM`	Updated commit reference from 347ad215a8ca2f46c9a599666b03465c475bf4eb to e02e4270655d47429b99c10001050edfdf8ef8a5.
Performance Utilities `scripts/performance/utils/overrides.py`	Added `_set_vision_cuda_graph_overrides` helper to configure CUDA graph settings for vision encoder, including TE RNG tracking and scope validation.
Qwen3VL Configuration & Detection `src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py`, `src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py`	Added vision CUDA graph configuration fields (vision_cuda_graph_impl, vision_cuda_graph_scope, max_vision_cuda_graph_seq_length) to transformer config and provider; added utility to convert scope strings to enums.
Qwen3VL Model Compatibility `src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py`	Added CUDA graph compatibility shim exposing position_embedding_type, rotary_pos_emb, and decoder attributes when CUDA graphs are enabled.
Qwen3VL Vision Processing `src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py`	Added helper methods to detect vision CUDA graphs and compute max sequence length; introduced padding/unpadding workflow for CUDA graph compatibility with attention mask construction.
Qwen3VL Transformer Block `src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py`	Added per-layer detection of TE CUDA graph replay and conditional suppression of packed_seq_params to prevent invalid inputs to CUDA graph replay.
Training Integration `src/megatron/bridge/training/train.py`	Added vision CUDA graph initialization, post-warmup graph capture, hook setup, and cleanup logic alongside language model CUDA graphs.
VLM Batch Handling `src/megatron/bridge/training/vlm_step.py`	Modified batch handling to force first/last pipeline stage detection and apply fixed-length padding unconditionally.

Sequence Diagram(s)

sequenceDiagram
    participant Training as Training Loop
    participant VisionModel as Vision Model
    participant VisionGraphHelper as Vision CUDA<br/>Graph Helper
    participant TEBackend as Transformer Engine<br/>Backend
    
    Training->>VisionModel: Initialize (detect cuda_graph_impl)
    VisionModel-->>Training: cuda_graph_impl enabled
    Training->>VisionGraphHelper: Create & initialize
    Training->>Training: Warmup iterations
    Training->>VisionGraphHelper: Capture graphs (post-warmup)
    VisionGraphHelper->>TEBackend: Record CUDA graph operations
    
    loop Training Steps
        Training->>VisionModel: Forward pass
        VisionModel->>VisionModel: Pad to max_seq_len if CUDA graphs active
        VisionModel->>TEBackend: Encode (uses replayed graph if warmed)
        VisionModel->>VisionModel: Unpad to original length
        Training->>TEBackend: Setup vision graph hooks
        Training->>Training: Backward pass
    end
    
    Training->>VisionGraphHelper: Cleanup (delete graphs)
    VisionGraphHelper->>TEBackend: Destroy CUDA graphs

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Support qwen3-vl for THD format and CP #1943: Modifies same Qwen3VL code paths (model.py, transformer_block, transformer_config, vision_model, qwen3_vl_provider) to add vision/CUDA-graph support.
chore: Change submodule pointer for release #2191: Updates same 3rdparty/Megatron-LM git submodule pointer.
THD training in VLMs #1997: Modifies vlm_step.py and VLM model forward paths related to packed sequence handling and batch packing.

Suggested labels

Run CICD

Suggested reviewers

malay-nagda
erhoo82

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR introduces major CUDA Graph changes for vision encoders that could affect numerics and convergence, but includes no test results, performance benchmarks, or convergence data.	Include test results demonstrating no numerical regressions, add before-and-after performance benchmarks, and fix the double-dereference bug in vision model initialization that prevents CUDA graph functionality.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title 'Adding CUDA Graph Support for Vision Encoder' clearly and directly summarizes the primary change in the PR, which is adding CUDA graph functionality for vision encoders across multiple modified files.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/megatron/bridge/training/train.py`:
- Around line 264-286: The code double-dereferences the vision model:
get_attr_wrapped_model(..., 'vision_model') already returns the vision model,
but the block then checks hasattr(unwrapped, 'vision_model') and accesses
unwrapped.vision_model, preventing initialization; change the logic to treat
unwrapped as the vision model (check unwrapped is not None and has
attribute/config as needed), read vision_model_config = unwrapped.config (or use
unwrapped directly), then construct VisionTECudaGraphHelper with that
vision_model_config and set vision_cuda_graph_helper accordingly so it actually
initializes when the returned object indicates transformer_engine.

🧹 Nitpick comments (6)

src/megatron/bridge/training/vlm_step.py (1)
238-255: Avoid hardcoding PP stage flags and unconditional fixed-length padding

Hardcoding is_first/is_last = True and the if True: branch disables the dynamic-length path and forces labels/loss_mask onto every PP stage, which adds extra GPU transfers and leaves dead code. Consider restoring real PP-stage flags and gating fixed-length padding on PP size or CUDA-graph enablement.
♻️ Suggested gating for PP/CUDA-graph fixed-length padding
-    is_first = True
-    is_last = True
+    is_first = is_pp_first_stage(pg_collection.pp)
+    is_last = is_pp_last_stage(pg_collection.pp)

...
-        if True:
+        if (
+            getattr(cfg.model, "pipeline_model_parallel_size", 1) > 1
+            or getattr(cfg.model, "cuda_graph_impl", "none") != "none"
+        ):
scripts/performance/utils/overrides.py (1)
126-163: Align the new override signature with repo typing conventions

The new helper uses Optional/List in its signature. The repo guidelines prefer built-in generics and T | None. Consider updating the annotations accordingly.
♻️ Typing guideline-aligned signature
 def _set_vision_cuda_graph_overrides(
     recipe: ConfigContainer,
-    vision_cuda_graph_impl: Optional[str] = None,
-    vision_cuda_graph_scope: Optional[str | List[str]] = None,
+    vision_cuda_graph_impl: str | None = None,
+    vision_cuda_graph_scope: str | list[str] | None = None,
 ) -> ConfigContainer:
As per coding guidelines, "Use 'T | None' for nullable types instead of 'Optional[T]'" and "Use built-in generics (list, dict, tuple) instead of typing equivalents".
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py (1)
183-191: Initialize CUDA-graph helper attributes even when graphs are disabled

These attributes are only set under the CUDA-graph condition. Initializing them unconditionally (e.g., to None) avoids attribute errors and aligns with the "externally visible members" guideline.
♻️ Initialize attributes defensively
-        cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none"
-        if cuda_graph_enabled:
-            self.position_embedding_type = self.language_model.position_embedding_type
-            self.rotary_pos_emb = self.language_model.rotary_pos_emb
-            self.decoder = self.language_model.decoder
+        self.position_embedding_type = None
+        self.rotary_pos_emb = None
+        self.decoder = None
+        cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none"
+        if cuda_graph_enabled:
+            self.position_embedding_type = self.language_model.position_embedding_type
+            self.rotary_pos_emb = self.language_model.rotary_pos_emb
+            self.decoder = self.language_model.decoder
As per coding guidelines, "Initialize all externally visible members of a class in the constructor".
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py (1)

116-128: Deduplicate TE CUDA-graph detection logic

The same layer_uses_te_cudagraph block is repeated in both paths. Consider extracting a small helper (e.g., _uses_te_cudagraph(layer)) to keep the logic consistent and reduce drift.

Also applies to: 331-343
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py (1)
56-60: Use int | None for the new max vision seq length field

The repo’s typing guidelines prefer T | None for nullable types. Consider switching the new field to int | None.
♻️ Typing guideline-aligned field
-    max_vision_cuda_graph_seq_length: Optional[int] = None
+    max_vision_cuda_graph_seq_length: int | None = None
As per coding guidelines, "Use 'T | None' for nullable types instead of 'Optional[T]'".
src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py (1)
38-42: Use built-in generics and | None in the new vision CUDA-graph annotations

The new helper and fields still use List/Optional. The repo guidelines prefer built-in generics and T | None. Updating these will keep the typing consistent across the codebase.
♻️ Typing guideline-aligned annotations
-def _convert_cuda_graph_scope_to_enum(scope_list: List[str]) -> List[CudaGraphScope]:
+def _convert_cuda_graph_scope_to_enum(scope_list: list[str]) -> list[CudaGraphScope]:
     """Convert string list to CudaGraphScope enum list."""

...
-    vision_cuda_graph_scope: List[str] = field(default_factory=list)
+    vision_cuda_graph_scope: list[str] = field(default_factory=list)

...
-    max_vision_cuda_graph_seq_length: Optional[int] = None
+    max_vision_cuda_graph_seq_length: int | None = None
As per coding guidelines, "Use 'T | None' for nullable types instead of 'Optional[T]'" and "Use built-in generics (list, dict, tuple) instead of typing equivalents".
Also applies to: 120-127, 273-280

coderabbitai · 2026-02-11T19:44:53Z

src/megatron/bridge/training/train.py

+    vision_config = getattr(config.model, 'vision_cuda_graph_impl', None)
+    if vision_config == "transformer_engine":
+        # Try to get vision config from the model
+        try:
+            for model_chunk in model:
+                unwrapped = get_attr_wrapped_model(
+                    model_chunk, 'vision_model', allow_none=True, return_model_obj=True
+                )
+                if unwrapped is not None and hasattr(unwrapped, 'vision_model') and unwrapped.vision_model is not None:
+                    vision_model_config = unwrapped.vision_model.config
+                    if vision_model_config.cuda_graph_impl == "transformer_engine":
+                        vision_seq_length = get_vision_cuda_graph_seq_length(vision_model_config)
+                        vision_cuda_graph_helper = VisionTECudaGraphHelper(
+                            model=model,
+                            vision_config=vision_model_config,
+                            vision_seq_length=vision_seq_length,
+                            micro_batch_size=config.train.micro_batch_size,
+                            num_microbatches=get_num_microbatches(),
+                        )
+                        print_rank_0(
+                            f"Vision encoder CUDA graph enabled with seq_length={vision_seq_length}"
+                        )
+                    break


⚠️ Potential issue | 🟠 Major

Vision CUDA-graph helper likely never initializes due to double-dereference

get_attr_wrapped_model(..., "vision_model") already returns the vision model. The subsequent hasattr(unwrapped, "vision_model") gate will typically fail, leaving vision_cuda_graph_helper unset. Use the returned object directly (or normalize once).

🐛 Use the returned vision model directly

- unwrapped = get_attr_wrapped_model( - model_chunk, 'vision_model', allow_none=True, return_model_obj=True - ) - if unwrapped is not None and hasattr(unwrapped, 'vision_model') and unwrapped.vision_model is not None: - vision_model_config = unwrapped.vision_model.config + vision_model = get_attr_wrapped_model( + model_chunk, "vision_model", allow_none=True, return_model_obj=True + ) + if vision_model is not None: + vision_model = getattr(vision_model, "vision_model", vision_model) + if vision_model is None: + continue + vision_model_config = vision_model.config

🤖 Prompt for AI Agents

In `@src/megatron/bridge/training/train.py` around lines 264 - 286, The code double-dereferences the vision model: get_attr_wrapped_model(..., 'vision_model') already returns the vision model, but the block then checks hasattr(unwrapped, 'vision_model') and accesses unwrapped.vision_model, preventing initialization; change the logic to treat unwrapped as the vision model (check unwrapped is not None and has attribute/config as needed), read vision_model_config = unwrapped.config (or use unwrapped directly), then construct VisionTECudaGraphHelper with that vision_model_config and set vision_cuda_graph_helper accordingly so it actually initializes when the returned object indicates transformer_engine.

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py

tomlifu · 2026-02-14T01:03:27Z

We need make sure the lm loss remain the same after using cuda graph.

I have made a fix in my mcore PR: NVIDIA/Megatron-LM#3294

Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com>

Signed-off-by: Lifu Zhang <lifuz@login-lyris01.lyris.clusters.nvidia.com>

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

…eated() Non-vision PP stages never create graphs, so calling delete_cuda_graphs unconditionally triggers an assertion in the parent class. Signed-off-by: Lifu Zhang <lifuz@login-lyris01.lyris.clusters.nvidia.com>

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

tomlifu requested a review from shifangx February 11, 2026 19:32

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

shifangx changed the title ~~Rebasing "Adding CUDA Graph Support for Vision Encoder" to main~~ Adding CUDA Graph Support for Vision Encoder Feb 12, 2026

shifangx mentioned this pull request Feb 12, 2026

[Deprecated] Adding CUDA Graph Support for Vision Encoder #2274

Closed

5 tasks

shifangx added 2 commits February 13, 2026 04:49

qwen3-vl m4 leftover

7ed06e2

m4 leftover for TE cuda graph

4c4ef34

shifangx reviewed Feb 14, 2026

View reviewed changes

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py Show resolved Hide resolved

shifangx force-pushed the shifang/qwen3_vl_perf_cuda_graph branch 8 times, most recently from 9e020f7 to 2d60349 Compare February 14, 2026 09:34

Add TE CUDA Graph Support for Vision Encoder

9721bf2

Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com>

shifangx force-pushed the shifang/qwen3_vl_perf_cuda_graph branch from 2d60349 to 9721bf2 Compare February 14, 2026 09:55

tomlifu self-assigned this Feb 17, 2026

Lifu Zhang and others added 4 commits February 19, 2026 11:20

updated import for VisionTECudaGraph

e201aaf

Signed-off-by: Lifu Zhang <lifuz@login-lyris01.lyris.clusters.nvidia.com>

Merge branch 'main' into shifang/qwen3_vl_perf_cuda_graph

81a3e49

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

Merge branch 'main' into shifang/qwen3_vl_perf_cuda_graph

c86778f

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

shifangx requested a review from yaoyu-33 March 12, 2026 00:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding CUDA Graph Support for Vision Encoder#2334

Adding CUDA Graph Support for Vision Encoder#2334
tomlifu wants to merge 7 commits intoNVIDIA-NeMo:mainfrom
shifangx:shifang/qwen3_vl_perf_cuda_graph

tomlifu commented Feb 11, 2026 •

edited by shifangx

Loading

Uh oh!

copy-pr-bot bot commented Feb 11, 2026

Uh oh!

coderabbitai bot commented Feb 11, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 11, 2026

Uh oh!

Uh oh!

tomlifu commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tomlifu commented Feb 11, 2026 • edited by shifangx Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PRs merge order

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 11, 2026

Uh oh!

coderabbitai bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tomlifu commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomlifu commented Feb 11, 2026 •

edited by shifangx

Loading

coderabbitai bot commented Feb 11, 2026 •

edited

Loading