Skip to content

Adding CUDA Graph Support for Vision Encoder#2334

Open
tomlifu wants to merge 7 commits intoNVIDIA-NeMo:mainfrom
shifangx:shifang/qwen3_vl_perf_cuda_graph
Open

Adding CUDA Graph Support for Vision Encoder#2334
tomlifu wants to merge 7 commits intoNVIDIA-NeMo:mainfrom
shifangx:shifang/qwen3_vl_perf_cuda_graph

Conversation

@tomlifu
Copy link
Contributor

@tomlifu tomlifu commented Feb 11, 2026

PRs merge order

Please merge the PRs in the following order.
(1) #2370
(2) #2372
(3) #2334

What does this PR do ?

This PR adds CUDA Graph support for vision encoder.
Previous PR: #2274 was deprecated.

Related Mcore PR:NVIDIA/Megatron-LM#3293, NVIDIA/Megatron-LM#3294
Related TE PR:NVIDIA/TransformerEngine#2657

Example run command:

python ${MEGATRON_BRIDGE_PATH}/scripts/performance/run_script.py \
--account my-account \
--partition my-partition \
--container_image my-container-image \
--gpu gb200 \
--domain vlm \
--model_family_name qwen_vl \
--model_recipe_name qwen3_vl_235b_a22b \
--gpus_per_node 4 \
--num_gpus 64 \
--log_dir ${WORKSPACE}/../logs \
dataset.seq_length=4096 \
dataset.image_size=[392,392] \
model.seq_length=4096 \
model.freeze_language_model=False \
model.freeze_vision_model=False \
model.freeze_vision_projection=False \
model.pipeline_model_parallel_size=8 \
model.expert_model_parallel_size=8 \
model.account_for_embedding_in_pipeline_split=False \
model.account_for_loss_in_pipeline_split=False \
model.num_layers_in_first_pipeline_stage=4 \
model.num_layers_in_last_pipeline_stage=12 \
train.micro_batch_size=1 \
train.global_batch_size=1024 \
train.train_iters=20 \
train.eval_iters=0 \
train.manual_gc=true \
train.manual_gc_interval=100 \
logger.log_interval=1 \
model.cuda_graph_impl=transformer_engine \
model.cuda_graph_scope=[attn,moe_router,moe_preprocess] \
model.vision_cuda_graph_impl=transformer_engine \
model.vision_cuda_graph_scope=[attn,mlp] \
model.max_vision_cuda_graph_seq_length=784 \

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • New Features

    • Added vision encoder CUDA graph support for improved training performance.
    • Added configurable vision CUDA graph settings including implementation scope and maximum sequence length limits.
  • Improvements

    • Enhanced vision model with automatic sequence padding for CUDA graph compatibility.
  • Updates

    • Updated Megatron-LM submodule to latest version.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 11, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@tomlifu tomlifu requested a review from shifangx February 11, 2026 19:32
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

📝 Walkthrough

Walkthrough

This PR adds CUDA graph support for vision encoders in Qwen3VL models. Changes include vision CUDA graph configuration fields, per-layer detection logic in transformer blocks, padding/unpadding workflows in the vision model, training integration for graph initialization and cleanup, and updates to the Megatron-LM submodule pointer.

Changes

Cohort / File(s) Summary
Submodule Update
3rdparty/Megatron-LM
Updated commit reference from 347ad215a8ca2f46c9a599666b03465c475bf4eb to e02e4270655d47429b99c10001050edfdf8ef8a5.
Performance Utilities
scripts/performance/utils/overrides.py
Added _set_vision_cuda_graph_overrides helper to configure CUDA graph settings for vision encoder, including TE RNG tracking and scope validation.
Qwen3VL Configuration & Detection
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py, src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py
Added vision CUDA graph configuration fields (vision_cuda_graph_impl, vision_cuda_graph_scope, max_vision_cuda_graph_seq_length) to transformer config and provider; added utility to convert scope strings to enums.
Qwen3VL Model Compatibility
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py
Added CUDA graph compatibility shim exposing position_embedding_type, rotary_pos_emb, and decoder attributes when CUDA graphs are enabled.
Qwen3VL Vision Processing
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py
Added helper methods to detect vision CUDA graphs and compute max sequence length; introduced padding/unpadding workflow for CUDA graph compatibility with attention mask construction.
Qwen3VL Transformer Block
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py
Added per-layer detection of TE CUDA graph replay and conditional suppression of packed_seq_params to prevent invalid inputs to CUDA graph replay.
Training Integration
src/megatron/bridge/training/train.py
Added vision CUDA graph initialization, post-warmup graph capture, hook setup, and cleanup logic alongside language model CUDA graphs.
VLM Batch Handling
src/megatron/bridge/training/vlm_step.py
Modified batch handling to force first/last pipeline stage detection and apply fixed-length padding unconditionally.

Sequence Diagram(s)

sequenceDiagram
    participant Training as Training Loop
    participant VisionModel as Vision Model
    participant VisionGraphHelper as Vision CUDA<br/>Graph Helper
    participant TEBackend as Transformer Engine<br/>Backend
    
    Training->>VisionModel: Initialize (detect cuda_graph_impl)
    VisionModel-->>Training: cuda_graph_impl enabled
    Training->>VisionGraphHelper: Create & initialize
    Training->>Training: Warmup iterations
    Training->>VisionGraphHelper: Capture graphs (post-warmup)
    VisionGraphHelper->>TEBackend: Record CUDA graph operations
    
    loop Training Steps
        Training->>VisionModel: Forward pass
        VisionModel->>VisionModel: Pad to max_seq_len if CUDA graphs active
        VisionModel->>TEBackend: Encode (uses replayed graph if warmed)
        VisionModel->>VisionModel: Unpad to original length
        Training->>TEBackend: Setup vision graph hooks
        Training->>Training: Backward pass
    end
    
    Training->>VisionGraphHelper: Cleanup (delete graphs)
    VisionGraphHelper->>TEBackend: Destroy CUDA graphs
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

Run CICD

Suggested reviewers

  • malay-nagda
  • erhoo82
🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR introduces major CUDA Graph changes for vision encoders that could affect numerics and convergence, but includes no test results, performance benchmarks, or convergence data. Include test results demonstrating no numerical regressions, add before-and-after performance benchmarks, and fix the double-dereference bug in vision model initialization that prevents CUDA graph functionality.
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title 'Adding CUDA Graph Support for Vision Encoder' clearly and directly summarizes the primary change in the PR, which is adding CUDA graph functionality for vision encoders across multiple modified files.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/megatron/bridge/training/train.py`:
- Around line 264-286: The code double-dereferences the vision model:
get_attr_wrapped_model(..., 'vision_model') already returns the vision model,
but the block then checks hasattr(unwrapped, 'vision_model') and accesses
unwrapped.vision_model, preventing initialization; change the logic to treat
unwrapped as the vision model (check unwrapped is not None and has
attribute/config as needed), read vision_model_config = unwrapped.config (or use
unwrapped directly), then construct VisionTECudaGraphHelper with that
vision_model_config and set vision_cuda_graph_helper accordingly so it actually
initializes when the returned object indicates transformer_engine.
🧹 Nitpick comments (6)
src/megatron/bridge/training/vlm_step.py (1)

238-255: Avoid hardcoding PP stage flags and unconditional fixed-length padding

Hardcoding is_first/is_last = True and the if True: branch disables the dynamic-length path and forces labels/loss_mask onto every PP stage, which adds extra GPU transfers and leaves dead code. Consider restoring real PP-stage flags and gating fixed-length padding on PP size or CUDA-graph enablement.

♻️ Suggested gating for PP/CUDA-graph fixed-length padding
-    is_first = True
-    is_last = True
+    is_first = is_pp_first_stage(pg_collection.pp)
+    is_last = is_pp_last_stage(pg_collection.pp)

...
-        if True:
+        if (
+            getattr(cfg.model, "pipeline_model_parallel_size", 1) > 1
+            or getattr(cfg.model, "cuda_graph_impl", "none") != "none"
+        ):
scripts/performance/utils/overrides.py (1)

126-163: Align the new override signature with repo typing conventions

The new helper uses Optional/List in its signature. The repo guidelines prefer built-in generics and T | None. Consider updating the annotations accordingly.

♻️ Typing guideline-aligned signature
 def _set_vision_cuda_graph_overrides(
     recipe: ConfigContainer,
-    vision_cuda_graph_impl: Optional[str] = None,
-    vision_cuda_graph_scope: Optional[str | List[str]] = None,
+    vision_cuda_graph_impl: str | None = None,
+    vision_cuda_graph_scope: str | list[str] | None = None,
 ) -> ConfigContainer:
As per coding guidelines, "Use 'T | None' for nullable types instead of 'Optional[T]'" and "Use built-in generics (list, dict, tuple) instead of typing equivalents".
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py (1)

183-191: Initialize CUDA-graph helper attributes even when graphs are disabled

These attributes are only set under the CUDA-graph condition. Initializing them unconditionally (e.g., to None) avoids attribute errors and aligns with the "externally visible members" guideline.

♻️ Initialize attributes defensively
-        cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none"
-        if cuda_graph_enabled:
-            self.position_embedding_type = self.language_model.position_embedding_type
-            self.rotary_pos_emb = self.language_model.rotary_pos_emb
-            self.decoder = self.language_model.decoder
+        self.position_embedding_type = None
+        self.rotary_pos_emb = None
+        self.decoder = None
+        cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none"
+        if cuda_graph_enabled:
+            self.position_embedding_type = self.language_model.position_embedding_type
+            self.rotary_pos_emb = self.language_model.rotary_pos_emb
+            self.decoder = self.language_model.decoder
As per coding guidelines, "Initialize all externally visible members of a class in the constructor".
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py (1)

116-128: Deduplicate TE CUDA-graph detection logic

The same layer_uses_te_cudagraph block is repeated in both paths. Consider extracting a small helper (e.g., _uses_te_cudagraph(layer)) to keep the logic consistent and reduce drift.

Also applies to: 331-343

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py (1)

56-60: Use int | None for the new max vision seq length field

The repo’s typing guidelines prefer T | None for nullable types. Consider switching the new field to int | None.

♻️ Typing guideline-aligned field
-    max_vision_cuda_graph_seq_length: Optional[int] = None
+    max_vision_cuda_graph_seq_length: int | None = None
As per coding guidelines, "Use 'T | None' for nullable types instead of 'Optional[T]'".
src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py (1)

38-42: Use built-in generics and | None in the new vision CUDA-graph annotations

The new helper and fields still use List/Optional. The repo guidelines prefer built-in generics and T | None. Updating these will keep the typing consistent across the codebase.

♻️ Typing guideline-aligned annotations
-def _convert_cuda_graph_scope_to_enum(scope_list: List[str]) -> List[CudaGraphScope]:
+def _convert_cuda_graph_scope_to_enum(scope_list: list[str]) -> list[CudaGraphScope]:
     """Convert string list to CudaGraphScope enum list."""

...
-    vision_cuda_graph_scope: List[str] = field(default_factory=list)
+    vision_cuda_graph_scope: list[str] = field(default_factory=list)

...
-    max_vision_cuda_graph_seq_length: Optional[int] = None
+    max_vision_cuda_graph_seq_length: int | None = None
As per coding guidelines, "Use 'T | None' for nullable types instead of 'Optional[T]'" and "Use built-in generics (list, dict, tuple) instead of typing equivalents".

Also applies to: 120-127, 273-280

Comment on lines +264 to +286
vision_config = getattr(config.model, 'vision_cuda_graph_impl', None)
if vision_config == "transformer_engine":
# Try to get vision config from the model
try:
for model_chunk in model:
unwrapped = get_attr_wrapped_model(
model_chunk, 'vision_model', allow_none=True, return_model_obj=True
)
if unwrapped is not None and hasattr(unwrapped, 'vision_model') and unwrapped.vision_model is not None:
vision_model_config = unwrapped.vision_model.config
if vision_model_config.cuda_graph_impl == "transformer_engine":
vision_seq_length = get_vision_cuda_graph_seq_length(vision_model_config)
vision_cuda_graph_helper = VisionTECudaGraphHelper(
model=model,
vision_config=vision_model_config,
vision_seq_length=vision_seq_length,
micro_batch_size=config.train.micro_batch_size,
num_microbatches=get_num_microbatches(),
)
print_rank_0(
f"Vision encoder CUDA graph enabled with seq_length={vision_seq_length}"
)
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Vision CUDA-graph helper likely never initializes due to double-dereference

get_attr_wrapped_model(..., "vision_model") already returns the vision model. The subsequent hasattr(unwrapped, "vision_model") gate will typically fail, leaving vision_cuda_graph_helper unset. Use the returned object directly (or normalize once).

🐛 Use the returned vision model directly
-                unwrapped = get_attr_wrapped_model(
-                    model_chunk, 'vision_model', allow_none=True, return_model_obj=True
-                )
-                if unwrapped is not None and hasattr(unwrapped, 'vision_model') and unwrapped.vision_model is not None:
-                    vision_model_config = unwrapped.vision_model.config
+                vision_model = get_attr_wrapped_model(
+                    model_chunk, "vision_model", allow_none=True, return_model_obj=True
+                )
+                if vision_model is not None:
+                    vision_model = getattr(vision_model, "vision_model", vision_model)
+                    if vision_model is None:
+                        continue
+                    vision_model_config = vision_model.config
🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/train.py` around lines 264 - 286, The code
double-dereferences the vision model: get_attr_wrapped_model(...,
'vision_model') already returns the vision model, but the block then checks
hasattr(unwrapped, 'vision_model') and accesses unwrapped.vision_model,
preventing initialization; change the logic to treat unwrapped as the vision
model (check unwrapped is not None and has attribute/config as needed), read
vision_model_config = unwrapped.config (or use unwrapped directly), then
construct VisionTECudaGraphHelper with that vision_model_config and set
vision_cuda_graph_helper accordingly so it actually initializes when the
returned object indicates transformer_engine.

@shifangx shifangx changed the title Rebasing "Adding CUDA Graph Support for Vision Encoder" to main Adding CUDA Graph Support for Vision Encoder Feb 12, 2026
@tomlifu
Copy link
Contributor Author

tomlifu commented Feb 14, 2026

We need make sure the lm loss remain the same after using cuda graph.

I have made a fix in my mcore PR: NVIDIA/Megatron-LM#3294

@shifangx shifangx force-pushed the shifang/qwen3_vl_perf_cuda_graph branch 8 times, most recently from 9e020f7 to 2d60349 Compare February 14, 2026 09:34
Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com>
@shifangx shifangx force-pushed the shifang/qwen3_vl_perf_cuda_graph branch from 2d60349 to 9721bf2 Compare February 14, 2026 09:55
@tomlifu tomlifu self-assigned this Feb 17, 2026
Lifu Zhang and others added 4 commits February 19, 2026 11:20
Signed-off-by: Lifu Zhang <lifuz@login-lyris01.lyris.clusters.nvidia.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
…eated()

Non-vision PP stages never create graphs, so calling delete_cuda_graphs
unconditionally triggers an assertion in the parent class.

Signed-off-by: Lifu Zhang <lifuz@login-lyris01.lyris.clusters.nvidia.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
@shifangx shifangx requested a review from yaoyu-33 March 12, 2026 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants