Adding CUDA Graph Support for Vision Encoder#2334
Adding CUDA Graph Support for Vision Encoder#2334tomlifu wants to merge 7 commits intoNVIDIA-NeMo:mainfrom
Conversation
📝 WalkthroughWalkthroughThis PR adds CUDA graph support for vision encoders in Qwen3VL models. Changes include vision CUDA graph configuration fields, per-layer detection logic in transformer blocks, padding/unpadding workflows in the vision model, training integration for graph initialization and cleanup, and updates to the Megatron-LM submodule pointer. Changes
Sequence Diagram(s)sequenceDiagram
participant Training as Training Loop
participant VisionModel as Vision Model
participant VisionGraphHelper as Vision CUDA<br/>Graph Helper
participant TEBackend as Transformer Engine<br/>Backend
Training->>VisionModel: Initialize (detect cuda_graph_impl)
VisionModel-->>Training: cuda_graph_impl enabled
Training->>VisionGraphHelper: Create & initialize
Training->>Training: Warmup iterations
Training->>VisionGraphHelper: Capture graphs (post-warmup)
VisionGraphHelper->>TEBackend: Record CUDA graph operations
loop Training Steps
Training->>VisionModel: Forward pass
VisionModel->>VisionModel: Pad to max_seq_len if CUDA graphs active
VisionModel->>TEBackend: Encode (uses replayed graph if warmed)
VisionModel->>VisionModel: Unpad to original length
Training->>TEBackend: Setup vision graph hooks
Training->>Training: Backward pass
end
Training->>VisionGraphHelper: Cleanup (delete graphs)
VisionGraphHelper->>TEBackend: Destroy CUDA graphs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/megatron/bridge/training/train.py`:
- Around line 264-286: The code double-dereferences the vision model:
get_attr_wrapped_model(..., 'vision_model') already returns the vision model,
but the block then checks hasattr(unwrapped, 'vision_model') and accesses
unwrapped.vision_model, preventing initialization; change the logic to treat
unwrapped as the vision model (check unwrapped is not None and has
attribute/config as needed), read vision_model_config = unwrapped.config (or use
unwrapped directly), then construct VisionTECudaGraphHelper with that
vision_model_config and set vision_cuda_graph_helper accordingly so it actually
initializes when the returned object indicates transformer_engine.
🧹 Nitpick comments (6)
src/megatron/bridge/training/vlm_step.py (1)
238-255: Avoid hardcoding PP stage flags and unconditional fixed-length paddingHardcoding
is_first/is_last = Trueand theif True:branch disables the dynamic-length path and forces labels/loss_mask onto every PP stage, which adds extra GPU transfers and leaves dead code. Consider restoring real PP-stage flags and gating fixed-length padding on PP size or CUDA-graph enablement.♻️ Suggested gating for PP/CUDA-graph fixed-length padding
- is_first = True - is_last = True + is_first = is_pp_first_stage(pg_collection.pp) + is_last = is_pp_last_stage(pg_collection.pp) ... - if True: + if ( + getattr(cfg.model, "pipeline_model_parallel_size", 1) > 1 + or getattr(cfg.model, "cuda_graph_impl", "none") != "none" + ):scripts/performance/utils/overrides.py (1)
126-163: Align the new override signature with repo typing conventionsThe new helper uses
Optional/Listin its signature. The repo guidelines prefer built-in generics andT | None. Consider updating the annotations accordingly.As per coding guidelines, "Use 'T | None' for nullable types instead of 'Optional[T]'" and "Use built-in generics (list, dict, tuple) instead of typing equivalents".♻️ Typing guideline-aligned signature
def _set_vision_cuda_graph_overrides( recipe: ConfigContainer, - vision_cuda_graph_impl: Optional[str] = None, - vision_cuda_graph_scope: Optional[str | List[str]] = None, + vision_cuda_graph_impl: str | None = None, + vision_cuda_graph_scope: str | list[str] | None = None, ) -> ConfigContainer:src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py (1)
183-191: Initialize CUDA-graph helper attributes even when graphs are disabledThese attributes are only set under the CUDA-graph condition. Initializing them unconditionally (e.g., to
None) avoids attribute errors and aligns with the "externally visible members" guideline.As per coding guidelines, "Initialize all externally visible members of a class in the constructor".♻️ Initialize attributes defensively
- cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none" - if cuda_graph_enabled: - self.position_embedding_type = self.language_model.position_embedding_type - self.rotary_pos_emb = self.language_model.rotary_pos_emb - self.decoder = self.language_model.decoder + self.position_embedding_type = None + self.rotary_pos_emb = None + self.decoder = None + cuda_graph_enabled = getattr(self.language_model.config, "cuda_graph_impl", "none") != "none" + if cuda_graph_enabled: + self.position_embedding_type = self.language_model.position_embedding_type + self.rotary_pos_emb = self.language_model.rotary_pos_emb + self.decoder = self.language_model.decodersrc/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py (1)
116-128: Deduplicate TE CUDA-graph detection logicThe same
layer_uses_te_cudagraphblock is repeated in both paths. Consider extracting a small helper (e.g.,_uses_te_cudagraph(layer)) to keep the logic consistent and reduce drift.Also applies to: 331-343
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py (1)
56-60: Useint | Nonefor the new max vision seq length fieldThe repo’s typing guidelines prefer
T | Nonefor nullable types. Consider switching the new field toint | None.As per coding guidelines, "Use 'T | None' for nullable types instead of 'Optional[T]'".♻️ Typing guideline-aligned field
- max_vision_cuda_graph_seq_length: Optional[int] = None + max_vision_cuda_graph_seq_length: int | None = Nonesrc/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py (1)
38-42: Use built-in generics and| Nonein the new vision CUDA-graph annotationsThe new helper and fields still use
List/Optional. The repo guidelines prefer built-in generics andT | None. Updating these will keep the typing consistent across the codebase.As per coding guidelines, "Use 'T | None' for nullable types instead of 'Optional[T]'" and "Use built-in generics (list, dict, tuple) instead of typing equivalents".♻️ Typing guideline-aligned annotations
-def _convert_cuda_graph_scope_to_enum(scope_list: List[str]) -> List[CudaGraphScope]: +def _convert_cuda_graph_scope_to_enum(scope_list: list[str]) -> list[CudaGraphScope]: """Convert string list to CudaGraphScope enum list.""" ... - vision_cuda_graph_scope: List[str] = field(default_factory=list) + vision_cuda_graph_scope: list[str] = field(default_factory=list) ... - max_vision_cuda_graph_seq_length: Optional[int] = None + max_vision_cuda_graph_seq_length: int | None = NoneAlso applies to: 120-127, 273-280
| vision_config = getattr(config.model, 'vision_cuda_graph_impl', None) | ||
| if vision_config == "transformer_engine": | ||
| # Try to get vision config from the model | ||
| try: | ||
| for model_chunk in model: | ||
| unwrapped = get_attr_wrapped_model( | ||
| model_chunk, 'vision_model', allow_none=True, return_model_obj=True | ||
| ) | ||
| if unwrapped is not None and hasattr(unwrapped, 'vision_model') and unwrapped.vision_model is not None: | ||
| vision_model_config = unwrapped.vision_model.config | ||
| if vision_model_config.cuda_graph_impl == "transformer_engine": | ||
| vision_seq_length = get_vision_cuda_graph_seq_length(vision_model_config) | ||
| vision_cuda_graph_helper = VisionTECudaGraphHelper( | ||
| model=model, | ||
| vision_config=vision_model_config, | ||
| vision_seq_length=vision_seq_length, | ||
| micro_batch_size=config.train.micro_batch_size, | ||
| num_microbatches=get_num_microbatches(), | ||
| ) | ||
| print_rank_0( | ||
| f"Vision encoder CUDA graph enabled with seq_length={vision_seq_length}" | ||
| ) | ||
| break |
There was a problem hiding this comment.
Vision CUDA-graph helper likely never initializes due to double-dereference
get_attr_wrapped_model(..., "vision_model") already returns the vision model. The subsequent hasattr(unwrapped, "vision_model") gate will typically fail, leaving vision_cuda_graph_helper unset. Use the returned object directly (or normalize once).
🐛 Use the returned vision model directly
- unwrapped = get_attr_wrapped_model(
- model_chunk, 'vision_model', allow_none=True, return_model_obj=True
- )
- if unwrapped is not None and hasattr(unwrapped, 'vision_model') and unwrapped.vision_model is not None:
- vision_model_config = unwrapped.vision_model.config
+ vision_model = get_attr_wrapped_model(
+ model_chunk, "vision_model", allow_none=True, return_model_obj=True
+ )
+ if vision_model is not None:
+ vision_model = getattr(vision_model, "vision_model", vision_model)
+ if vision_model is None:
+ continue
+ vision_model_config = vision_model.config🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/train.py` around lines 264 - 286, The code
double-dereferences the vision model: get_attr_wrapped_model(...,
'vision_model') already returns the vision model, but the block then checks
hasattr(unwrapped, 'vision_model') and accesses unwrapped.vision_model,
preventing initialization; change the logic to treat unwrapped as the vision
model (check unwrapped is not None and has attribute/config as needed), read
vision_model_config = unwrapped.config (or use unwrapped directly), then
construct VisionTECudaGraphHelper with that vision_model_config and set
vision_cuda_graph_helper accordingly so it actually initializes when the
returned object indicates transformer_engine.
I have made a fix in my mcore PR: NVIDIA/Megatron-LM#3294 |
9e020f7 to
2d60349
Compare
Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com>
2d60349 to
9721bf2
Compare
Signed-off-by: Lifu Zhang <lifuz@login-lyris01.lyris.clusters.nvidia.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
…eated() Non-vision PP stages never create graphs, so calling delete_cuda_graphs unconditionally triggers an assertion in the parent class. Signed-off-by: Lifu Zhang <lifuz@login-lyris01.lyris.clusters.nvidia.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
PRs merge order
Please merge the PRs in the following order.
(1) #2370
(2) #2372
(3) #2334
What does this PR do ?
This PR adds CUDA Graph support for vision encoder.
Previous PR: #2274 was deprecated.
Related Mcore PR:NVIDIA/Megatron-LM#3293, NVIDIA/Megatron-LM#3294
Related TE PR:NVIDIA/TransformerEngine#2657
Example run command:
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
New Features
Improvements
Updates