-
Notifications
You must be signed in to change notification settings - Fork 240
[Deprecated] Adding CUDA Graph Support for Vision Encoder #2274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
8fdb5a4
3ba2474
a1d3e8f
e839630
6f278c5
8c67a8d
bee30d6
299cb89
5469e54
d88ad7f
7fcc910
5f7b646
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| +60 −19 | megatron/core/transformer/cuda_graphs.py | |
| +4 −3 | megatron/core/transformer/module.py | |
| +1 −1 | megatron/core/transformer/transformer_block.py |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| #!/usr/bin/env python3 | ||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """ | ||
| ============================================================================== | ||
| Example: Qwen3_VL Pretraining with Decentralized Process Groups (Simple) | ||
| ============================================================================== | ||
|
|
||
| This example demonstrates the simplest way to enable decentralized process groups: | ||
| just use an existing recipe and set `cfg.dist.use_decentralized_pg = True`. | ||
|
|
||
| The setup() function inside pretrain() will automatically create the | ||
| ProcessGroupCollection using HyperCommGrid based on the parallelism settings. | ||
|
|
||
| How to Run | ||
| ---------- | ||
| # 8 GPUs: EP8 | ||
| uv run python -m torch.distributed.run --nproc_per_node=8 examples/recipes/decentralized_pg/pretrain_qwen3_vl_simple.py | ||
| """ | ||
|
|
||
| import torch | ||
|
|
||
| from megatron.bridge.recipes.qwen_vl.qwen3_vl import qwen3_vl_30b_a3b_pretrain_config | ||
| from megatron.bridge.training.pretrain import pretrain | ||
| from megatron.bridge.training.vlm_step import forward_step | ||
|
|
||
|
|
||
| def main() -> None: | ||
| """Run Qwen3 pretraining with decentralized process groups enabled.""" | ||
| # Get the standard Qwen3 4B pretrain config with overrides | ||
| cfg = qwen3_vl_30b_a3b_pretrain_config( | ||
| # Use mock data for demo | ||
| mock=True, | ||
| # Parallelism | ||
| expert_model_parallel_size=8, | ||
| # Training settings (small for demo) | ||
| train_iters=100, | ||
| seq_length=1024, | ||
| global_batch_size=32, | ||
| micro_batch_size=1, | ||
| # LR schedule (must fit within train_iters) | ||
| lr_warmup_iters=10, | ||
| lr_decay_iters=100, | ||
| ) | ||
| # known issue with share_embeddings_and_output_weights | ||
| cfg.model.share_embeddings_and_output_weights = False | ||
|
|
||
| # ========================================================================= | ||
| # KEY: Enable decentralized process groups | ||
| # ========================================================================= | ||
| cfg.dist.use_decentralized_pg = True | ||
| cfg.dist.use_gloo_process_groups = False # Gloo not supported with decentralized PG | ||
|
|
||
| pretrain(config=cfg, forward_step_func=forward_step) | ||
|
|
||
| # Cleanup | ||
| if torch.distributed.is_initialized(): | ||
| torch.distributed.barrier() | ||
| torch.distributed.destroy_process_group() | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -127,6 +127,46 @@ def _set_cuda_graph_overrides( | |
| return recipe | ||
|
|
||
|
|
||
| def _set_vision_cuda_graph_overrides( | ||
| recipe: ConfigContainer, | ||
| vision_cuda_graph_impl: Optional[str] = None, | ||
| vision_cuda_graph_scope: Optional[str | List[str]] = None, | ||
| ) -> ConfigContainer: | ||
| """Set the vision encoder CUDA graph overrides. | ||
|
|
||
| This enables TE CUDA graph for the vision encoder separately from the language model. | ||
|
|
||
| Args: | ||
| recipe: The config container | ||
| vision_cuda_graph_impl: Vision encoder CUDA graph implementation ("none" or "transformer_engine") | ||
| vision_cuda_graph_scope: Vision encoder CUDA graph scope (e.g., ["attn"]) | ||
|
|
||
| Returns: | ||
| Updated config container | ||
| """ | ||
| if isinstance(vision_cuda_graph_scope, str): | ||
| vision_cuda_graph_scope = [vision_cuda_graph_scope] | ||
|
|
||
| if vision_cuda_graph_impl is not None: | ||
| recipe.model.vision_cuda_graph_impl = vision_cuda_graph_impl | ||
|
|
||
| if vision_cuda_graph_impl == "transformer_engine": | ||
| # Ensure TE RNG tracker is enabled for CUDA graph compatibility | ||
| recipe.rng.te_rng_tracker = recipe.model.use_te_rng_tracker = True | ||
|
|
||
| valid_te_scopes = ["attn", "mlp"] # Vision encoder typically only has attn and mlp | ||
| if vision_cuda_graph_scope: | ||
| assert all(scope in valid_te_scopes for scope in vision_cuda_graph_scope), ( | ||
| f"Invalid vision cuda graph scope: {vision_cuda_graph_scope}. " | ||
| f"Valid options for vision encoder are: {valid_te_scopes}" | ||
| ) | ||
|
|
||
| if vision_cuda_graph_scope is not None: | ||
| recipe.model.vision_cuda_graph_scope = vision_cuda_graph_scope | ||
|
|
||
| return recipe | ||
|
Comment on lines
+130
to
+167
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: # Check if _set_vision_cuda_graph_overrides is called anywhere in the codebase
rg -n '_set_vision_cuda_graph_overrides' --type=pyRepository: NVIDIA-NeMo/Megatron-Bridge Length of output: 151 🏁 Script executed: # Also check the file to see the full context of both functions and their calls
cat -n scripts/performance/utils/overrides.py | head -270Repository: NVIDIA-NeMo/Megatron-Bridge Length of output: 13292
The function is not integrated into Additionally, the function is missing the reset logic for 🤖 Prompt for AI Agents |
||
|
|
||
|
|
||
| def _set_recompute_overrides( | ||
| recipe: ConfigContainer, | ||
| cpu_offloading_num_layers: Optional[int] = None, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1262
🏁 Script executed:
Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 8630
🏁 Script executed:
Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3428
Remove manual process group cleanup;
pretrain()handles this internally.The
pretrain()function already manages distributed cleanup via_maybe_destroy_process_group(), which destroys the process group only if it was created bypretrain()itself (lines 111, 191 insrc/megatron/bridge/training/pretrain.py). The manualbarrier()+destroy_process_group()call will cause a runtime error in the typical scenario where the process group is not initialized beforepretrain()is called, sincepretrain()will have already destroyed it upon return.Delete lines 68-71 (or the equivalent cleanup block in similar example files).
🤖 Prompt for AI Agents