NVIDIA-NeMo · yaoyu-33 · Jan 28, 2026 · Oct 23, 2025 · Oct 23, 2025 · Oct 23, 2025
diff --git a/examples/recipes/decentralized_pg/README.md b/examples/recipes/decentralized_pg/README.md
@@ -0,0 +1,179 @@
+# Decentralized Process Groups Examples
+
+This directory contains examples demonstrating how to use **decentralized process groups** (`use_decentralized_pg=True`) in Megatron-Bridge for distributed training.
+
+## Overview
+
+Instead of relying on Megatron-Core's global parallel state (mpu) module, you can use a `ProcessGroupCollection` that is explicitly passed to all components. This gives you full control over the parallelism topology and is useful for:
+
+1. **Reinforcement Learning**: Multiple model instances (policy, value, reference) with different parallelism
+2. **Multi-Model Pipelines**: Complex workflows requiring explicit control over communication
+3. **Testing/Debugging**: Isolated process groups without global state side effects
+
+## Files
+
+| File | Description |
+|------|-------------|
+| `pretrain_qwen3_simple.py` | **Simple**: Use a recipe and enable `use_decentralized_pg=True` |
+| `pretrain_qwen3_with_decentralized_pg.py` | **Advanced**: Manually create process groups with `HyperCommGrid` |
+
+## Quick Start
+
+### Simple Approach (Recommended)
+
+Just use an existing recipe and enable decentralized process groups:
+
+```bash
+# 8 GPUs: TP2 x PP2 x DP2
+uv run python -m torch.distributed.run --nproc_per_node=8 examples/recipes/decentralized_pg/pretrain_qwen3_simple.py
+
+# 4 GPUs: TP2 x PP2 x DP1
+uv run python -m torch.distributed.run --nproc_per_node=4 examples/recipes/decentralized_pg/pretrain_qwen3_simple.py
+```
+
+The key is just two lines:
+
+```python
+from megatron.bridge.recipes.qwen.qwen3 import qwen3_4b_pretrain_config
+
+cfg = qwen3_4b_pretrain_config(
+    tensor_model_parallel_size=2,
+    pipeline_model_parallel_size=2,
+    # ... other settings
+)
+
+# Enable decentralized process groups
+cfg.dist.use_decentralized_pg = True
+cfg.dist.use_gloo_process_groups = False  # Gloo not supported
+```
+
+### Advanced Approach (Manual Process Group Creation)
+
+For full control over process groups:
+
+```bash
+# 8 GPUs: TP2 x PP2 x DP2
+uv run python -m torch.distributed.run --nproc_per_node=8 examples/recipes/decentralized_pg/pretrain_qwen3_with_decentralized_pg.py
+
+# 4 GPUs: TP2 x PP2 x DP1
+uv run python -m torch.distributed.run --nproc_per_node=4 examples/recipes/decentralized_pg/pretrain_qwen3_with_decentralized_pg.py \
+    --tp-size 2 --pp-size 2
+
+# 2 GPUs: TP2 x PP1 x DP1
+uv run python -m torch.distributed.run --nproc_per_node=2 examples/recipes/decentralized_pg/pretrain_qwen3_with_decentralized_pg.py \
+    --tp-size 2 --pp-size 1
+```
+
+## Manual Process Group Creation (Advanced)
+
+### Step 1: Initialize torch.distributed
+
+```python
+torch.distributed.init_process_group(backend="nccl", world_size=world_size, rank=rank)
+```
+
+### Step 2: Create ProcessGroupCollection with HyperCommGrid
+
+```python
+from megatron.core.hyper_comm_grid import HyperCommGrid
+from megatron.core.process_groups_config import ProcessGroupCollection
+
+# Create a grid with shape [TP, CP, DP, PP]
+grid = HyperCommGrid(
+    shape=[tp_size, cp_size, dp_size, pp_size],
+    dim_names=["tp", "cp", "dp", "pp"],
+    rank_offset=0,
+    backend="nccl",
+)
+
+# Create process groups by selecting dimensions
+tp_pg = grid.create_pg(["tp"])      # Ranks differ only in TP dimension
+pp_pg = grid.create_pg(["pp"])      # Ranks differ only in PP dimension
+dp_pg = grid.create_pg(["dp"])      # Ranks differ only in DP dimension
+mp_pg = grid.create_pg(["tp", "pp"]) # Model parallel = TP + PP
+
+# Bundle into ProcessGroupCollection
+pg_collection = ProcessGroupCollection(
+    tp=tp_pg,
+    pp=pp_pg,
+    dp=dp_pg,
+    mp=mp_pg,
+    # ... more groups
+)
+```
+
+### Step 3: Set Random Seeds (REQUIRED)
+
+```python
+from megatron.core import tensor_parallel
+from megatron.core.utils import get_pg_rank
+
+# Get TP rank from our process group
+tp_rank = get_pg_rank(pg_collection.tp)
+
+# Initialize CUDA RNG tracker - REQUIRED before model creation!
+tensor_parallel.model_parallel_cuda_manual_seed(
+    seed=1234,
+    te_rng_tracker=False,
+    inference_rng_tracker=False,
+    use_cudagraphable_rng=False,
+    tp_rank=tp_rank,
+    ep_rank=0,
+    etp_rank=tp_rank,
+)
+```
+
+### Step 4: Pass pg_collection Explicitly to Components
+
+```python
+# Model creation
+model = cfg.model.provide_distributed_model(
+    pg_collection=pg_collection,  # <-- Pass here!
+    ...
+)
+
+# Optimizer setup
+optimizer, scheduler = setup_optimizer(
+    pg_collection=pg_collection,  # <-- Pass here!
+    ...
+)
+
+# Data loaders use the DP group
+train_data_iterator = setup_data_iterators(
+    dp_group=pg_collection.dp,  # <-- Use DP group for data sharding!
+    ...
+)
+
+# Training loop
+train(
+    pg_collection=pg_collection,  # <-- Pass here!
+    ...
+)
+```
+
+## HyperCommGrid Explained
+
+`HyperCommGrid` creates a multi-dimensional grid of ranks. The grid shape `[TP, CP, DP, PP]` defines how ranks are organized:
+
+```
+World Size = 8, Shape = [2, 1, 2, 2] means:
+  TP=2, CP=1, DP=2, PP=2
+
+Rank layout:
+  TP=0,DP=0,PP=0: rank 0    TP=1,DP=0,PP=0: rank 1
+  TP=0,DP=0,PP=1: rank 2    TP=1,DP=0,PP=1: rank 3
+  TP=0,DP=1,PP=0: rank 4    TP=1,DP=1,PP=0: rank 5
+  TP=0,DP=1,PP=1: rank 6    TP=1,DP=1,PP=1: rank 7
+```
+
+When you call `grid.create_pg(["tp"])`, it creates groups of ranks that share the same DP and PP coordinates but differ in TP:
+- Group 1: [rank 0, rank 1] (DP=0, PP=0)
+- Group 2: [rank 2, rank 3] (DP=0, PP=1)
+- Group 3: [rank 4, rank 5] (DP=1, PP=0)
+- Group 4: [rank 6, rank 7] (DP=1, PP=1)
+
+## Limitations
+
+- Gloo process groups are not supported (NCCL only)
+- ModelOpt sharded checkpointing is disabled
+- Distillation tensor shape adjustment is disabled
diff --git a/examples/recipes/decentralized_pg/pretrain_qwen3_simple.py b/examples/recipes/decentralized_pg/pretrain_qwen3_simple.py
@@ -0,0 +1,79 @@
+#!/usr/bin/env python3
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+==============================================================================
+Example: Qwen3 Pretraining with Decentralized Process Groups (Simple)
+==============================================================================
+
+This example demonstrates the simplest way to enable decentralized process groups:
+just use an existing recipe and set `cfg.dist.use_decentralized_pg = True`.
+
+The setup() function inside pretrain() will automatically create the
+ProcessGroupCollection using HyperCommGrid based on the parallelism settings.
+
+How to Run
+----------
+# 8 GPUs: TP2 x PP2 x DP2
+uv run python -m torch.distributed.run --nproc_per_node=8 examples/recipes/decentralized_pg/pretrain_qwen3_simple.py
+
+# 4 GPUs: TP2 x PP2 x DP1
+uv run python -m torch.distributed.run --nproc_per_node=4 examples/recipes/decentralized_pg/pretrain_qwen3_simple.py
+"""
+
+import torch
+
+from megatron.bridge.recipes.qwen.qwen3 import qwen3_4b_pretrain_config
+from megatron.bridge.training.gpt_step import forward_step
+from megatron.bridge.training.pretrain import pretrain
+
+
+def main() -> None:
+    """Run Qwen3 pretraining with decentralized process groups enabled."""
+    # Get the standard Qwen3 4B pretrain config with overrides
+    cfg = qwen3_4b_pretrain_config(
+        # Use mock data for demo
+        mock=True,
+        # Parallelism
+        tensor_model_parallel_size=2,
+        pipeline_model_parallel_size=2,
+        # Training settings (small for demo)
+        train_iters=100,
+        seq_length=1024,
+        global_batch_size=32,
+        micro_batch_size=1,
+        # LR schedule (must fit within train_iters)
+        lr_warmup_iters=10,
+        lr_decay_iters=100,
+    )
+    # known issue with share_embeddings_and_output_weights
+    cfg.model.share_embeddings_and_output_weights = False
+
+    # =========================================================================
+    # KEY: Enable decentralized process groups
+    # =========================================================================
+    cfg.dist.use_decentralized_pg = True
+    cfg.dist.use_gloo_process_groups = False  # Gloo not supported with decentralized PG
+
+    pretrain(config=cfg, forward_step_func=forward_step)
+
+    # Cleanup
+    if torch.distributed.is_initialized():
+        torch.distributed.barrier()
+        torch.distributed.destroy_process_group()
+
+
+if __name__ == "__main__":
+    main()