cp: `feat: add Megatron support for on-policy distillation (1324)` into `r0.4.0` by chtruong814 · Pull Request #1398 · NVIDIA-NeMo/RL

chtruong814 · 2025-10-21T04:29:11Z

beep boop [🤖]: Hi @zpqiu 👋,

we've cherry picked #1324 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

New Features
- Added epoch-based training progress tracking to distillation workflows alongside step counters.
- Enabled top-k logit computation support for Megatron backend configurations.
- Introduced new distillation configuration examples with Megatron parallelism support.
Documentation
- Removed outdated backend compatibility constraints from documentation.
Tests
- Added functional and unit tests for Megatron distillation and top-k logits validation.

Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com> Signed-off-by: alexchiu <qiuzhaopeng@foxmail.com> Signed-off-by: alexchiu <alexq@nvidia.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

coderabbitai · 2025-10-21T04:37:35Z

📝 Walkthrough

Walkthrough

This PR adds Megatron on-policy distillation support by introducing epoch-based training tracking in the distillation algorithm, implementing top-k logits computation for the Megatron backend with context parallelism support, and providing comprehensive configuration examples and test coverage for Megatron-specific distillation workflows.

Changes

Cohort / File(s)	Change Summary
Documentation & Configuration `README.md`, `examples/configs/distillation_math.yaml`	Removed on-policy distillation backend restrictions; expanded distillation_math.yaml with max_num_epochs and detailed Megatron configuration block (parallelism, memory, optimization, routing, data-parallel settings).
New Megatron Distillation Configs `examples/configs/distillation_math_megatron.yaml`, `examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.yaml`	Added complete Megatron-based distillation configurations with checkpointing, policy settings, generation backend (vLLM), logging, teacher model overrides, and cluster resource allocation.
Core Algorithm & Policy `nemo_rl/algorithms/distillation.py`, `nemo_rl/models/policy/megatron_policy_worker.py`	Introduced max_num_epochs field and epoch/step tracking in DistillationConfig and DistillationSaveState; implemented get_topk_logits for Megatron backend with distributed top-k computation and context parallelism support; added imports for allgather_cp_sharded_tensor and distributed_vocab_topk.
Functional Test Infrastructure `tests/functional/L1_Functional_Tests_GPU.sh`, `tests/functional/distillation_megatron.sh`, `tests/test_suites/nightly.txt`	Added distillation_megatron.sh test script to GPU functional tests; added nightly test suite entry for Megatron distillation run.
Distillation Test Suite `tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh`	Added comprehensive test suite script for Qwen3 32B-to-1.7B distillation with Megatron parallelism; includes configuration, TensorBoard log conversion, and metric validation.
Unit Tests `tests/unit/algorithms/test_distillation.py`, `tests/unit/models/policy/test_megatron_worker.py`, `tests/unit/test_recipes_and_test_suites.py`	Added max_num_epochs to test configurations and offload_after_refit method to DummyPolicy; introduced topk_setup fixture and comprehensive top-k logits validation tests for Megatron backend including context parallelism agreement checks; updated nightly GPU hours threshold from 1030 to 1040.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Distillation Loop
    participant DistillationSaveState
    participant MegatronWorker
    
    User->>Distillation Loop: Start (max_num_epochs=10, max_num_steps=N)
    
    loop for each epoch (0 to max_num_epochs-1)
        loop for each step in epoch
            Distillation Loop->>MegatronWorker: get_topk_logits()
            
            rect rgb(200, 220, 255)
            Note over MegatronWorker: compute local top-k via distributed_vocab_topk
            end
            
            alt context_parallel > 1 and sequence_packing
                MegatronWorker->>MegatronWorker: allgather across CP groups
            end
            
            MegatronWorker-->>Distillation Loop: topk_logits, topk_indices
            
            Distillation Loop->>Distillation Loop: train step
            Distillation Loop->>DistillationSaveState: update(current_epoch, current_step, total_steps)
            
            alt total_steps >= max_num_steps
                Distillation Loop->>Distillation Loop: break (training complete)
            end
        end
    end
    
    Distillation Loop-->>User: Training complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Rationale: The PR modifies core data structures (DistillationConfig, DistillationSaveState) affecting training loop state management; implements a substantial new feature (get_topk_logits for Megatron with distributed computation, context parallelism, and sequence packing logic); spans 13 heterogeneous files including core logic, configs, and extensive test coverage; requires understanding epoch-based tracking, distributed tensor operations, and validation of test suite metrics.

Possibly related PRs

feat: add Megatron support for on-policy distillation #1324: Directly related—both PRs add Megatron on-policy distillation support with modifications to nemo_rl/algorithms/distillation.py and get_topk_logits implementation in megatron_policy_worker.py.
cp: feat: Using mcore cpu optimizer (1242) into r0.4.0 #1329: Related through Megatron optimizer CPU-offload feature (optimizer_cpu_offload/optimizer_offload_fraction added to megatron_cfg in both PRs).
feat: FP8 Training in Megatron Path #971: Related through MegatronPolicyWorker modifications (this PR implements get_topk_logits and distributed imports; retrieved PR adds FP8 and padding logic to same class).

Suggested labels

r0.4.0, CI:L1

Suggested reviewers

zpqiu
terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The PR title "cp: feat: add Megatron support for on-policy distillation (1324) into r0.4.0" clearly and specifically describes the main change in the changeset. Despite the unconventional formatting with the "cp:" prefix and "into r0.4.0" suffix (which convey cherry-pick metadata), the core message directly aligns with the substantial changes: implementing Megatron backend support for on-policy distillation across configuration files, the core distillation algorithm, policy workers, and comprehensive test coverage. The title is specific and concrete enough that a teammate scanning the history would immediately understand the primary change without confusion.
Test Results For Major Changes	✅ Passed	The PR description includes alignment experiments showing convergence comparison between DTensor baseline and multiple Megatron configurations (with varying parallelism strategies), which demonstrates that the major changes do not introduce numeric regressions. The author has also confirmed in the pre-checks that unit tests and functional tests were run locally, and multiple new tests have been added to the codebase (functional tests, unit tests for top-k logits, and nightly test entries). The PR includes a usage example and clearly documents the new Megatron support feature being added.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch cherry-pick-1324-r0.4.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (17)

nemo_rl/algorithms/distillation.py (4)
78-80: Config and save-state shape look good; add brief key doc.

New max_num_epochs and multi-counter save state are consistent. Add a short doc comment on recommended defaults/usage per coding guidelines for TypedDict keys.

Also applies to: 88-107

253-255: Setup logging is helpful

The added progress prints (dataloader/val/cluster) aid operability without impacting behavior.

Consider gating verbose prints behind a logger verbosity level in future.

Also applies to: 270-273, 278-278, 295-297, 360-362

300-302: Clarify Megatron non‑colocated constraint in error

Good assertion. Consider appending “Use backend=vllm for non-colocated inference” to reduce config round‑trips.

578-580: Step header denominator can mislead late in training

Printing denominator as min(len(dataloader), max_steps) is coarse. Prefer remaining steps in this epoch.
-            print(
-                f"\n{'=' * 25} Step {current_step + 1}/{min(len(dataloader), max_steps)} {'=' * 25}",
-                flush=True,
-            )
+            steps_left = max_steps - total_steps
+            steps_this_epoch = min(len(dataloader), steps_left)
+            print(
+                f"\n{'=' * 25} Step {current_step + 1}/{steps_this_epoch} {'=' * 25}",
+                flush=True,
+            )
tests/unit/test_recipes_and_test_suites.py (1)

156-156: Threshold bump acknowledged; consider centralizing the limit.

Raising the cap to 1040 is fine. To avoid future rename churn, consider extracting 1040 to a module constant (e.g., MAX_NIGHTLY_GPU_HOURS) and reference it in the assertion/message; the test name can remain as-is. The new Megatron distillation driver is properly listed in nightly and present on disk.
nemo_rl/models/policy/megatron_policy_worker.py (1)
1409-1700: Address lint nits: add return type, remove unused variables, shorten exception message

Verification confirms the CP > 1 test with sequence packing (test_megatron_context_parallel_topk_agreement) exists and validates the slice math through comprehensive assertions. The implementation is sound. Apply these minor cleanup improvements:

Add return type annotation for clarity:
-    def get_topk_logits(
+    def get_topk_logits(
         self,
         *,
         data: BatchedDataDict[GenerationDatumSpec],
         k: int,
         micro_batch_size: Optional[int] = None,
-    ):
+    ) -> BatchedDataDict:
Remove unused variables (F841 Ruff warnings):

pp_size (line 1469): read from config but never used

input_ids_unpacked (line 1477): unpacked but never referenced

cu_seqlens (line 1480): unpacked but only cu_seqlens_padded is used

Shorten exception message (TRY003):
-                        raise RuntimeError(
-                            "Context Parallelism (CP>1) requires sequence packing to be enabled."
-                        )
+                        raise RuntimeError("CP>1 requires sequence packing enabled.")
tests/unit/algorithms/test_distillation.py (3)
552-554: Match interface: add return type annotation for DummyPolicy.offload_after_refit.

Annotate to align with interfaces.PolicyWorker.offload_after_refit(self) -> None.
-        def offload_after_refit(self):
-            return None
+        def offload_after_refit(self) -> None:
+            return None
91-96: Avoid fragile MagicMock.iter overrides.

Assigning a plain function to iter can bypass method binding. Use return_value/side_effect on the mock instead.
-    def train_iter(self):
-        return iter([mock_batch] * 10)
-    train_dataloader.__iter__ = train_iter
+    train_dataloader.__iter__.side_effect = lambda: iter([mock_batch] * 10)
     train_dataloader.__len__ = MagicMock(return_value=10)

-    def val_iter(self):
-        return iter([mock_batch] * 10)
-    val_dataloader.__iter__ = val_iter
+    val_dataloader.__iter__.side_effect = lambda: iter([mock_batch] * 10)
     val_dataloader.__len__ = MagicMock(return_value=10)
Also applies to: 99-104

123-131: Add an epoch-termination unit test.

You added distillation.max_num_epochs, but no test asserts epoch based early-exit. Recommend a focused test where max_num_epochs=1 and max_num_steps is large to verify the loop stops at epoch boundary. I can draft it if helpful.

Also applies to: 512-519, 620-627
tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh (1)
2-4: Export config vars sourced by helpers (silence SC2034).

If common.env consumes these via env, export them; else add a brief comment.
 source $SCRIPT_DIR/common.env
 # ===== BEGIN CONFIG =====
-NUM_NODES=1
-STEPS_PER_RUN=10
-MAX_STEPS=10
-NUM_RUNS=$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))  # Round up
-NUM_MINUTES=60
+export NUM_NODES=1
+export STEPS_PER_RUN=10
+export MAX_STEPS=10
+export NUM_RUNS=$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))  # Round up
+export NUM_MINUTES=60
Also applies to: 6-11
tests/unit/models/policy/test_megatron_worker.py (4)
316-319: Narrow or annotate broad exception handlers (BLE001).

Catching Exception masks real failures. Either catch specific errors (e.g., ray exceptions, CUDA OOM) or annotate to acknowledge intent in tests.
-    except Exception as e:
+    except Exception as e:  # noqa: BLE001 - broad to convert infra/env failures into skips in tests
         print(f"Error during training setup: {e}")
         pytest.skip(f"Training setup failed: {e}")
@@
-    except Exception as e:
+    except Exception as e:  # noqa: BLE001
         print(f"Error during generation setup: {e}")
         pytest.skip(f"Generation setup failed: {e}")
@@
-    except Exception as e:
+    except Exception as e:  # noqa: BLE001
         print(f"Error during logprob setup: {e}")
         pytest.skip(f"Logprob setup failed: {e}")
@@
-    except Exception as e:
+    except Exception as e:  # noqa: BLE001
         print(f"Error during topk setup: {e}")
         pytest.skip(f"Topk setup failed: {e}")
Also applies to: 507-509, 657-660, 1413-1416

1460-1462: Silence unused cluster var (RUF059).

Prefix with underscore.
-    policy, cluster, data = topk_setup
+    policy, _cluster, data = topk_setup
1516-1521: Guard CP top‑k test on CUDA availability.

Prevents failures on CPU-only CI.
 def test_megatron_context_parallel_topk_agreement(tiny_qwen2_model_path):
     """Test that CP and non-CP models produce identical top-k logits with sequence packing enabled."""
-    num_gpus = 2
+    if not torch.cuda.is_available():
+        pytest.skip("CUDA is required for context-parallel top-k test")
+    num_gpus = 2
1861-1873: Fix stray tuple; use msg for clarity.

The current code builds an unused tuple. Pass msg= to assert_close.
-    (
-        torch.testing.assert_close(
-            logprobs_no_cp, logprobs_no_cp_no_packing, rtol=1e-3, atol=1e-3
-        ),
-        (
-            "Logprobs should match between non-CP and non-CP models with sequence packing"
-        ),
-    )
+    torch.testing.assert_close(
+        logprobs_no_cp,
+        logprobs_no_cp_no_packing,
+        rtol=1e-3,
+        atol=1e-3,
+        msg="Logprobs should match between packing and non-packing (non-CP)",
+    )
examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.yaml (3)
12-24: Align policy parallelism with filename “tp2pp2cp2”.

File name implies TP=2, PP=2, CP=2, but policy.megatron_cfg doesn’t set these explicitly. Recommend pinning them to avoid surprises from defaults.
 policy:
@@
   megatron_cfg:
     enabled: true
+    tensor_model_parallel_size: 2
+    pipeline_model_parallel_size: 2
+    context_parallel_size: 2
25-37: Teacher parallelism naming mismatch; verify intent.

Recipe name suggests “tp2pp2cp2”, but teacher sets tp=4, cp=1 and omits PP. If intentional (e.g., different sharding for teacher), add a brief comment; otherwise align to tp2/pp2/cp2.
 teacher:
   model_name: Qwen/Qwen3-32B
@@
   megatron_cfg:
     enabled: true
-    tensor_model_parallel_size: 4
-    context_parallel_size: 1
+    tensor_model_parallel_size: 2
+    pipeline_model_parallel_size: 2
+    context_parallel_size: 2
12-17: Ensure policy.model_name is set to Qwen3‑1.7B‑Base.

The filename encodes “to‑1.7b‑base”, but policy.model_name isn’t set here. If it isn’t provided by defaults, set it explicitly to avoid composing the wrong model.

I can draft a quick OmegaConf loader script to print the composed policy/teacher names.

Also applies to: 25-27

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a06941c and 6f686d5.

📒 Files selected for processing (13)

README.md (0 hunks)
examples/configs/distillation_math.yaml (2 hunks)
examples/configs/distillation_math_megatron.yaml (1 hunks)
examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.yaml (1 hunks)
nemo_rl/algorithms/distillation.py (26 hunks)
nemo_rl/models/policy/megatron_policy_worker.py (2 hunks)
tests/functional/L1_Functional_Tests_GPU.sh (1 hunks)
tests/functional/distillation_megatron.sh (1 hunks)
tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh (1 hunks)
tests/test_suites/nightly.txt (1 hunks)
tests/unit/algorithms/test_distillation.py (4 hunks)
tests/unit/models/policy/test_megatron_worker.py (1 hunks)
tests/unit/test_recipes_and_test_suites.py (2 hunks)

💤 Files with no reviewable changes (1)

README.md

🧰 Additional context used

📓 Path-based instructions (11)

examples/configs/*.yaml