[skyrl-train] Megatron LoRA by erictang000 · Pull Request #743 · NovaSky-AI/SkyRL

erictang000 · 2025-12-06T01:47:57Z

Enables LoRA training with the Megatron Backend. Currently waiting for NVIDIA-NeMo/Megatron-Bridge#1762 to be merged into main, so we can at least pin a commit rather than a branch for stability.

Adds LoRA support via Megatron-Bridge
Adds custom checkpointing for LoRA model parameters (until LoRA checkpointing logic is upstreamed to Megatron-Bridge).
Weight syncing logic for Megatron + LoRA is handled by merging the LoRA parameters back into the base model before exporting to vLLM. This means that for megatron lora (for now), lora does not have to be configured for vLLM.

Examples

GSM8K for Qwen3-30B-MoE and Qwen3-0.6B converging:

Qwen3-30B-A3B previously required 2 H100 nodes for full parameter fine tuning - we can increase batch size compared to previous runs with LoRA on just 1 H100 node!

DAPO Qwen-4B

With TIS - megatron dense backend can match/exceed FSDP backend perf. TIS is especially important for the current version of LoRA. Canonical Lora seems to be less good than "performant lora" - or maybe more sensitive to learning rate.

Blockers/TODOs:

~~For Dense models, LoRA results in low grad norm/0 ppo_clip_ratio unless pp > 1. Something on megatron-core or megatron-bridge is broken for dense models.~~ Issue tracked on Megatron-Bridge (PEFT Training fails for dense models when PP = 1 when activation checkpointing is on NVIDIA-NeMo/Megatron-Bridge#1750), awaiting PR fix: Fix PEFT recompute hook NVIDIA-NeMo/Megatron-Bridge#1762
Test out MoE models

Future Work

Once Megatron-Bridge support for exporting only lora parameters is done, we should support just syncing these to vLLM for lower communication cost
Add support for other LoRA variants from Megatron-Bridge (canonical lora, qlora, dora).

gemini-code-assist

Code Review

This pull request introduces LoRA (Low-Rank Adaptation) support for Megatron, which is a significant feature for efficient model fine-tuning. The changes span across Docker configuration, project dependencies, and the core Megatron worker implementation. The implementation correctly adds LoRA configuration and applies it via a pre-wrap hook. The tests have also been updated to cover the new LoRA functionality. My review includes a few suggestions to improve the Dockerfile efficiency and minor code style fixes.

gemini-code-assist · 2025-12-06T01:54:39Z

docker/Dockerfile.megatron

+RUN sudo apt-get update && sudo apt-get install -y \
+    cmake \
+    ninja-build \
+    libnccl-dev=2.25.1-1+cuda12.8 \
+    libnccl2=2.25.1-1+cuda12.8 \
+    && sudo apt-get clean && sudo rm -rf /var/lib/apt/lists/*


To optimize the Docker image size and build time, it's a best practice to consolidate apt-get commands into fewer RUN layers. This RUN instruction could be combined with other apt-get install commands in this Dockerfile to reduce the number of layers in the final image.

gemini-code-assist · 2025-12-06T01:54:39Z

docker/Dockerfile.megatron

+RUN pip install --no-cache-dir \
+    nvidia-mathdx \
+    pybind11 \
+    setuptools \
+    wheel
+
+RUN pip install --no-cache-dir --no-build-isolation "transformer_engine[pytorch]==2.9.0"


To reduce the number of layers in the Docker image and improve build efficiency, it's better to chain the pip install commands within a single RUN instruction using &&.

RUN pip install --no-cache-dir \ nvidia-mathdx \ pybind11 \ setuptools \ wheel && \ pip install --no-cache-dir --no-build-isolation "transformer_engine[pytorch]==2.9.0"

gemini-code-assist · 2025-12-06T01:54:39Z

skyrl-train/skyrl_train/workers/megatron/megatron_worker.py

+
+                return lora_model
+            self.provider.register_pre_wrap_hook(lora_pre_wrap_hook)
+


There is trailing whitespace on this line that should be removed to maintain code cleanliness.

Suggested change

self.provider.register_pre_wrap_hook(lora_pre_wrap_hook)

gemini-code-assist · 2025-12-06T01:54:39Z

skyrl-train/skyrl_train/workers/megatron/megatron_worker.py

+

There is an extra blank line here. Please remove it to improve code readability and maintain a consistent style.

…tron_lora

…into megatron_lora

…tron_lora

…into megatron_lora

erictang000 · 2025-12-21T23:10:24Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces significant new functionality by adding LoRA support for Megatron, including new example scripts, core logic changes for distributed training strategies and workers, and updated tests. The implementation for LoRA support appears solid. My review focuses on improving the robustness and clarity of the new example scripts and project configuration. I've identified a critical issue with undefined variables in one script that will cause it to fail, and several medium-severity issues related to configuration consistency, dependency management, and script robustness.

skyrl-train/examples/megatron/run_megatron_lora_qwen3-0.6b.sh

skyrl-train/examples/megatron/run_megatron_lora_qwen3-30b-a3b.sh

gemini-code-assist · 2025-12-21T23:12:05Z

skyrl-train/pyproject.toml

    { url = "https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl", marker = "extra == 'sglang' and extra != 'mcore' and extra != 'vllm'" }
 ]
-megatron-bridge = {git = "https://github.com/NVIDIA-NeMo/Megatron-Bridge", rev = "22ef9ff9f9684ba2f2dbea14db974f5c31bbd683"}
+megatron-bridge = {git = "https://github.com/NVIDIA-NeMo/Megatron-Bridge", branch = "feature/peft-recompute-hook"}


Pinning a dependency to a git branch can lead to non-reproducible builds and instability, as the branch can be updated or deleted. It's better to pin to a specific commit hash from that branch to ensure stability. The PR description mentions that this is awaiting a merge, which has now happened. Please update this to a specific commit hash or a released version.

Suggested change

megatron-bridge = {git = "https://github.com/NVIDIA-NeMo/Megatron-Bridge", branch = "feature/peft-recompute-hook"}

megatron-bridge = {git = "https://github.com/NVIDIA-NeMo/Megatron-Bridge", rev = "<commit_hash_from_feature_branch>"}

gemini-code-assist · 2025-12-21T23:12:05Z

skyrl-train/retry.sh

+SCRIPT_PATH=$1 
+LOG_FILE=$2 
+state=error
+error_string="unexpected system error"
+echo "Clearing log file first $LOG_FILE"
+> $LOG_FILE
+while [ $state == 'error' ]; do 
+    echo "(Re)Running the script $SCRIPT_PATH"
+    bash $SCRIPT_PATH >> $LOG_FILE 2>&1
+    # script returned, let's check if there was an error
+    last_lines=$(tail -n 150 $LOG_FILE)
+    if [[ ! "$last_lines" =~ "$error_string" ]];then
+        state=noterror
+    fi
+done


This script is not robust against file paths with spaces. Variables like $SCRIPT_PATH and $LOG_FILE should be quoted when used to prevent word splitting. Also, the while loop condition should have its variable quoted.

For example:

SCRIPT_PATH="$1"

> "$LOG_FILE"

while [ "$state" == "error" ] or while [[ "$state" == "error" ]]

bash "$SCRIPT_PATH" >> "$LOG_FILE" 2>&1

Suggested change

SCRIPT_PATH=$1

LOG_FILE=$2

state=error

error_string="unexpected system error"

echo "Clearing log file first $LOG_FILE"

> $LOG_FILE

while [ $state == 'error' ]; do

echo "(Re)Running the script $SCRIPT_PATH"

bash $SCRIPT_PATH >> $LOG_FILE 2>&1

# script returned, let's check if there was an error

last_lines=$(tail -n 150 $LOG_FILE)

if [[ ! "$last_lines" =~ "$error_string" ]];then

state=noterror

fi

done

SCRIPT_PATH="$1"

LOG_FILE="$2"

state=error

error_string="unexpected system error"

echo "Clearing log file first $LOG_FILE"

> "$LOG_FILE"

while [[ "$state" == "error" ]]; do

echo "(Re)Running the script $SCRIPT_PATH"

bash "$SCRIPT_PATH" >> "$LOG_FILE" 2>&1

# script returned, let's check if there was an error

last_lines=$(tail -n 150 "$LOG_FILE")

if [[ ! "$last_lines" =~ "$error_string" ]];then

state=noterror

fi

done

skyrl-train/scripts/full_context/run_full_ctx_megatron.sh

…tron_lora

erictang000 · 2025-12-28T21:31:03Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces LoRA training support for the Megatron backend, which is a significant enhancement. The changes are comprehensive, touching upon model configuration, distributed strategy, worker implementation, and testing. The custom checkpointing for LoRA adapters is well-implemented, with clear logic for saving and loading adapter weights separately. The approach of merging LoRA parameters into the base model for vLLM compatibility is a practical solution for now.

I've identified a couple of areas for improvement:

In skyrl-train/skyrl_train/workers/megatron/megatron_worker.py, there's an opportunity to refactor the configure_lora method to reduce code duplication.
The new test test_megatron_lora_forward in skyrl-train/tests/gpu/test_megatron_worker.py seems incomplete as it calculates a difference but doesn't assert its value.

Overall, this is a great contribution. Addressing these minor points will improve the code's maintainability and test coverage.

skyrl-train/skyrl_train/workers/megatron/megatron_worker.py

gemini-code-assist · 2025-12-28T21:33:22Z

skyrl-train/tests/gpu/test_megatron_worker.py

+    max_diff = torch.max(torch.abs(action_log_probs_full - action_log_probs_lora))
+    print(f"Max diff: {max_diff}")


The docstring for this test states that it checks for numerical equivalence, but there is no assertion to verify this. Since LoRA B weights are initialized to zero, the forward pass of the LoRA model should be identical to the base model before any training. Please add an assertion to confirm that the difference is zero or very close to it.

Suggested change

max_diff = torch.max(torch.abs(action_log_probs_full - action_log_probs_lora))

print(f"Max diff: {max_diff}")

max_diff = torch.max(torch.abs(action_log_probs_full - action_log_probs_lora))

print(f"Max diff: {max_diff}")

assert torch.allclose(action_log_probs_full, action_log_probs_lora, atol=1e-6)

Not sure why this comment was resolved @erictang000

SkyRL/skyrl-train/tests/gpu/gpu_ci/test_megatron_worker.py

Line 416 in ec738ae

max_diff = torch.max(torch.abs(action_log_probs_full - action_log_probs_lora))

fixed it later

Enables LoRA training with the Megatron Backend. Currently waiting for NVIDIA-NeMo/Megatron-Bridge#1762 to be merged into main, so we can at least pin a commit rather than a branch for stability. - Adds [LoRA](https://docs.nvidia.com/nemo/megatron-bridge/0.2.0/apidocs/bridge/bridge.peft.lora.html) support via Megatron-Bridge - Adds custom checkpointing for LoRA model parameters (until LoRA checkpointing logic is upstreamed to Megatron-Bridge). - Weight syncing logic for Megatron + LoRA is handled by merging the LoRA parameters back into the base model before exporting to vLLM. This means that for megatron lora (for now), lora does not have to be configured for vLLM. ## Examples GSM8K for Qwen3-30B-MoE and Qwen3-0.6B converging: <img width="1087" height="808" alt="image" src="https://github.com/user-attachments/assets/95e03b75-4a8c-4734-8f55-2cf535b04876" /> - Qwen3-30B-A3B previously required 2 H100 nodes for full parameter fine tuning - we can increase batch size compared to previous runs with LoRA on just 1 H100 node! ### DAPO Qwen-4B With TIS - megatron dense backend can match/exceed FSDP backend perf. TIS is especially important for the current version of LoRA. Canonical Lora seems to be less good than "performant lora" - or maybe more sensitive to learning rate. <img width="1214" height="814" alt="image" src="https://github.com/user-attachments/assets/4c2d2b37-f835-4e53-ac54-7e54812b6006" /> Blockers/TODOs: - [x] ~~For Dense models, LoRA results in low grad norm/0 ppo_clip_ratio unless pp > 1. Something on megatron-core or megatron-bridge is broken for dense models.~~ Issue tracked on Megatron-Bridge (NVIDIA-NeMo/Megatron-Bridge#1750), awaiting PR NVIDIA-NeMo/Megatron-Bridge#1762 - [x] Test out MoE models ## Future Work - Once Megatron-Bridge support for exporting only lora parameters is done, we should support just syncing these to vLLM for lower communication cost - Add support for other LoRA variants from Megatron-Bridge (canonical lora, qlora, dora).

erictang000 added 2 commits December 4, 2025 20:10

x

cb75604

basic weight sync test working!

4de19a8

erictang000 marked this pull request as draft December 6, 2025 01:48

gemini-code-assist bot reviewed Dec 6, 2025

View reviewed changes

erictang000 added 8 commits December 8, 2025 19:52

x

59e1899

Merge branch 'main' of https://github.com/erictang000/SkyRL into mega…

c1cd5e6

…tron_lora

x

7621360

pyproject.toml solved?

3bf5e33

Merge branch 'megatron_lora' of https://github.com/erictang000/SkyRL …

d4f7280

…into megatron_lora

x

35f87ab

lora weight merging broken but things otherwise running e2e

0211323

figured out pp=1 is the problem

c4438cf

erictang000 mentioned this pull request Dec 16, 2025

PEFT Training fails for dense models when PP = 1 when activation checkpointing is on NVIDIA-NeMo/Megatron-Bridge#1750

Open

erictang000 added 9 commits December 18, 2025 00:23

x

9d07855

x

3becd46

x

6d83aca

x

3383112

Merge branch 'main' of https://github.com/erictang000/SkyRL into mega…

2c84b1f

…tron_lora

x

8a172d9

Merge branch 'main' of https://github.com/erictang000/SkyRL into mega…

d0f0939

…tron_lora

Merge branch 'megatron_lora' of https://github.com/erictang000/SkyRL …

e3ad7ee

…into megatron_lora

lint

3ab4924

erictang000 changed the title ~~[skyrl-train] WIP Megatron LoRA~~ [skyrl-train] Megatron LoRA Dec 21, 2025

erictang000 marked this pull request as ready for review December 21, 2025 00:48

erictang000 added 6 commits December 21, 2025 00:56

x

cf06951

x

a7f480e

X

43cfba5

add temporary separate lora checkpointing path

c86381a

x

ce606d7

Merge branch 'megatron_lora' of https://github.com/erictang000/SkyRL …

52d4bbc

…into megatron_lora

erictang000 added 2 commits December 21, 2025 23:08

x

36f99de

Merge branch 'megatron_lora' of https://github.com/erictang000/SkyRL …

20f134d

…into megatron_lora

gemini-code-assist bot reviewed Dec 21, 2025

View reviewed changes

erictang000 added 6 commits December 21, 2025 23:31

thanks gemini

09bee5a

x

2e1999a

x

240174e

add canonical lora

c311bff

Merge branch 'main' of https://github.com/erictang000/SkyRL into mega…

002be54

…tron_lora

pin commit

4a3a0b3

gemini-code-assist bot reviewed Dec 28, 2025

View reviewed changes

erictang000 added 2 commits December 28, 2025 21:36

x

3ac51cd

x

f9109a9

erictang000 merged commit 901fc9c into NovaSky-AI:main Dec 28, 2025
3 checks passed

erictang000 mentioned this pull request Mar 2, 2026

Support Megatron training backend for MoE training #203

Open

17 tasks


		return lora_model
		self.provider.register_pre_wrap_hook(lora_pre_wrap_hook)

	megatron-bridge = {git = "https://github.com/NVIDIA-NeMo/Megatron-Bridge", branch = "feature/peft-recompute-hook"}
	megatron-bridge = {git = "https://github.com/NVIDIA-NeMo/Megatron-Bridge", rev = "<commit_hash_from_feature_branch>"}

		max_diff = torch.max(torch.abs(action_log_probs_full - action_log_probs_lora))
		print(f"Max diff: {max_diff}")

Conversation

erictang000 commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Examples

DAPO Qwen-4B

Future Work

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

erictang000 commented Dec 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

erictang000 commented Dec 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

SumanthRH Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

erictang000 Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erictang000 commented Dec 6, 2025 •

edited

Loading