Skip to content

Conversation

@gshennvm
Copy link
Contributor

@gshennvm gshennvm commented Jun 4, 2025

adds nm5 sharding, it supports 32K context in my small script testing.

No need to shard mamba layers, act ckpt is enough. shards only the mlps

@gshennvm gshennvm changed the title Geshen/nm5 add nemotron5 sharding Jun 4, 2025
@gshennvm gshennvm self-assigned this Jun 4, 2025
@gshennvm gshennvm changed the title add nemotron5 sharding feat: add nemotron5 sharding Jun 4, 2025
@gshennvm gshennvm requested a review from terrykong June 4, 2025 20:36
@gshennvm gshennvm added the CI:L0 Run doctests and unit tests label Jun 4, 2025
@gshennvm gshennvm added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Jun 4, 2025
@terrykong
Copy link
Contributor

@gshennvm do you mind sharing command + wandb plots for posterity?

@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L0 Run doctests and unit tests labels Jul 15, 2025
@terrykong terrykong added CI:L0 Run doctests and unit tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 26, 2025
@terrykong terrykong force-pushed the geshen/nm5 branch 3 times, most recently from c809cdb to a95b9d6 Compare July 31, 2025 05:55
@terrykong terrykong enabled auto-merge August 4, 2025 06:14
terrykong
terrykong previously approved these changes Aug 4, 2025
@terrykong terrykong added this pull request to the merge queue Aug 4, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 4, 2025
@terrykong terrykong added research Tag for research team's issues and removed Run CICD labels Aug 7, 2025
@terrykong
Copy link
Contributor

was putting out other fires, will come back to this PR soon. For context, the issue now is that we create a dummy mamba model to run unit tests, but mamba unfortunately needs mamba to be installed to even import the model class

needs some thought

- Add mamba-ssm and causal-conv1d dependencies to automodel and vllm extras
- Configure git sources for mamba-ssm and causal-conv1d packages
- Add no-build-isolation for mamba-ssm and causal-conv1d
- Implement _parallelize_nm5_h function for NemotronHForCausalLM parallelization
- Update related unit tests for new parallelization functionality

Signed-off-by: Terry Kong <[email protected]>

fix stuff

Signed-off-by: Terry Kong <[email protected]>

gerald's fix for 32k

Signed-off-by: Terry Kong <[email protected]>

fix the tests

Signed-off-by: Terry Kong <[email protected]>
terrykong
terrykong previously approved these changes Aug 11, 2025
@terrykong terrykong enabled auto-merge August 11, 2025 07:18
@terrykong terrykong added this pull request to the merge queue Aug 11, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 11, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Aug 12, 2025
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
@terrykong terrykong enabled auto-merge August 12, 2025 01:17
@terrykong terrykong added this pull request to the merge queue Aug 12, 2025
Merged via the queue into main with commit 223bfa8 Aug 12, 2025
19 checks passed
@terrykong terrykong deleted the geshen/nm5 branch August 12, 2025 03:35
soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 13, 2025
Signed-off-by: Terry Kong <[email protected]>
Co-authored-by: Terry Kong <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
youngeunkwon0405 added a commit to youngeunkwon0405/RL that referenced this pull request Aug 25, 2025
commit b246e55
Author: Youngeun Kwon <[email protected]>
Date:   Mon Aug 25 15:05:48 2025 -0700

    update the script

    Signed-off-by: Youngeun Kwon <[email protected]>

commit 5315a6b
Author: Youngeun Kwon <[email protected]>
Date:   Mon Aug 25 13:59:16 2025 -0700

    script update

    Signed-off-by: Youngeun Kwon <[email protected]>

commit 4437402
Author: Youngeun Kwon <[email protected]>
Date:   Tue Jul 15 17:42:23 2025 -0700

    local

    Signed-off-by: Youngeun Kwon <[email protected]>

    wip

    Signed-off-by: Youngeun Kwon <[email protected]>

    add script

    Signed-off-by: Youngeun Kwon <[email protected]>

    update script

    Signed-off-by: Youngeun Kwon <[email protected]>

    update script

    Signed-off-by: Youngeun Kwon <[email protected]>

    interactive

    Signed-off-by: Youngeun Kwon <[email protected]>

commit b721703
Author: Charlie Truong <[email protected]>
Date:   Mon Aug 18 11:22:54 2025 -0500

    build: Fix pytorch image ref in Dockerfile.ngc_pytorch (NVIDIA-NeMo#936)

    Signed-off-by: Charlie Truong <[email protected]>

commit 70b9666
Author: Charlie Truong <[email protected]>
Date:   Sun Aug 17 21:17:58 2025 -0500

    build: Add Dockerfile that uses NGC pytorch image (NVIDIA-NeMo#897)

    Signed-off-by: Charlie Truong <[email protected]>

commit df31c1b
Author: pjin-nvidia <[email protected]>
Date:   Thu Aug 14 18:34:50 2025 -0700

    feat: chunked logprob calculation with deferred fp32 cast to help with OOM (NVIDIA-NeMo#918)

    Signed-off-by: Peter Jin <[email protected]>

commit 83c6bfc
Author: yuki <[email protected]>
Date:   Thu Aug 14 21:48:55 2025 +0800

    refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) (NVIDIA-NeMo#900)

    Signed-off-by: Yuki Huang <[email protected]>

commit 9f7825e
Author: Rayen <[email protected]>
Date:   Thu Aug 14 12:38:27 2025 +0800

    feat: Add TP to embed_tokens and lm_head for Gemma models (NVIDIA-NeMo#879)

    Signed-off-by: ruit <[email protected]>

commit e1f56c4
Author: Terry Kong <[email protected]>
Date:   Tue Aug 12 13:09:37 2025 -0700

    feat: add diagnostic script for problematic embeddings (NVIDIA-NeMo#896)

    Signed-off-by: Terry Kong <[email protected]>

commit 223bfa8
Author: Gerald Shen <[email protected]>
Date:   Mon Aug 11 18:19:52 2025 -0700

    feat: add nemotron5 sharding (NVIDIA-NeMo#481)

    Signed-off-by: Terry Kong <[email protected]>
    Co-authored-by: Terry Kong <[email protected]>

commit 18b9e2c
Author: Terry Kong <[email protected]>
Date:   Mon Aug 11 15:08:52 2025 -0700

    test: lower step count on gemma nightly test to finish within 4 hours (NVIDIA-NeMo#880)

    Signed-off-by: Terry Kong <[email protected]>

commit 8fd8c96
Author: guyueh1 <[email protected]>
Date:   Mon Aug 11 10:46:29 2025 -0700

    feat: Fix and enhances for Nsight system profiling (NVIDIA-NeMo#865)

    Signed-off-by: Guyue Huang <[email protected]>

commit 2b87def
Author: Qidong Su <[email protected]>
Date:   Fri Aug 8 18:54:20 2025 -0400

    fix: OOM in deepscaler1.5b with sequence length = 16/24k  (NVIDIA-NeMo#875)

    Signed-off-by: Qidong Su <[email protected]>

commit fecf71e
Author: Rayen <[email protected]>
Date:   Sat Aug 9 06:42:07 2025 +0800

    fix: remove tie weight check (NVIDIA-NeMo#700)

    Signed-off-by: ruit <[email protected]>

commit d45ff3f
Author: Terry Kong <[email protected]>
Date:   Fri Aug 8 10:07:02 2025 -0700

    test: add deepscaler tests + pipe-clean configs + fix eval for deepscaler (NVIDIA-NeMo#866)

    Signed-off-by: Terry Kong <[email protected]>

commit d73c942
Author: Anna Shors <[email protected]>
Date:   Fri Aug 8 09:27:15 2025 -0700

    feat: qwen3 export to HF (NVIDIA-NeMo#873)

    Signed-off-by: Abdalgader Abubaker <[email protected]>
    Signed-off-by: Anna Shors <[email protected]>
    Co-authored-by: Abdalgader Abubaker <[email protected]>

commit e924d33
Author: Shang Wang <[email protected]>
Date:   Fri Aug 8 12:15:34 2025 -0400

    docs: Link uv's installation instructions to uv's website (NVIDIA-NeMo#837)

    Signed-off-by: Shang Wang <[email protected]>

commit bbbb3d6
Author: yuki <[email protected]>
Date:   Fri Aug 8 23:26:15 2025 +0800

    fix: fix non-colocated with cpu_offload enabled (NVIDIA-NeMo#861)

    Signed-off-by: Yuki Huang <[email protected]>

commit 88a399e
Author: yuki <[email protected]>
Date:   Fri Aug 8 14:04:08 2025 +0800

    chore: remove old fsdp1 unit test (NVIDIA-NeMo#871)

    Signed-off-by: Yuki Huang <[email protected]>

commit b8a89a9
Author: yuki <[email protected]>
Date:   Fri Aug 8 13:56:19 2025 +0800

    feat: support non-colocated in mcore (NVIDIA-NeMo#613)

    Signed-off-by: Yuki Huang <[email protected]>

commit 5910abb
Author: Anna Shors <[email protected]>
Date:   Thu Aug 7 13:11:43 2025 -0700

    feat: support DTensor CP in DPO and SFT (NVIDIA-NeMo#798)

    Signed-off-by: ashors1 <[email protected]>

commit 0988a7d
Author: Felipe Vieira Frujeri <[email protected]>
Date:   Wed Aug 6 22:01:32 2025 -0700

    fix: Fix error message in VllmGenerationWorker. (NVIDIA-NeMo#633)

    Signed-off-by: Felipe Vieira Frujeri <[email protected]>

commit 233cc07
Author: Parth Chadha <[email protected]>
Date:   Wed Aug 6 15:14:22 2025 -0700

    fix: force use of eager (disabled cuda graphs) due to convergence issues (NVIDIA-NeMo#857)

    Signed-off-by: Parth Chadha <[email protected]>

commit 0557402
Author: Terry Kong <[email protected]>
Date:   Wed Aug 6 14:44:29 2025 -0700

    chore: 0.3.0 -> 0.4.0rc0 (NVIDIA-NeMo#840)

    Signed-off-by: Terry Kong <[email protected]>

commit 03472a0
Author: Terry Kong <[email protected]>
Date:   Wed Aug 6 14:43:55 2025 -0700

    feat: dockerfile can build hermetically or from build context (NVIDIA-NeMo#799)

    Signed-off-by: Terry Kong <[email protected]>

commit 9af0a52
Author: Anna Shors <[email protected]>
Date:   Wed Aug 6 12:35:51 2025 -0700

    fix: fix grpo + mcore checkpointing without validation (NVIDIA-NeMo#844)

    Signed-off-by: ashors1 <[email protected]>

commit b6269f7
Author: Yubo Gao <[email protected]>
Date:   Tue Aug 5 16:55:02 2025 -0400

    feat: track policy training compute throughput (NVIDIA-NeMo#632)

    Signed-off-by: Yubo Gao <[email protected]>

commit b74c5d0
Author: Wei Du <[email protected]>
Date:   Tue Aug 5 15:05:13 2025 -0500

    feat: save checkpoint before timeout to avoid 4-hour runtime limit (NVIDIA-NeMo#734)

    Signed-off-by: Wei Du <[email protected]>
    Signed-off-by: Terry Kong <[email protected]>
    Co-authored-by: Terry Kong <[email protected]>

commit c784dd9
Author: Zhiyu Li <[email protected]>
Date:   Tue Aug 5 10:47:30 2025 -0700

    feat: add data shuffle and random seed option (NVIDIA-NeMo#334)

    Signed-off-by: Zhiyu Li <[email protected]>
    Signed-off-by: Zhiyu Li <[email protected]>

commit c249efc
Author: Abdalgader Abubaker <[email protected]>
Date:   Tue Aug 5 21:33:28 2025 +0400

    docs: fix checkpointing command for megatron->hf export  (NVIDIA-NeMo#823)

    Signed-off-by: abdalgader-a <[email protected]>

Signed-off-by: Youngeun Kwon <[email protected]>
jveronvialard pushed a commit that referenced this pull request Aug 27, 2025
Signed-off-by: Terry Kong <[email protected]>
Co-authored-by: Terry Kong <[email protected]>
Signed-off-by: Julien Veron Vialard <[email protected]>
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Signed-off-by: Terry Kong <[email protected]>
Co-authored-by: Terry Kong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L0 Run doctests and unit tests documentation Improvements or additions to documentation research Tag for research team's issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants