fix: load HF model only on rank 0#544

Merged

parthchadha merged 19 commits intomainfrom

pchadha/large-model-state-dict-load

Jul 2, 2025

Contributor

parthchadha commented Jun 24, 2025 •

edited

Loading

What does this PR do ?

Loads model only on rank 0 and then uses fsdp2 set_model_state_dict API to load the weights on other ranks (after the model has been parallelized).
Note that the current PR still leads to GPU OOM for 70B model on 1 node with dtensor, will fix in a separate PR.

Issues

closes #279

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

parthchadha added 2 commits

June 18, 2025 23:15


          Use set_model_state_dict and load model on rank 0

9c04fe4

Signed-off-by: Parth Chadha <pchadha@nvidia.com>


          Merge remote-tracking branch 'origin/main' into pchadha/large-model-s…

…tate-dict-load

parthchadha marked this pull request as ready for review

June 24, 2025 17:26

parthchadha requested review from SahilJain314 and terrykong

June 24, 2025 17:26

parthchadha added the CI:L0 label

parthchadha had a problem deploying to nemo-ci

June 24, 2025 17:27

— with

GitHub Actions Error

terrykong reviewed

View reviewed changes

nemo_rl/models/policy/dtensor_policy_worker.py Outdated Show resolved Hide resolved

parthchadha force-pushed the pchadha/large-model-state-dict-load branch from aae142f to ebaeb99 Compare

June 24, 2025 17:42

parthchadha added CI:L0 and removed CI:L0 labels

parthchadha had a problem deploying to nemo-ci

June 24, 2025 17:43

— with

GitHub Actions Error

terrykong reviewed

View reviewed changes

nemo_rl/models/policy/dtensor_policy_worker.py Show resolved Hide resolved


          Fix use of model_config and remove duplicate args

fcec2db

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

parthchadha force-pushed the pchadha/large-model-state-dict-load branch from ebaeb99 to fcec2db Compare

June 24, 2025 18:11

parthchadha added CI:L0 and removed CI:L0 labels

parthchadha temporarily deployed to nemo-ci

June 24, 2025 18:14

— with

GitHub Actions Inactive

terrykong previously approved these changes

View reviewed changes

parthchadha enabled auto-merge

June 24, 2025 18:42


          Merge remote-tracking branch 'origin/main' into pchadha/large-model-s…

9b7bd0f

…tate-dict-load

parthchadha added this pull request to the merge queue

github-merge-queue bot removed this pull request from the merge queue due to failed status checks

parthchadha added this pull request to the merge queue

github-merge-queue bot removed this pull request from the merge queue due to failed status checks


          Disable nccl shm to fix #564

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

parthchadha dismissed terrykong’s stale review via

June 26, 2025 18:19


          Merge remote-tracking branch 'origin/main' into pchadha/large-model-s…

dadea4a

…tate-dict-load

parthchadha added CI:L0 and removed CI:L0 labels

parthchadha added the CI:L0 label

parthchadha temporarily deployed to nemo-ci

July 1, 2025 16:50

— with

GitHub Actions Inactive

parthchadha enabled auto-merge

July 1, 2025 18:27

SahilJain314 previously approved these changes

View reviewed changes

parthchadha added this pull request to the merge queue

github-merge-queue bot removed this pull request from the merge queue due to failed status checks

parthchadha added this pull request to the merge queue

github-merge-queue bot removed this pull request from the merge queue due to failed status checks


          Check if generation exists in config before accessing it

381707b

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

parthchadha dismissed SahilJain314’s stale review via

381707b

July 2, 2025 05:44


          Merge remote-tracking branch 'origin/main' into pchadha/large-model-s…

8e89bfb

…tate-dict-load

parthchadha enabled auto-merge

July 2, 2025 05:47

SahilJain314 previously approved these changes

View reviewed changes

parthchadha added this pull request to the merge queue

github-merge-queue bot removed this pull request from the merge queue due to failed status checks


          Update eval.yaml with colocated

140876e

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

parthchadha dismissed SahilJain314’s stale review via

140876e

July 2, 2025 14:03

parthchadha enabled auto-merge

July 2, 2025 14:04


          Merge branch 'main' into pchadha/large-model-state-dict-load

b226eef

SahilJain314 approved these changes

View reviewed changes

parthchadha added this pull request to the merge queue

Merged via the queue into main with commit be05b13

13 of 14 checks passed

parthchadha deleted the pchadha/large-model-state-dict-load branch

July 2, 2025 19:59

xxman-google pushed a commit to xxman-google/NeMo-RL that referenced this pull request


          fix: load HF model only on rank 0 (NVIDIA-NeMo#544)

283074a

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

therealnaveenkamal pushed a commit to therealnaveenkamal/RL that referenced this pull request


          fix: load HF model only on rank 0 (NVIDIA-NeMo#544)

09331d8

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

parthchadha mentioned this pull request

FSDP2+TP2 demo script does not work #621

Closed

YzjiaoNvd pushed a commit to YzjiaoNvd/NeMo-RL that referenced this pull request


          fix: load HF model only on rank 0 (NVIDIA-NeMo#544)

c17e4cb

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

jialei777 pushed a commit to jialei777/nemo-rl that referenced this pull request


          fix: load HF model only on rank 0 (NVIDIA-NeMo#544)

eba5e76

Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>

KiddoZhu pushed a commit that referenced this pull request


          fix: load HF model only on rank 0 (#544)

bab3f22

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L0