Skip to content

fix: load HF model only on rank 0#544

Merged
parthchadha merged 19 commits intomainfrom
pchadha/large-model-state-dict-load
Jul 2, 2025
Merged

fix: load HF model only on rank 0#544
parthchadha merged 19 commits intomainfrom
pchadha/large-model-state-dict-load

Conversation

@parthchadha
Copy link
Contributor

@parthchadha parthchadha commented Jun 24, 2025

What does this PR do ?

Loads model only on rank 0 and then uses fsdp2 set_model_state_dict API to load the weights on other ranks (after the model has been parallelized).
Note that the current PR still leads to GPU OOM for 70B model on 1 node with dtensor, will fix in a separate PR.

Issues

closes #279

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@parthchadha parthchadha marked this pull request as ready for review June 24, 2025 17:26
@parthchadha parthchadha added the CI:L0 Run doctests and unit tests label Jun 24, 2025
@parthchadha parthchadha force-pushed the pchadha/large-model-state-dict-load branch from aae142f to ebaeb99 Compare June 24, 2025 17:42
@parthchadha parthchadha added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Jun 24, 2025
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
@parthchadha parthchadha force-pushed the pchadha/large-model-state-dict-load branch from ebaeb99 to fcec2db Compare June 24, 2025 18:11
@parthchadha parthchadha added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Jun 24, 2025
terrykong
terrykong previously approved these changes Jun 24, 2025
@parthchadha parthchadha enabled auto-merge June 24, 2025 18:42
@parthchadha parthchadha added this pull request to the merge queue Jun 24, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 24, 2025
@parthchadha parthchadha added this pull request to the merge queue Jun 25, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 25, 2025
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
@parthchadha parthchadha added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Jun 26, 2025
@parthchadha parthchadha added the CI:L0 Run doctests and unit tests label Jul 1, 2025
@parthchadha parthchadha enabled auto-merge July 1, 2025 18:27
SahilJain314
SahilJain314 previously approved these changes Jul 1, 2025
@parthchadha parthchadha added this pull request to the merge queue Jul 1, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 1, 2025
@parthchadha parthchadha added this pull request to the merge queue Jul 1, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 2, 2025
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
@parthchadha parthchadha enabled auto-merge July 2, 2025 05:47
SahilJain314
SahilJain314 previously approved these changes Jul 2, 2025
@parthchadha parthchadha added this pull request to the merge queue Jul 2, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 2, 2025
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
@parthchadha parthchadha added this pull request to the merge queue Jul 2, 2025
Merged via the queue into main with commit be05b13 Jul 2, 2025
13 of 14 checks passed
@parthchadha parthchadha deleted the pchadha/large-model-state-dict-load branch July 2, 2025 19:59
xxman-google pushed a commit to xxman-google/NeMo-RL that referenced this pull request Jul 2, 2025
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
therealnaveenkamal pushed a commit to therealnaveenkamal/RL that referenced this pull request Jul 7, 2025
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
YzjiaoNvd pushed a commit to YzjiaoNvd/NeMo-RL that referenced this pull request Jul 14, 2025
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
jialei777 pushed a commit to jialei777/nemo-rl that referenced this pull request Jul 23, 2025
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
KiddoZhu pushed a commit that referenced this pull request Jul 28, 2025
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L0 Run doctests and unit tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CPU OOM affecting 70b

3 participants