feat: save checkpoint before timeout to avoid 4-hour runtime limit #734

wedu-nvidia · 2025-07-24T00:12:27Z

…g time lmit

What does this PR do ?

Since the server automatically stops after 4 hours, it's recommended to save a checkpoint beforehand. For example, set the timeout to 3 hours and 45 minutes to ensure check point saved is saved in time

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

terrykong

thanks for contributing. this would definitely be a valuable feature to have. I've left some comments

nemo_rl/algorithms/grpo.py

nemo_rl/utils/timer.py

nemo_rl/algorithms/grpo.py

nemo_rl/utils/timer.py

nemo_rl/algorithms/grpo.py

wedu-nvidia · 2025-07-28T22:59:28Z

@terrykong I revised based on your suggestions and let me know if have more comments

nemo_rl/algorithms/dpo.py

nemo_rl/utils/timer.py

terrykong · 2025-07-29T23:01:58Z

@wedu-nvidia could you address the DCO failure and run the pre-commit hooks. See https://github.com/NVIDIA-NeMo/RL/blob/main/CONTRIBUTING.md

…g time lmit Signed-off-by: Wei Du <[email protected]>

Signed-off-by: Wei Du <[email protected]>

wedu-nvidia · 2025-07-30T21:06:20Z

Hi @terrykong, all DCO and pre-commit issues have been resolved, and the commits are now properly signed.

Please help approve the pending workflows and review the change request when convenient — thanks!

terrykong · 2025-07-30T23:29:14Z

@wedu-nvidia looks like there are still some failures, this time with pyrefly

ERROR `float` is not assignable to attribute `previous_iteration_time` with type `None` [bad-assignment]
   --> /home/runner/work/RL/RL/nemo_rl/utils/timer.py:311:40
    |
311 |         self.previous_iteration_time = time.time()
    |                                        ^^^^^^^^^^^
    |
ERROR `-` is not supported between `float` and `None` [bad-argument-type]
   --> /home/runner/work/RL/RL/nemo_rl/utils/timer.py:318:24
    |
318 |         elapsed_time = current_time - self.previous_iteration_time
    |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
  Argument `None` is not assignable to parameter `value` with type `float` in function `float.__sub__`
ERROR `float` is not assignable to attribute `previous_iteration_time` with type `None` [bad-assignment]
   --> /home/runner/work/RL/RL/nemo_rl/utils/timer.py:319:40
    |
319 |         self.previous_iteration_time = current_time
    |                                        ^^^^^^^^^^^^
    |
 INFO errors shown: 3, errors ignored: 24, modules: 82, transitive dependencies: 4,273, lines: 2,048,943, time: 4.08s, peak memory: physical 786.5 MiB

terrykong · 2025-08-04T16:05:43Z

This one looks to have a legitimate unit test failure:

FAILED unit/algorithms/test_sft.py::test_exit_on_max_steps - KeyError: 'save_...

When you've resolved, we can retry

Signed-off-by: Wei Du <[email protected]>

wedu-nvidia · 2025-08-04T18:03:10Z

@terrykong I added another parameter, and hope it can pass all.

Signed-off-by: Wei Du <[email protected]>

wedu-nvidia · 2025-08-04T19:12:32Z

@terrykong The previous error seems solved and I saw another error and I added in
checkpoint_must_save_by: NotRequired[str | None]
in following config


class CheckpointingConfig(TypedDict):
    """Configuration for checkpoint management.

    Attributes:
    enabled (bool): Whether checkpointing is enabled.
    checkpoint_dir (PathLike): Directory where checkpoints will be saved.
    metric_name (str): Name of the metric to use for determining best checkpoints.
    higher_is_better (bool): Whether higher values of the metric indicate better performance.
    keep_top_k (Optional[int]): Number of best checkpoints to keep. If None, all checkpoints are kept.
    """

    enabled: bool
    checkpoint_dir: PathLike
    metric_name: str
    higher_is_better: bool
    save_period: int
    keep_top_k: NotRequired[int]
    checkpoint_must_save_by: NotRequired[str | None]

Signed-off-by: Wei Du <[email protected]>

wedu-nvidia · 2025-08-05T03:23:27Z

@terrykong Can you help add it the merge queue again? Thanks so much

wedu-nvidia · 2025-08-05T15:17:32Z

@terrykong can you put it into mergequeue?

wedu-nvidia · 2025-08-05T19:57:24Z

@terrykong Why I did not see the conflict?

…VIDIA-NeMo#734) Signed-off-by: Wei Du <[email protected]> Signed-off-by: Terry Kong <[email protected]> Co-authored-by: Terry Kong <[email protected]> Signed-off-by: Qidong Su <[email protected]>

commit b246e55 Author: Youngeun Kwon <[email protected]> Date: Mon Aug 25 15:05:48 2025 -0700 update the script Signed-off-by: Youngeun Kwon <[email protected]> commit 5315a6b Author: Youngeun Kwon <[email protected]> Date: Mon Aug 25 13:59:16 2025 -0700 script update Signed-off-by: Youngeun Kwon <[email protected]> commit 4437402 Author: Youngeun Kwon <[email protected]> Date: Tue Jul 15 17:42:23 2025 -0700 local Signed-off-by: Youngeun Kwon <[email protected]> wip Signed-off-by: Youngeun Kwon <[email protected]> add script Signed-off-by: Youngeun Kwon <[email protected]> update script Signed-off-by: Youngeun Kwon <[email protected]> update script Signed-off-by: Youngeun Kwon <[email protected]> interactive Signed-off-by: Youngeun Kwon <[email protected]> commit b721703 Author: Charlie Truong <[email protected]> Date: Mon Aug 18 11:22:54 2025 -0500 build: Fix pytorch image ref in Dockerfile.ngc_pytorch (NVIDIA-NeMo#936) Signed-off-by: Charlie Truong <[email protected]> commit 70b9666 Author: Charlie Truong <[email protected]> Date: Sun Aug 17 21:17:58 2025 -0500 build: Add Dockerfile that uses NGC pytorch image (NVIDIA-NeMo#897) Signed-off-by: Charlie Truong <[email protected]> commit df31c1b Author: pjin-nvidia <[email protected]> Date: Thu Aug 14 18:34:50 2025 -0700 feat: chunked logprob calculation with deferred fp32 cast to help with OOM (NVIDIA-NeMo#918) Signed-off-by: Peter Jin <[email protected]> commit 83c6bfc Author: yuki <[email protected]> Date: Thu Aug 14 21:48:55 2025 +0800 refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) (NVIDIA-NeMo#900) Signed-off-by: Yuki Huang <[email protected]> commit 9f7825e Author: Rayen <[email protected]> Date: Thu Aug 14 12:38:27 2025 +0800 feat: Add TP to embed_tokens and lm_head for Gemma models (NVIDIA-NeMo#879) Signed-off-by: ruit <[email protected]> commit e1f56c4 Author: Terry Kong <[email protected]> Date: Tue Aug 12 13:09:37 2025 -0700 feat: add diagnostic script for problematic embeddings (NVIDIA-NeMo#896) Signed-off-by: Terry Kong <[email protected]> commit 223bfa8 Author: Gerald Shen <[email protected]> Date: Mon Aug 11 18:19:52 2025 -0700 feat: add nemotron5 sharding (NVIDIA-NeMo#481) Signed-off-by: Terry Kong <[email protected]> Co-authored-by: Terry Kong <[email protected]> commit 18b9e2c Author: Terry Kong <[email protected]> Date: Mon Aug 11 15:08:52 2025 -0700 test: lower step count on gemma nightly test to finish within 4 hours (NVIDIA-NeMo#880) Signed-off-by: Terry Kong <[email protected]> commit 8fd8c96 Author: guyueh1 <[email protected]> Date: Mon Aug 11 10:46:29 2025 -0700 feat: Fix and enhances for Nsight system profiling (NVIDIA-NeMo#865) Signed-off-by: Guyue Huang <[email protected]> commit 2b87def Author: Qidong Su <[email protected]> Date: Fri Aug 8 18:54:20 2025 -0400 fix: OOM in deepscaler1.5b with sequence length = 16/24k (NVIDIA-NeMo#875) Signed-off-by: Qidong Su <[email protected]> commit fecf71e Author: Rayen <[email protected]> Date: Sat Aug 9 06:42:07 2025 +0800 fix: remove tie weight check (NVIDIA-NeMo#700) Signed-off-by: ruit <[email protected]> commit d45ff3f Author: Terry Kong <[email protected]> Date: Fri Aug 8 10:07:02 2025 -0700 test: add deepscaler tests + pipe-clean configs + fix eval for deepscaler (NVIDIA-NeMo#866) Signed-off-by: Terry Kong <[email protected]> commit d73c942 Author: Anna Shors <[email protected]> Date: Fri Aug 8 09:27:15 2025 -0700 feat: qwen3 export to HF (NVIDIA-NeMo#873) Signed-off-by: Abdalgader Abubaker <[email protected]> Signed-off-by: Anna Shors <[email protected]> Co-authored-by: Abdalgader Abubaker <[email protected]> commit e924d33 Author: Shang Wang <[email protected]> Date: Fri Aug 8 12:15:34 2025 -0400 docs: Link uv's installation instructions to uv's website (NVIDIA-NeMo#837) Signed-off-by: Shang Wang <[email protected]> commit bbbb3d6 Author: yuki <[email protected]> Date: Fri Aug 8 23:26:15 2025 +0800 fix: fix non-colocated with cpu_offload enabled (NVIDIA-NeMo#861) Signed-off-by: Yuki Huang <[email protected]> commit 88a399e Author: yuki <[email protected]> Date: Fri Aug 8 14:04:08 2025 +0800 chore: remove old fsdp1 unit test (NVIDIA-NeMo#871) Signed-off-by: Yuki Huang <[email protected]> commit b8a89a9 Author: yuki <[email protected]> Date: Fri Aug 8 13:56:19 2025 +0800 feat: support non-colocated in mcore (NVIDIA-NeMo#613) Signed-off-by: Yuki Huang <[email protected]> commit 5910abb Author: Anna Shors <[email protected]> Date: Thu Aug 7 13:11:43 2025 -0700 feat: support DTensor CP in DPO and SFT (NVIDIA-NeMo#798) Signed-off-by: ashors1 <[email protected]> commit 0988a7d Author: Felipe Vieira Frujeri <[email protected]> Date: Wed Aug 6 22:01:32 2025 -0700 fix: Fix error message in VllmGenerationWorker. (NVIDIA-NeMo#633) Signed-off-by: Felipe Vieira Frujeri <[email protected]> commit 233cc07 Author: Parth Chadha <[email protected]> Date: Wed Aug 6 15:14:22 2025 -0700 fix: force use of eager (disabled cuda graphs) due to convergence issues (NVIDIA-NeMo#857) Signed-off-by: Parth Chadha <[email protected]> commit 0557402 Author: Terry Kong <[email protected]> Date: Wed Aug 6 14:44:29 2025 -0700 chore: 0.3.0 -> 0.4.0rc0 (NVIDIA-NeMo#840) Signed-off-by: Terry Kong <[email protected]> commit 03472a0 Author: Terry Kong <[email protected]> Date: Wed Aug 6 14:43:55 2025 -0700 feat: dockerfile can build hermetically or from build context (NVIDIA-NeMo#799) Signed-off-by: Terry Kong <[email protected]> commit 9af0a52 Author: Anna Shors <[email protected]> Date: Wed Aug 6 12:35:51 2025 -0700 fix: fix grpo + mcore checkpointing without validation (NVIDIA-NeMo#844) Signed-off-by: ashors1 <[email protected]> commit b6269f7 Author: Yubo Gao <[email protected]> Date: Tue Aug 5 16:55:02 2025 -0400 feat: track policy training compute throughput (NVIDIA-NeMo#632) Signed-off-by: Yubo Gao <[email protected]> commit b74c5d0 Author: Wei Du <[email protected]> Date: Tue Aug 5 15:05:13 2025 -0500 feat: save checkpoint before timeout to avoid 4-hour runtime limit (NVIDIA-NeMo#734) Signed-off-by: Wei Du <[email protected]> Signed-off-by: Terry Kong <[email protected]> Co-authored-by: Terry Kong <[email protected]> commit c784dd9 Author: Zhiyu Li <[email protected]> Date: Tue Aug 5 10:47:30 2025 -0700 feat: add data shuffle and random seed option (NVIDIA-NeMo#334) Signed-off-by: Zhiyu Li <[email protected]> Signed-off-by: Zhiyu Li <[email protected]> commit c249efc Author: Abdalgader Abubaker <[email protected]> Date: Tue Aug 5 21:33:28 2025 +0400 docs: fix checkpointing command for megatron->hf export (NVIDIA-NeMo#823) Signed-off-by: abdalgader-a <[email protected]> Signed-off-by: Youngeun Kwon <[email protected]>

) Signed-off-by: Wei Du <[email protected]> Signed-off-by: Terry Kong <[email protected]> Co-authored-by: Terry Kong <[email protected]> Signed-off-by: Julien Veron Vialard <[email protected]>

…VIDIA-NeMo#734) Signed-off-by: Wei Du <[email protected]> Signed-off-by: Terry Kong <[email protected]> Co-authored-by: Terry Kong <[email protected]>

wedu-nvidia changed the title ~~save checking point before timeout to deal with 4 hour limit~~ feat: save checkpoint before timeout to avoid 4-hour runtime limit Jul 24, 2025

terrykong requested changes Jul 28, 2025

View reviewed changes

nemo_rl/algorithms/grpo.py Outdated Show resolved Hide resolved

nemo_rl/utils/timer.py Show resolved Hide resolved

nemo_rl/algorithms/grpo.py Show resolved Hide resolved

nemo_rl/utils/timer.py Outdated Show resolved Hide resolved

nemo_rl/algorithms/grpo.py Outdated Show resolved Hide resolved

github-actions bot added the community-request label Jul 28, 2025

terrykong reviewed Jul 29, 2025

View reviewed changes

nemo_rl/algorithms/dpo.py Outdated Show resolved Hide resolved

terrykong reviewed Jul 29, 2025

View reviewed changes

nemo_rl/utils/timer.py Outdated Show resolved Hide resolved

wedu-nvidia force-pushed the wedu/timeout-save-checkpoint branch from d106c44 to b3c7f82 Compare July 30, 2025 15:18

github-actions bot added documentation Improvements or additions to documentation CI Relating to CI labels Jul 30, 2025

wedu-nvidia force-pushed the wedu/timeout-save-checkpoint branch from b3c7f82 to 33597d1 Compare July 30, 2025 15:21

github-actions bot removed documentation Improvements or additions to documentation CI Relating to CI labels Jul 30, 2025

wedu-nvidia force-pushed the wedu/timeout-save-checkpoint branch from 33597d1 to b3c7f82 Compare July 30, 2025 15:26

github-actions bot added documentation Improvements or additions to documentation CI Relating to CI labels Jul 30, 2025

wedu-nvidia force-pushed the wedu/timeout-save-checkpoint branch 2 times, most recently from 9cfa5a7 to b242f32 Compare July 30, 2025 20:48

github-actions bot removed documentation Improvements or additions to documentation CI Relating to CI labels Jul 30, 2025

wedu-nvidia added 5 commits July 30, 2025 13:57

save checking point before timeout to deal with 4 hour cluster runnin…

b772f4a

…g time lmit Signed-off-by: Wei Du <[email protected]>

remove unused space

730c313

Signed-off-by: Wei Du <[email protected]>

make timeout optionl

6fc99bf

Signed-off-by: Wei Du <[email protected]>

update timeout for all algorithms and configs and add unit tests as well

5d92547

Signed-off-by: Wei Du <[email protected]>

put checkpoint_must_save_by under checkpointing

94d1cc4

Signed-off-by: Wei Du <[email protected]>

wedu-nvidia force-pushed the wedu/timeout-save-checkpoint branch from 831926d to e51cad2 Compare July 30, 2025 21:01

style: fix formatting via pre-commit

421d34c

Signed-off-by: Wei Du <[email protected]>

wedu-nvidia force-pushed the wedu/timeout-save-checkpoint branch from e51cad2 to 421d34c Compare July 30, 2025 21:03

fix unit test error

514b8ed

Signed-off-by: Wei Du <[email protected]>

wedu-nvidia dismissed terrykong’s stale review via 514b8ed August 4, 2025 18:01

terrykong enabled auto-merge August 4, 2025 18:10

terrykong previously approved these changes Aug 4, 2025

View reviewed changes

terrykong added this pull request to the merge queue Aug 4, 2025

fix unit test error

3c55472

Signed-off-by: Wei Du <[email protected]>

auto-merge was automatically disabled August 4, 2025 19:10
Head branch was pushed to by a user without write access

wedu-nvidia dismissed terrykong’s stale review via 3c55472 August 4, 2025 19:10

fix unit test error

51aaeaa

Signed-off-by: Wei Du <[email protected]>

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 4, 2025

terrykong approved these changes Aug 5, 2025

View reviewed changes

terrykong enabled auto-merge August 5, 2025 07:24

Merge branch 'main' into wedu/timeout-save-checkpoint

9120602

terrykong added this pull request to the merge queue Aug 5, 2025

github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Aug 5, 2025

Merge branch 'NVIDIA-NeMo:main' into wedu/timeout-save-checkpoint

c676918

terrykong enabled auto-merge August 5, 2025 20:03

terrykong added this pull request to the merge queue Aug 5, 2025

Merged via the queue into NVIDIA-NeMo:main with commit b74c5d0 Aug 6, 2025
19 checks passed

feat: save checkpoint before timeout to avoid 4-hour runtime limit #734

feat: save checkpoint before timeout to avoid 4-hour runtime limit #734

Uh oh!

Conversation

wedu-nvidia commented Jul 24, 2025

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wedu-nvidia commented Jul 28, 2025

Uh oh!

Uh oh!

Uh oh!

terrykong commented Jul 29, 2025

Uh oh!

wedu-nvidia commented Jul 30, 2025

Uh oh!

terrykong commented Jul 30, 2025

Uh oh!

terrykong commented Aug 4, 2025

Uh oh!

wedu-nvidia commented Aug 4, 2025

Uh oh!

wedu-nvidia commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

wedu-nvidia commented Aug 5, 2025

Uh oh!

wedu-nvidia commented Aug 5, 2025

Uh oh!

wedu-nvidia commented Aug 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wedu-nvidia commented Aug 4, 2025 •

edited

Loading