fix: stop jobs after timeout and add warning for validation#1069
fix: stop jobs after timeout and add warning for validation#1069terrykong merged 9 commits intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: Wei Du <wedu@nvidia.com>
terrykong
left a comment
There was a problem hiding this comment.
thanks @wedu-nvidia. can you add this as well to https://github.com/NVIDIA-NeMo/RL/blob/main/nemo_rl/algorithms/rm.py#L600
Signed-off-by: Wei Du <wedu@nvidia.com>
|
@terrykong thanks, I added the timeout feature for rm as well |
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
|
@terrykong I push another commit to fix the bug for unit test, can you put it into merged queue again, thanks. |
Signed-off-by: Wei Du <wedu@nvidia.com>
WalkthroughAdds a timeout-based checkpointing key and TimeoutChecker integration; enforces validation-period assertions when no val dataloader is provided; introduces early-return-on-timeout in training loops (RM, SFT, DPO); overhauls GRPO to epoch-driven training with expanded save state, logging, and Megatron train_iters propagation; tests and example config updated. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Trainer as RM/SFT/DPO Trainer
participant Timeout as TimeoutChecker
participant CKPT as Checkpoint Manager
Note over Trainer: Per-step loop
Trainer->>Timeout: mark_iteration()
Trainer->>Timeout: check_save()
alt Timeout-triggered save
Timeout-->>Trainer: should_save_by_timeout = true
Trainer->>CKPT: save_checkpoint(timeout-based)
CKPT-->>Trainer: saved
Note over Trainer: Early return (exit training)
else Step-periodic save
Trainer->>CKPT: save_checkpoint(step-based)
CKPT-->>Trainer: saved
end
sequenceDiagram
autonumber
participant GRPO as GRPO Trainer
participant Data as Dataloader
participant Policy as Policy Model
participant Ref as Reference Model
participant CKPT as Checkpoint Manager
participant Log as Logger
Note over GRPO: while current_epoch < max_num_epochs and total_steps < max_num_steps
GRPO->>Data: next batch (epoch loop)
GRPO->>Policy: prepare_for_lp_inference
GRPO->>Policy: generate (sync/async)
GRPO->>GRPO: compute rewards/advantages
GRPO->>Policy: prepare_for_training
par Forward/Backward
GRPO->>Policy: train step
GRPO->>Ref: fprop logprobs (as needed)
end
GRPO->>Log: metrics (loss, reward, tokens, FLOPS, etc.)
GRPO->>CKPT: maybe save (period/timeout)
alt End of epoch
GRPO->>GRPO: increment current_epoch, reset current_step
else Continue
GRPO->>GRPO: increment current_step and total_steps
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Pre-merge checks (2 passed, 1 warning)❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Poem
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. 📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
✨ Finishing touches
🧪 Generate unit tests
Comment |
Signed-off-by: Wei Du <wedu@nvidia.com>
|
@terrykong I’ve resolved the conflict. Could you please add this to the merge queue, since others are actively updating and it may touch the same files? Thanks! |
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/configs/rm.yaml(1 hunks)nemo_rl/algorithms/dpo.py(2 hunks)nemo_rl/algorithms/rm.py(5 hunks)nemo_rl/algorithms/sft.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- examples/configs/rm.yaml
- nemo_rl/algorithms/dpo.py
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_rl/algorithms/rm.py (2)
tests/unit/utils/test_timer.py (3)
timer(26-27)TestTimeoutChecker(193-235)test_double_save_prevented(204-208)nemo_rl/utils/timer.py (4)
TimeoutChecker(264-321)start_iterations(310-311)mark_iteration(313-321)check_save(284-308)
🔇 Additional comments (3)
nemo_rl/algorithms/sft.py (1)
584-585: LGTM! Timeout-based early exit is properly implemented.The early return after timeout-based checkpoint saving ensures that training stops gracefully when the timeout is reached, which aligns with the PR objectives to stop jobs after timeout.
nemo_rl/algorithms/rm.py (2)
432-436: LGTM! TimeoutChecker integration follows best practices.The timeout checker is properly initialized with the configuration value and fit_last_save_time enabled, which intelligently considers average iteration time when determining if a timeout is approaching.
633-634: LGTM! Timeout-based early termination is correctly implemented.The early return after timeout ensures the training loop exits gracefully after saving a checkpoint, consistent with the implementation in the SFT module.
|
@terrykong, I see it was enabled to auto-merge and why it still got stuck here? |
|
@wedu-nvidia there was a test failure: |
Signed-off-by: Wei Du <wedu@nvidia.com>
|
@terrykong Thanks for the info, I fixed the bug. |
|
@terrykong it failed due to some network issues, maybe? can you please have a check when you have time |
…eMo#1069) Signed-off-by: Wei Du <wedu@nvidia.com>
…eMo#1069) Signed-off-by: Wei Du <wedu@nvidia.com>
What does this PR do ?
Stop job training after timeout and also add a warning message for validation when val_period > 0 and val_dataloader is None
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
Refactor
Bug Fixes
Tests