fix: fix temperature-related issues #935

zhandaz · 2025-08-18T14:47:54Z

What does this PR do ?

Fix the temperature-related accuracy issues in both the dtensor path and the mcore path.

As reported in #887 and some other researchers, training GRPO (some variants of Qwen 2.5 7B, AceMath) does not work.

uv run python -u NeMo-Skills/nemo_skills/training/nemo_rl/start_grpo.py \
    ++policy.model_name=models/final_hf_checkpoint_1220 \
    ++cluster.gpus_per_node=8 \
    ++cluster.num_nodes=${NUM_ACTOR_NODES} \
    +data.train_data_path=data/deepseek_r1/rl_data/acemath_rl/shuffled_acemath_rl_920k.jsonl \
    +data.val_data_path=data/deepseek_r1/rl_data/acemath_rl/shuffled_acemath_rl_920k.jsonl \
    ++checkpointing.checkpoint_dir=results/${TASK_NAME}_${EXP_SUFFIX}/checkpoints \
    ++logger.log_dir=results/${TASK_NAME}_${EXP_SUFFIX}/training-logs \
    ++logger.wandb_enabled=True \
    ++logger.wandb.project=${WANDB_PROJECT} \
    ++logger.wandb.name=${WANDB_NAME} \
    ++logger.wandb.id=${WANDB_NAME} \
    ++policy.train_global_batch_size=1024 \
    ++policy.train_micro_batch_size=1 \
    ++policy.max_total_sequence_length=8192 \
    ++policy.logprob_batch_size=4 \
    ++policy.sequence_packing.enabled=True \
    ++policy.sequence_packing.train_mb_tokens=8192 \
    ++policy.sequence_packing.logprob_mb_tokens=8192 \
    ++policy.sequence_packing.algorithm="modified_first_fit_decreasing" \
    ++policy.sequence_packing.sequence_length_round=64 \
    ++policy.generation.vllm_cfg.tensor_parallel_size=4 \
    ++policy.generation.vllm_cfg.gpu_memory_utilization=0.8 \
    ++policy.generation.vllm_cfg.pipeline_parallel_size=1 \
    ++policy.generation.temperature=0.6 \
    ++policy.optimizer.kwargs.lr=1e-06 \
    ++loss_fn.reference_policy_kl_penalty=0.001 \
    ++policy.dtensor_cfg.enabled=True \
    ++policy.dtensor_cfg.tensor_parallel_size=2 \
    ++policy.dtensor_cfg.sequence_parallel=False \
    ++policy.dtensor_cfg.activation_checkpointing=True \
    ++data.prompt.prompt_config=qwen/math-cot \
    ++data.prompt.prompt_template=qwen-instruct \
    ++grpo.num_prompts_per_step=128 \
    ++grpo.num_generations_per_prompt=8

1. Reproduce the errors

As we can see in the images below and the wandb, the rewards decrease a lot after around 50 steps, and the token_mult_prob_error (meaning) is large.

2. Fix the temperature error in the DTensor Policy Worker

We found the reason could be: in this PR #660, it scales wrt the temperature according to the vllm engine version (as vllm V1 engine does not return the final logprobs wrt sampling parameters). This fix is actually wrong. We should follow the pattern in #316 to scale logits according to the temperatures anyway. Otherwise, it may lead to off-policy. The inference and the training behavior do not match.

By removing the if in

RL/nemo_rl/models/policy/dtensor_policy_worker.py

Lines 471 to 472 in df31c1b

    
           if not is_vllm_v1_engine_enabled(): 
        
               logits.div_(self.cfg["generation"]["temperature"])

we can see that the reward is back to normal, and token_mult_prob_error is much reasonable (though it still has one large spike that is larger than 2).

3. Fix the returned logprobs in vLLM V1

We know that the returned logprobs from vLLM V1 is not correct for post-training because it is the raw logits (before applying any sampling paramters). Therefore, the token_mult_prob_error is actually an inaccurate indicator. The convergence should not be affected since use_importance_sampling_correction and use_on_policy_kl_approximation are False, and the vllm returned logprobs are not involved in the real training (just used in the metrics calculation).

To confirm that, we do a file patch over the vLLM to return the logprobs after temperature. The reward curve remains almost the same in case 2, and the token_mult_prob_error is almost around 1 (which is expected).

Issues

Fixes #902, #887

nemo_rl/models/generation/vllm.py

nemo_rl/models/policy/dtensor_policy_worker.py

github-actions · 2025-08-19T19:03:53Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: e9523c2 (PR #935 from zhanda/debug-accuracy)

❌ Submodules that need attention:

NeMo: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA/NeMo/commits/5c42641e344a487c7ca5b253a7483f0af8ef40e6/
CURRENT (PR #935 from zhanda/debug-accuracy): https://github.com/NVIDIA/NeMo/commits/aaefedd1d13f4ccd5cd06a19e06f1df33589a235/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: Zhanda <[email protected]>

Signed-off-by: Zhanda Zhu <[email protected]>

zhandaz · 2025-08-20T01:29:15Z

Have rerun the experiments after merging the main. There is a small spike. Should be fine.

If there is no further comment, should be ready to merge.

wangshangsam

LGTM!

I don't have further comments but I just have a quick question - does it mean that, once vLLM 0.11 is released, we don't need to patch
raw_logprobs = self.compute_logprobs(logits) into
raw_logprobs = self.compute_logprobs(self.apply_temperature(logits.to(torch.float32), sampling_metadata.temperature)) entirely?

zhandaz · 2025-08-20T02:48:21Z

I don't have further comments but I just have a quick question - does it mean that, once vLLM 0.11 is released, we don't need to patch raw_logprobs = self.compute_logprobs(logits) into raw_logprobs = self.compute_logprobs(self.apply_temperature(logits.to(torch.float32), sampling_metadata.temperature)) entirely?

Yes! All these logics can be greatly simplified once the final_logprobs can be returned by vllm. Hopefully it is available in vllm=0.11.0.

github-actions · 2025-08-25T19:06:11Z

ℹ️ File Synchronization Check

Check based on commit: 8769667 (PR #935 from zhanda/debug-accuracy)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

Signed-off-by: Zhanda <[email protected]>

github-actions · 2025-08-25T20:48:49Z

ℹ️ File Synchronization Check

Check based on commit: 1dd2d46 (PR #935 from zhanda/debug-accuracy)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-08-25T21:58:43Z

ℹ️ File Synchronization Check

Check based on commit: 1dd2d46 (PR #935 from zhanda/debug-accuracy)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

zhandaz · 2025-08-26T16:44:19Z

The reward seems similar for mcore path after this patch. However, the token_mult_prob_error issue is not mitigated much in this case. I am testing with deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.

github-actions · 2025-08-26T16:44:51Z

ℹ️ File Consistency Check

Check based on commit: 589d0d8 (PR #935 from zhanda/debug-accuracy)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

Signed-off-by: Zhanda <[email protected]> Signed-off-by: Zhanda Zhu <[email protected]> Co-authored-by: Shang Wang <[email protected]> Signed-off-by: Qidong Su <[email protected]>

Signed-off-by: Zhanda <[email protected]> Signed-off-by: Zhanda Zhu <[email protected]> Co-authored-by: Shang Wang <[email protected]> Signed-off-by: Stanislav Kirdey <[email protected]>

Signed-off-by: Zhanda <[email protected]> Signed-off-by: Zhanda Zhu <[email protected]> Co-authored-by: Shang Wang <[email protected]> Signed-off-by: Qidong Su <[email protected]>

Signed-off-by: Zhanda <[email protected]> Signed-off-by: Zhanda Zhu <[email protected]> Co-authored-by: Shang Wang <[email protected]>

zhandaz requested a review from parthchadha August 18, 2025 14:47

zhandaz self-assigned this Aug 18, 2025

zhandaz requested a review from wangshangsam August 18, 2025 14:48

zhandaz changed the title ~~fix: fix temperature related issues in the DTensor path~~ fix: fix temperature-related issues in the DTensor path Aug 18, 2025

zhandaz force-pushed the zhanda/debug-accuracy branch from 51133a8 to 761c82f Compare August 18, 2025 14:51

parthchadha previously approved these changes Aug 18, 2025

View reviewed changes

nemo_rl/models/generation/vllm.py Outdated Show resolved Hide resolved

nemo_rl/models/policy/dtensor_policy_worker.py Show resolved Hide resolved

wangshangsam linked an issue Aug 18, 2025 that may be closed by this pull request

GRPO (AceReason, Deepscaler, Qwen 2.5) accuracy debug #887

Closed

wangshangsam previously approved these changes Aug 18, 2025

View reviewed changes

zhandaz dismissed stale reviews from wangshangsam and parthchadha via 68aa949 August 19, 2025 18:26

zhandaz force-pushed the zhanda/debug-accuracy branch from 68aa949 to 300d1e2 Compare August 19, 2025 18:26

zhandaz force-pushed the zhanda/debug-accuracy branch from 03b3b43 to 67d3db3 Compare August 19, 2025 20:20

github-actions bot added documentation Improvements or additions to documentation CI Relating to CI labels Aug 19, 2025

zhandaz added 2 commits August 19, 2025 13:26

fix: fix temperature related issues in the DTensor path

5710437

Signed-off-by: Zhanda <[email protected]>

clean up

4932ede

Signed-off-by: Zhanda <[email protected]>

zhandaz force-pushed the zhanda/debug-accuracy branch from 67d3db3 to 4932ede Compare August 19, 2025 20:30

github-actions bot removed documentation Improvements or additions to documentation CI Relating to CI labels Aug 19, 2025

Merge branch 'main' into zhanda/debug-accuracy

93e317e

parthchadha previously approved these changes Aug 19, 2025

View reviewed changes

zhandaz requested a review from wangshangsam August 19, 2025 21:40

Merge branch 'main' into zhanda/debug-accuracy

e7cf1cc

Signed-off-by: Zhanda Zhu <[email protected]>

zhandaz dismissed parthchadha’s stale review via e7cf1cc August 20, 2025 00:40

wangshangsam previously approved these changes Aug 20, 2025

View reviewed changes

Merge branch 'main' into zhanda/debug-accuracy

313046d

terrykong added this pull request to the merge queue Aug 20, 2025

zhandaz temporarily deployed to nemo-ci August 25, 2025 20:02 — with GitHub Actions Inactive

Fix temperature related issues in the more path

1dd2d46

Signed-off-by: Zhanda <[email protected]>

zhandaz force-pushed the zhanda/debug-accuracy branch from 0dade0c to 45e571b Compare August 25, 2025 20:45

zhandaz changed the title ~~fix: fix temperature-related issues in the DTensor path~~ fix: fix temperature-related issues Aug 25, 2025

NVIDIA-NeMo deleted a comment from github-actions bot Aug 25, 2025

zhandaz mentioned this pull request Aug 25, 2025

GRPO (AceReason, Deepscaler, Qwen 2.5) accuracy debug #887

Closed

parthchadha approved these changes Aug 25, 2025

View reviewed changes

zhandaz added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Aug 25, 2025

zhandaz temporarily deployed to nemo-ci August 25, 2025 21:58 — with GitHub Actions Inactive

zhandaz temporarily deployed to nemo-ci August 25, 2025 22:04 — with GitHub Actions Inactive

zhandaz temporarily deployed to nemo-ci August 26, 2025 01:50 — with GitHub Actions Inactive

Merge branch 'main' into zhanda/debug-accuracy

589d0d8

zhandaz temporarily deployed to nemo-ci August 26, 2025 16:46 — with GitHub Actions Inactive

zhandaz temporarily deployed to nemo-ci August 26, 2025 16:51 — with GitHub Actions Inactive

zhandaz temporarily deployed to nemo-ci August 26, 2025 18:53 — with GitHub Actions Inactive

terrykong added this pull request to the merge queue Aug 26, 2025

Merged via the queue into main with commit 2aea5ad Aug 27, 2025
37 of 38 checks passed

terrykong deleted the zhanda/debug-accuracy branch August 27, 2025 00:09

PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025

fix: fix temperature-related issues (NVIDIA-NeMo#935)

244c307

Signed-off-by: Zhanda <[email protected]> Signed-off-by: Zhanda Zhu <[email protected]> Co-authored-by: Shang Wang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: fix temperature-related issues #935

fix: fix temperature-related issues #935

Uh oh!

zhandaz commented Aug 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

zhandaz commented Aug 20, 2025 •

edited

Loading

Uh oh!

wangshangsam left a comment •

edited

Loading

Uh oh!

zhandaz commented Aug 20, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 25, 2025

Uh oh!

github-actions bot commented Aug 25, 2025

Uh oh!

github-actions bot commented Aug 25, 2025

Uh oh!

zhandaz commented Aug 26, 2025

Uh oh!

github-actions bot commented Aug 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	if not is_vllm_v1_engine_enabled():
	logits.div_(self.cfg["generation"]["temperature"])

fix: fix temperature-related issues #935

fix: fix temperature-related issues #935

Uh oh!

Conversation

zhandaz commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

1. Reproduce the errors

2. Fix the temperature error in the DTensor Policy Worker

3. Fix the returned logprobs in vLLM V1

Related

Issues

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 19, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

zhandaz commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangshangsam left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhandaz commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 25, 2025

ℹ️ File Synchronization Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions bot commented Aug 25, 2025

ℹ️ File Synchronization Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions bot commented Aug 25, 2025

ℹ️ File Synchronization Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

zhandaz commented Aug 26, 2025

Uh oh!

github-actions bot commented Aug 26, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhandaz commented Aug 18, 2025 •

edited

Loading

zhandaz commented Aug 20, 2025 •

edited

Loading

wangshangsam left a comment •

edited

Loading

zhandaz commented Aug 20, 2025 •

edited

Loading