Fix completion_mask alignment and temperature scaling in Megatron GRPO trainer by hjh0119 · Pull Request #8427 · modelscope/ms-swift

hjh0119 · 2026-03-25T09:18:29Z

Summary

Fix completion_mask misalignment with per_token_logps
Apply temperature scaling to training-side logits to ensure consistent logps computation with vLLM's processed_logprobs for rollout importance sampling
Refactor logps computation: extract compute_per_token_logps to MegatronRLHFTrainer for reusability, and return per_token_logps instead of raw logits from forward_step

gemini-code-assist · 2026-03-25T09:19:10Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors and improves the log probability computation within the Megatron GRPO trainer. The changes address a critical completion_mask misalignment issue, introduce proper temperature scaling for logits to ensure consistent logps calculation, and modularize the logps computation logic for better reusability. These updates are crucial for accurate reinforcement learning from human feedback (RLHF) training, especially when interacting with external inference systems like vLLM.

Highlights

Completion Mask Alignment: Fixed completion_mask misalignment by introducing rolled_labels to ensure correct correspondence with per_token_logps in both padding-free and non-padding-free modes.
Temperature Scaling for Logits: Applied temperature scaling to training-side logits before logps computation, ensuring consistency with vLLM's processed log probabilities for rollout importance sampling.
Logps Computation Refactoring: Extracted the compute_per_token_logps method to MegatronRLHFTrainer for reusability and clarity, and modified forward_step to directly return per_token_logps.
vLLM Integration: Enabled vLLM integration in the example script by adding configuration parameters such as use_vllm, vllm_mode, and vllm_gpu_memory_utilization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist · 2026-03-25T09:19:25Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Megatron GRPO trainer by addressing critical issues related to log-probability calculation and data alignment. It corrects a completion_mask misalignment and integrates temperature scaling for training-side logits, which is vital for consistent importance sampling. The changes also include a substantial refactoring of the log-probability computation logic, centralizing it into a reusable method for improved code organization and clarity.

Highlights

Completion Mask Alignment: Fixed the misalignment of completion_mask with per_token_logps by introducing rolled_labels to ensure accurate log-probability computation.
Temperature Scaling for Logits: Applied temperature scaling to training-side logits to maintain consistency with vLLM's processed_logprobs, which is crucial for accurate rollout importance sampling.
Logps Computation Refactoring: Refactored the log-probability computation by extracting the compute_per_token_logps method to MegatronRLHFTrainer for reusability and modifying forward_step to directly return per_token_logps instead of raw logits.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces VLLM support and updates training parameters in opsd.sh. The main changes involve a significant refactoring of how per-token log probabilities and entropy are computed within the GRPO trainer. A new compute_per_token_logps method is added to rlhf_mixin.py to handle model forwarding, temperature scaling, and logps computation, replacing the previous model_forward method in grpo_trainer.py. Additionally, grpo_trainer.py includes a fix for label alignment using torch.roll and streamlines the forward_step and loss_func methods. A suggestion was made to use a shallow copy (.copy()) instead of deepcopy for performance optimization when passing inputs to compute_per_token_logps.

gemini-code-assist · 2026-03-25T09:23:01Z

+                ref_per_token_logps_packed = self.compute_per_token_logps(
+                    ref_model, iter([deepcopy(inputs)]), temperature=self.temperature)
                if self.template.padding_free:
-                    # In padding_free mode, logps are in rmpad format [1, total_tokens]
-                    # Pad to batch format [batch_size, max_seq_len]
                    ref_per_token_logps, _ = pad_logps_back_to_batch(
-                        logps_rmpad=ref_per_token_logps_raw,
+                        logps_rmpad=ref_per_token_logps_packed,
                        logits_to_keep=max_seq_len,
                        batch_size=batch_size,
                        seq_lengths=seq_lengths)
                else:
-                    # In non-padding_free mode, logps are already in batch format [batch_size, seq_len]
-                    ref_per_token_logps = ref_per_token_logps_raw
+                    ref_per_token_logps = ref_per_token_logps_packed
                batch['ref_per_token_logps'] = ref_per_token_logps

-        old_per_token_logps_raw = self.model_forward(
-            self.unwrapped_models[0], iter([deepcopy(inputs)]), no_grad=True, per_token=True)['logps']
+        old_per_token_logps_packed = self.compute_per_token_logps(
+            self.unwrapped_models[0], iter([deepcopy(inputs)]), temperature=self.temperature)


Using deepcopy can be computationally expensive, especially when dealing with large tensors. Since compute_per_token_logps modifies the dictionary it receives by popping keys, a copy is necessary. However, a shallow copy using .copy() should be sufficient if the model's forward pass does not modify tensors in-place. This would be more performant.

If you can ensure that the forward pass is free of in-place tensor modifications, consider using a shallow copy for both calls to compute_per_token_logps.

Suggested change

ref_per_token_logps_packed = self.compute_per_token_logps(

ref_model, iter([deepcopy(inputs)]), temperature=self.temperature)

if self.template.padding_free:

# In padding_free mode, logps are in rmpad format [1, total_tokens]

# Pad to batch format [batch_size, max_seq_len]

ref_per_token_logps, _ = pad_logps_back_to_batch(

logps_rmpad=ref_per_token_logps_raw,

logps_rmpad=ref_per_token_logps_packed,

logits_to_keep=max_seq_len,

batch_size=batch_size,

seq_lengths=seq_lengths)

else:

# In non-padding_free mode, logps are already in batch format [batch_size, seq_len]

ref_per_token_logps = ref_per_token_logps_raw

ref_per_token_logps = ref_per_token_logps_packed

batch['ref_per_token_logps'] = ref_per_token_logps

old_per_token_logps_raw = self.model_forward(

self.unwrapped_models[0], iter([deepcopy(inputs)]), no_grad=True, per_token=True)['logps']

old_per_token_logps_packed = self.compute_per_token_logps(

self.unwrapped_models[0], iter([deepcopy(inputs)]), temperature=self.temperature)

ref_per_token_logps_packed = self.compute_per_token_logps(

ref_model, iter([inputs.copy()]), temperature=self.temperature)

if self.template.padding_free:

ref_per_token_logps, _ = pad_logps_back_to_batch(

logps_rmpad=ref_per_token_logps_packed,

logits_to_keep=max_seq_len,

batch_size=batch_size,

seq_lengths=seq_lengths)

else:

ref_per_token_logps = ref_per_token_logps_packed

batch['ref_per_token_logps'] = ref_per_token_logps

old_per_token_logps_packed = self.compute_per_token_logps(

self.unwrapped_models[0], iter([inputs.copy()]), temperature=self.temperature)

gemini-code-assist

Code Review

This pull request introduces VLLM support to the opsd.sh example script and refactors the computation of per-token log probabilities and entropy within the GRPO trainer. The model_forward method was removed from grpo_trainer.py and its functionality, along with temperature scaling, was moved into a new compute_per_token_logps method in rlhf_mixin.py. The grpo_trainer.py now uses this new method and also introduces rolled_labels for completion mask calculations. However, a critical issue was identified where the labels tensor is not consistently shifted before being passed to compute_logps_and_entropy_from_logits in both grpo_trainer.py and rlhf_mixin.py, which could lead to misaligned log probabilities.

gemini-code-assist · 2026-03-25T09:23:13Z

+            per_token_logps_packed, per_token_entropy_packed = compute_logps_and_entropy_from_logits(
+                logits_packed, labels, compute_entropy=self.compute_entropy)


The labels tensor passed to compute_logps_and_entropy_from_logits appears to be unshifted. For autoregressive models, the labels should be shifted left by one position to align with the logits for next-token prediction (i.e., logits[..., i, :] predicts labels[..., i+1]). Using unshifted labels will result in misaligned log probabilities.

Please consider shifting the labels before this call.

Suggested change

per_token_logps_packed, per_token_entropy_packed = compute_logps_and_entropy_from_logits(

logits_packed, labels, compute_entropy=self.compute_entropy)

per_token_logps_packed, per_token_entropy_packed = compute_logps_and_entropy_from_logits(

logits_packed, torch.roll(labels, shifts=-1, dims=-1), compute_entropy=self.compute_entropy)

gemini-code-assist · 2026-03-25T09:23:14Z

+
+        if temperature != 1.0:
+            output_tensor.div_(temperature)
+        per_token_logps, _ = compute_logps_and_entropy_from_logits(output_tensor, labels)


Similar to the issue in grpo_trainer.py, the labels passed to compute_logps_and_entropy_from_logits here are unshifted. This will lead to misaligned log probabilities, as output_tensor (logits) at position i is for predicting token i+1. The labels should be shifted left by one.

Suggested change

per_token_logps, _ = compute_logps_and_entropy_from_logits(output_tensor, labels)

per_token_logps, _ = compute_logps_and_entropy_from_logits(output_tensor, torch.roll(labels, shifts=-1, dims=-1))

hjh0119 · 2026-03-26T08:06:32Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the computation of per-token log probabilities and entropy within the GRPO trainer, introducing a new compute_per_token_logps method in rlhf_mixin.py to centralize this logic, including temperature scaling. The grpo_trainer.py now utilizes this new method and removes its own model_forward implementation. Additionally, the PR updates an example script to enable and configure VLLM. The review comments suggest improving code safety and clarity by changing in-place tensor division operations to out-of-place ones in both grpo_trainer.py and rlhf_mixin.py.

gemini-code-assist · 2026-03-26T08:12:18Z

+        if is_pp_last_stage and output_tensor is not None:
+            logits_packed = output_tensor
+            if self.temperature != 1.0:
+                logits_packed.div_(self.temperature)


For clarity and to avoid potential side effects, it's generally safer to perform tensor operations out-of-place, especially on tensors that are part of the computation graph. While the current in-place modification of logits_packed seems safe as output_tensor is not used elsewhere, using an out-of-place operation would make the code more robust to future changes.

Consider changing this to an out-of-place division to improve code safety, unless the in-place operation is a deliberate memory optimization.

Suggested change

logits_packed.div_(self.temperature)

logits_packed = logits_packed / self.temperature

gemini-code-assist · 2026-03-26T08:12:18Z

+            return None
+
+        if temperature != 1.0:
+            output_tensor.div_(temperature)


Similar to the change in grpo_trainer.py, consider using an out-of-place division here for better code safety and clarity. While the in-place operation is currently safe as output_tensor is not reused, an out-of-place operation is more robust against future modifications.

Suggested change

output_tensor.div_(temperature)

output_tensor = output_tensor / temperature

…O trainer (#8427)

fix

e685491

gemini-code-assist Bot reviewed Mar 25, 2026

View reviewed changes

hjh0119 mentioned this pull request Mar 25, 2026

[megatron]feat: Add routing replay support for Megatron-Swift GRPO #8196

Merged

4 tasks

fix

0fa6140

gemini-code-assist Bot reviewed Mar 26, 2026

View reviewed changes

Jintao-Huang approved these changes Mar 26, 2026

View reviewed changes

hjh0119 merged commit 0b1ecca into modelscope:main Mar 26, 2026
2 of 3 checks passed

hjh0119 deleted the fix-mg-grpo-0325 branch March 26, 2026 08:24

Jintao-Huang pushed a commit that referenced this pull request Mar 28, 2026

Fix completion_mask alignment and temperature scaling in Megatron GRP…

a9c1794

…O trainer (#8427)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix completion_mask alignment and temperature scaling in Megatron GRPO trainer#8427

Fix completion_mask alignment and temperature scaling in Megatron GRPO trainer#8427
hjh0119 merged 2 commits intomodelscope:mainfrom
hjh0119:fix-mg-grpo-0325

hjh0119 commented Mar 25, 2026

Uh oh!

gemini-code-assist Bot commented Mar 25, 2026

Uh oh!

gemini-code-assist Bot commented Mar 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 25, 2026

Uh oh!

gemini-code-assist Bot Mar 25, 2026

Uh oh!

hjh0119 commented Mar 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 26, 2026

Uh oh!

gemini-code-assist Bot Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		per_token_logps_packed, per_token_entropy_packed = compute_logps_and_entropy_from_logits(
		logits_packed, labels, compute_entropy=self.compute_entropy)

	per_token_logps, _ = compute_logps_and_entropy_from_logits(output_tensor, labels)
	per_token_logps, _ = compute_logps_and_entropy_from_logits(output_tensor, torch.roll(labels, shifts=-1, dims=-1))

	logits_packed.div_(self.temperature)
	logits_packed = logits_packed / self.temperature

	output_tensor.div_(temperature)
	output_tensor = output_tensor / temperature

Conversation

hjh0119 commented Mar 25, 2026

Summary

Uh oh!

gemini-code-assist Bot commented Mar 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot commented Mar 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

hjh0119 commented Mar 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants