Skip to content

Fix completion_mask alignment and temperature scaling in Megatron GRPO trainer#8427

Merged
hjh0119 merged 2 commits intomodelscope:mainfrom
hjh0119:fix-mg-grpo-0325
Mar 26, 2026
Merged

Fix completion_mask alignment and temperature scaling in Megatron GRPO trainer#8427
hjh0119 merged 2 commits intomodelscope:mainfrom
hjh0119:fix-mg-grpo-0325

Conversation

@hjh0119
Copy link
Copy Markdown
Collaborator

@hjh0119 hjh0119 commented Mar 25, 2026

Summary

  1. Fix completion_mask misalignment with per_token_logps
  2. Apply temperature scaling to training-side logits to ensure consistent logps computation with vLLM's processed_logprobs for rollout importance sampling
  3. Refactor logps computation: extract compute_per_token_logps to MegatronRLHFTrainer for reusability, and return per_token_logps instead of raw logits from forward_step

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors and improves the log probability computation within the Megatron GRPO trainer. The changes address a critical completion_mask misalignment issue, introduce proper temperature scaling for logits to ensure consistent logps calculation, and modularize the logps computation logic for better reusability. These updates are crucial for accurate reinforcement learning from human feedback (RLHF) training, especially when interacting with external inference systems like vLLM.

Highlights

  • Completion Mask Alignment: Fixed completion_mask misalignment by introducing rolled_labels to ensure correct correspondence with per_token_logps in both padding-free and non-padding-free modes.
  • Temperature Scaling for Logits: Applied temperature scaling to training-side logits before logps computation, ensuring consistency with vLLM's processed log probabilities for rollout importance sampling.
  • Logps Computation Refactoring: Extracted the compute_per_token_logps method to MegatronRLHFTrainer for reusability and clarity, and modified forward_step to directly return per_token_logps.
  • vLLM Integration: Enabled vLLM integration in the example script by adding configuration parameters such as use_vllm, vllm_mode, and vllm_gpu_memory_utilization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Megatron GRPO trainer by addressing critical issues related to log-probability calculation and data alignment. It corrects a completion_mask misalignment and integrates temperature scaling for training-side logits, which is vital for consistent importance sampling. The changes also include a substantial refactoring of the log-probability computation logic, centralizing it into a reusable method for improved code organization and clarity.

Highlights

  • Completion Mask Alignment: Fixed the misalignment of completion_mask with per_token_logps by introducing rolled_labels to ensure accurate log-probability computation.
  • Temperature Scaling for Logits: Applied temperature scaling to training-side logits to maintain consistency with vLLM's processed_logprobs, which is crucial for accurate rollout importance sampling.
  • Logps Computation Refactoring: Refactored the log-probability computation by extracting the compute_per_token_logps method to MegatronRLHFTrainer for reusability and modifying forward_step to directly return per_token_logps instead of raw logits.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces VLLM support and updates training parameters in opsd.sh. The main changes involve a significant refactoring of how per-token log probabilities and entropy are computed within the GRPO trainer. A new compute_per_token_logps method is added to rlhf_mixin.py to handle model forwarding, temperature scaling, and logps computation, replacing the previous model_forward method in grpo_trainer.py. Additionally, grpo_trainer.py includes a fix for label alignment using torch.roll and streamlines the forward_step and loss_func methods. A suggestion was made to use a shallow copy (.copy()) instead of deepcopy for performance optimization when passing inputs to compute_per_token_logps.

Comment on lines +942 to +955
ref_per_token_logps_packed = self.compute_per_token_logps(
ref_model, iter([deepcopy(inputs)]), temperature=self.temperature)
if self.template.padding_free:
# In padding_free mode, logps are in rmpad format [1, total_tokens]
# Pad to batch format [batch_size, max_seq_len]
ref_per_token_logps, _ = pad_logps_back_to_batch(
logps_rmpad=ref_per_token_logps_raw,
logps_rmpad=ref_per_token_logps_packed,
logits_to_keep=max_seq_len,
batch_size=batch_size,
seq_lengths=seq_lengths)
else:
# In non-padding_free mode, logps are already in batch format [batch_size, seq_len]
ref_per_token_logps = ref_per_token_logps_raw
ref_per_token_logps = ref_per_token_logps_packed
batch['ref_per_token_logps'] = ref_per_token_logps

old_per_token_logps_raw = self.model_forward(
self.unwrapped_models[0], iter([deepcopy(inputs)]), no_grad=True, per_token=True)['logps']
old_per_token_logps_packed = self.compute_per_token_logps(
self.unwrapped_models[0], iter([deepcopy(inputs)]), temperature=self.temperature)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using deepcopy can be computationally expensive, especially when dealing with large tensors. Since compute_per_token_logps modifies the dictionary it receives by popping keys, a copy is necessary. However, a shallow copy using .copy() should be sufficient if the model's forward pass does not modify tensors in-place. This would be more performant.

If you can ensure that the forward pass is free of in-place tensor modifications, consider using a shallow copy for both calls to compute_per_token_logps.

Suggested change
ref_per_token_logps_packed = self.compute_per_token_logps(
ref_model, iter([deepcopy(inputs)]), temperature=self.temperature)
if self.template.padding_free:
# In padding_free mode, logps are in rmpad format [1, total_tokens]
# Pad to batch format [batch_size, max_seq_len]
ref_per_token_logps, _ = pad_logps_back_to_batch(
logps_rmpad=ref_per_token_logps_raw,
logps_rmpad=ref_per_token_logps_packed,
logits_to_keep=max_seq_len,
batch_size=batch_size,
seq_lengths=seq_lengths)
else:
# In non-padding_free mode, logps are already in batch format [batch_size, seq_len]
ref_per_token_logps = ref_per_token_logps_raw
ref_per_token_logps = ref_per_token_logps_packed
batch['ref_per_token_logps'] = ref_per_token_logps
old_per_token_logps_raw = self.model_forward(
self.unwrapped_models[0], iter([deepcopy(inputs)]), no_grad=True, per_token=True)['logps']
old_per_token_logps_packed = self.compute_per_token_logps(
self.unwrapped_models[0], iter([deepcopy(inputs)]), temperature=self.temperature)
ref_per_token_logps_packed = self.compute_per_token_logps(
ref_model, iter([inputs.copy()]), temperature=self.temperature)
if self.template.padding_free:
ref_per_token_logps, _ = pad_logps_back_to_batch(
logps_rmpad=ref_per_token_logps_packed,
logits_to_keep=max_seq_len,
batch_size=batch_size,
seq_lengths=seq_lengths)
else:
ref_per_token_logps = ref_per_token_logps_packed
batch['ref_per_token_logps'] = ref_per_token_logps
old_per_token_logps_packed = self.compute_per_token_logps(
self.unwrapped_models[0], iter([inputs.copy()]), temperature=self.temperature)

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces VLLM support to the opsd.sh example script and refactors the computation of per-token log probabilities and entropy within the GRPO trainer. The model_forward method was removed from grpo_trainer.py and its functionality, along with temperature scaling, was moved into a new compute_per_token_logps method in rlhf_mixin.py. The grpo_trainer.py now uses this new method and also introduces rolled_labels for completion mask calculations. However, a critical issue was identified where the labels tensor is not consistently shifted before being passed to compute_logps_and_entropy_from_logits in both grpo_trainer.py and rlhf_mixin.py, which could lead to misaligned log probabilities.

Comment on lines +1060 to +1061
per_token_logps_packed, per_token_entropy_packed = compute_logps_and_entropy_from_logits(
logits_packed, labels, compute_entropy=self.compute_entropy)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The labels tensor passed to compute_logps_and_entropy_from_logits appears to be unshifted. For autoregressive models, the labels should be shifted left by one position to align with the logits for next-token prediction (i.e., logits[..., i, :] predicts labels[..., i+1]). Using unshifted labels will result in misaligned log probabilities.

Please consider shifting the labels before this call.

Suggested change
per_token_logps_packed, per_token_entropy_packed = compute_logps_and_entropy_from_logits(
logits_packed, labels, compute_entropy=self.compute_entropy)
per_token_logps_packed, per_token_entropy_packed = compute_logps_and_entropy_from_logits(
logits_packed, torch.roll(labels, shifts=-1, dims=-1), compute_entropy=self.compute_entropy)


if temperature != 1.0:
output_tensor.div_(temperature)
per_token_logps, _ = compute_logps_and_entropy_from_logits(output_tensor, labels)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the issue in grpo_trainer.py, the labels passed to compute_logps_and_entropy_from_logits here are unshifted. This will lead to misaligned log probabilities, as output_tensor (logits) at position i is for predicting token i+1. The labels should be shifted left by one.

Suggested change
per_token_logps, _ = compute_logps_and_entropy_from_logits(output_tensor, labels)
per_token_logps, _ = compute_logps_and_entropy_from_logits(output_tensor, torch.roll(labels, shifts=-1, dims=-1))

@hjh0119
Copy link
Copy Markdown
Collaborator Author

hjh0119 commented Mar 26, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the computation of per-token log probabilities and entropy within the GRPO trainer, introducing a new compute_per_token_logps method in rlhf_mixin.py to centralize this logic, including temperature scaling. The grpo_trainer.py now utilizes this new method and removes its own model_forward implementation. Additionally, the PR updates an example script to enable and configure VLLM. The review comments suggest improving code safety and clarity by changing in-place tensor division operations to out-of-place ones in both grpo_trainer.py and rlhf_mixin.py.

if is_pp_last_stage and output_tensor is not None:
logits_packed = output_tensor
if self.temperature != 1.0:
logits_packed.div_(self.temperature)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For clarity and to avoid potential side effects, it's generally safer to perform tensor operations out-of-place, especially on tensors that are part of the computation graph. While the current in-place modification of logits_packed seems safe as output_tensor is not used elsewhere, using an out-of-place operation would make the code more robust to future changes.

Consider changing this to an out-of-place division to improve code safety, unless the in-place operation is a deliberate memory optimization.

Suggested change
logits_packed.div_(self.temperature)
logits_packed = logits_packed / self.temperature

return None

if temperature != 1.0:
output_tensor.div_(temperature)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the change in grpo_trainer.py, consider using an out-of-place division here for better code safety and clarity. While the in-place operation is currently safe as output_tensor is not reused, an out-of-place operation is more robust against future modifications.

Suggested change
output_tensor.div_(temperature)
output_tensor = output_tensor / temperature

@hjh0119 hjh0119 merged commit 0b1ecca into modelscope:main Mar 26, 2026
2 of 3 checks passed
@hjh0119 hjh0119 deleted the fix-mg-grpo-0325 branch March 26, 2026 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants