Fix completion_mask alignment and temperature scaling in Megatron GRPO trainer#8427
Fix completion_mask alignment and temperature scaling in Megatron GRPO trainer#8427hjh0119 merged 2 commits intomodelscope:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors and improves the log probability computation within the Megatron GRPO trainer. The changes address a critical Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Megatron GRPO trainer by addressing critical issues related to log-probability calculation and data alignment. It corrects a Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces VLLM support and updates training parameters in opsd.sh. The main changes involve a significant refactoring of how per-token log probabilities and entropy are computed within the GRPO trainer. A new compute_per_token_logps method is added to rlhf_mixin.py to handle model forwarding, temperature scaling, and logps computation, replacing the previous model_forward method in grpo_trainer.py. Additionally, grpo_trainer.py includes a fix for label alignment using torch.roll and streamlines the forward_step and loss_func methods. A suggestion was made to use a shallow copy (.copy()) instead of deepcopy for performance optimization when passing inputs to compute_per_token_logps.
| ref_per_token_logps_packed = self.compute_per_token_logps( | ||
| ref_model, iter([deepcopy(inputs)]), temperature=self.temperature) | ||
| if self.template.padding_free: | ||
| # In padding_free mode, logps are in rmpad format [1, total_tokens] | ||
| # Pad to batch format [batch_size, max_seq_len] | ||
| ref_per_token_logps, _ = pad_logps_back_to_batch( | ||
| logps_rmpad=ref_per_token_logps_raw, | ||
| logps_rmpad=ref_per_token_logps_packed, | ||
| logits_to_keep=max_seq_len, | ||
| batch_size=batch_size, | ||
| seq_lengths=seq_lengths) | ||
| else: | ||
| # In non-padding_free mode, logps are already in batch format [batch_size, seq_len] | ||
| ref_per_token_logps = ref_per_token_logps_raw | ||
| ref_per_token_logps = ref_per_token_logps_packed | ||
| batch['ref_per_token_logps'] = ref_per_token_logps | ||
|
|
||
| old_per_token_logps_raw = self.model_forward( | ||
| self.unwrapped_models[0], iter([deepcopy(inputs)]), no_grad=True, per_token=True)['logps'] | ||
| old_per_token_logps_packed = self.compute_per_token_logps( | ||
| self.unwrapped_models[0], iter([deepcopy(inputs)]), temperature=self.temperature) |
There was a problem hiding this comment.
Using deepcopy can be computationally expensive, especially when dealing with large tensors. Since compute_per_token_logps modifies the dictionary it receives by popping keys, a copy is necessary. However, a shallow copy using .copy() should be sufficient if the model's forward pass does not modify tensors in-place. This would be more performant.
If you can ensure that the forward pass is free of in-place tensor modifications, consider using a shallow copy for both calls to compute_per_token_logps.
| ref_per_token_logps_packed = self.compute_per_token_logps( | |
| ref_model, iter([deepcopy(inputs)]), temperature=self.temperature) | |
| if self.template.padding_free: | |
| # In padding_free mode, logps are in rmpad format [1, total_tokens] | |
| # Pad to batch format [batch_size, max_seq_len] | |
| ref_per_token_logps, _ = pad_logps_back_to_batch( | |
| logps_rmpad=ref_per_token_logps_raw, | |
| logps_rmpad=ref_per_token_logps_packed, | |
| logits_to_keep=max_seq_len, | |
| batch_size=batch_size, | |
| seq_lengths=seq_lengths) | |
| else: | |
| # In non-padding_free mode, logps are already in batch format [batch_size, seq_len] | |
| ref_per_token_logps = ref_per_token_logps_raw | |
| ref_per_token_logps = ref_per_token_logps_packed | |
| batch['ref_per_token_logps'] = ref_per_token_logps | |
| old_per_token_logps_raw = self.model_forward( | |
| self.unwrapped_models[0], iter([deepcopy(inputs)]), no_grad=True, per_token=True)['logps'] | |
| old_per_token_logps_packed = self.compute_per_token_logps( | |
| self.unwrapped_models[0], iter([deepcopy(inputs)]), temperature=self.temperature) | |
| ref_per_token_logps_packed = self.compute_per_token_logps( | |
| ref_model, iter([inputs.copy()]), temperature=self.temperature) | |
| if self.template.padding_free: | |
| ref_per_token_logps, _ = pad_logps_back_to_batch( | |
| logps_rmpad=ref_per_token_logps_packed, | |
| logits_to_keep=max_seq_len, | |
| batch_size=batch_size, | |
| seq_lengths=seq_lengths) | |
| else: | |
| ref_per_token_logps = ref_per_token_logps_packed | |
| batch['ref_per_token_logps'] = ref_per_token_logps | |
| old_per_token_logps_packed = self.compute_per_token_logps( | |
| self.unwrapped_models[0], iter([inputs.copy()]), temperature=self.temperature) |
There was a problem hiding this comment.
Code Review
This pull request introduces VLLM support to the opsd.sh example script and refactors the computation of per-token log probabilities and entropy within the GRPO trainer. The model_forward method was removed from grpo_trainer.py and its functionality, along with temperature scaling, was moved into a new compute_per_token_logps method in rlhf_mixin.py. The grpo_trainer.py now uses this new method and also introduces rolled_labels for completion mask calculations. However, a critical issue was identified where the labels tensor is not consistently shifted before being passed to compute_logps_and_entropy_from_logits in both grpo_trainer.py and rlhf_mixin.py, which could lead to misaligned log probabilities.
| per_token_logps_packed, per_token_entropy_packed = compute_logps_and_entropy_from_logits( | ||
| logits_packed, labels, compute_entropy=self.compute_entropy) |
There was a problem hiding this comment.
The labels tensor passed to compute_logps_and_entropy_from_logits appears to be unshifted. For autoregressive models, the labels should be shifted left by one position to align with the logits for next-token prediction (i.e., logits[..., i, :] predicts labels[..., i+1]). Using unshifted labels will result in misaligned log probabilities.
Please consider shifting the labels before this call.
| per_token_logps_packed, per_token_entropy_packed = compute_logps_and_entropy_from_logits( | |
| logits_packed, labels, compute_entropy=self.compute_entropy) | |
| per_token_logps_packed, per_token_entropy_packed = compute_logps_and_entropy_from_logits( | |
| logits_packed, torch.roll(labels, shifts=-1, dims=-1), compute_entropy=self.compute_entropy) |
|
|
||
| if temperature != 1.0: | ||
| output_tensor.div_(temperature) | ||
| per_token_logps, _ = compute_logps_and_entropy_from_logits(output_tensor, labels) |
There was a problem hiding this comment.
Similar to the issue in grpo_trainer.py, the labels passed to compute_logps_and_entropy_from_logits here are unshifted. This will lead to misaligned log probabilities, as output_tensor (logits) at position i is for predicting token i+1. The labels should be shifted left by one.
| per_token_logps, _ = compute_logps_and_entropy_from_logits(output_tensor, labels) | |
| per_token_logps, _ = compute_logps_and_entropy_from_logits(output_tensor, torch.roll(labels, shifts=-1, dims=-1)) |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the computation of per-token log probabilities and entropy within the GRPO trainer, introducing a new compute_per_token_logps method in rlhf_mixin.py to centralize this logic, including temperature scaling. The grpo_trainer.py now utilizes this new method and removes its own model_forward implementation. Additionally, the PR updates an example script to enable and configure VLLM. The review comments suggest improving code safety and clarity by changing in-place tensor division operations to out-of-place ones in both grpo_trainer.py and rlhf_mixin.py.
| if is_pp_last_stage and output_tensor is not None: | ||
| logits_packed = output_tensor | ||
| if self.temperature != 1.0: | ||
| logits_packed.div_(self.temperature) |
There was a problem hiding this comment.
For clarity and to avoid potential side effects, it's generally safer to perform tensor operations out-of-place, especially on tensors that are part of the computation graph. While the current in-place modification of logits_packed seems safe as output_tensor is not used elsewhere, using an out-of-place operation would make the code more robust to future changes.
Consider changing this to an out-of-place division to improve code safety, unless the in-place operation is a deliberate memory optimization.
| logits_packed.div_(self.temperature) | |
| logits_packed = logits_packed / self.temperature |
| return None | ||
|
|
||
| if temperature != 1.0: | ||
| output_tensor.div_(temperature) |
There was a problem hiding this comment.
Similar to the change in grpo_trainer.py, consider using an out-of-place division here for better code safety and clarity. While the in-place operation is currently safe as output_tensor is not reused, an out-of-place operation is more robust against future modifications.
| output_tensor.div_(temperature) | |
| output_tensor = output_tensor / temperature |
Summary
completion_maskmisalignment withper_token_logps