Conversation
Contributor
Codecov ReportAttention: Patch coverage is 📢 Thoughts on this report? Let us know! |
winglian
reviewed
May 7, 2025
winglian
reviewed
May 7, 2025
winglian
reviewed
May 7, 2025
salmanmohammadi
left a comment
Contributor
There was a problem hiding this comment.
Really really nice. A few nits, and we need to sync the GRPOTrainer changes with upstream, but this looks good to me.
Collaborator
Author
|
Note: GRPO + SP + Liger results in exploding losses. I suggest we merge this now (once tests pass) and follow up with a fix for this case. |
… instability; lint
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR implements support for sequence parallelism (SP) for the GRPO trainer. This includes a custom sampler, similar to the one implemented in TRL, but with logic for replicating samples across processes in the same SP group.
To make this happen, we need to override a lot more code from the GRPO trainer (e.g., the
__init__function). This could be greatly improved by refactoring the trainer upstream to be a lot more modular and therefore more easily extensible; most of the code we're taking from the superclass is unchanged.We were also able to fix the issue in the
pad_to_sequence_len: falsecase where, for specific input lengths, thering_flash_attnbatch ring attention function experienced extremely large orinfgradient norms, by callingtorch.compileon the function 🤔Motivation and Context
GRPO for sufficiently advanced tasks will likely require longer sequence lengths than we can fit on on a single GPU. Hence, let's add SP support.
Follow-ups:
How has this been tested?
pad_to_seq_len: falsevs.pad_to_seq_len: true,sample_packing: falsevs.sample_packing: true, etc.)Screenshots (if appropriate)
Example GRPO + SP training run:
Note that the config differed a fair bit from the one in our blog post, so they're not directly comparable.
Types of changes
Social Handles (Optional)