-
Notifications
You must be signed in to change notification settings - Fork 3.5k
feat: add gradient accumulation support #2646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds gradient accumulation support to the training pipeline, enabling simulation of larger batch sizes without increasing GPU memory usage. The implementation leverages Accelerator's built-in gradient accumulation features.
Key changes:
- Added
gradient_accumulation_stepsconfiguration parameter (default: 1) to control how many batches to accumulate before performing an optimizer step - Updated training loop to properly handle gradient synchronization, skipping evaluation/checkpointing during accumulation steps
- Modified effective batch size calculations throughout the codebase to account for gradient accumulation
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/lerobot/configs/train.py | Adds gradient_accumulation_steps configuration parameter with documentation |
| src/lerobot/scripts/lerobot_train.py | Updates training loop and update_policy() to use Accelerator's accumulation context, removes unused lock parameter, and adds gradient sync checks |
| src/lerobot/utils/logging_utils.py | Updates MetricsTracker.step() to calculate effective batch size including gradient accumulation |
| tests/training/test_update_policy.py | Adds comprehensive tests for gradient sync behavior and mathematical equivalence |
| tests/utils/test_logging_utils.py | Adds test for MetricsTracker with gradient accumulation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Remove lock parameter and related changes that are outside the scope of gradient accumulation feature. This keeps the branch focused on its primary topic and reduces unnecessary diff from main.
|
Hey @irisTa56 that's a very useful feature! I did it locally to train bigger models. Could you maybe resolve the conflicts with main, so we kickoff a review? |
|
@jadechoghari |
|
Thanks! @irisTa56, the tests in this PR seem unnecessary since we’re using the accelerate api, and testing this functionality is the responsibility of accelerate itself. Removing the accelerate-related tests would make the PR cleaner and shorter 😄 |
Thanks for the feedback! I agree and have removed the tests. |
What this does
Adds gradient accumulation support to the training script.
This allows simulating larger batch sizes without increasing GPU memory usage, which is useful when training large models or when memory is limited.
gradient_accumulation_stepsconfiguration parameter toTrainPipelineConfig(default: 1)update_policy()to useaccelerator.accumulate()context managerMetricsTrackerstepas an optimizer step)How it was tested
tests/training/test_update_policy.py.test_update_policy_sync_gradients: Verifies gradient sync behaviortest_update_policy_gradient_accumulation_equivalence: Validates mathematical equivalencetest_metrics_tracker_step_with_acceleratorintests/utils/test_logging_utils.pyP.S. I have removed those tests since we’re using the Accelerate API, and testing it is out of the scope of this PR. But they can be seen at d5df208 and d365eb8.
How to checkout & try? (for the reviewer)
# Effective batch size: 8 × 1 × 4 = 32 lerobot-train \ --dataset.repo_id=lerobot/svla_so101_pickplace \ --policy.type=act \ --policy.repo_id=foo/my_policy \ --policy.push_to_hub=false \ --batch_size=8 \ --gradient_accumulation_steps=4 \ --log_freq=1 \ --steps=50