Skip to content

Fix OOM in CI by reducing batch size in GRPO/RLOO VLM tests#5767

Merged
albertvillanova merged 8 commits into
mainfrom
pfix-5750-per_device_train_batch_size
May 21, 2026
Merged

Fix OOM in CI by reducing batch size in GRPO/RLOO VLM tests#5767
albertvillanova merged 8 commits into
mainfrom
pfix-5750-per_device_train_batch_size

Conversation

@albertvillanova

@albertvillanova albertvillanova commented May 13, 2026

Copy link
Copy Markdown
Member

Fix OOM in CI by reducing batch size in GRPO/RLOO VLM tests.

This PR updates the test configurations for GRPO and RLOO trainers to further reduce memory usage during VLM (Vision-Language Model) training. The primary change is lowering the per_device_train_batch_size from 3 to 1 in various test cases, with updated comments to clarify that this is to avoid out-of-memory (OOM) errors due to the memory-intensive nature of VLM training.

Partial fix for:

Aligned with:

Related to:

Motivation

CI was OOMing because -n auto runs 4 xdist workers sharing one 14.74 GiB GPU. When test_train_vlm[tiny-Gemma4ForConditionalGeneration] ran in one worker (~9.17 GiB), a concurrent worker attempting to allocate 2.95 GiB found only 2.87 GiB free. Two fixes address this at different levels:

Changes

Test configuration updates for memory optimization:

  • Reduced per_device_train_batch_size from 3 to 1 in all relevant test cases within tests/test_grpo_trainer.py to prevent OOM errors during VLM training.
  • Made the same batch size reduction in all relevant test cases within tests/test_rloo_trainer.py for consistency and to address VLM memory constraints.

These changes ensure that the test suite can run reliably on machines with limited memory resources when training VLMs.


Note

Low Risk
Test-only config tweaks to lower GPU memory usage in CI; low functional risk beyond potentially reducing VLM training coverage in tests.

Overview
Reduces GPU memory pressure in GRPO and RLOO vision-language training tests by lowering per_device_train_batch_size and num_generations from 3 to 2, with updated comments clarifying this is CI-only to avoid OOM.

Updates the GRPO multimodal tools test to reflect the new num_generations=2 behavior by adjusting the mocked generate batch shapes and the expected tools/call_frequency assertion (from 2/3 to 1/2).

Reviewed by Cursor Bugbot for commit 6184460. Bugbot is set up for automated code reviews on this repo. Configure here.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f25ebbc. Configure here.

Comment thread tests/test_grpo_trainer.py Outdated
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@albertvillanova

Copy link
Copy Markdown
Member Author

@qgallouedec

Copy link
Copy Markdown
Member

is this pr still needed?
Technically it would works, but I feel like num_generations=2 is ideally not something we want because with a group of 2 rewards, standardized advantages are: (r1-μ)/σ = sign(r1-r2) = ±0.707, and the other is ∓0.707

so no matter how big or small the actual reward gap is, the advantages are always exactly +0.707 and -0.707. The magnitude of the reward difference gets completely erased. You're left with just "which completion was better," i.e. a pairwise sign, not a group-relative advantage.

@albertvillanova

albertvillanova commented May 17, 2026

Copy link
Copy Markdown
Member Author

@qgallouedec,

Yes, this PR is still needed. The other OOM mitigations (reducing the tiny Gemma4 model footprint and clearing chained exception tracebacks) reduce accumulated memory between test reruns; they don't reduce the peak memory during a single VLM test run. Reducing per_device_train_batch_size and num_generations directly lowers peak GPU memory in both the generation phase and the training phase. These are independent, complementary fixes.

On num_generations=2, you're right that with only 2 completions per prompt, standardized advantages collapse to ±1/√2 ≈ ±0.707, losing reward magnitude information. This is a valid concern for production training, where num_generations should be larger (typically ≥ 4–8) to get meaningful advantage estimates. In this context, however, num_generations=2 is purely a CI test parameter: the goal is to verify that the code runs end-to-end without OOM errors, not to validate training quality or convergence.

It is also worth noting that num_generations=2 is already used across other VLM tests inour CI, e.g.:

num_generations=2,

I'm adding a note to make this distinction explicit, clarifying that num_generations=2 is a CI-only concession and production training should use more generations.

@qgallouedec

Copy link
Copy Markdown
Member

ok sounds good

@albertvillanova albertvillanova merged commit bbb3976 into main May 21, 2026
13 checks passed
@albertvillanova albertvillanova deleted the pfix-5750-per_device_train_batch_size branch May 21, 2026 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants