Fix OOM in CI by reducing batch size in VLM SFT tests#5687
Merged
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Member
|
It seems to work! |
qgallouedec
approved these changes
Apr 30, 2026
qgallouedec
pushed a commit
that referenced
this pull request
May 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix OOM in CI by reducing batch size in VLM SFT tests.
Partial fix for:
Motivation
VLM training tests in
test_sft_trainer.pywere running with the defaultper_device_train_batch_size=8. For Gemma3, withvocab_size=262208(production-scale, never reduced for tiny models) andmm_tokens_per_image=256, each training step computes logits of shape[8, 279, 262208].PyTorch needs several float32 copies of this tensor for log-softmax and its gradient, pushing peak GPU memory to ~9 GiB per worker. With 4 parallel pytest-xdist workers this caused CUDA out-of-memory errors for other concurrent tests.
Solution
Set
per_device_train_batch_size=1intest_train_vlm,test_train_vlm_multi_image, andtest_train_vlm_prompt_completion, following the pattern already used intest_train_vlm_gemma_3n. This drops peak GPU memory to ~1.1 GiB per worker for Gemma3, leaving ample headroom for parallel execution.Note
Low Risk
Test-only change that lowers batch size to avoid CI OOMs; no production code paths are modified.
Overview
Reduces GPU memory pressure in vision-language SFT integration tests by explicitly setting
per_device_train_batch_size=1intest_train_vlm,test_train_vlm_multi_image, andtest_train_vlm_prompt_completion.This prevents CI CUDA OOMs during parallel test execution while keeping the VLM-specific
max_length=Nonebehavior unchanged.Reviewed by Cursor Bugbot for commit 0480d77. Bugbot is set up for automated code reviews on this repo. Configure here.