Skip to content

Enable chunked NLL loss with VLM in SFT#5684

Merged
qgallouedec merged 30 commits into
mainfrom
chunked-nll-vlm
May 8, 2026
Merged

Enable chunked NLL loss with VLM in SFT#5684
qgallouedec merged 30 commits into
mainfrom
chunked-nll-vlm

Conversation

@qgallouedec

@qgallouedec qgallouedec commented Apr 29, 2026

Copy link
Copy Markdown
Member

Requires #5676


Note

Medium Risk
Expands the chunked_nll training path to VLM and MoE wrappers by patching model forward, which can subtly affect loss/gradient behavior across many model families and transformers versions.

Overview
Enables loss_type='chunked_nll' for vision-language models by extending _patch_chunked_ce_lm_head to handle VLM config (text_config), run the multimodal wrapper (base_model/model) so vision token injection occurs, and compute MoE auxiliary loss using the correct config fields.

Updates SFTTrainer to apply the patched chunked-loss forward for VLMs (removing the prior VLM restriction) and relaxes SFTConfig docs/help text to reflect that chunked_nll is now only incompatible with use_liger_kernel.

Adds/expands tests to cover chunked NLL training on multiple VLM families, plus forward/backward equivalence tests for patched chunked CE on VLMs (including a VLM MoE aux-loss case), and tightens the PEFT chunked-NLL test to assert base weights stay frozen while adapter params update.

Reviewed by Cursor Bugbot for commit ec0cad7. Bugbot is set up for automated code reviews on this repo. Configure here.

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread tests/test_sft_trainer.py
Comment thread tests/test_sft_trainer.py
Comment thread trl/trainer/sft_trainer.py
@qgallouedec

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Keep them coming!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Base automatically changed from chunked_nll_peft to main May 5, 2026 17:07
Comment thread trl/trainer/sft_trainer.py

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 00cb84b. Configure here.

Comment thread tests/test_sft_trainer.py Outdated
Comment thread tests/test_sft_trainer.py
# the model itself. We should investigate this further, but for now we just skip these params.
# fmt: off
if (
model_id == "trl-internal-testing/tiny-Gemma3ForConditionalGeneration" and "model.vision_tower.vision_model.head" in n or

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we refacto this a bit ? any reasons they didn't change ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure; it's been an open question for a long time, but it's never been urgent enough for me to set aside time to investigate. My hunch is that the gradients reaching the vision tower are too weak for the weights to be updated, either because of the structure of the tiny model or because of the initialization values.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the refacto, I'd recommend keeping thing like this mostly because it's consistent with TestDPOTrainer.test_train_vlm and TestSFTTrainer.test_train_vlm, plus it explicitly shows which layers are problematic.
Although I agree it's no pretty

@qgallouedec qgallouedec merged commit b05330a into main May 8, 2026
13 checks passed
@qgallouedec qgallouedec deleted the chunked-nll-vlm branch May 8, 2026 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants