Padding-free SFT #740

garrett361 · 2025-06-30T17:41:25Z

This PR adds the ability to perform padding-free SFT, which reduces the memory costs and increases throughput when per_device_train_batch_size>1. The model itself must support padding-free training to properly use this feature (Llama and bamba models support this, instance; see huggingface/transformers#35861 for some typical throughput improvements).

Pass --padding-free True to use.

vwxyzjn · 2025-06-30T18:08:04Z

Is this packing?

garrett361 · 2025-06-30T18:14:36Z

Yeah, think it's the same @vwxyzjn .

Basically rather than padding out uneven examples which are batched together, you just concatenate them and add additional info to say where example boundaries are.

garrett361 · 2025-06-30T18:15:19Z

Same idea as here

garrett361 · 2025-07-02T17:06:07Z

FYI I'm double checking a possible issue with this at the moment; will report back after I have more info.

garrett361 · 2025-07-03T19:15:48Z

I'm double checking a possible issue

Okay, I was debugging some bad-looking padding-free SFT training curves, but it turned out to be an issue with the mamba kernels which was recently fixed here. So, not an issue on the part of this commit.

garrett361 · 2025-07-03T19:25:25Z

Some plots of padding-free training. All curves use 8xA100s and have --per_device_train_batch_size 2. The only difference between any pair of curves on a plot is whether --padding_free is true or false.

First, tuluv3 training on a BambaForCausalLM model. Loss curves are nearly identical (they will never be precisely the same due to numerics), and the throughput of the --padding_free true run is ~40% higher.

Schermafbeelding 2025-07-03 om 3 17 41 PM

And then the same experiment with LlamaForCausalLM. Similar results: loss curves are similar and --padding_free true througput is ~40% higher.

Schermafbeelding 2025-07-03 om 3 18 08 PM

CC @vwxyzjn @hamishivi

hamishivi

Thanks! So it seems like if the concatted sequence goes over the maximum supported length of the model, there might be issues? I guess in the ppo/grpo packing implementations, our code instead tries to fit samples into a given max length, and does use multiple microbatches if it has to.

Generally happy to merge this in since its optional, just also might be nice to name it packing since I think that's a more common term that people will understand.

hamishivi · 2025-07-07T20:52:49Z

open_instruct/finetune.py


    sync_each_batch: bool = False
    """Optionaly sync grads every batch when using grad accumulation. Can significantly reduce memory costs."""
+    padding_free: bool = field(


minor nit: could we call this packing instead of padding_free? To make it clearer to end-users what the feature is.

Suggested change

padding_free: bool = field(

packing: bool = field(

yep, that is fine

hamishivi · 2025-07-07T20:52:57Z

open_instruct/finetune.py

        model.gradient_checkpointing_enable()

    # DataLoaders creation:
+    if args.padding_free:


Suggested change

if args.padding_free:

if args.packing:

garrett361 · 2025-07-08T00:45:49Z

So it seems like if the concatted sequence goes over the maximum supported length of the model, there might be issues?

Maybe? I'm not sure what the max supported length affects, apart from RoPE embeddings? Or I guess similar mechanisms like alibi or just fixed positional embeddings. For the models I've used (llama, bamba) padding-free/packing with, there's no problem because the explicitly sequence dependent bits adjust accordingly. Certainly this only works for some model classes, which is why the collator raises a warning.

I guess in the ppo/grpo packing implementations, our code instead tries to fit samples into a given max length, and does use multiple microbatches if it has to.

I'm only very familiar with the SFT finetune.py script at the moment, so can't yet speak to how you might use this for ppo/grpo packing.

garrett361 force-pushed the padding-free-squashing-1 branch from 214853f to 447b05b Compare July 3, 2025 18:43

padding-free

ebeefe1

garrett361 force-pushed the padding-free-squashing-1 branch from 447b05b to ebeefe1 Compare July 3, 2025 18:44

cleanup

489b0a5

test skips

312241a

hamishivi approved these changes Jul 7, 2025

View reviewed changes

padding_free -> packing

8d7e330

hamishivi merged commit e75f1f2 into allenai:main Jul 8, 2025
3 checks passed

fabianlim mentioned this pull request Jul 10, 2025

Port Padding-Free, #772

Merged

Padding-free SFT #740

Padding-free SFT #740

Uh oh!

Conversation

garrett361 commented Jun 30, 2025

Uh oh!

vwxyzjn commented Jun 30, 2025

Uh oh!

garrett361 commented Jun 30, 2025

Uh oh!

garrett361 commented Jun 30, 2025

Uh oh!

garrett361 commented Jul 2, 2025

Uh oh!

garrett361 commented Jul 3, 2025

Uh oh!

garrett361 commented Jul 3, 2025

Uh oh!

hamishivi left a comment

Choose a reason for hiding this comment

Uh oh!

hamishivi Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

garrett361 Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

garrett361 Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

hamishivi Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

garrett361 Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

garrett361 commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

garrett361 commented Jul 8, 2025 •

edited

Loading