Enable users to use their own loss functions + deal with prefetching for grad accum #34198

muellerzr · 2024-10-16T14:52:28Z

What does this PR do?

In conjunction with #34191, this PR solves the other half of what's needed:

Letting users pass in their own loss functions directly to the Trainer via compute_loss
Prefetching the first gradient_accumulation_steps worth of data each complete step and marking how many samples were seen (num_items_in_batch), which can be passed to a loss function if it takes in num_items_seen (name TBD)

A bit of feedback needed we need to coordinate:

Should it be called num_items_in_batch and then passed through to the loss functions as such? Or is there a better name we can think of

Fixes huggingface/trl#2175

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@LysandreJik @ArthurZucker

ArthurZucker

LGTM, IMO a regression test on the grad norms could be fairly nice!

ArthurZucker · 2024-10-16T19:17:51Z

src/transformers/trainer.py

+                            self.state.num_input_tokens_seen += (
+                                torch.sum(
+                                    self.accelerator.gather(
+                                        torch.tensor(
+                                            inputs[main_input_name].numel(), device=self.args.device, dtype=torch.int64
+                                        )
                                    )
                                )
+                                .cpu()
+                                .item()


let's make this more readable!

clean did this one 🫠

you can split in 3-4 lines 🎐

ArthurZucker · 2024-10-16T19:20:05Z

src/transformers/trainer.py

+        if (self.label_smoother is not None or self.compute_loss is not None) and "labels" in inputs:
            labels = inputs.pop("labels")


mmmm if people don't pass a loss, we won't use the model's default?

We will, it stays in inputs and gets passed to the models forward()

src/transformers/trainer.py

muellerzr · 2024-10-17T01:25:57Z

A bit more context, full fine-tuning does NOT SEEM TO BE IMPACTED BY THIS (when padding). I am looking into how this directly affects TRL, however things are not as bad as they may seem.

(Below is an example CausalLM result comparing grad accum 4, bs 8 vs bs 32 both before and after this fix)

BenjaminBossan · 2024-10-17T11:56:32Z

src/transformers/trainer.py

+                    # For now we don't support object detection
+                    try:
+                        num_items_in_batch = sum(
+                            [data_batch["labels"][..., 1:].ne(-100).sum().item() for data_batch in batch_samples]


I already quickly discussed this with Zach, so this is a more general questions to other reviewers:

Would this line be work for all the different task types we support? Specifically, can we always skip the first item in the sequence, i.e. is the [..., 1:] part valid?

For casual auto regressive models it works but won't work in other ones

src/transformers/trainer.py

ArthurZucker · 2024-10-17T15:44:54Z

src/transformers/trainer.py

+                            self.state.num_input_tokens_seen += (
+                                torch.sum(
+                                    self.accelerator.gather(
+                                        torch.tensor(
+                                            inputs[main_input_name].numel(), device=self.args.device, dtype=torch.int64
+                                        )
                                    )
                                )
+                                .cpu()
+                                .item()


you can split in 3-4 lines 🎐

src/transformers/trainer.py

tests/trainer/test_trainer.py

src/transformers/trainer.py

danielhanchen

Just a denominator change in the test case

tests/trainer/test_trainer.py

ArthurZucker

Feel free to merge!

Co-authored-by: Daniel Han <danielhanchen@gmail.com>

…for grad accum (huggingface#34198) * bookmark * Bookmark * Bookmark * Actually implement * Pass in kwarg explicitly * Adjust for if we do or don't have labels * Bookmark fix for od * bookmark * Fin * closer * Negate accelerate grad accum div * Fixup not training long enough * Add in compute_loss to take full model output * Document * compute_loss -> compute_loss_fn * Add a test * Refactor * Refactor * Uncomment tests * Update tests/trainer/test_trainer.py Co-authored-by: Daniel Han <danielhanchen@gmail.com> --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com>

- The basic issue is that since the version of transformers pinned by requirements.txt (4.39.1) and now (4.57.1), a new arg was added to `Trainer.__init__` called `compute_loss_func` (added in 4.54.1). This new arg broke things because T5Trainer in trainer.py uses positional args instead of keyword args, so all of the positional args are now effectively off by one - The fix was to switch from positional args to keyword args to prevent the off-by-one issue - This fix is backwards compatible with 4.39.1 - This issue is also mentioned in jkallini#1 - huggingface/transformers@6ba31a8 - huggingface/transformers#34198 ``` File /opt/homebrew/Cellar/jupyterlab/4.4.5/libexec/lib/python3.13/site-packages/transformers/trainer.py:647, in Trainer.__init__(self, model, args, data_collator, train_dataset, eval_dataset, processing_class, model_init, compute_loss_func, compute_metrics, callbacks, optimizers, optimizer_cls_and_kwargs, preprocess_logits_for_metrics) 645 self.compute_metrics = compute_metrics 646 self.preprocess_logits_for_metrics = preprocess_logits_for_metrics --> 647 self.optimizer, self.lr_scheduler = optimizers 648 self.optimizer_cls_and_kwargs = optimizer_cls_and_kwargs 649 if self.optimizer_cls_and_kwargs is not None and self.optimizer is not None: TypeError: cannot unpack non-iterable NoneType object ```

muellerzr added 9 commits October 16, 2024 08:23

bookmark

57c698f

Bookmark

3c57947

Bookmark

1d57bd8

Actually implement

15e61f1

Pass in kwarg explicitly

928e927

Adjust for if we do or don't have labels

b59c8f1

Bookmark fix for od

79f9479

bookmark

13f3369

Fin

8080f28

muellerzr marked this pull request as ready for review October 16, 2024 17:29

muellerzr requested review from ArthurZucker and LysandreJik October 16, 2024 17:29

closer

13160e0

ArthurZucker reviewed Oct 16, 2024

View reviewed changes

muellerzr added 2 commits October 16, 2024 16:34

Negate accelerate grad accum div

6fa155a

Fixup not training long enough

c2a705f

danielhanchen mentioned this pull request Oct 17, 2024

Fix Gradient Accumulation issue #34191

Merged

1 task

BenjaminBossan reviewed Oct 17, 2024

View reviewed changes

Add in compute_loss to take full model output

ac04e61

winglian reviewed Oct 17, 2024

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

muellerzr added 2 commits October 17, 2024 11:38

Document

af8411b

compute_loss -> compute_loss_fn

a5fac5a

ArthurZucker reviewed Oct 17, 2024

View reviewed changes

Add a test

39d8f28

muellerzr changed the title ~~[DRAFT] Enable users to use their own loss functions + deal with prefetching for grad accum~~ Enable users to use their own loss functions + deal with prefetching for grad accum Oct 17, 2024

muellerzr added 3 commits October 17, 2024 13:07

Refactor

4284930

Refactor

932a491

Uncomment tests

2a6b038

muellerzr requested a review from ArthurZucker October 17, 2024 17:14

danielhanchen suggested changes Oct 17, 2024

View reviewed changes

tests/trainer/test_trainer.py Outdated Show resolved Hide resolved

ArthurZucker approved these changes Oct 17, 2024

View reviewed changes

Update tests/trainer/test_trainer.py

54d10de

Co-authored-by: Daniel Han <danielhanchen@gmail.com>

muellerzr merged commit 6ba31a8 into main Oct 17, 2024

muellerzr deleted the muellerzr-fix-loss-calc branch October 17, 2024 21:01

qgallouedec mentioned this pull request Oct 18, 2024

🔀 Rename get_batch_sample and add num_items_in_batch to compute_loss huggingface/trl#2246

Merged

5 tasks

This was referenced Oct 21, 2024

New GA fix causes training loss multiple times higher across the board (5x to 10x higher) #34263

Closed

Enable Gradient Accumulation fix across all models + trainer fully in forward() #34283

Merged

SunMarc mentioned this pull request Oct 22, 2024

[Trainer][Eval] Why the model output for the first element in eval batch is skipped in logits? #34278

Closed

lucasdegeorge mentioned this pull request Oct 24, 2024

Conflict between last version of Transformers.Trainer and DPOTrainer.get_batch_samples huggingface/trl#2275

Open

4 tasks

tomaarsen mentioned this pull request Oct 28, 2024

[integration] Add support for Transformers v4.46.0 huggingface/sentence-transformers#3026

Merged

Tender-Su mentioned this pull request Oct 30, 2024

Bug in calculating num_input_tokens_seen in multi-gpu environments #34503

Closed

4 tasks

muupan mentioned this pull request Oct 31, 2024

Correctly support resuming from checkpoint with a dataset without length #33544

Open

5 tasks

qgallouedec mentioned this pull request Nov 4, 2024

trl dpo AttributeError: 'generator' object has no attribute 'generate' huggingface/trl#2292

Closed

qgallouedec mentioned this pull request Nov 25, 2024

Gradient accumulation yields worse results than the equivalent batch size huggingface/trl#2175

Closed

qgallouedec mentioned this pull request Dec 29, 2024

Scale loss before backward #35207

Merged

5 tasks

yoadsn mentioned this pull request Feb 10, 2025

Adapting Whisper to the new loss_function attribute #36119

Open

This was referenced Apr 26, 2025

fix total updates in epoch #37783

Closed

Fix tot update in trainer #37923

Merged

jiosephlee mentioned this pull request Jul 22, 2025

Clarification on Recent Changes to Loss and Gradient Accumulation #39567

Closed

		if (self.label_smoother is not None or self.compute_loss is not None) and "labels" in inputs:
		labels = inputs.pop("labels")

Enable users to use their own loss functions + deal with prefetching for grad accum #34198

Enable users to use their own loss functions + deal with prefetching for grad accum #34198

Uh oh!

Conversation

muellerzr commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

muellerzr commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielhanchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

muellerzr commented Oct 16, 2024 •

edited

Loading

muellerzr commented Oct 17, 2024 •

edited

Loading