Add changes to support FSDP by vivekgoe · Pull Request #598 · huggingface/optimum-habana

vivekgoe · 2023-12-14T09:12:04Z

What does this PR do?

Add changes to support FSDP. BERT-Base enabled with FSDP as a toy example.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

HuggingFaceDocBuilderDev · 2023-12-14T09:18:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

regisss

I left a few comments. For the Transformers-related changes, it seems you based them on a newer version (main?). Let's just stick to v4.34.1 for now as newer releases are not supported 🙂

Can you also format the code as follows?

pip install --upgrade ruff
make style

regisss · 2024-01-03T22:56:59Z

 from optimum.habana import GaudiConfig, GaudiTrainer, GaudiTrainingArguments
 from optimum.habana.utils import set_seed
-
+from optimum.habana.peft.layer import GaudiLoraLayerLinearforward


Should we do that everytime we use FSDP and LoRA? Or is it necessary for this example only?
If it's always necessary, it's better to do that in optimum.habana.transformers.modeling_utils.py.

We need to do this every time we use LoRA with torch_compile enabled (see detailed explanation I added in response to one of the other comments from you).
I can move it to optimum.habana.transformers.modeling_utils.py, its ok to do it unconditionally, right?

I think it's okay yes. Maybe LoRA inference in the text-generation example could be impacted, I'll check that.

@regisss My understanding was that LoRA is used only for finetuning. But please check and resolve this conversation when you get a chance. Thanks.

vivekgoe · 2024-01-23T05:49:53Z

@regisss any suggestions on where in code should I add a warning that FSDP is an experimental feature not ready for use yet? How can I add test_fsdp_examples.py to list of tests you run as part of your regular testing?

regisss

Merging now since @libinta told me that 1.14 is getting released.

I'll add the warning and the test in another PR.

regisss · 2024-01-24T14:25:57Z

@vivekgoe When running the FSDP test on Gaudi2 with 1.13, I get errors like

ValueError: Inconsistent compute device and `device_id` on rank 4: hpu:0 vs hpu

And on Gaudi1 with 1.14 I get

RuntimeError: offset_ 0is not matching with the offset of new tensor 13835057943613014016

I launched the test with

ytest tests/test_fsdp_examples.py -v -s

Is there something I'm doing wrong?

vivekgoe marked this pull request as ready for review January 3, 2024 09:36

vivekgoe requested a review from regisss as a code owner January 3, 2024 09:36

regisss reviewed Jan 3, 2024

View reviewed changes

vivekgoe requested a review from libinta January 5, 2024 09:39

vivekgoe added synapse1.14 run-test Run CI for PRs from external contributors labels Jan 8, 2024

regisss reviewed Jan 11, 2024

View reviewed changes

Comment thread examples/question-answering/gaudi_config.json Outdated

Comment thread optimum/habana/transformers/trainer.py

vivekgoe force-pushed the fsdp branch from cd026a1 to e46ff8a Compare January 12, 2024 10:19

vivekgoe requested a review from mandy-li as a code owner January 12, 2024 10:19

vivekgoe requested a review from a user January 12, 2024 10:19

vivekgoe added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Jan 12, 2024

vivekgoe added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Jan 22, 2024

Vivek added 15 commits January 23, 2024 07:19

Commit 1

bc19cc5

Commit 2

4159224

Commit 3

126d70d

Commit 4

067b261

Commit 5

40d90b4

Commit 6

5244420

Commit 7

553d11e

Commit 8

aa56dfc

Add FSDP tests

b762912

update fsdp test

7d5990b

Add peft linear layer monkey patch for hpu

b69d99d

Fix issues with activation_checkpointing code

55322c8

Add FSDP configuration file for Llama2 FT

8a5c83b

Update as per review comments

ad5a04a

Run formatter

0d0ed0d

Vivek added 4 commits January 23, 2024 07:19

Rebase to commit 979c132

760a7e7

Additional wrap policy for reducing device memory usage with PEFT

79ec05a

Update save checkpoint code to fix issue with LoRA + FSDP checkpoints

938411e

Delete question-answering/gaudi_config.json

0df0f76

vivekgoe force-pushed the fsdp branch from 74fb630 to 0df0f76 Compare January 23, 2024 05:45

vivekgoe added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Jan 23, 2024

Update device id usage in dataclasses.py

31083c1

regisss added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Jan 23, 2024

regisss approved these changes Jan 23, 2024

View reviewed changes

regisss merged commit e238bca into huggingface:main Jan 23, 2024

jychen21 pushed a commit to jychen21/optimum-habana that referenced this pull request Feb 27, 2024

Add changes to support FSDP (huggingface#598)

a6d7fff

Conversation

vivekgoe commented Dec 14, 2023

What does this PR do?

Before submitting

Uh oh!

HuggingFaceDocBuilderDev commented Dec 14, 2023

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

regisss Jan 3, 2024

Choose a reason for hiding this comment

Uh oh!

vivekgoe Jan 4, 2024

Choose a reason for hiding this comment

Uh oh!

regisss Jan 4, 2024

Choose a reason for hiding this comment

Uh oh!

vivekgoe Jan 23, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vivekgoe commented Jan 23, 2024

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

regisss commented Jan 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants