Skip to content

Add changes to support FSDP#598

Merged
regisss merged 20 commits intohuggingface:mainfrom
vivekgoe:fsdp
Jan 23, 2024
Merged

Add changes to support FSDP#598
regisss merged 20 commits intohuggingface:mainfrom
vivekgoe:fsdp

Conversation

@vivekgoe
Copy link
Copy Markdown
Collaborator

What does this PR do?

Add changes to support FSDP. BERT-Base enabled with FSDP as a toy example.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@vivekgoe vivekgoe marked this pull request as ready for review January 3, 2024 09:36
@vivekgoe vivekgoe requested a review from regisss as a code owner January 3, 2024 09:36
Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments. For the Transformers-related changes, it seems you based them on a newer version (main?). Let's just stick to v4.34.1 for now as newer releases are not supported 🙂

Can you also format the code as follows?

pip install --upgrade ruff
make style

Comment thread examples/question-answering/gaudi_config.json Outdated
Comment thread optimum/habana/accelerate/utils/dataclasses.py Outdated
Comment thread optimum/habana/accelerate/accelerator.py Outdated
Comment thread optimum/habana/accelerate/accelerator.py Outdated
Comment thread optimum/habana/peft/layer.py
Comment thread optimum/habana/peft/layer.py Outdated
Comment thread optimum/habana/transformers/trainer.py Outdated
Comment thread tests/test_fsdp_examples.py
from optimum.habana import GaudiConfig, GaudiTrainer, GaudiTrainingArguments
from optimum.habana.utils import set_seed

from optimum.habana.peft.layer import GaudiLoraLayerLinearforward
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do that everytime we use FSDP and LoRA? Or is it necessary for this example only?
If it's always necessary, it's better to do that in optimum.habana.transformers.modeling_utils.py.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to do this every time we use LoRA with torch_compile enabled (see detailed explanation I added in response to one of the other comments from you).
I can move it to optimum.habana.transformers.modeling_utils.py, its ok to do it unconditionally, right?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay yes. Maybe LoRA inference in the text-generation example could be impacted, I'll check that.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@regisss My understanding was that LoRA is used only for finetuning. But please check and resolve this conversation when you get a chance. Thanks.

@vivekgoe vivekgoe requested a review from libinta January 5, 2024 09:39
@vivekgoe vivekgoe added synapse1.14 run-test Run CI for PRs from external contributors labels Jan 8, 2024
Comment thread examples/question-answering/gaudi_config.json Outdated
Comment thread optimum/habana/transformers/trainer.py
@vivekgoe vivekgoe requested a review from mandy-li as a code owner January 12, 2024 10:19
@vivekgoe vivekgoe requested a review from a user January 12, 2024 10:19
@vivekgoe vivekgoe added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Jan 12, 2024
@vivekgoe vivekgoe added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Jan 22, 2024
@vivekgoe vivekgoe added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Jan 23, 2024
@vivekgoe
Copy link
Copy Markdown
Collaborator Author

@regisss any suggestions on where in code should I add a warning that FSDP is an experimental feature not ready for use yet? How can I add test_fsdp_examples.py to list of tests you run as part of your regular testing?

@regisss regisss added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Jan 23, 2024
Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging now since @libinta told me that 1.14 is getting released.

I'll add the warning and the test in another PR.

@regisss regisss merged commit e238bca into huggingface:main Jan 23, 2024
@regisss
Copy link
Copy Markdown
Collaborator

regisss commented Jan 24, 2024

@vivekgoe When running the FSDP test on Gaudi2 with 1.13, I get errors like

ValueError: Inconsistent compute device and `device_id` on rank 4: hpu:0 vs hpu

And on Gaudi1 with 1.14 I get

RuntimeError: offset_ 0is not matching with the offset of new tensor 13835057943613014016

I launched the test with

ytest tests/test_fsdp_examples.py -v -s

Is there something I'm doing wrong?

jychen21 pushed a commit to jychen21/optimum-habana that referenced this pull request Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-test Run CI for PRs from external contributors

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants