Skip to content

Remove FSDP wrapping from sub-models.#34452

Merged
ArthurZucker merged 6 commits intohuggingface:mainfrom
eljandoubi:fix_fsdp_auto_wrap_policy
Nov 15, 2024
Merged

Remove FSDP wrapping from sub-models.#34452
ArthurZucker merged 6 commits intohuggingface:mainfrom
eljandoubi:fix_fsdp_auto_wrap_policy

Conversation

@eljandoubi
Copy link
Contributor

What does this PR do?

Fixes #34113

Who can review?

Library:

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thnaks for fixing the issue @eljandoubi ! Do you think there is a simpler way to handle this edge case @muellerzr ?

Comment on lines 2263 to 2286
Copy link
Member

@SunMarc SunMarc Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use unwarp_model function in transformers instead. Also, why do we need to set recursive to True ? Also, please leave a comment above as this specific path is only to make it functional with auto_find_batch_size .

Copy link
Contributor Author

@eljandoubi eljandoubi Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unwrap_model does not provide access to the recursive argument. Auto-wrap policies wrap submodules with FSDP, and unwrap_model is unable to remove them. You can test this on the toy example from the PyTorch FSDP tutorial for rank=0 and world_size=1, then experiment with the line I provided in a notebook.

my_auto_wrap_policy = functools.partial(
        size_based_auto_wrap_policy, min_num_params=20000
    )
torch.cuda.set_device(rank)
model = Net().to(rank)
print(model)
fsdp_model = FSDP(model,
    auto_wrap_policy=my_auto_wrap_policy)
print(fsdp_model)
unwrap_model = unwarp_model(fsdp_model)
print(unwrap_model)

VS
You need to reinstantiates model and fsdp_model:

model = Net().to(rank)

fsdp_model = FSDP(model,
    auto_wrap_policy=my_auto_wrap_policy)

extract_model = extract_model_from_parallel(fsdp_model, recursive=True)
print(extract_model)

Copy link
Member

@SunMarc SunMarc Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking about this function in transformers. It uses extract_model_from_parallel under the hood so it should be comparable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Euh I see.

@eljandoubi
Copy link
Contributor Author

@SunMarc @muellerzr Did you get a different result than I did?

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, can you add a test in tests/test_trainer.py for this?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@muellerzr muellerzr requested a review from LysandreJik October 31, 2024 13:55
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Left an suggestion for unwrap_model

@eljandoubi
Copy link
Contributor Author

@SunMarc I migrated to unwrap_model.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge it if you're both ok with it @SunMarc @muellerzr

@ArthurZucker ArthurZucker removed their request for review November 5, 2024 12:41
@SunMarc
Copy link
Member

SunMarc commented Nov 5, 2024

Please rebase this PR on main in order to pass the CI @eljandoubi !

@eljandoubi eljandoubi force-pushed the fix_fsdp_auto_wrap_policy branch from 693ba36 to 0df20d6 Compare November 5, 2024 20:07
@eljandoubi
Copy link
Contributor Author

@SunMarc @LysandreJik @muellerzr Is there any update on the pull request?

@ArthurZucker
Copy link
Collaborator

We were on a company wide offsite! merging as they all approved 🤗

@ArthurZucker ArthurZucker merged commit 8d50fda into huggingface:main Nov 15, 2024
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* Remove FSDP wrapping from sub-models.

* solve conflict trainer.py

* make fixup

* add unit test for fsdp_auto_wrap_policy when using auto_find_batch_size

* put back extract_model_from_parallel

* use transformers unwrap_model
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP is not working with the Trainer

6 participants