-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wraps sharded model for proper access to it state_dict
in FSDP
strategy
#16558
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this!
This is only an issue in master, not 1.9, correct? In that case, this doesn't need a CHANGELOG entry.
Can you write a test?
No, I faced this in |
@carmocca I wrote tests for this. And I notice that the problem is actually occurred only for layers that is not wrapped, e.g. small layers. So, I parametrized tests with different wrapping policies and slightly changed My build has one failed test, but it seems not my fault. What should I do? And do you have any suggestions about modifying |
for more information, see https://pre-commit.ci
@awaelchli I implemented what we discussed. For now, FSDP always aggregate full state dict on zero rank. Cpu offload depends on |
I rethought this logic with offloading to the CPU. It's not good to reuse this variable as it's intended for a completely different purpose. We need to figure out why this doesn't work on CI. Because it works on my setup (2xA100). |
I'm looking into it! |
What does this PR do?
Fixes #16526 by following previously deleted
DDPFullyShardedStrategy
Does your PR introduce any breaking changes?
No, it doesn't.