Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialising modules with different weights with DDP is dangerously buried in the doc notes #6013

Closed
epignatelli opened this issue Feb 16, 2021 · 6 comments · Fixed by #6032
Labels
bug Something isn't working docs Documentation related help wanted Open to be worked on

Comments

@epignatelli
Copy link

epignatelli commented Feb 16, 2021

🐛 Bug

More a discussion than a real bug.

The docs for the DDP say:

Make sure to set the random seed before the instantiation of a Trainer() so that each model initializes with the same weights.

In the most common settings, it is considered bad practice to initialise the modules with different parameters according to the GPU.
To me - and this is a personal opinion - it does not make sense in theory as well.

The reason why I posted it as a bug and not as a doc issue is that this is really dangerous, if you don't know about it.
Using a note in the docs does not help either. I don't assume a user reads the entire docs before using a library.

@epignatelli epignatelli added bug Something isn't working help wanted Open to be worked on labels Feb 16, 2021
@edenlightning edenlightning added the docs Documentation related label Feb 16, 2021
@awaelchli
Copy link
Contributor

So a solution would be to broadcast the weights from process 0 to all others before starting training. Is this correct? Then setting the seed is optional.

@epignatelli
Copy link
Author

I am not sure what the solution might be, as my knowledge about lightning is poor.
On a high level, I guess passing the seed would be the minimal information pytorch needs to coordinate the initialisations.

@epignatelli epignatelli changed the title Initialising modules with different weights is dangerously buried in the doc notes Initialising modules with different weights with DDP is dangerously buried in the doc notes Feb 16, 2021
@awaelchli
Copy link
Contributor

On a high level, I guess passing the seed would be the minimal information pytorch needs to coordinate the initialisations.

We can't know the seed unless the user sets it, right?
So I believe my suggestion is the only way if we want to allow for random initial conditions while having the same state across processes.

@awaelchli
Copy link
Contributor

@justusschock @tchaton @SeanNaren any thoughts/concerns here?

@justusschock
Copy link
Member

justusschock commented Feb 17, 2021

@awaelchli You're right, but luckily there is nothing left to do for us besides removing this line from the docs. Since your refactor bases it back to plain torch.nn.DistributedDataParallel, it does that automatically (see here , here and
here)

@awaelchli
Copy link
Contributor

oh you're right, that's awesome. And it only makes sense now that pytorch DDP would take care of this. I sent a PR to remove the outdated info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working docs Documentation related help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants