fix running without ema #6

feiloo · 2024-07-04T15:03:13Z

afaict the model has to be unwrapped from accelerates distributed wrapping to not wait for the other gpus forever.
this happens when not using ema.

also since validation only runs on the main process and gpu, the other gpus timeout (nccl) waiting for it.
synchronizing within the validation function is a okay-ish liveliness-check and prevents timeouts or having to increase them to long durations.

i have done some rough and manual testing with updated dependencies (newer pytorch, accelerate, ...) on a single node with multiple gpus.

this should work with the original conda deps too, afaict

afaict the `model` has to be unwrapped from accelerates distributed wrapping, to not wait for the other gpus forever. this happens when not using ema. also since validation only runs on the main process and gpu, the other gpus timeout (ncc) waiting for it. synchronizing within the validation function is a okay-ish liveliness-check and prevents timeouts or having to increase them.

flukeskywalker · 2024-07-05T21:57:07Z

Thanks @feiloo!

I was aware of the nccl sync issue after porting the code to this simplified repo, but I decided that increasing the timeout or reducing val iters to not spend more too long on validation is a simpler solution for the purposes of this repo. IIRC the default timeout is 30 mins so if the val takes longer than that nccl will throw an error.

As you demonstrate, taking care of this complicates what should be a rather simple validation function. So I prefer not to do that. But perhaps it would be good to discuss this in the README to avoid surprises for users.

Thanks for catching the bug when EMA is off. I think this script wasn't fully tested in that setting since all our experiments used EMA.

feiloo · 2024-07-08T19:28:46Z

feel free to rewrite and adapt however you seem fit and close the pr.

the nccl default timeout is 10 minutes, other backends are 30.
it was/is a bit more costly to have a script fail late.
i think my fix is the more correct one, but i'll leave the choice up to you, ofc.

in a few days, i will adapt the pr to the em fix only for you, if you haven't by then.
Anyways, we appreciate your code.

feiloo added 2 commits July 3, 2024 18:37

skip ema.to() when ema is None, in train.py

33a61af

reduce changes to ema fix only

18da715

feiloo changed the title ~~fix hanging and timeout from missing model synchronization~~ fix running without ema Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix running without ema #6

fix running without ema #6

feiloo commented Jul 4, 2024 •

edited

Loading

flukeskywalker commented Jul 5, 2024

feiloo commented Jul 8, 2024

fix running without ema #6

Are you sure you want to change the base?

fix running without ema #6

Conversation

feiloo commented Jul 4, 2024 • edited Loading

flukeskywalker commented Jul 5, 2024

feiloo commented Jul 8, 2024

feiloo commented Jul 4, 2024 •

edited

Loading