Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix running without ema #6

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

fix running without ema #6

wants to merge 3 commits into from

Conversation

feiloo
Copy link

@feiloo feiloo commented Jul 4, 2024

afaict the model has to be unwrapped from accelerates distributed wrapping to not wait for the other gpus forever.
this happens when not using ema.

also since validation only runs on the main process and gpu, the other gpus timeout (nccl) waiting for it.
synchronizing within the validation function is a okay-ish liveliness-check and prevents timeouts or having to increase them to long durations.

i have done some rough and manual testing with updated dependencies (newer pytorch, accelerate, ...) on a single node with multiple gpus.

this should work with the original conda deps too, afaict

feiloo added 2 commits July 3, 2024 18:37
afaict the `model` has to be unwrapped from accelerates distributed wrapping, to not wait for the other gpus forever.
this happens when not using ema.

also since validation only runs on the main process and gpu, the other gpus timeout (ncc) waiting for it.
synchronizing within the validation function is a okay-ish liveliness-check and prevents timeouts or having to increase them.
@flukeskywalker
Copy link
Contributor

Thanks @feiloo!

I was aware of the nccl sync issue after porting the code to this simplified repo, but I decided that increasing the timeout or reducing val iters to not spend more too long on validation is a simpler solution for the purposes of this repo. IIRC the default timeout is 30 mins so if the val takes longer than that nccl will throw an error.

As you demonstrate, taking care of this complicates what should be a rather simple validation function. So I prefer not to do that. But perhaps it would be good to discuss this in the README to avoid surprises for users.

Thanks for catching the bug when EMA is off. I think this script wasn't fully tested in that setting since all our experiments used EMA.

@feiloo
Copy link
Author

feiloo commented Jul 8, 2024

feel free to rewrite and adapt however you seem fit and close the pr.

the nccl default timeout is 10 minutes, other backends are 30.
it was/is a bit more costly to have a script fail late.
i think my fix is the more correct one, but i'll leave the choice up to you, ofc.

in a few days, i will adapt the pr to the em fix only for you, if you haven't by then.
Anyways, we appreciate your code.

@feiloo feiloo changed the title fix hanging and timeout from missing model synchronization fix running without ema Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants