-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot replicate training results with seed_everything and deterministic flag = True with DDP #3335
Comments
Thanks for the report. We're also seeing something similar in our CI, unfortunately we don't have a lead yet. I did not exactly understand how you came to the conclusion that it is related to metrics. Did you try several runs with and without having metric? I don't think it is metrics, but rather just a ddp issue. |
@awaelchli Thanks for your follow-up. No, I do not think this issue has anything to do with the metrics that I was adding but more likely to DDP. |
@junwen-austin I may have fixed this problem with the linked PR. But to be sure it closes this issue, could you let me know the trainer flags you used? Did you use |
@awaelchli Thanks for letting me know the progress. I used the ddp as the backend, not the ddp_spawn. Thanks |
hmm, that's unfortunate because my fix only applies to ddp_spawn. |
@awaelchli it is rather a tricky one. I have two versions of the Lightning models v1 and v2. The only difference between them is that I added additional metric (confusion matrix in this case) in v2, and I noticed the training/validation/test results are slightly off, with both case having ddp as backend, same seed for seed_everything and deterministic flag = True. Since the additional code about the confusion matrix has nothing to do with randomization, I expect I should get the exactly the same results. S |
but the confusion matrix is not used for training, right? It is just there for visualization?
sorry if I misunderstood |
Sorry about any miscommunication on my part. Thanks |
@awaelchli able to repro? any clue? |
Unfortunately not. I tried to reproduce by adding the confusion matrix to the validation epoch end as described, but this gives me same val and train loss as it is expected with setting the seed. @junwen-austin I think I need an example code from you or I don't know where to look for the problem. |
@awaelchli Thanks for looking into it. Sure, the pseudo code looks like: In v1, there is no confusion matrix at all or any calculation related to it. in v2, I added the following in the then at the end of validation/test epoch, I calculated confusion matrix based on sync'ed
|
I had it almost like this, but without sync ddp. Can you share the whole runnable script? At this point I can only make guesses. |
@awaelchli I have the following helper function with the Lightning Module:
|
still no luck reproducing this. since you cannot share all the code, I had tried to combine what you show here with an existing example code, but no luck. maybe if you tried yourself to adapt our mnist example with your confusion matrix use case we get a reproducible script. I tried but I failed. |
Thanks I will do that and let you know. |
@awaelchli My apology I cannot replicate the issue seen from my work project. Let's close the issue for now and if I spot a similar one, I will have better documentation of it instead of trying to getting it from top of my head. Thanks. |
🐛 Bug
I noticed this when I was adding more metrics calculation to the LightningModule, for example, adding the confusion matrix at the end of validation/test epoch. Before and after I added these functions (which do not appear to be dependent on any random seed), I noticed the training results are not the exactly the same.
However, once I added these function and re-ran again, yes I got the same training results.
To Reproduce
Code sample
Expected behavior
The training results should be identical even if some deterministic functions are added
Environment
Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
You can get the script and run it with:
conda
,pip
, source): pipAdditional context
The text was updated successfully, but these errors were encountered: