-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainer test cannot load from checkpoint when training on multiple GPUs #5144
Comments
@wjaskowski I believe I found the fix for it. Thanks for including a reproducible script, it helped alot. |
I tried the branch but the problem seems to be still there:
|
this is surprising. with your exact script I can get the error on master within 2-3 trials, but on the bugfix branch (bugfix/ddp-ckpt) I ran it probably 20+ times and it never occurs. |
I will carefully give it a try once again when I will regain access to a machine with >1 GPUs. |
@wjaskowski any update? |
Feel free to reopen if needed! |
I have the same issues... |
Same issue in 2023 |
🐛 Bug
The Trainer.test() looks for
epoch=X-v0.ckpt
when onlyepoch=X.ckpt
exists, thus the result is:To Reproduce
Execute several times on >1 gpu machine:
Expected behavior
No exception.
Environment
- GPU:
- GeForce GTX TITAN X
- GeForce GTX TITAN X
- GeForce GTX TITAN X
- GeForce GTX TITAN X
- available: True
- version: 10.2
- numpy: 1.19.4
- pyTorch_debug: False
- pyTorch_version: 1.7.1
- pytorch-lightning: 1.1.0 [Also 1.0.8]
- tqdm: 4.54.1
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.5
- version: Support of different batch types #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020
The text was updated successfully, but these errors were encountered: