-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataLoader worker is killed in Docker #2559
Comments
Hi! thanks for your contribution!, great first issue! |
Hi @karwojan, thanks for reporting this issue. |
Hi @SkafteNicki, thanks for your answer.
Without Docker, I am also not able to reproduce the issue. |
Hi @karwojan, |
🐛 Bug
The problem is, that under some very specific circumstances, when I use the classification metric during the training in Docker image, my DataLoader's workers are unexpectedly killed. I'm even not sure if this is a bug in torchmetrics, as there are some very specific circumstances required to reproduce the issue. Due to the fact, that the described situation appeared after the torchmetrics version has been updated to 1.4.0, I guess that the cause of the issue is somehow related to the latest changes (there is no issue when using torchmetrics 1.2 or 1.3). After some longer investigation I still have no idea, what the issue really is, but as I am able to simply reproduce it I decided to report it.
To Reproduce
To reproduce the issue, the following code snippet has to be run in docker container with CUDA device available (cuda drivers and cuda container toolkit installed):
Example command to launch this script in docker, assuming the above script is in
error.py
file:When tested with the above, official image
pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
, the output is:The same output has been observed for image
pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
.When tested with the custom image on PyTorch 1.12.1+cu113 and Python 3.10 (the official pytorch 1.12.1 image uses Python 3.7 while >3.8 is required by torchmetrics), there is also CUDA error reported before DataLoader workers death:
This error does not appear when:
num_workers
intrain_dl
orvalid_dl
is 0 (for sure);num_workers
are configured (e.g., 1 and 3, no pattern has been observed).Expected behavior
There should be no errors and "Epoch end" should be printed when running this code.
Environment
The text was updated successfully, but these errors were encountered: