-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torchvision decode_jpeg memory leak #4378
Comments
@fmassa Is general |
@NicolasHug @fmassa Also having this issue. Tried loading images on loop using |
same problem: |
Thanks all for the reports. I took a look a this today. I can reproduce the leak. I do see the memory usage going up constantly with
I thought the leak might come from the fact that we don't free the nvjpeg handle (we literally leak it for convenience) vision/torchvision/csrc/io/image/cuda/decode_jpeg_cuda.cpp Lines 28 to 30 in 9ae0169
but that's not the case: putting back the handle within the function and properly destroying it with I don't see the leak anymore when commenting out the I don't know whether that's actually a bug from nvjpeg, or if there's something else going on. Either way, I don't understand. nvjpeg allows to pass custom device memory allocators, perhaps there is something to do there. Cheers |
Update: this still leaks 🥲 int dev_malloc(void **p, size_t s) {
*p = c10::cuda::CUDACachingAllocator::raw_alloc(s);
return 0;
}
int dev_free(void *p) {
c10::cuda::CUDACachingAllocator::raw_delete(p);
return 0;
}
...
nvjpegDevAllocator_t dev_allocator = {&dev_malloc, &dev_free};
nvjpegStatus_t status = nvjpegCreateEx(NVJPEG_BACKEND_DEFAULT, &dev_allocator,
NULL, NVJPEG_FLAGS_DEFAULT, &nvjpeg_handle); |
I had a chance to look at this more: this is an nvjpeg bug. Unfortunately I'm not sure we can do much about it. It was fixed with CUDA 11.6 but I'm still observing the leak with 11.0 - 11.5. A temporary fix for linux users is to download the 11.6 nvjpeg.so e.g. from here and to tell |
Hello @NicolasHug thanks for answer! I reinstalled CUDA, now I have this version But problem does not disappear. Should I rebuild torch with cuda 11.6 from source? |
What does |
@NicolasHug Mine is showing /site-packages/torchvision/../torchvision.libs/libnvjpeg.90286a3c.so.11 How do I fix this to use system cuda? |
@rydenisbak Did you figure this out? |
@Scass0807 if the path is coming from |
@NicolasHug
|
@NicolasHug I added a symlink libnvjpeg.90286a3c.so.11 -> /usr/local/cuda-11.6/lib64/libnvjpeg.so.11. Now there is only 1 nvjpeg but the memory leak persists. I wonder if this is because even though I am using 11.6 my driver version is 495.23 which is technically for 11.5. I am using GCP Compute Engine and unfortunately they do not yet support 511. |
Hi, @NicolasHug, Would you mind telling where do you get this information? i could not find it in the cuda 11.6 release note. And I cannot reproduce this memory leak with cuda 10.2 (docker pull nvcr.io/nvidia/cuda:10.2-cudnn8-devel-ubuntu18.04) . It would be great if there is some more information. |
I basically tried all versions I could find from https://pkgs.org/search/?q=libnvjpeg-devel |
@NicolasHug should installing 11.6 and using the one that CUDA was built with it work? Do I have to install the RPM I’m on Ubuntu? Based on the LDD results from above I’m not sure if there’s anything else I can do? |
It seems that there is a small multithread confusion here :
the nvjpeg_handle_creation_flag should be global, not local. |
It does not work with cuda 11.7 libnvjpeg either. But same behavior is observed when using numpy.frombuffer. Now I have to decode jpegs on a cpu like a pheasant :'( |
Also seeing this issue on CUDA 11.6 (running in a docker container):
|
I just checked if this was fixed in pytorch nightly with cuda 11.6, but i'm still experiencing a memory leak.
|
same ^ |
Yes, there are still leaks, even on cuda 11.6 |
Memory leaks on torchvision-0.14.0+cu117 (torchvision-0.14.0%2Bcu117-cp37-cp37m-win_amd64.whl). easy to reproduce: for i in range(10000):
torchvision.io.decode_jpeg(torch.frombuffer(jpeg_bytes,dtype=torch.uint8), device='cuda') Memory leaks didn't happen when using pynvjpeg 0.0.13, which seems to be built with cuda 10.2 nj = NvJpeg()
nj.decode(jpeg_bytes) |
Is there anyone who solve this problem?? I also tried to use pynvjpeg, it is slower than torchvision.io.decode_jpeg and also at last some error msg pops up like this : what() : memory allocator error aborted (core dumped).. |
It seems that this problem has been solved. My environment is as follows finally, after waiting for over a year. :) |
🐛 Describe the bug
nvJPEG leaks memory and fails with OOM after ~1-2k images.
Probably related to first response to #3848
is exactly the message you get after OOM.
Versions
PyTorch version: 1.9.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Arch Linux (x86_64)
GCC version: (GCC) 11.1.0
Clang version: 12.0.1
CMake version: version 3.21.1
Libc version: glibc-2.33
Python version: 3.8.7 (default, Jan 19 2021, 18:48:37) [GCC 10.2.0] (64-bit runtime)
Python platform: Linux-5.13.8-arch1-1-x86_64-with-glibc2.2.5
Is CUDA available: True
CUDA runtime version: 11.4.48
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 2080 Ti
GPU 1: NVIDIA GeForce RTX 2080 Ti
GPU 2: NVIDIA GeForce GTX 1080
Nvidia driver version: 470.57.02
cuDNN version: Probably one of the following:
/usr/lib/libcudnn.so.8.2.2
/usr/lib/libcudnn_adv_infer.so.8.2.2
/usr/lib/libcudnn_adv_train.so.8.2.2
/usr/lib/libcudnn_cnn_infer.so.8.2.2
/usr/lib/libcudnn_cnn_train.so.8.2.2
/usr/lib/libcudnn_ops_infer.so.8.2.2
/usr/lib/libcudnn_ops_train.so.8.2.2
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] adabelief-pytorch==0.2.0
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.19.5
[pip3] pytorch-lightning==1.4.5
[pip3] torch==1.9.0+cu111
[pip3] torchaudio==0.9.0
[pip3] torchfile==0.1.0
[pip3] torchmetrics==0.4.1
[pip3] torchvision==0.10.0+cu111
[conda] Could not collect
The text was updated successfully, but these errors were encountered: