Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvjpeg decoder #3504

Closed
totaam opened this issue Mar 27, 2022 · 16 comments
Closed

nvjpeg decoder #3504

totaam opened this issue Mar 27, 2022 · 16 comments
Labels
encoding enhancement New feature or request

Comments

@totaam
Copy link
Collaborator

totaam commented Mar 27, 2022

Same as #2984 but for client side decoding this time.

This will allow us to achieve the best possible single picture latency.

Once implemented, it would make sense to tell the server to prefer jpeg over webp in all(?) cases.

@totaam totaam added enhancement New feature or request encoding labels Mar 27, 2022
totaam added a commit that referenced this issue Mar 28, 2022
commented out for now until I can figure out why the decoding doesn't populate the pixel buffer
totaam added a commit that referenced this issue Mar 31, 2022
@totaam
Copy link
Collaborator Author

totaam commented Apr 1, 2022

The inspiration for the GPU code above was not from nvidia's own documentation or sample code - which fails to tell you that the buffer must be a GPU buffer... and the API doesn't fail if you use a CPU buffer either.
Some good code samples:

Still TODO: move GPU device selection and keep the context alive as creating it is very expensive.
(and perhaps re-create it when the window is moved to another monitor?)

@totaam
Copy link
Collaborator Author

totaam commented Apr 2, 2022

Testing with:

XPRA_OPENGL_DRAW_REFRESH=0 XPRA_OPENGL_NVJPEG=1 xpra attach tcp://127.0.0.1:10000/ \
    --no-speaker --no-mmap --opengl=force -d nvjpeg,opengl --quality=10

A glxgears 4K screen update:

23,158 glXMakeCurrent: xid=0x5c00033, context=<OpenGL.raw.GLX._types.LP_struct___GLXcontextRec object at 0x7f5eeceb1ac0>
23,159 decompress_with_device(RGB, 141256 bytes, None)
23,159 got image info: 3840x2123 YUV420
23,181 nvjpegDecode took 21ms
23,182 copy_buffer(<pycuda._driver.DeviceAllocation object at 0x7f5eecf14a60>, 24456960)
23,186 RegisteredBuffer(1, pycuda._driver.graphics_map_flags.WRITE_DISCARD)=<pycuda._driver.RegisteredBuffer object at 0x7f5eecec9db0>
23,187 copying from <pycuda._driver.DeviceAllocation object at 0x7f5eecf14a60> to mapping=<pycuda._driver.RegisteredMapping object at 0x7f5eecf14c40> at 0x7f5daa400000
23,193 paint_nvjpeg(GLXWindowContext(0x5c00033)) img=ImageWrapper(RGB:(0, 0, 3840, 2123, 24):PACKED), updating fbo
23,194 set_alignment(3840, 11520, 'RGB') GL_UNPACK_ROW_LENGTH=0, GL_UNPACK_ALIGNMENT=8
23,195 present_fbo: adding (0, 0, 3840, 2123) to pending paint list (size=0), flush=0, paint_screen=True
23,195 do_present_fbo: painting ((0, 0, 3840, 2123),)
23,195 Presenting FBO on screen
23,196 1.do_gl_show(GLDrawingArea(10, (3840, 2123), None)) swapping buffers now
23,196 gl_show after  77ms took  0ms,  1 updates
23,196 GLDrawingArea(10, (3840, 2123), None).gl_frame_terminator()
23,197 <OpenGL.platform.baseplatform.glBindFramebuffer object at 0x7f5f1c086850>(GL_FRAMEBUFFER (GL_FRAMEBUFFER), 2)
23,197 GLDrawingArea(10, (3840, 2123), None).do_present_fbo() done
23,198 glXMakeCurrent: NULL for xid=0x5c00033

That's 40ms to decompress and update the screen at 4K, more than half of which (21ms) is spent decoding. The next biggest item is the RGB buffer copying (10ms).
Because of the jitter, we can only reach ~20fps.
(the nvjpeg encoder compresses those same frames in ~30-35ms)

Would using YUV420P subsampling speed things up by lowering the amount of data to copy by half, or slow things down by making us download 3 separate buffers? (my guess: not worth the added complexity)

We can't really make this more parallel because we need an X11 context to access OpenGL and copy the GPU buffer. I don't think that this can be split into two threads (one for decoding and one for copying) because pycuda stores contexts per-thread.

When handling multiple windows, the cuda context object could be cached and re-used, see #3209

jpega should be added. glTexImage2D doesn't let us load just the alpha channel, so the easiest workaround might just be to replace memcpy_dtod with a CUDA kernel that merges RGB and A.

totaam added a commit that referenced this issue Apr 5, 2022
enable nvjpeg decoder by default now that it works well
@totaam
Copy link
Collaborator Author

totaam commented Apr 5, 2022

This is now enabled by default and works quite well!
jpega is handled by the decoder itself: it merges an RGB image with a Y one to re-create the RGBA data that the window backings can paint with. This costs an extra memory copy, but the complexity of using shaders just isn't worth it IMO.


Possible future enhancements:

  • paint grayscale directly (single plane) - meh
  • decode to YUV420P - meh
  • cuda context cleanup: cuda context cleanup #3515
  • would it be possible to decode asynchronously and go back to the main loop whilst nvjpegDecode does its thing? (using the piecemeal API instead) - unlikely and the code would be hellish

@totaam
Copy link
Collaborator Author

totaam commented Apr 11, 2022

It still leaks GPU memory..
With and without the OpenGL buffer passing.
This happens somewhere in nvjpegDecode, the memory leak is gone when commenting out that one call.
The memory is freed when the window is closed with the opengl backend, so clearly it is somehow attached not just to our process, but the CUDA or GLX context.
With the plain GTK backend, the memory is only freed when the client process exits!?
So who owns this buffer and where is it?


Things I have tried:

  • NVJPEG_OUTPUT_UNCHANGED so it doesn't have to do any CSC step itself (thinking that this might be where the leaked buffer was)
  • decode using decoupled phases - but this fails miserably quite early with xpra.codecs.nvjpeg.decoder.NVJPEG_Exception: nvjpegStateAttachPinnedBuffer failed: INVALID_PARAMETER - the idea was that using more API calls would perhaps identify the one that is leaking
  • keeping the same PBO object for the duration of the OpenGL window instead of doing a glGenBuffers + glDeleteBuffers for every paint with nvjpeg.

Things I have not tried:

  • different driver versions since some versions were known to have a memory leak, that's more difficult on Fedora with signed modules: you can no longer just run the NVidia binary installer.
  • cooking up a plain C PoC
  • skip using a PBO and load the pixel data directly into a texture?

Other notes:

  1. cudaGraphicsGLRegisterBuffer()
  2. cudaGraphicsMapResources()
  3. cudaGraphicsResourceGetMappedPointer()
  4. nvjpegDecodeJpegHost()
  5. nvjpegDecodeJpegTransferToDevice()
  6. nvjpegDecodeJpegDevice()
  7. (optional)cuda_kernel()
  8. cudaGraphicsUnmapResources()
  9. glBindBuffer(GL_PIXEL_UNPACK_BUFFER, buffer)
  10. glTexImage2D()
  11. glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0)
  12. glGenerateMipmap()
    This workflow is almost identical to what I have implemented for OpenGL rendering.

@totaam totaam reopened this Apr 11, 2022
totaam added a commit that referenced this issue Apr 11, 2022
the ImageWrapper just returns the CUDA buffer and the caller is responsible for downloading it or copying on the GPU
@totaam
Copy link
Collaborator Author

totaam commented Apr 11, 2022

The memory is freed when the window is closed with the opengl backend

That's because we free the cuda context for this window by calling self.free_cuda_context. (the memory leak disappears as soon as we call this method... and the performance falls off a cliff as we then re-create a context for every paint packet!)
So the buffer is still attached to that context somehow, but only when we call nvjpegDecode..
It really looks like an nvjpeg bug where they lock the buffer and forget to unlock it, calling nvjpegStateAttachDeviceBuffer without nvjpegBufferDeviceDestroy perhaps?
I'm also not sure why the cairo backend doesn't free the memory since it also calls free_cuda_context.

@totaam
Copy link
Collaborator Author

totaam commented Jul 13, 2022

Perhaps try with nvjpeg-python, only temporarily as I really don't like the dependencies and the interface.

@totaam
Copy link
Collaborator Author

totaam commented Jul 20, 2022

Perhaps the leak comes from:

with self.assign_cuda_context(True):

Or:
from pycuda.driver import memcpy_dtod #pylint: disable=no-name-in-module
from pycuda.gl import RegisteredBuffer, graphics_map_flags

paint_nvjpeg is already called from the UI thread:

self.idle_add(self.with_gl_context, paint_nvjpeg)

But perhaps we need to re-arrange the allocation / context / import?

@totaam
Copy link
Collaborator Author

totaam commented Aug 7, 2022

Could also be a vbo leak?
mcfletch/pyopengl@8c82e72

@totaam
Copy link
Collaborator Author

totaam commented Aug 15, 2022

Or perhaps we need to free the resources / stream whilst the cuda context is active: inducer/pycuda#370 (comment)

@totaam
Copy link
Collaborator Author

totaam commented Aug 21, 2022

Testing with b9c24fa and a patched pycuda (2be7037), I do see the new context warnings if I somehow exit the context without freeing the stream. (ie: from an exception)
Unfortunately, when synchronizing properly the leak is still present.

I also tried moving the pbo allocation.

totaam added a commit that referenced this issue Sep 22, 2022
until the GPU memory leak is fixed
@totaam
Copy link
Collaborator Author

totaam commented Oct 6, 2022

Found memory allocation related issues in the release notes:

  • nvjpeg 11.5.1 : Fixed the issue in which nvcuvid() released uncompressed frames causing a memory leak.
  • nvjpeg 11.8.0 : Fallback to cudaMalloc/cudaFree when stream ordered allocators are unavailable.

Perhaps we can just try again.

@totaam
Copy link
Collaborator Author

totaam commented Nov 28, 2022

As per the commit above, this no longer leaks memory with the latest drivers (at least on MS Windows) and is now enabled by default if the driver version is 522.6 or later.

If we find that the fix is also included in older versions of the drivers, we can change the number used in the version check.

@dev0x13
Copy link

dev0x13 commented Jan 20, 2023

Hi @totaam. May I ask why do you you rely on the driver version instead of CUDA version for checking? I'm trying to employ nvJPEG in other software, and I need a robust environment check to prevent this severe memory leak in deployments too. I've managed to find only this one reported issue on this bug in nvJPEG, and the PyTorch maintainers seem to rely on CUDA version for the check. I'm just trying to figure whether it's driver or CUDA is the primary factor of the memory leak presence.
Thank you in advance!

@totaam
Copy link
Collaborator Author

totaam commented Jan 20, 2023

@dev0x13 libnvjpeg.so is updated with the drivers - not with cuda.
That said, the bug could still be in cuda - I don't know. Someone should ask nvidia.

@dev0x13
Copy link

dev0x13 commented Jan 20, 2023

@totaam That makes sense, thank you!

@totaam
Copy link
Collaborator Author

totaam commented Mar 12, 2023

FWIW: The leak has also been fixed with the Linux drivers - 525.89.02 and potentially earlier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoding enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants