nvjpeg decoder #3504

totaam · 2022-03-27T14:26:28Z

Same as #2984 but for client side decoding this time.

This will allow us to achieve the best possible single picture latency.

Once implemented, it would make sense to tell the server to prefer jpeg over webp in all(?) cases.

The text was updated successfully, but these errors were encountered:

commented out for now until I can figure out why the decoding doesn't populate the pixel buffer

totaam · 2022-04-01T14:24:07Z

The inspiration for the GPU code above was not from nvidia's own documentation or sample code - which fails to tell you that the buffer must be a GPU buffer... and the API doesn't fail if you use a CPU buffer either.
Some good code samples:

Still TODO: move GPU device selection and keep the context alive as creating it is very expensive.
(and perhaps re-create it when the window is moved to another monitor?)

totaam · 2022-04-02T15:35:26Z

Testing with:

XPRA_OPENGL_DRAW_REFRESH=0 XPRA_OPENGL_NVJPEG=1 xpra attach tcp://127.0.0.1:10000/ \
    --no-speaker --no-mmap --opengl=force -d nvjpeg,opengl --quality=10

A glxgears 4K screen update:

23,158 glXMakeCurrent: xid=0x5c00033, context=<OpenGL.raw.GLX._types.LP_struct___GLXcontextRec object at 0x7f5eeceb1ac0>
23,159 decompress_with_device(RGB, 141256 bytes, None)
23,159 got image info: 3840x2123 YUV420
23,181 nvjpegDecode took 21ms
23,182 copy_buffer(<pycuda._driver.DeviceAllocation object at 0x7f5eecf14a60>, 24456960)
23,186 RegisteredBuffer(1, pycuda._driver.graphics_map_flags.WRITE_DISCARD)=<pycuda._driver.RegisteredBuffer object at 0x7f5eecec9db0>
23,187 copying from <pycuda._driver.DeviceAllocation object at 0x7f5eecf14a60> to mapping=<pycuda._driver.RegisteredMapping object at 0x7f5eecf14c40> at 0x7f5daa400000
23,193 paint_nvjpeg(GLXWindowContext(0x5c00033)) img=ImageWrapper(RGB:(0, 0, 3840, 2123, 24):PACKED), updating fbo
23,194 set_alignment(3840, 11520, 'RGB') GL_UNPACK_ROW_LENGTH=0, GL_UNPACK_ALIGNMENT=8
23,195 present_fbo: adding (0, 0, 3840, 2123) to pending paint list (size=0), flush=0, paint_screen=True
23,195 do_present_fbo: painting ((0, 0, 3840, 2123),)
23,195 Presenting FBO on screen
23,196 1.do_gl_show(GLDrawingArea(10, (3840, 2123), None)) swapping buffers now
23,196 gl_show after  77ms took  0ms,  1 updates
23,196 GLDrawingArea(10, (3840, 2123), None).gl_frame_terminator()
23,197 <OpenGL.platform.baseplatform.glBindFramebuffer object at 0x7f5f1c086850>(GL_FRAMEBUFFER (GL_FRAMEBUFFER), 2)
23,197 GLDrawingArea(10, (3840, 2123), None).do_present_fbo() done
23,198 glXMakeCurrent: NULL for xid=0x5c00033

That's 40ms to decompress and update the screen at 4K, more than half of which (21ms) is spent decoding. The next biggest item is the RGB buffer copying (10ms).
Because of the jitter, we can only reach ~20fps.
(the nvjpeg encoder compresses those same frames in ~30-35ms)

Would using YUV420P subsampling speed things up by lowering the amount of data to copy by half, or slow things down by making us download 3 separate buffers? (my guess: not worth the added complexity)

We can't really make this more parallel because we need an X11 context to access OpenGL and copy the GPU buffer. I don't think that this can be split into two threads (one for decoding and one for copying) because pycuda stores contexts per-thread.

When handling multiple windows, the cuda context object could be cached and re-used, see #3209

jpega should be added. glTexImage2D doesn't let us load just the alpha channel, so the easiest workaround might just be to replace memcpy_dtod with a CUDA kernel that merges RGB and A.

enable nvjpeg decoder by default now that it works well

totaam · 2022-04-05T13:59:29Z

This is now enabled by default and works quite well!
jpega is handled by the decoder itself: it merges an RGB image with a Y one to re-create the RGBA data that the window backings can paint with. This costs an extra memory copy, but the complexity of using shaders just isn't worth it IMO.

Possible future enhancements:

paint grayscale directly (single plane) - meh
decode to YUV420P - meh
cuda context cleanup: cuda context cleanup #3515
would it be possible to decode asynchronously and go back to the main loop whilst nvjpegDecode does its thing? (using the piecemeal API instead) - unlikely and the code would be hellish

totaam · 2022-04-11T09:33:43Z

It still leaks GPU memory..
With and without the OpenGL buffer passing.
This happens somewhere in nvjpegDecode, the memory leak is gone when commenting out that one call.
The memory is freed when the window is closed with the opengl backend, so clearly it is somehow attached not just to our process, but the CUDA or GLX context.
With the plain GTK backend, the memory is only freed when the client process exits!?
So who owns this buffer and where is it?

Things I have tried:

NVJPEG_OUTPUT_UNCHANGED so it doesn't have to do any CSC step itself (thinking that this might be where the leaked buffer was)
decode using decoupled phases - but this fails miserably quite early with xpra.codecs.nvjpeg.decoder.NVJPEG_Exception: nvjpegStateAttachPinnedBuffer failed: INVALID_PARAMETER - the idea was that using more API calls would perhaps identify the one that is leaking
keeping the same PBO object for the duration of the OpenGL window instead of doing a glGenBuffers + glDeleteBuffers for every paint with nvjpeg.

Things I have not tried:

different driver versions since some versions were known to have a memory leak, that's more difficult on Fedora with signed modules: you can no longer just run the NVidia binary installer.
cooking up a plain C PoC
skip using a PBO and load the pixel data directly into a texture?

Other notes:

Ampere A100 can be used with nvjpeg but there is no hardware engine specifically for it, and there is no support at all for nvenc
questions abour decoupled decoding:
For the sake of clarity, the sequence in my code is

cudaGraphicsGLRegisterBuffer()
cudaGraphicsMapResources()
cudaGraphicsResourceGetMappedPointer()
nvjpegDecodeJpegHost()
nvjpegDecodeJpegTransferToDevice()
nvjpegDecodeJpegDevice()
(optional)cuda_kernel()
cudaGraphicsUnmapResources()
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, buffer)
glTexImage2D()
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0)
glGenerateMipmap()
This workflow is almost identical to what I have implemented for OpenGL rendering.

the ImageWrapper just returns the CUDA buffer and the caller is responsible for downloading it or copying on the GPU

totaam · 2022-04-11T14:28:44Z

The memory is freed when the window is closed with the opengl backend

That's because we free the cuda context for this window by calling self.free_cuda_context. (the memory leak disappears as soon as we call this method... and the performance falls off a cliff as we then re-create a context for every paint packet!)
So the buffer is still attached to that context somehow, but only when we call nvjpegDecode..
It really looks like an nvjpeg bug where they lock the buffer and forget to unlock it, calling nvjpegStateAttachDeviceBuffer without nvjpegBufferDeviceDestroy perhaps?
I'm also not sure why the cairo backend doesn't free the memory since it also calls free_cuda_context.

totaam · 2022-07-13T11:55:57Z

Perhaps try with nvjpeg-python, only temporarily as I really don't like the dependencies and the interface.

totaam · 2022-07-20T12:05:02Z

Perhaps the leak comes from:

xpra/xpra/client/gl/gl_window_backing_base.py

Line 1051 in 4fbaf2c

with self.assign_cuda_context(True):

Or:

xpra/xpra/client/gl/gl_window_backing_base.py

Lines 1064 to 1065 in 4fbaf2c

    
           from pycuda.driver import memcpy_dtod   #pylint: disable=no-name-in-module 
        
           from pycuda.gl import RegisteredBuffer, graphics_map_flags

paint_nvjpeg is already called from the UI thread:

xpra/xpra/client/gl/gl_window_backing_base.py

Line 1114 in 4fbaf2c

self.idle_add(self.with_gl_context, paint_nvjpeg)

But perhaps we need to re-arrange the allocation / context / import?

totaam · 2022-08-07T10:35:33Z

Could also be a vbo leak?
mcfletch/pyopengl@8c82e72

totaam · 2022-08-15T10:13:56Z

Or perhaps we need to free the resources / stream whilst the cuda context is active: inducer/pycuda#370 (comment)

totaam · 2022-08-21T16:59:50Z

Testing with b9c24fa and a patched pycuda (2be7037), I do see the new context warnings if I somehow exit the context without freeing the stream. (ie: from an exception)
Unfortunately, when synchronizing properly the leak is still present.

I also tried moving the pbo allocation.

until the GPU memory leak is fixed

totaam · 2022-10-06T05:13:06Z

Found memory allocation related issues in the release notes:

nvjpeg 11.5.1 : Fixed the issue in which nvcuvid() released uncompressed frames causing a memory leak.
nvjpeg 11.8.0 : Fallback to cudaMalloc/cudaFree when stream ordered allocators are unavailable.

Perhaps we can just try again.

(at least on MS Windows..)

totaam · 2022-11-28T07:49:28Z

As per the commit above, this no longer leaks memory with the latest drivers (at least on MS Windows) and is now enabled by default if the driver version is 522.6 or later.

If we find that the fix is also included in older versions of the drivers, we can change the number used in the version check.

dev0x13 · 2023-01-20T11:37:56Z

Hi @totaam. May I ask why do you you rely on the driver version instead of CUDA version for checking? I'm trying to employ nvJPEG in other software, and I need a robust environment check to prevent this severe memory leak in deployments too. I've managed to find only this one reported issue on this bug in nvJPEG, and the PyTorch maintainers seem to rely on CUDA version for the check. I'm just trying to figure whether it's driver or CUDA is the primary factor of the memory leak presence.
Thank you in advance!

totaam · 2023-01-20T15:15:42Z

@dev0x13 libnvjpeg.so is updated with the drivers - not with cuda.
That said, the bug could still be in cuda - I don't know. Someone should ask nvidia.

dev0x13 · 2023-01-20T16:16:46Z

@totaam That makes sense, thank you!

totaam · 2023-03-12T15:04:38Z

FWIW: The leak has also been fixed with the Linux drivers - 525.89.02 and potentially earlier.

totaam added enhancement New feature or request encoding labels Mar 27, 2022

totaam added a commit that referenced this issue Mar 28, 2022

#3504 add nvjpeg decoder

5881281

commented out for now until I can figure out why the decoding doesn't populate the pixel buffer

totaam added a commit that referenced this issue Mar 30, 2022

#3504 common python definitions cannot live in pxd

9ea6c1f

totaam added a commit that referenced this issue Mar 31, 2022

#3504 use pycuda rather than Cython

d9560cd

totaam added a commit that referenced this issue Apr 1, 2022

#3504 cuda gpu buffer transfer for opengl backend

d8f6d35

totaam added a commit that referenced this issue Apr 2, 2022

#3504 manage the cuda context outside the nvjpeg decoder

2f1c3ab

totaam added a commit that referenced this issue Apr 5, 2022

#3504 add 'jpega' support

4805473

enable nvjpeg decoder by default now that it works well

totaam closed this as completed Apr 5, 2022

totaam added a commit that referenced this issue Apr 8, 2022

#3504 free buffers as soon as possible, log alpha channel merge cost

a8603b5

totaam reopened this Apr 11, 2022

totaam added a commit that referenced this issue Apr 11, 2022

#3504 simplify: don't use a download callback

5b23277

the ImageWrapper just returns the CUDA buffer and the caller is responsible for downloading it or copying on the GPU

totaam added a commit that referenced this issue Apr 11, 2022

#3504 make it clearer this is a CUDA buffer

16f603f

totaam added a commit that referenced this issue Apr 11, 2022

#3504 use a MemBuf to quickly allocate host memory

02e3753

totaam mentioned this issue Jul 20, 2022

race condition with multiple numpys in a single shell #3483

Closed

totaam added a commit that referenced this issue Aug 16, 2022

#3504 show pycuda context cleanup failures

2be7037

totaam added a commit that referenced this issue Aug 20, 2022

#3504 move nvjpeg code to a separate method

411f637

totaam added a commit that referenced this issue Aug 21, 2022

#3504 use async calls using a cuda stream

b9c24fa

totaam added a commit that referenced this issue Sep 22, 2022

#3504 don't build the nvjpeg decoder by default

bc1537d

until the GPU memory leak is fixed

totaam added a commit that referenced this issue Nov 24, 2022

#3504 nvjpeg decoder no longer leaks memory with latest drivers

9a8f8ca

(at least on MS Windows..)

totaam added a commit that referenced this issue Nov 24, 2022

#3504 make it possible to bypass the check

b97ec0b

totaam added a commit that referenced this issue Nov 25, 2022

#3504 nvjpeg decoder no longer leaks memory with latest drivers

3fa5760

totaam closed this as completed Nov 28, 2022

totaam mentioned this issue Jan 17, 2023

nvdec hardware accelerated decoder #3703

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvjpeg decoder #3504

nvjpeg decoder #3504

totaam commented Mar 27, 2022

totaam commented Apr 1, 2022 •

edited

Loading

totaam commented Apr 2, 2022 •

edited

Loading

totaam commented Apr 5, 2022

totaam commented Apr 11, 2022 •

edited

Loading

totaam commented Apr 11, 2022

totaam commented Jul 13, 2022

totaam commented Jul 20, 2022

totaam commented Aug 7, 2022

totaam commented Aug 15, 2022

totaam commented Aug 21, 2022

totaam commented Oct 6, 2022

totaam commented Nov 28, 2022

dev0x13 commented Jan 20, 2023

totaam commented Jan 20, 2023

dev0x13 commented Jan 20, 2023

totaam commented Mar 12, 2023

nvjpeg decoder #3504

nvjpeg decoder #3504

Comments

totaam commented Mar 27, 2022

totaam commented Apr 1, 2022 • edited Loading

totaam commented Apr 2, 2022 • edited Loading

totaam commented Apr 5, 2022

totaam commented Apr 11, 2022 • edited Loading

totaam commented Apr 11, 2022

totaam commented Jul 13, 2022

totaam commented Jul 20, 2022

totaam commented Aug 7, 2022

totaam commented Aug 15, 2022

totaam commented Aug 21, 2022

totaam commented Oct 6, 2022

totaam commented Nov 28, 2022

dev0x13 commented Jan 20, 2023

totaam commented Jan 20, 2023

dev0x13 commented Jan 20, 2023

totaam commented Mar 12, 2023

totaam commented Apr 1, 2022 •

edited

Loading

totaam commented Apr 2, 2022 •

edited

Loading

totaam commented Apr 11, 2022 •

edited

Loading