[Feature] Optimizations for JPEG input on NVIDIA GPU#19749
[Feature] Optimizations for JPEG input on NVIDIA GPU#19749yhyang201 merged 2 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
f6c360b to
aba56ef
Compare
aba56ef to
9c099fa
Compare
|
/tag-and-rerun-ci |
|
Hi maintainers, could you help me understand the CI failures? I'd like to address them to move this PR forward. |
9c099fa to
a845791
Compare
a845791 to
8ba3ec5
Compare
|
/rerun-failed-ci |
CI might be flaky; please rerun until all checks pass. |
|
/rerun-failed-ci |
|
Hi @yhyang201 @yuan-luo I investigate the CI report and find some information. registered/vlm/test_vision_openai_server_a.py
openai.InternalServerError: Error code: 500 - {'object': 'error', 'message': "Internal server error: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.", 'type': 'InternalServerError', 'param': None, 'code': 500}
registered/lora/test_multi_lora_backend.pyError information as below, reproduced stably. KeyError: '/loky-7341-yz54xt5a'xpu/test_intel_xpu_backend.pyError information as below, reproduced stably. AttributeError: module 'torch.xpu' has no attribute 'graph_pool_handle'Some other tests:Error information as below, but I don‘t what it means. Error: Unhandled error: HttpError: <!DOCTYPE html> |
8ba3ec5 to
d594058
Compare
|
For the fail unit tests of MiniCPM-o-2_6 and MiniCPM-V-4, we have several solutions:
|
| if discard_alpha_channel and img.mode != "RGB": | ||
| if ( | ||
| discard_alpha_channel | ||
| and img.mode != "RGB" |
There was a problem hiding this comment.
This may also need a small adjustment.
There was a problem hiding this comment.
It seems we should still check the not isinstance(img, torch.Tensor) first ?
There was a problem hiding this comment.
Sorry, maybe I lost the commit... fixed now.
By the way, in the encode_server.py, we cannot figure out the model kind (unless search name from self.server_args) easily.
So the gpu_image_decode is disabled by default here.
You might consider option 3. Some processors may only accept PIL images, so one possible approach is to add a switch to disable GPU image decoding for those models. For example (just a quick idea, not very well thought through, just for reference): # base_processor.py
class BaseMultimodalProcessor(ABC):
gpu_image_decode = True # Enable GPU decoding by default
...
@staticmethod
def _load_single_item(data, modality, ..., gpu_image_decode=True):
if modality == Modality.IMAGE:
img, _ = load_image(data, use_gpu=gpu_image_decode)
...Then incompatible models could simply turn it off: # minicpm.py
class MiniCPMMultimodalProcessor(BaseMultimodalProcessor):
gpu_image_decode = False # MiniCPM HF processor does not support tensor inputsJust a quick thought for reference. Also, |
Good idea, let's try about this. |
24bbdf0 to
798f0f1
Compare
|
@yhyang201 I manage to add the switch, could you help us to review it at your convenience? |
|
/rerun-failed-ci |
|
It seems like this change may affect InternVL2.5 and KimiVL. |
Code of these two models are fixed. |
|
/rerun-failed-ci again |
|
The result of GPQA tests is updated in the description. Could we move forward the PR? |
|
Let me see what exactly is wrong with CI. |
|
I’ll rebase and see if the CI passes. |
| max_dynamic_patch: Optional[int] = None | ||
|
|
||
|
|
||
| image_extension_names = (".png", ".jpg", ".jpeg", ".webp", ".gif") |
There was a problem hiding this comment.
we need a mm_utils.py in this folder after this PR
cc @yhyang201
|
great work. do we have e2e compare results BTW? |
Thank you for your attention! @mickqian Furthermore, I add a result of E2E simple test with Qwen3VL-8B in the description. |
|
@wili-65535 I'm thinking we might need performance statistics on e2e benchmarks for this pr, you could check |
v0.2: fix CI error v2.0: add gpu_image_decode v2.1: fix in encode_server.py v2.2: fix more models
1867b27 to
af32299
Compare
|
Use the a tool to conduct a latency test on Qwen3-VL-8B-Instruct (tp=1) with a single request while progressively increasing the number of images. Each request contains N images of the same resolution (with N increasing from 1 to 32), a text input length of 256 tokens, and an output length of 32 tokens. The timeout for each individual request is set to 300 seconds. Tests are conducted independently at three resolutions: 720p, 1080p, and 1440×2560. The server is restarted whenever switching resolutions. This setup is used to observe how the response time of a single request changes as the number of images increases. For full experimental details, please refer to: main: |
|
This pr: |
|
This PR reduces TTFT by about 3–5% overall, with the most noticeable improvement (~5%) at 720p and smaller gains at higher resolutions. |
|
All CI checks have passed — should we go ahead and merge? |
Motivation
Modifications
torch.ops.image.decode_jpegs_cuda, converting CPU bytes directly to torch GPU tensors using the nvJPEG hardware decoder.Accuracy Tests
lm_evalshows no drop between main branch and this PR, which get the similar score:Command:
lmms_evalshows no drop between main branch and this PR, which get the similar score:Command:
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci