[Feature] Optimizations for JPEG input on NVIDIA GPU by wili-65535 · Pull Request #19749 · sgl-project/sglang

wili-65535 · 2026-03-03T06:41:15Z

Motivation

This is the first part of ([Feature] Optimizations for Qwen3VL models #18784 and Optimizations for Qwen3VL models #18559).
Corresponding information and performance data is also there.

Modifications

Use torch.ops.image.decode_jpegs_cuda, converting CPU bytes directly to torch GPU tensors using the nvJPEG hardware decoder.
This eliminates intermediate data formats such as PIL Images and CPU tensors, and minimizes CPU-GPU data transfers.

Accuracy Tests

lm_eval shows no drop between main branch and this PR, which get the similar score:
Command:

lm-eval run --model sglang --model_args pretrained=/workspace/Qwen3-VL-8B-Instruct,dtype=auto,tp_size=1 --tasks gpqa_diamond_zeroshot gpqa_extended_zeroshot gpqa_main_zeroshot gpqa_diamond_n_shot gpqa_extended_n_shot gpqa_main_n_shot

Before this optimization (main branch baseline)

Tasks	Version	Filter	Metric		Value		Stderr
gpqa_diamond_n_shot	2	none	acc	↑	0.3687	±	0.0344
		none	acc_norm	↑	0.3687	±	0.0344
gpqa_diamond_zeroshot	1	none	acc	↑	0.3788	±	0.0346
		none	acc_norm	↑	0.3788	±	0.0346
gpqa_extended_n_shot	2	none	acc	↑	0.3791	±	0.0208
		none	acc_norm	↑	0.3791	±	0.0208
gpqa_extended_zeroshot	1	none	acc	↑	0.3773	±	0.0208
		none	acc_norm	↑	0.3773	±	0.0208
gpqa_main_n_shot	2	none	acc	↑	0.3862	±	0.0230
		none	acc_norm	↑	0.3862	±	0.0230
gpqa_main_zeroshot	1	none	acc	↑	0.4085	±	0.0232
		none	acc_norm	↑	0.4085	±	0.0232

After this optimization (this PR)

Tasks	Version	Filter	Metric		Value		Stderr
gpqa_diamond_n_shot	2	none	acc	↑	0.3687	±	0.0344
		none	acc_norm	↑	0.3687	±	0.0344
gpqa_diamond_zeroshot	1	none	acc	↑	0.3788	±	0.0346
		none	acc_norm	↑	0.3788	±	0.0346
gpqa_extended_n_shot	2	none	acc	↑	0.3790	±	0.0185
		none	acc_norm	↑	0.3790	±	0.0185
gpqa_extended_zeroshot	1	none	acc	↑	0.3773	±	0.0208
		none	acc_norm	↑	0.3773	±	0.0208
gpqa_main_n_shot	2	none	acc	↑	0.3862	±	0.0230
		none	acc_norm	↑	0.3862	±	0.0230
gpqa_main_zeroshot	1	none	acc	↑	0.4085	±	0.0232
		none	acc_norm	↑	0.4085	±	0.0232

lmms_eval shows no drop between main branch and this PR, which get the similar score:
Command:

python3 -m lmms_eval --model sglang --model_args model_path=/workspace/Qwen3-VL-8B-Instruct --tasks mmmu_val

Before this optimization (main branch baseline)

{'Overall-Art and Design': {'num': 120, 'acc': 0.68333}, 'Art': {'num': 30, 'acc': 0.66667}, 'Art_Theory': {'num': 30, 'acc': 0.9}, 'Design': {'num': 30, 'acc': 0.73333}, 'Music': {'num': 30, 'acc': 0.43333}, 'Overall-Business': {'num': 150, 'acc': 0.40667}, 'Accounting': {'num': 30, 'acc': 0.3}, 'Economics': {'num': 30, 'acc': 0.5}, 'Finance': {'num': 30, 'acc': 0.26667}, 'Manage': {'num': 30, 'acc': 0.5}, 'Marketing': {'num': 30, 'acc': 0.46667}, 'Overall-Science': {'num': 150, 'acc': 0.45333}, 'Biology': {'num': 30, 'acc': 0.53333}, 'Chemistry': {'num': 30, 'acc': 0.36667}, 'Geography': {'num': 30, 'acc': 0.53333}, 'Math': {'num': 30, 'acc': 0.3}, 'Physics': {'num': 30, 'acc': 0.53333}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.53333}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.66667}, 'Clinical_Medicine': {'num': 30, 'acc': 0.66667}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.36667}, 'Pharmacy': {'num': 30, 'acc': 0.5}, 'Public_Health': {'num': 30, 'acc': 0.46667}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.675}, 'History': {'num': 30, 'acc': 0.6}, 'Literature': {'num': 30, 'acc': 0.83333}, 'Sociology': {'num': 30, 'acc': 0.63333}, 'Psychology': {'num': 30, 'acc': 0.63333}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.38095}, 'Agriculture': {'num': 30, 'acc': 0.53333}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.3}, 'Computer_Science': {'num': 30, 'acc': 0.5}, 'Electronics': {'num': 30, 'acc': 0.3}, 'Energy_and_Power': {'num': 30, 'acc': 0.3}, 'Materials': {'num': 30, 'acc': 0.4}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.33333}, 'Overall': {'num': 900, 'acc': 0.50222}}
2026-03-16T05:00:15.655750+0000 | save_results_aggregated | INFO - Output path not provided, skipping saving results aggregated
sglang (model_path=/workspace/Qwen3-VL-8B-Instruct), gen_kwargs: (), limit: None, offset: 0, num_fewshot: None, batch_size: 1

LMMs-Eval: Probing Intelligence in the Real World
> The unified evaluation toolkit for frontier models.

branch: main
commit: v0.6-72-g88b23e2b

| Tasks  |Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------|-----:|--------|---|-----:|---|------|
|mmmu_val|none  |     0|mmmu_acc|↑  |0.5022|±  |N/A   |

After this optimization (this PR)

{'Overall-Art and Design': {'num': 120, 'acc': 0.68333}, 'Art': {'num': 30, 'acc': 0.66667}, 'Art_Theory': {'num': 30, 'acc': 0.9}, 'Design': {'num': 30, 'acc': 0.73333}, 'Music': {'num': 30, 'acc': 0.43333}, 'Overall-Business': {'num': 150, 'acc': 0.40667}, 'Accounting': {'num': 30, 'acc': 0.3}, 'Economics': {'num': 30, 'acc': 0.5}, 'Finance': {'num': 30, 'acc': 0.26667}, 'Manage': {'num': 30, 'acc': 0.5}, 'Marketing': {'num': 30, 'acc': 0.46667}, 'Overall-Science': {'num': 150, 'acc': 0.45333}, 'Biology': {'num': 30, 'acc': 0.53333}, 'Chemistry': {'num': 30, 'acc': 0.36667}, 'Geography': {'num': 30, 'acc': 0.53333}, 'Math': {'num': 30, 'acc': 0.3}, 'Physics': {'num': 30, 'acc': 0.53333}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.53333}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.66667}, 'Clinical_Medicine': {'num': 30, 'acc': 0.66667}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.36667}, 'Pharmacy': {'num': 30, 'acc': 0.5}, 'Public_Health': {'num': 30, 'acc': 0.46667}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.675}, 'History': {'num': 30, 'acc': 0.6}, 'Literature': {'num': 30, 'acc': 0.83333}, 'Sociology': {'num': 30, 'acc': 0.63333}, 'Psychology': {'num': 30, 'acc': 0.63333}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.38095}, 'Agriculture': {'num': 30, 'acc': 0.53333}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.3}, 'Computer_Science': {'num': 30, 'acc': 0.5}, 'Electronics': {'num': 30, 'acc': 0.3}, 'Energy_and_Power': {'num': 30, 'acc': 0.3}, 'Materials': {'num': 30, 'acc': 0.4}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.33333}, 'Overall': {'num': 900, 'acc': 0.50222}}
2026-03-16T05:11:01.956747+0000 | save_results_aggregated | INFO - Output path not provided, skipping saving results aggregated
sglang (model_path=/workspace/Qwen3-VL-8B-Instruct), gen_kwargs: (), limit: None, offset: 0, num_fewshot: None, batch_size: 1

LMMs-Eval: Probing Intelligence in the Real World
> The unified evaluation toolkit for frontier models.

branch: main
commit: v0.6-72-g88b23e2b

| Tasks  |Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------|-----:|--------|---|-----:|---|------|
|mmmu_val|none  |     0|mmmu_acc|↑  |0.5022|±  |N/A   |

Benchmarking and Profiling

Part of the performance data is in the original issue, here we file more detailed data.
We use H100 GPU to run Qwen3-VL-8B model (actually it doesn't matter which model in use), sending requests with one JPEG image of different resolution, and focus on the log information like:

[2026-03-03 06:53:03] [QwenVLProcessor Perf] rid='44a05fff418f4b1cb448b345fa8ac336', load_time: 13.69 ms, preprocess_time: 0.00 ms, process_time: 307.90 ms, get_rope_index_time: 3.58 ms, total_time: 325.17 ms

The results before / after this PR are shown below. Averagely 1.5x acceleration is earned, and the larger the image is, the better performance is earned. In some extreme scenario (like the original issue shown), up to 3.8x acceleration might be earned.

size	before			after			Speed Up
	load_time/ms	process_time/ms	total_time/ms	load_time/ms	process_time/ms	total_time/ms
32x32	0.58	1.18	2.16	0.62	1.18	2.15	1.00
64x64	0.34	1.55	2.16	0.57	1.86	2.91	0.74
96x96	0.52	2.32	3.12	0.57	1.25	2.09	1.49
128x128	0.37	2.40	3.05	0.56	1.25	2.08	1.47
160x160	0.52	1.97	2.76	0.60	1.24	2.12	1.30
192x192	0.35	1.66	2.27	0.63	1.29	2.20	1.03
224x224	0.37	1.94	2.57	0.69	1.28	2.26	1.14
256x256	0.48	2.02	2.77	0.68	1.24	2.19	1.26
288x288	0.43	1.96	2.64	0.74	1.12	2.14	1.23
320x320	0.45	2.22	2.93	0.77	1.19	2.26	1.30
352x352	0.57	2.32	3.15	0.82	1.25	2.34	1.35
384x384	0.52	2.65	3.43	0.86	1.33	2.47	1.39
416x416	0.54	2.84	3.63	0.88	1.43	2.58	1.41
448x448	0.56	3.09	3.91	0.97	1.57	2.84	1.38
480x480	0.66	3.60	4.51	1.11	1.60	2.99	1.51
512x512	0.63	3.87	4.77	1.10	1.90	3.29	1.45
544x544	0.68	4.52	5.48	1.10	1.83	3.20	1.71
576x576	0.68	5.05	6.00	1.30	2.29	3.90	1.54
608x608	0.73	5.16	6.16	1.25	2.18	3.73	1.65
640x640	0.76	5.34	6.42	1.70	2.68	4.67	1.37
672x672	0.79	5.92	6.98	1.56	2.27	4.10	1.70
704x704	0.83	6.72	7.84	1.83	2.49	4.60	1.70
736x736	0.82	7.00	8.09	1.67	2.79	4.81	1.68
768x768	0.99	7.27	8.54	2.05	2.94	5.30	1.61
800x800	0.98	8.22	9.49	2.16	3.01	5.46	1.74
832x832	1.13	8.31	9.72	2.13	3.43	5.88	1.65
864x864	1.09	9.58	10.95	2.57	3.25	6.11	1.79
896x896	1.24	9.83	11.35	2.30	3.62	6.22	1.82
928x928	1.21	10.95	12.44	2.46	3.86	6.63	1.88
960x960	1.23	10.88	12.39	2.54	4.52	7.36	1.68
992x992	1.45	12.61	14.34	2.81	4.09	7.36	1.95
1024x1024	1.41	12.89	14.60	2.90	4.04	7.24	2.02

A simple E2E test with Qwen3VL-8B, using 2 2048x2048 pictures.

============================================== Before this PR ============================================
Request 1 with picture 1:
[2026-03-20 07:12:56] [QwenVLProcessor Perf] rid='a622c8dc55994f01b6f6596cbc45bf6d', load_time: 11.30 ms, preprocess_time: 0.00 ms, process_time: 285.03 ms, get_rope_index_time: 3.70 ms, total_time: 300.03 ms
[2026-03-20 07:13:04] Prefill batch, #new-seq: 1, #new-token: 4128, #cached-token: 0, token usage: 0.09, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 0.00

图中展示的是一只电脑鼠标，从俯视角度拍摄。

它的主要特征如下：

- **外观设计**：鼠标整体呈流线型，表面为哑光黑色，带有细微的颗粒质感，显得非常现代和精致。
- **滚轮**：位于鼠标中央，是一个金属材质的滚轮，表面有清晰的环状纹理，便于手指抓握和操作。
- **结构**：鼠标左右两侧有弧形的侧裙，设计符合人体工学，旨在提供舒适的握持感。
- **背景与光线**：背景为纯黑色，突出了鼠标的轮廓和细节。光线从上方打下，在鼠标表面形成了柔和的高光，增强了立体感和质感。
- **底部**：鼠标底部可见网格状的防滑纹理，有助于在桌面上稳定放置。

右下角有“豆包AI生成”的水印，表明这张图片可能是由AI生成的。

总的来说，这是一张

Request 2 with picture 1 and 2:
[2026-03-20 07:13:13] [QwenVLProcessor Perf] rid='9360050de3284a6b8abf74974b09dd13', load_time: 8.10 ms, preprocess_time: 0.00 ms, process_time: 305.50 ms, get_rope_index_time: 0.65 ms, total_time: 314.25 ms
[2026-03-20 07:13:15] Prefill batch, #new-seq: 1, #new-token: 4128, #cached-token: 4096, token usage: 0.18, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 356.51

根据您提供的两张图片，它们的共同点主要体现在以下几个方面：

1.  **核心主体相同**：两张图片展示的都是同一种款式的电脑鼠标。虽然颜色不同（一张是深灰色/黑色，另一张是白色），但它们的外形轮廓、人体工学设计、滚轮结构以及侧边按键布局都完全一致，可以判断是同一款产品的不同配色版本。

2.  **设计风格统一**：两款鼠标都采用了极简主义和现代的设计语言。线条流畅，表面光滑，整体造型圆润，没有多余的装饰，体现了简约、科技感的设计风格。

3.  **功能布局一致**：从图片中可以清晰地看到，两款鼠标都拥有一个位于中央的滚轮，滚轮下方有一个方形的按键（可能是前进/后退或功能键），两侧有用于拇指操作的侧键。这种布局是该款鼠标的标志性设计。

4.  **AI生成

============================================== After this PR ============================================
Request 1 with picture 1:
[2026-03-20 07:24:36] [QwenVLProcessor Perf] rid='9c6013f20584466aad6716abae3f2d41', load_time: 8.87 ms, preprocess_time: 0.00 ms, process_time: 261.03 ms, get_rope_index_time: 0.89 ms, total_time: 270.79 ms
[2026-03-20 07:24:43] Prefill batch, #new-seq: 1, #new-token: 4128, #cached-token: 0, token usage: 0.09, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 0.00

图中展示的是一只电脑鼠标，从俯视角度拍摄。

它的主要特征如下：

- **外观设计**：鼠标整体呈流线型，表面为哑光黑色，带有细微的颗粒质感，显得非常现代和精致。
- **滚轮**：位于鼠标中央，是一个金属材质的滚轮，表面有清晰的环状纹理，便于手指抓握和操作。
- **结构**：鼠标左右两侧有弧形的侧裙，设计符合人体工学，旨在提供舒适的握持感。
- **背景与光线**：背景为纯黑色，突出了鼠标的轮廓和细节。光线从上方打下，在鼠标表面形成了柔和的高光，增强了立体感和质感。
- **底部**：鼠标底部可见网格状的防滑纹理，有助于在桌面上稳定放置。

右下角有“豆包AI生成”的水印，表明这张图片可能是由AI生成的。

总的来说，这是一张

Request 2 with picture 1 and 2:
[2026-03-20 07:24:52] [QwenVLProcessor Perf] rid='77ae5d49db6c4ce492ad8c7861e1462b', load_time: 6.68 ms, preprocess_time: 0.00 ms, process_time: 265.42 ms, get_rope_index_time: 0.67 ms, total_time: 272.76 ms
[2026-03-20 07:24:54] Prefill batch, #new-seq: 1, #new-token: 4128, #cached-token: 4096, token usage: 0.18, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 357.54

根据您提供的两张图片，它们的共同点主要体现在以下几个方面：

1.  **核心主体相同**：两张图片展示的都是同一种款式的电脑鼠标。虽然颜色不同（一张是深灰色/黑色，另一张是白色），但它们的外形轮廓、人体工学设计、滚轮结构以及侧边按键布局都完全一致，可以判断是同一款产品的不同配色版本。

2.  **设计风格统一**：两款鼠标都采用了极简主义和现代的设计语言。线条流畅，表面光滑，整体造型圆润，没有多余的装饰，体现了简约、科技感的设计风格。

3.  **功能布局一致**：从图片中可以清晰地看到，两款鼠标都拥有一个位于中央的滚轮，滚轮下方有一个方形的按键（可能是前进/后退或功能键），两侧有用于拇指操作的侧键。这种布局是该款鼠标的标志性设计。

4.  **AI生成

Request 3 with picture 1 again:
[2026-03-20 07:25:04] [QwenVLProcessor Perf] rid='19084840f467424893ce55672b9a006f', load_time: 2.85 ms, preprocess_time: 0.00 ms, process_time: 124.14 ms, get_rope_index_time: 0.43 ms, total_time: 127.42 ms
[2026-03-20 07:25:05] Prefill batch, #new-seq: 1, #new-token: 32, #cached-token: 4096, token usage: 0.09, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 391.67

图中展示的是一只电脑鼠标，从俯视角度拍摄。

它的主要特征如下：

- **外观设计**：鼠标整体呈流线型，表面为哑光黑色，带有细微的颗粒质感，显得非常现代和精致。
- **滚轮**：位于鼠标中央，是一个金属材质的滚轮，表面有清晰的环状纹理，便于手指抓握和操作。
- **结构**：鼠标左右两侧有弧形的侧裙，设计符合人体工学，旨在提供舒适的握持感。
- **背景与光线**：背景为纯黑色，突出了鼠标的轮廓和细节。光线从上方打下，在鼠标表面形成了柔和的高光，增强了立体感和质感。
- **底部**：鼠标底部可见网格状的防滑纹理，有助于在桌面上稳定放置。

右下角有“豆包AI生成”的水印，表明这张图片可能是由AI生成的。

总的来说，这是一张

Request 4 with picture 1 and 2 again:
[2026-03-20 07:25:14] [QwenVLProcessor Perf] rid='ce500eeccd4747da8fb3eeea191d76df', load_time: 4.26 ms, preprocess_time: 0.00 ms, process_time: 258.15 ms, get_rope_index_time: 0.67 ms, total_time: 263.08 ms
[2026-03-20 07:25:15] Prefill batch, #new-seq: 1, #new-token: 32, #cached-token: 8192, token usage: 0.18, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 3.08

根据您提供的两张图片，它们的共同点主要体现在以下几个方面：

1.  **核心主体相同**：两张图片展示的都是同一种款式的电脑鼠标。虽然颜色不同（一张是深灰色/黑色，另一张是白色），但它们的外形轮廓、人体工学设计、滚轮结构以及侧边按键布局都完全一致，可以判断是同一款产品的不同配色版本。

2.  **设计风格统一**：两款鼠标都采用了极简主义和现代的设计语言。线条流畅，表面光滑，整体造型圆润，没有多余的装饰，体现了简约、科技感的设计风格。

3.  **功能布局一致**：从图片中可以清晰地看到，两款鼠标都拥有一个位于中央的滚轮，滚轮下方有一个方形的按键（可能是前进/后退或功能键），两侧有用于拇指操作的侧键。这种布局是该款鼠标的标志性设计。

4.  **AI生成

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-03T06:41:19Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yuan-luo · 2026-03-04T05:03:39Z

/tag-and-rerun-ci

wili-65535 · 2026-03-05T05:00:39Z

Hi maintainers, could you help me understand the CI failures? I'd like to address them to move this PR forward.

yhyang201 · 2026-03-09T12:40:17Z

/rerun-failed-ci

yhyang201 · 2026-03-09T12:45:47Z

Hi maintainers, could you help me understand the CI failures? I'd like to address them to move this PR forward.

CI might be flaky; please rerun until all checks pass.

yhyang201 · 2026-03-09T13:54:23Z

/rerun-failed-ci

wili-65535 · 2026-03-10T03:38:03Z

Hi @yhyang201 @yuan-luo I investigate the CI report and find some information.

registered/vlm/test_vision_openai_server_a.py

In our PR, when processing JPEG images on NVIDIA GPUs, function python/sglang/srt/utils/common.py::load_image() directly returns a torch GPU tensor instead of a PIL Image (here).
This works fine for most model workflows because the subsequent image processing is in transformers/src/transformers/image_processing_utils_fast.py::_process_image(), which accepts various input types including torch tensors, PIL Images, and numpy arrays (here).
However, MiniCPM models have their own image pre-processing script that only accepts PIL Image inputs and internally converts them to numpy arrays using the .numpy() method (see code).
This incompatibility causes the following error when running MiniCPM models on NVIDIA GPUs (raised here):

openai.InternalServerError: Error code: 500 - {'object': 'error', 'message': "Internal server error: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.", 'type': 'InternalServerError', 'param': None, 'code': 500}

In our testing, only MiniCPM-o-2_6 and MiniCPM-V-4 exhibit this issue. Should we implement any special handling in the unit tests for these specific models?

registered/lora/test_multi_lora_backend.py

Error information as below, reproduced stably.

KeyError: '/loky-7341-yz54xt5a'

xpu/test_intel_xpu_backend.py

Error information as below, reproduced stably.

AttributeError: module 'torch.xpu' has no attribute 'graph_pool_handle'

Some other tests:

Error information as below, but I don‘t what it means.

Error: Unhandled error: HttpError: <!DOCTYPE html>

python/sglang/srt/multimodal/processors/base_processor.py

wili-65535 · 2026-03-10T08:56:20Z

For the fail unit tests of MiniCPM-o-2_6 and MiniCPM-V-4, we have several solutions:

Fix in MiniCPM's Huggingface code (these tests pass after fixing):
- Change here from image = image.numpy() to image = image.cpu().numpy();
- Change here from if isinstance(images, Image.Image): to if isinstance(images, (Image.Image, torch.Tensor)):
- Change here from elif isinstance(images[0], Image.Image): to elif isinstance(images[0], (Image.Image, torch.Tensor)):
Skip the tests on NVIDIA GPU?
Add a switch to turn off the optimization in this PR when using those models?

yhyang201 · 2026-03-10T10:02:40Z

python/sglang/srt/disaggregation/encode_server.py

-                if discard_alpha_channel and img.mode != "RGB":
+                if (
+                    discard_alpha_channel
+                    and img.mode != "RGB"


This may also need a small adjustment.

It seems we should still check the not isinstance(img, torch.Tensor) first ？

Sorry, maybe I lost the commit... fixed now.
By the way, in the encode_server.py, we cannot figure out the model kind (unless search name from self.server_args) easily.
So the gpu_image_decode is disabled by default here.

python/sglang/srt/utils/common.py

yhyang201 · 2026-03-10T11:20:06Z

For the fail unit tests of MiniCPM-o-2_6 and MiniCPM-V-4, we have several solutions:

Fix in MiniCPM's Huggingface code (these tests pass after fixing):

Change here from image = image.numpy() to image = image.cpu().numpy();

Change here from if isinstance(images, Image.Image): to if isinstance(images, (Image.Image, torch.Tensor)):

Change here from elif isinstance(images[0], Image.Image): to elif isinstance(images[0], (Image.Image, torch.Tensor)):

Skip the tests on NVIDIA GPU?

Add a switch to turn off the optimization in this PR when using those models?

You might consider option 3.

Some processors may only accept PIL images, so one possible approach is to add a switch to disable GPU image decoding for those models.

For example (just a quick idea, not very well thought through, just for reference):

# base_processor.py
class BaseMultimodalProcessor(ABC):
    gpu_image_decode = True  # Enable GPU decoding by default
    ...

    @staticmethod
    def _load_single_item(data, modality, ..., gpu_image_decode=True):
        if modality == Modality.IMAGE:
            img, _ = load_image(data, use_gpu=gpu_image_decode)
            ...

Then incompatible models could simply turn it off:

# minicpm.py
class MiniCPMMultimodalProcessor(BaseMultimodalProcessor):
    gpu_image_decode = False  # MiniCPM HF processor does not support tensor inputs

Just a quick thought for reference.

Also, llava.py appears to call load_image() as well, so it might be worth checking whether the same adjustment is needed there.

wili-65535 · 2026-03-10T14:00:00Z

You might consider option 3.

Some processors may only accept PIL images, so one possible approach is to add a switch to disable GPU image decoding for those models.

...

Also, llava.py appears to call load_image() as well, so it might be worth checking whether the same adjustment is needed there.

Good idea, let's try about this.
In addition, maybe AMD GPU can benefit from this switch in the future, too (https://lmsys.org/blog/2026-02-11-Qwen-latency/#221-image-decoding-optimization-with-rocjpeg).

wili-65535 · 2026-03-11T01:15:47Z

@yhyang201 I manage to add the switch, could you help us to review it at your convenience?

samuellees · 2026-03-12T06:16:45Z

/rerun-failed-ci

yhyang201 · 2026-03-13T03:07:11Z

It seems like this change may affect InternVL2.5 and KimiVL.
In CI, KimiVL fails at w, h = image.size, which suggests the processor might be receiving a tensor/array-like object instead of a PIL image. InternVL2.5 tests also regress in the same run, possibly due to a similar image input type issue.

wili-65535 · 2026-03-14T14:46:14Z

It seems like this change may affect InternVL2.5 and KimiVL. In CI, KimiVL fails at w, h = image.size, which suggests the processor might be receiving a tensor/array-like object instead of a PIL image. InternVL2.5 tests also regress in the same run, possibly due to a similar image input type issue.

Code of these two models are fixed.

samuellees · 2026-03-14T14:52:36Z

/rerun-failed-ci again

wili-65535 · 2026-03-16T04:04:15Z

The result of GPQA tests is updated in the description. Could we move forward the PR?

yhyang201 · 2026-03-17T11:19:09Z

Let me see what exactly is wrong with CI.

yhyang201 · 2026-03-17T11:23:43Z

I’ll rebase and see if the CI passes.

mickqian · 2026-03-20T06:33:45Z

python/sglang/srt/utils/common.py

    max_dynamic_patch: Optional[int] = None


+image_extension_names = (".png", ".jpg", ".jpeg", ".webp", ".gif")


we need a mm_utils.py in this folder after this PR
cc @yhyang201

mickqian · 2026-03-20T06:40:24Z

great work. do we have e2e compare results BTW?

wili-65535 · 2026-03-20T06:56:24Z

great work. do we have e2e compare results BTW?

Thank you for your attention! @mickqian
We only run the mmmu_val task in lm_eval (shown as description).
What other end-to-end tests do you suggest we to add?

Furthermore, I add a result of E2E simple test with Qwen3VL-8B in the description.

mickqian · 2026-03-21T11:21:02Z

@wili-65535 I'm thinking we might need performance statistics on e2e benchmarks for this pr, you could check bench_serving.py or mmmu folder

v0.2: fix CI error v2.0: add gpu_image_decode v2.1: fix in encode_server.py v2.2: fix more models

yhyang201 · 2026-03-27T04:34:16Z

Use the a tool to conduct a latency test on Qwen3-VL-8B-Instruct (tp=1) with a single request while progressively increasing the number of images.

Each request contains N images of the same resolution (with N increasing from 1 to 32), a text input length of 256 tokens, and an output length of 32 tokens. The timeout for each individual request is set to 300 seconds.

Tests are conducted independently at three resolutions: 720p, 1080p, and 1440×2560. The server is restarted whenever switching resolutions. This setup is used to observe how the response time of a single request changes as the number of images increases.

For full experimental details, please refer to:
https://github.com/yhyang201/sgl-bench/tree/main/records/20260327/20260327_042330_qwen3_vl_8b_max_image_count_probe_1-32

main:

============================================================
Probing: 720p (1280x720)
============================================================
  Warmup: sending 3 requests (1~3 images)...
    warmup [1 images] ok
    warmup [2 images] ok
    warmup [3 images] ok
  Warmup done.

  [1 images] Generating 1x 720p... (0.0s) Sending... OK (TTFT=152ms, e2e=0.3s)
  [2 images] Generating 2x 720p... (0.0s) Sending... OK (TTFT=288ms, e2e=0.5s)
  [3 images] Generating 3x 720p... (0.1s) Sending... OK (TTFT=423ms, e2e=0.6s)
  [4 images] Generating 4x 720p... (0.1s) Sending... OK (TTFT=647ms, e2e=0.8s)
  [5 images] Generating 5x 720p... (0.1s) Sending... OK (TTFT=710ms, e2e=0.9s)
  [6 images] Generating 6x 720p... (0.1s) Sending... OK (TTFT=842ms, e2e=1.0s)
  [7 images] Generating 7x 720p... (0.1s) Sending... OK (TTFT=1137ms, e2e=1.3s)
  [8 images] Generating 8x 720p... (0.1s) Sending... OK (TTFT=1104ms, e2e=1.3s)
  [9 images] Generating 9x 720p... (0.2s) Sending... OK (TTFT=1263ms, e2e=1.5s)
  [10 images] Generating 10x 720p... (0.2s) Sending... OK (TTFT=1509ms, e2e=1.7s)
  [11 images] Generating 11x 720p... (0.2s) Sending... OK (TTFT=1661ms, e2e=1.9s)
  [12 images] Generating 12x 720p... (0.2s) Sending... OK (TTFT=1797ms, e2e=2.0s)
  [13 images] Generating 13x 720p... (0.2s) Sending... OK (TTFT=2308ms, e2e=2.5s)
  [14 images] Generating 14x 720p... (0.3s) Sending... OK (TTFT=2147ms, e2e=2.4s)
  [15 images] Generating 15x 720p... (0.3s) Sending... OK (TTFT=2296ms, e2e=2.5s)
  [16 images] Generating 16x 720p... (0.3s) Sending... OK (TTFT=2443ms, e2e=2.7s)
  [17 images] Generating 17x 720p... (0.3s) Sending... OK (TTFT=2622ms, e2e=2.8s)
  [18 images] Generating 18x 720p... (0.3s) Sending... OK (TTFT=2796ms, e2e=3.0s)
  [19 images] Generating 19x 720p... (0.3s) Sending... OK (TTFT=3182ms, e2e=3.4s)
  [20 images] Generating 20x 720p... (0.4s) Sending... OK (TTFT=3411ms, e2e=3.6s)
  [21 images] Generating 21x 720p... (0.4s) Sending... OK (TTFT=3523ms, e2e=3.8s)


============================================================
Probing: 1080p (1920x1080)
============================================================
  Warmup: sending 3 requests (1~3 images)...
    warmup [1 images] ok
    warmup [2 images] ok
    warmup [3 images] ok
  Warmup done.

  [1 images] Generating 1x 1080p... (0.0s) Sending... OK (TTFT=334ms, e2e=0.5s)
  [2 images] Generating 2x 1080p... (0.1s) Sending... OK (TTFT=656ms, e2e=0.8s)
  [3 images] Generating 3x 1080p... (0.1s) Sending... OK (TTFT=984ms, e2e=1.2s)
  [4 images] Generating 4x 1080p... (0.2s) Sending... OK (TTFT=1328ms, e2e=1.5s)
  [5 images] Generating 5x 1080p... (0.2s) Sending... OK (TTFT=1861ms, e2e=2.1s)
  [6 images] Generating 6x 1080p... (0.3s) Sending... OK (TTFT=2578ms, e2e=2.8s)
  [7 images] Generating 7x 1080p... (0.3s) Sending... OK (TTFT=2635ms, e2e=2.8s)
  [8 images] Generating 8x 1080p... (0.3s) Sending... OK (TTFT=3033ms, e2e=3.2s)
  [9 images] Generating 9x 1080p... (0.4s) Sending... OK (TTFT=3778ms, e2e=4.0s)
  [10 images] Generating 10x 1080p... (0.5s) Sending... OK (TTFT=4195ms, e2e=4.4s)
  [11 images] Generating 11x 1080p... (0.5s) Sending... OK (TTFT=5252ms, e2e=5.5s)
  [12 images] Generating 12x 1080p... (0.6s) Sending... OK (TTFT=5090ms, e2e=5.3s)
  [13 images] Generating 13x 1080p... (0.6s) Sending... OK (TTFT=6059ms, e2e=6.3s)
  [14 images] Generating 14x 1080p... (0.6s) Sending... OK (TTFT=6529ms, e2e=6.8s)
  [15 images] Generating 15x 1080p... (0.7s) Sending... OK (TTFT=7034ms, e2e=7.3s)
  [16 images] Generating 16x 1080p... (0.7s) Sending... OK (TTFT=7649ms, e2e=7.9s)
  [17 images] Generating 17x 1080p... (0.7s) Sending... OK (TTFT=8805ms, e2e=9.1s)
  [18 images] Generating 18x 1080p... (0.8s) Sending... OK (TTFT=9370ms, e2e=9.7s)
  [19 images] Generating 19x 1080p... (0.8s) Sending... OK (TTFT=9868ms, e2e=10.2s)
  [20 images] Generating 20x 1080p... (0.9s) Sending... OK (TTFT=10508ms, e2e=10.8s)
  [21 images] Generating 21x 1080p... (1.0s) Sending... OK (TTFT=11798ms, e2e=12.1s)
  [22 images] Generating 22x 1080p... (1.0s) Sending... OK (TTFT=13687ms, e2e=14.0s)
  [23 images] Generating 23x 1080p... (1.1s) Sending... OK (TTFT=13017ms, e2e=13.3s)
  [24 images] Generating 24x 1080p... (1.1s) Sending... OK (TTFT=13764ms, e2e=14.1s)
  [25 images] Generating 25x 1080p... (1.2s) Sending... OK (TTFT=15322ms, e2e=15.7s)
  [26 images] Generating 26x 1080p... (1.2s) Sending... OK (TTFT=15923ms, e2e=16.3s)
  [27 images] Generating 27x 1080p... (1.3s) Sending... OK (TTFT=16645ms, e2e=17.0s)
  [28 images] Generating 28x 1080p... (1.3s) Sending... OK (TTFT=17308ms, e2e=17.7s)
  [29 images] Generating 29x 1080p... (1.3s) Sending... OK (TTFT=19003ms, e2e=19.4s)


============================================================
Probing: 1440x2560 (2560x1440)
============================================================
  Warmup: sending 3 requests (1~3 images)...
    warmup [1 images] ok
    warmup [2 images] ok
    warmup [3 images] ok
  Warmup done.

  [1 images] Generating 1x 1440x2560... (0.1s) Sending... OK (TTFT=608ms, e2e=0.8s)
  [2 images] Generating 2x 1440x2560... (0.2s) Sending... OK (TTFT=1206ms, e2e=1.4s)
  [3 images] Generating 3x 1440x2560... (0.2s) Sending... OK (TTFT=2125ms, e2e=2.3s)
  [4 images] Generating 4x 1440x2560... (0.3s) Sending... OK (TTFT=3148ms, e2e=3.4s)
  [5 images] Generating 5x 1440x2560... (0.4s) Sending... OK (TTFT=4076ms, e2e=4.3s)
  [6 images] Generating 6x 1440x2560... (0.5s) Sending... OK (TTFT=4921ms, e2e=5.2s)
  [7 images] Generating 7x 1440x2560... (0.6s) Sending... OK (TTFT=7037ms, e2e=7.3s)
  [8 images] Generating 8x 1440x2560... (0.7s) Sending... OK (TTFT=7432ms, e2e=7.7s)
  [9 images] Generating 9x 1440x2560... (0.7s) Sending... OK (TTFT=8358ms, e2e=8.6s)
  [10 images] Generating 10x 1440x2560... (0.8s) Sending... OK (TTFT=10356ms, e2e=10.7s)
  [11 images] Generating 11x 1440x2560... (0.9s) Sending... OK (TTFT=11502ms, e2e=11.8s)
  [12 images] Generating 12x 1440x2560... (1.0s) Sending... OK (TTFT=13678ms, e2e=14.0s)
  [13 images] Generating 13x 1440x2560... (1.1s) Sending... OK (TTFT=16127ms, e2e=16.4s)
  [14 images] Generating 14x 1440x2560... (1.2s) Sending... OK (TTFT=17453ms, e2e=17.8s)

yhyang201 · 2026-03-27T05:23:20Z

This pr:

============================================================
Probing: 720p (1280x720)
============================================================
  Warmup: sending 3 requests (1~3 images)...
    warmup [1 images] ok
    warmup [2 images] ok
    warmup [3 images] ok
  Warmup done.

  [1 images] Generating 1x 720p... (0.0s) Sending... OK (TTFT=145ms, e2e=0.3s)
  [2 images] Generating 2x 720p... (0.0s) Sending... OK (TTFT=274ms, e2e=0.4s)
  [3 images] Generating 3x 720p... (0.1s) Sending... OK (TTFT=394ms, e2e=0.6s)
  [4 images] Generating 4x 720p... (0.1s) Sending... OK (TTFT=620ms, e2e=0.8s)
  [5 images] Generating 5x 720p... (0.1s) Sending... OK (TTFT=662ms, e2e=0.8s)
  [6 images] Generating 6x 720p... (0.1s) Sending... OK (TTFT=786ms, e2e=1.0s)
  [7 images] Generating 7x 720p... (0.1s) Sending... OK (TTFT=1088ms, e2e=1.3s)
  [8 images] Generating 8x 720p... (0.1s) Sending... OK (TTFT=1044ms, e2e=1.2s)
  [9 images] Generating 9x 720p... (0.2s) Sending... OK (TTFT=1196ms, e2e=1.4s)
  [10 images] Generating 10x 720p... (0.2s) Sending... OK (TTFT=1444ms, e2e=1.6s)
  [11 images] Generating 11x 720p... (0.2s) Sending... OK (TTFT=1589ms, e2e=1.8s)
  [12 images] Generating 12x 720p... (0.2s) Sending... OK (TTFT=1717ms, e2e=1.9s)
  [13 images] Generating 13x 720p... (0.2s) Sending... OK (TTFT=2198ms, e2e=2.4s)
  [14 images] Generating 14x 720p... (0.3s) Sending... OK (TTFT=2017ms, e2e=2.2s)
  [15 images] Generating 15x 720p... (0.3s) Sending... OK (TTFT=2165ms, e2e=2.4s)
  [16 images] Generating 16x 720p... (0.3s) Sending... OK (TTFT=2327ms, e2e=2.5s)
  [17 images] Generating 17x 720p... (0.3s) Sending... OK (TTFT=2491ms, e2e=2.7s)
  [18 images] Generating 18x 720p... (0.3s) Sending... OK (TTFT=2619ms, e2e=2.8s)
  [19 images] Generating 19x 720p... (0.3s) Sending... OK (TTFT=3037ms, e2e=3.3s)
  [20 images] Generating 20x 720p... (0.4s) Sending... OK (TTFT=3217ms, e2e=3.4s)
  [21 images] Generating 21x 720p... (0.4s) Sending... OK (TTFT=3388ms, e2e=3.6s)
  [22 images] Generating 22x 720p... (0.4s) Sending... OK (TTFT=3558ms, e2e=3.8s)
  [23 images] Generating 23x 720p... (0.4s) Sending... OK (TTFT=3721ms, e2e=4.0s)
  [24 images] Generating 24x 720p... (0.4s) Sending... OK (TTFT=3888ms, e2e=4.1s)
  [25 images] Generating 25x 720p... (0.5s) Sending... OK (TTFT=4681ms, e2e=4.9s)
  [26 images] Generating 26x 720p... (0.5s) Sending... OK (TTFT=4338ms, e2e=4.6s)
  [27 images] Generating 27x 720p... (0.5s) Sending... OK (TTFT=4400ms, e2e=4.6s)
  [28 images] Generating 28x 720p... (0.5s) Sending... OK (TTFT=4945ms, e2e=5.2s)
  [29 images] Generating 29x 720p... (0.5s) Sending... OK (TTFT=5115ms, e2e=5.4s)
  [30 images] Generating 30x 720p... (0.6s) Sending... OK (TTFT=5310ms, e2e=5.6s)
  [31 images] Generating 31x 720p... (0.6s) Sending... OK (TTFT=5499ms, e2e=5.8s)
  [32 images] Generating 32x 720p... (0.6s) Sending... OK (TTFT=5696ms, e2e=6.0s)


============================================================
Probing: 1080p (1920x1080)
============================================================
  Warmup: sending 3 requests (1~3 images)...
    warmup [1 images] ok
    warmup [2 images] ok
    warmup [3 images] ok
  Warmup done.

  [1 images] Generating 1x 1080p... (0.0s) Sending... OK (TTFT=319ms, e2e=0.5s)
  [2 images] Generating 2x 1080p... (0.1s) Sending... OK (TTFT=620ms, e2e=0.8s)
  [3 images] Generating 3x 1080p... (0.1s) Sending... OK (TTFT=928ms, e2e=1.1s)
  [4 images] Generating 4x 1080p... (0.2s) Sending... OK (TTFT=1270ms, e2e=1.5s)
  [5 images] Generating 5x 1080p... (0.2s) Sending... OK (TTFT=1764ms, e2e=2.0s)
  [6 images] Generating 6x 1080p... (0.3s) Sending... OK (TTFT=2476ms, e2e=2.7s)
  [7 images] Generating 7x 1080p... (0.3s) Sending... OK (TTFT=2501ms, e2e=2.7s)
  [8 images] Generating 8x 1080p... (0.4s) Sending... OK (TTFT=2899ms, e2e=3.1s)
  [9 images] Generating 9x 1080p... (0.4s) Sending... OK (TTFT=3595ms, e2e=3.8s)
  [10 images] Generating 10x 1080p... (0.5s) Sending... OK (TTFT=4053ms, e2e=4.3s)
  [11 images] Generating 11x 1080p... (0.5s) Sending... OK (TTFT=5092ms, e2e=5.3s)
  [12 images] Generating 12x 1080p... (0.6s) Sending... OK (TTFT=4930ms, e2e=5.2s)
  [13 images] Generating 13x 1080p... (0.6s) Sending... OK (TTFT=5876ms, e2e=6.1s)
  [14 images] Generating 14x 1080p... (0.6s) Sending... OK (TTFT=6330ms, e2e=6.6s)
  [15 images] Generating 15x 1080p... (0.7s) Sending... OK (TTFT=6809ms, e2e=7.1s)
  [16 images] Generating 16x 1080p... (0.7s) Sending... OK (TTFT=7419ms, e2e=7.7s)
  [17 images] Generating 17x 1080p... (0.8s) Sending... OK (TTFT=8593ms, e2e=8.9s)
  [18 images] Generating 18x 1080p... (0.8s) Sending... OK (TTFT=9080ms, e2e=9.4s)
  [19 images] Generating 19x 1080p... (0.8s) Sending... OK (TTFT=9621ms, e2e=9.9s)
  [20 images] Generating 20x 1080p... (0.9s) Sending... OK (TTFT=10225ms, e2e=10.5s)
  [21 images] Generating 21x 1080p... (1.0s) Sending... OK (TTFT=11544ms, e2e=11.9s)
  [22 images] Generating 22x 1080p... (1.0s) Sending... OK (TTFT=13289ms, e2e=13.6s)
  [23 images] Generating 23x 1080p... (1.0s) Sending... OK (TTFT=12652ms, e2e=13.0s)
  [24 images] Generating 24x 1080p... (1.0s) Sending... OK (TTFT=13362ms, e2e=13.7s)
  [25 images] Generating 25x 1080p... (1.1s) Sending... OK (TTFT=14938ms, e2e=15.3s)
  [26 images] Generating 26x 1080p... (1.2s) Sending... OK (TTFT=15568ms, e2e=15.9s)
  [27 images] Generating 27x 1080p... (1.2s) Sending... OK (TTFT=16179ms, e2e=16.5s)
  [28 images] Generating 28x 1080p... (1.3s) Sending... OK (TTFT=16904ms, e2e=17.3s)
  [29 images] Generating 29x 1080p... (1.3s) Sending... OK (TTFT=18674ms, e2e=19.0s)
  [30 images] Generating 30x 1080p... (1.3s) Sending... OK (TTFT=19359ms, e2e=19.7s)


============================================================
Probing: 1440x2560 (2560x1440)
============================================================
  Warmup: sending 3 requests (1~3 images)...
    warmup [1 images] ok
    warmup [2 images] ok
    warmup [3 images] ok
  Warmup done.

  [1 images] Generating 1x 1440x2560... (0.1s) Sending... OK (TTFT=579ms, e2e=0.8s)
  [2 images] Generating 2x 1440x2560... (0.2s) Sending... OK (TTFT=1151ms, e2e=1.3s)
  [3 images] Generating 3x 1440x2560... (0.3s) Sending... OK (TTFT=2029ms, e2e=2.2s)
  [4 images] Generating 4x 1440x2560... (0.3s) Sending... OK (TTFT=3032ms, e2e=3.2s)
  [5 images] Generating 5x 1440x2560... (0.4s) Sending... OK (TTFT=3917ms, e2e=4.1s)
  [6 images] Generating 6x 1440x2560... (0.5s) Sending... OK (TTFT=4746ms, e2e=5.0s)
  [7 images] Generating 7x 1440x2560... (0.6s) Sending... OK (TTFT=6823ms, e2e=7.1s)
  [8 images] Generating 8x 1440x2560... (0.7s) Sending... OK (TTFT=7181ms, e2e=7.4s)
  [9 images] Generating 9x 1440x2560... (0.7s) Sending... OK (TTFT=8164ms, e2e=8.4s)
  [10 images] Generating 10x 1440x2560... (0.8s) Sending... OK (TTFT=10054ms, e2e=10.3s)
  [11 images] Generating 11x 1440x2560... (0.9s) Sending... OK (TTFT=11166ms, e2e=11.5s)
  [12 images] Generating 12x 1440x2560... (1.0s) Sending... OK (TTFT=13351ms, e2e=13.7s)
  [13 images] Generating 13x 1440x2560... (1.1s) Sending... OK (TTFT=15695ms, e2e=16.0s)
  [14 images] Generating 14x 1440x2560... (1.2s) Sending... OK (TTFT=17002ms, e2e=17.3s)

  Result: 1440x2560 max = 14 images

yhyang201 · 2026-03-27T05:39:09Z

This PR reduces TTFT by about 3–5% overall, with the most noticeable improvement (~5%) at 720p and smaller gains at higher resolutions.

yhyang201 · 2026-03-29T06:45:53Z

All CI checks have passed — should we go ahead and merge?

wili-65535 requested review from ByronHsu, JustinTong0323, ShangmingCai, hnyls2002, mickqian and yhyang201 as code owners March 3, 2026 06:41

wili-65535 mentioned this pull request Mar 3, 2026

Optimizations for Qwen3VL models #18559

Closed

wili-65535 force-pushed the wili/jpeg-preprocess branch from f6c360b to aba56ef Compare March 3, 2026 06:49

wili-65535 mentioned this pull request Mar 3, 2026

[Feature] Optimizations for Qwen3VL models #18784

Open

2 tasks

wili-65535 force-pushed the wili/jpeg-preprocess branch from aba56ef to 9c099fa Compare March 3, 2026 15:40

yuan-luo self-requested a review March 3, 2026 15:45

github-actions bot added the run-ci label Mar 4, 2026

ShangmingCai assigned yhyang201 Mar 4, 2026

wili-65535 force-pushed the wili/jpeg-preprocess branch from 9c099fa to a845791 Compare March 6, 2026 08:36

hlu1 mentioned this pull request Mar 6, 2026

[Tracking] Qwen3.5/Qwen3-Next Optimizations #18590

Open

38 tasks

wili-65535 force-pushed the wili/jpeg-preprocess branch from a845791 to 8ba3ec5 Compare March 9, 2026 02:41

yhyang201 reviewed Mar 10, 2026

View reviewed changes

python/sglang/srt/multimodal/processors/base_processor.py Show resolved Hide resolved

wili-65535 force-pushed the wili/jpeg-preprocess branch from 8ba3ec5 to d594058 Compare March 10, 2026 08:55

yhyang201 reviewed Mar 10, 2026

View reviewed changes

python/sglang/srt/utils/common.py Show resolved Hide resolved

wili-65535 force-pushed the wili/jpeg-preprocess branch 2 times, most recently from 24bbdf0 to 798f0f1 Compare March 11, 2026 01:12

yhyang201 approved these changes Mar 12, 2026

View reviewed changes

mickqian approved these changes Mar 20, 2026

View reviewed changes

v0.1: refactor load_image for JPEG on NVIDIA GPU

af32299

v0.2: fix CI error v2.0: add gpu_image_decode v2.1: fix in encode_server.py v2.2: fix more models

wili-65535 force-pushed the wili/jpeg-preprocess branch from 1867b27 to af32299 Compare March 25, 2026 03:23

Merge branch 'main' into wili/jpeg-preprocess

089d43c

yhyang201 mentioned this pull request Mar 27, 2026

[Feature] VLM Model Performance Improvement #21512

Open

2 tasks

yhyang201 merged commit 5bb9ca0 into sgl-project:main Mar 29, 2026
578 of 664 checks passed

wili-65535 deleted the wili/jpeg-preprocess branch March 30, 2026 02:21

Ratish1 mentioned this pull request Mar 30, 2026

[Bug] Qwen3.5 Not support streaming image processing using image_url when gzip is enabled. #21688

Open

5 tasks

		max_dynamic_patch: Optional[int] = None


		image_extension_names = (".png", ".jpg", ".jpeg", ".webp", ".gif")

Conversation

wili-65535 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Mar 3, 2026

Uh oh!

yuan-luo commented Mar 4, 2026

Uh oh!

wili-65535 commented Mar 5, 2026

Uh oh!

yhyang201 commented Mar 9, 2026

Uh oh!

yhyang201 commented Mar 9, 2026

Uh oh!

yhyang201 commented Mar 9, 2026

Uh oh!

wili-65535 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

registered/vlm/test_vision_openai_server_a.py

registered/lora/test_multi_lora_backend.py

xpu/test_intel_xpu_backend.py

Some other tests:

Uh oh!

Uh oh!

wili-65535 commented Mar 10, 2026

Uh oh!

yhyang201 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

wili-65535 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

yhyang201 Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wili-65535 Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yhyang201 commented Mar 10, 2026

Uh oh!

wili-65535 commented Mar 10, 2026

Uh oh!

wili-65535 commented Mar 11, 2026

Uh oh!

samuellees commented Mar 12, 2026

Uh oh!

yhyang201 commented Mar 13, 2026

Uh oh!

wili-65535 commented Mar 14, 2026

Uh oh!

samuellees commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wili-65535 commented Mar 16, 2026

Uh oh!

yhyang201 commented Mar 17, 2026

Uh oh!

yhyang201 commented Mar 17, 2026

Uh oh!

mickqian Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

yhyang201 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

mickqian commented Mar 20, 2026

Uh oh!

wili-65535 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mickqian commented Mar 21, 2026

wili-65535 commented Mar 3, 2026 •

edited

Loading

wili-65535 commented Mar 10, 2026 •

edited

Loading

yhyang201 Mar 12, 2026 •

edited

Loading

wili-65535 Mar 12, 2026 •

edited

Loading

samuellees commented Mar 14, 2026 •

edited

Loading

wili-65535 commented Mar 20, 2026 •

edited

Loading

yhyang201 commented Mar 27, 2026 •

edited

Loading