[VLM] Fix CUDA IPC OOM by yuan-luo · Pull Request #16118 · sgl-project/sglang

yuan-luo · 2025-12-30T02:49:30Z

Motivation

The feature Memory Pool based CUDA IPC has a fallback mechanism, if the slice can't be allocated, it will fallback to use the legacy Tensor.
If it is the case, the current implementation forgets to call D2H, the memory leak and OOM happens very frequently.
The fix is to align the fallback branch with the legacy path. Repro commands（without PR will OOM）:

$SGLANG_MM_FEATURE_CACHE_MB=1 SGLANG_USE_CUDA_IPC_TRANSPORT=1 SGLANG_VLM_CACHE_SIZE_MB=512 python -m sglang.launch_server --model-path /home/admin/Qwen3-VL-8B-Instruct/ --host 0.0.0.0 --port 8188 --trust-remote-code --tp-size 2 --enable-cache-report --log-level info --max-running-requests 48 --mem-fraction-static 0.7 --chunked-prefill-size 8192  --attention-backend flashinfer --mm-attention-backend fa3 --log-level debug --log-requests --log-requests-level 1

$python3 -m sglang.bench_serving     --backend sglang-oai-chat     --dataset-name image     --num-prompts 100     --apply-chat-template     --random-output-len 100     --random-input-len 100     --image-resolution 1120x700     --image-format jpeg     --image-count 7     --image-content random     --random-range-ratio 1 --port 8188 --max-concurrency 100
benchmark_args=Namespace(backend='sglang-oai-chat', base_url=None, host='0.0.0.0', port=8188, dataset_name='image', dataset_path='', model=None, served_model_name=None, tokenizer=None, num_prompts=100, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=100, random_output_len=100, random_range_ratio=1.0, image_count=7, image_resolution='1120x700', image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=100, output_file=None, output_details=False, disable_tqdm=False, disable_stream=False, return_logprob=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=True, profile=False, plot_throughput=False, profile_activities=['CPU', 'GPU'], profile_num_steps=None, profile_by_stage=False, profile_stages=None, lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None)
Namespace(backend='sglang-oai-chat', base_url=None, host='0.0.0.0', port=8188, dataset_name='image', dataset_path='', model='/home/admin/Qwen3-VL-8B-Instruct/', served_model_name=None, tokenizer=None, num_prompts=100, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=100, random_output_len=100, random_range_ratio=1.0, image_count=7, image_resolution='1120x700', image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=100, output_file=None, output_details=False, disable_tqdm=False, disable_stream=False, return_logprob=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=True, profile=False, plot_throughput=False, profile_activities=['CPU', 'GPU'], profile_num_steps=None, profile_by_stage=False, profile_stages=None, lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None)

#Input tokens: 551726
#Output tokens: 10000

Created 100 random jpeg images with average 5556953 bytes per request
Starting warmup with 1 sequences...
Warmup completed with 1 sequences. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [02:06<00:00,  1.26s/it]

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 100       
Successful requests:                     100       
Benchmark duration (s):                  126.29    
Total input tokens:                      551726    
Total input text tokens:                 11326     
Total input vision tokens:               540400    
Total generated tokens:                  10000     
Total generated tokens (retokenized):    9111      
Request throughput (req/s):              0.79      
Input token throughput (tok/s):          4368.57   
Output token throughput (tok/s):         79.18     
Peak output token throughput (tok/s):    2959.00   
Peak concurrent requests:                100       
Total token throughput (tok/s):          4447.75   
Concurrency:                             85.04     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   107397.30 
Median E2E Latency (ms):                 96080.01  
---------------Time to First Token----------------
Mean TTFT (ms):                          92413.64  
Median TTFT (ms):                        93481.04  
P99 TTFT (ms):                           123124.15 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          151.35    
Median TPOT (ms):                        143.49    
P99 TPOT (ms):                           647.20    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           166.07    
Median ITL (ms):                         14.64     
P95 ITL (ms):                            18.56     
P99 ITL (ms):                            965.85    
Max ITL (ms):                            62506.78  
==================================================

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-30T02:49:33Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yudian0504 · 2025-12-30T04:14:45Z

Is the del pixel_values in scheduler_batch.py necessary?

yuan-luo · 2025-12-30T07:48:22Z

Is the del pixel_values in scheduler_batch.py necessary?

Yes, del pixel_values is highly recommended. Updated.

JustinTong0323 · 2026-01-06T02:35:49Z

/tag-and-rerun-ci

yuan-luo · 2026-01-06T15:24:52Z

/rerun-failed-ci

yuan-luo · 2026-01-07T02:55:41Z

/rerun-failed-ci

BBuf · 2026-01-07T03:30:30Z

Merged with https://github.com/sgl-project/sglang/actions/runs/20736236107/job/59533992140?pr=16118

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

yuan-luo requested review from JustinTong0323, mickqian and yhyang201 as code owners December 30, 2025 02:49

yudian0504 mentioned this pull request Dec 30, 2025

Fix OOM by offloading multimodal features to CPU after embedding #16018

Merged

6 tasks

yuan-luo force-pushed the fix_cuda_ipc_oom branch from 59b07f4 to 03cfe1d Compare December 30, 2025 07:47

yuan-luo requested review from Ying1123, hnyls2002, merrymercy and xiezhq-hermann as code owners December 30, 2025 07:47

yuan-luo force-pushed the fix_cuda_ipc_oom branch from 03cfe1d to 1f39ceb Compare January 6, 2026 02:11

github-actions bot added the run-ci label Jan 6, 2026

JustinTong0323 approved these changes Jan 6, 2026

View reviewed changes

Fix CUDA IPC OOM

b8e9e3b

yuan-luo force-pushed the fix_cuda_ipc_oom branch from 1f39ceb to b8e9e3b Compare January 6, 2026 02:46

BBuf merged commit 5384674 into sgl-project:main Jan 7, 2026
158 of 165 checks passed

yuan-luo mentioned this pull request Jan 7, 2026

[feat] Reduce GPU memory overhead by using weakref #9673

Merged

4 tasks

yuan-luo deleted the fix_cuda_ipc_oom branch January 7, 2026 05:40

michaelzhang-ai pushed a commit to michaelzhang-ai/sglang that referenced this pull request Jan 7, 2026

[VLM] Fix CUDA IPC OOM (sgl-project#16118)

a214451

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

dsingal0 pushed a commit to dsingal0/sglang that referenced this pull request Feb 1, 2026

[VLM] Fix CUDA IPC OOM (sgl-project#16118)

659a7e1

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VLM] Fix CUDA IPC OOM#16118

[VLM] Fix CUDA IPC OOM#16118
BBuf merged 1 commit intosgl-project:mainfrom
antgroup:fix_cuda_ipc_oom

yuan-luo commented Dec 30, 2025

Uh oh!

gemini-code-assist bot commented Dec 30, 2025

Uh oh!

yudian0504 commented Dec 30, 2025

Uh oh!

yuan-luo commented Dec 30, 2025

Uh oh!

JustinTong0323 commented Jan 6, 2026

Uh oh!

yuan-luo commented Jan 6, 2026

Uh oh!

yuan-luo commented Jan 7, 2026

Uh oh!

BBuf commented Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yuan-luo commented Dec 30, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 30, 2025

Uh oh!

yudian0504 commented Dec 30, 2025

Uh oh!

yuan-luo commented Dec 30, 2025

Uh oh!

JustinTong0323 commented Jan 6, 2026

Uh oh!

yuan-luo commented Jan 6, 2026

Uh oh!

yuan-luo commented Jan 7, 2026

Uh oh!

BBuf commented Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants