Skip to content

[VLM] Fix CUDA IPC OOM#16118

Merged
BBuf merged 1 commit intosgl-project:mainfrom
antgroup:fix_cuda_ipc_oom
Jan 7, 2026
Merged

[VLM] Fix CUDA IPC OOM#16118
BBuf merged 1 commit intosgl-project:mainfrom
antgroup:fix_cuda_ipc_oom

Conversation

@yuan-luo
Copy link
Collaborator

Motivation

Close #16106.

The feature Memory Pool based CUDA IPC has a fallback mechanism, if the slice can't be allocated, it will fallback to use the legacy Tensor.
If it is the case, the current implementation forgets to call D2H, the memory leak and OOM happens very frequently.
The fix is to align the fallback branch with the legacy path. Repro commands(without PR will OOM):

$SGLANG_MM_FEATURE_CACHE_MB=1 SGLANG_USE_CUDA_IPC_TRANSPORT=1 SGLANG_VLM_CACHE_SIZE_MB=512 python -m sglang.launch_server --model-path /home/admin/Qwen3-VL-8B-Instruct/ --host 0.0.0.0 --port 8188 --trust-remote-code --tp-size 2 --enable-cache-report --log-level info --max-running-requests 48 --mem-fraction-static 0.7 --chunked-prefill-size 8192  --attention-backend flashinfer --mm-attention-backend fa3 --log-level debug --log-requests --log-requests-level 1

$python3 -m sglang.bench_serving     --backend sglang-oai-chat     --dataset-name image     --num-prompts 100     --apply-chat-template     --random-output-len 100     --random-input-len 100     --image-resolution 1120x700     --image-format jpeg     --image-count 7     --image-content random     --random-range-ratio 1 --port 8188 --max-concurrency 100
benchmark_args=Namespace(backend='sglang-oai-chat', base_url=None, host='0.0.0.0', port=8188, dataset_name='image', dataset_path='', model=None, served_model_name=None, tokenizer=None, num_prompts=100, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=100, random_output_len=100, random_range_ratio=1.0, image_count=7, image_resolution='1120x700', image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=100, output_file=None, output_details=False, disable_tqdm=False, disable_stream=False, return_logprob=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=True, profile=False, plot_throughput=False, profile_activities=['CPU', 'GPU'], profile_num_steps=None, profile_by_stage=False, profile_stages=None, lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None)
Namespace(backend='sglang-oai-chat', base_url=None, host='0.0.0.0', port=8188, dataset_name='image', dataset_path='', model='/home/admin/Qwen3-VL-8B-Instruct/', served_model_name=None, tokenizer=None, num_prompts=100, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=100, random_output_len=100, random_range_ratio=1.0, image_count=7, image_resolution='1120x700', image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=100, output_file=None, output_details=False, disable_tqdm=False, disable_stream=False, return_logprob=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=True, profile=False, plot_throughput=False, profile_activities=['CPU', 'GPU'], profile_num_steps=None, profile_by_stage=False, profile_stages=None, lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None)

#Input tokens: 551726
#Output tokens: 10000

Created 100 random jpeg images with average 5556953 bytes per request
Starting warmup with 1 sequences...
Warmup completed with 1 sequences. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [02:06<00:00,  1.26s/it]

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 100       
Successful requests:                     100       
Benchmark duration (s):                  126.29    
Total input tokens:                      551726    
Total input text tokens:                 11326     
Total input vision tokens:               540400    
Total generated tokens:                  10000     
Total generated tokens (retokenized):    9111      
Request throughput (req/s):              0.79      
Input token throughput (tok/s):          4368.57   
Output token throughput (tok/s):         79.18     
Peak output token throughput (tok/s):    2959.00   
Peak concurrent requests:                100       
Total token throughput (tok/s):          4447.75   
Concurrency:                             85.04     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   107397.30 
Median E2E Latency (ms):                 96080.01  
---------------Time to First Token----------------
Mean TTFT (ms):                          92413.64  
Median TTFT (ms):                        93481.04  
P99 TTFT (ms):                           123124.15 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          151.35    
Median TPOT (ms):                        143.49    
P99 TPOT (ms):                           647.20    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           166.07    
Median ITL (ms):                         14.64     
P95 ITL (ms):                            18.56     
P99 ITL (ms):                            965.85    
Max ITL (ms):                            62506.78  
==================================================

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@yudian0504
Copy link
Contributor

Is the del pixel_values in scheduler_batch.py necessary?

@yuan-luo
Copy link
Collaborator Author

Is the del pixel_values in scheduler_batch.py necessary?

Yes, del pixel_values is highly recommended. Updated.

@JustinTong0323
Copy link
Collaborator

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Jan 6, 2026
@yuan-luo
Copy link
Collaborator Author

yuan-luo commented Jan 6, 2026

/rerun-failed-ci

1 similar comment
@yuan-luo
Copy link
Collaborator Author

yuan-luo commented Jan 7, 2026

/rerun-failed-ci

@BBuf
Copy link
Collaborator

BBuf commented Jan 7, 2026

@BBuf BBuf merged commit 5384674 into sgl-project:main Jan 7, 2026
158 of 165 checks passed
@yuan-luo yuan-luo deleted the fix_cuda_ipc_oom branch January 7, 2026 05:40
michaelzhang-ai pushed a commit to michaelzhang-ai/sglang that referenced this pull request Jan 7, 2026
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
dsingal0 pushed a commit to dsingal0/sglang that referenced this pull request Feb 1, 2026
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug][VLM] duplicated memory usage in rank 0 when enable SGLANG_USE_CUDA_IPC_TRANSPORT

4 participants