Skip to content

[Feature][Perf] Support Selective CPU Weight Offloading#34535

Merged
vllm-bot merged 4 commits intovllm-project:mainfrom
wzhao18:wzhao/cpu-offload-moe-only
Feb 14, 2026
Merged

[Feature][Perf] Support Selective CPU Weight Offloading#34535
vllm-bot merged 4 commits intovllm-project:mainfrom
wzhao18:wzhao/cpu-offload-moe-only

Conversation

@wzhao18
Copy link
Contributor

@wzhao18 wzhao18 commented Feb 13, 2026

Purpose

This PR adds support to selectively offload parameters to CPU based on name matching. One use case is to only offload experts weights for MoE weights, which is useful for low-concurrency settings. This is turned on by passing argument --cpu-offload-params

Test Plan

Tested offloading Kimi K2 NVFP4 on one GB300.

VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY=1 python3 -m vllm.entrypoints.openai.api_server \
	--model nvidia/Kimi-K2-Thinking-NVFP4 \
	--trust-remote-code \
        --cpu-offload-gb 350 \
        --load-format dummy \
        --cpu-offload-params w13_weight w2_weight

Benchmarking single-user throughput:

vllm bench serve \
  --backend vllm \
  --endpoint /v1/completions \
  --num-prompts 5 \
  --dataset-name random \
  --input-len 100 \
  --output-len 100 \
  --max-concurrency 1 \
  --trust-remote-code \
  --num-warmups 2

Test Result

Before: 15 tok/s

============ Serving Benchmark Result ============
Successful requests:                     5         
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  32.17     
Total input tokens:                      500       
Total generated tokens:                  500       
Request throughput (req/s):              0.16      
Output token throughput (tok/s):         15.54     
Peak output token throughput (tok/s):    17.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          31.08     
---------------Time to First Token----------------
Mean TTFT (ms):                          302.71    
Median TTFT (ms):                        326.77    
P99 TTFT (ms):                           327.42    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          61.94     
Median TPOT (ms):                        61.93     
P99 TPOT (ms):                           61.95     
---------------Inter-token Latency----------------
Mean ITL (ms):                           61.94     
Median ITL (ms):                         61.90     
P99 ITL (ms):                            62.72     
==================================================

Offloading MoE weights only: 31 tok/s

============ Serving Benchmark Result ============
Successful requests:                     5         
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  15.81     
Total input tokens:                      500       
Total generated tokens:                  500       
Request throughput (req/s):              0.32      
Output token throughput (tok/s):         31.62     
Peak output token throughput (tok/s):    34.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          63.24     
---------------Time to First Token----------------
Mean TTFT (ms):                          244.12    
Median TTFT (ms):                        272.78    
P99 TTFT (ms):                           273.59    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          29.48     
Median TPOT (ms):                        29.48     
P99 TPOT (ms):                           29.49     
---------------Inter-token Latency----------------
Mean ITL (ms):                           29.48     
Median ITL (ms):                         29.47     
P99 ITL (ms):                            29.85     
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@wzhao18 wzhao18 requested a review from heheda12345 as a code owner February 13, 2026 19:44
@mergify mergify bot added the v1 label Feb 13, 2026
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
@wzhao18 wzhao18 force-pushed the wzhao/cpu-offload-moe-only branch from 9683388 to fe927de Compare February 13, 2026 19:50
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a useful feature for selectively offloading model parameters to the CPU, which can significantly improve performance in memory-constrained scenarios, as demonstrated by the provided benchmarks. The implementation is clear and follows existing patterns in the codebase. The changes to the configuration and model loading logic are well-integrated. The parameter name matching logic, while a bit subtle, appears correct and robust for its intended purpose. Overall, this is a solid contribution that enhances the flexibility and performance of vLLM.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice UX, LGTM!

@mgoin mgoin added ready ONLY add when PR is ready to merge/full CI is needed nvidia labels Feb 13, 2026
@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Feb 13, 2026
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
@vllm-bot vllm-bot merged commit b37b679 into vllm-project:main Feb 14, 2026
59 of 62 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Feb 14, 2026
athrael-soju pushed a commit to athrael-soju/vllm that referenced this pull request Feb 14, 2026
…#34535)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
athrael-soju pushed a commit to athrael-soju/vllm that referenced this pull request Feb 14, 2026
…#34535)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
wzhao18 added a commit to wzhao18/vllm that referenced this pull request Feb 18, 2026
eldarkurtic pushed a commit to eldarkurtic/vllm that referenced this pull request Feb 19, 2026
…#34535)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Eldar Kurtic <research@neuralmagic.com>
ZJY0516 pushed a commit to ZJY0516/vllm that referenced this pull request Feb 23, 2026
…#34535)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
@ehfd
Copy link
Contributor

ehfd commented Mar 24, 2026

@wzhao18 Is it possible to use regex syntax like in e.g., .ffn_.*_exps. (possible in llama.cpp)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants