[Feature] Fix DeepSeek-V4 retract req offloading by yunkchen · Pull Request #24675 · sgl-project/sglang

yunkchen · 2026-05-08T08:21:50Z

Motivation

Implements get_cpu_copy / load_cpu_copy across the full DSv4 KV-pool hierarchy so that retract->resume under SWA pressure no longer kills the decode scheduler with NotImplementedError.

Modifications

Rebase of closed PR #24042
Relate to #23639 , #23602

Accuracy Tests

gsm8k 500 samples

DeepSeek-V4-Flash-FP8 with PD-Disaggregation

Run	Retract	Score
before-fix baseline	none	0.976 (488/500)
after-fix sanity	none	0.976 (488/500)
after-fix forced retract	`SGLANG_TEST_RETRACT=1` & `INTERVAL=40` retract_events=843	0.980 (490/500)

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Implements get_cpu_copy / load_cpu_copy across the full DSv4 KV-pool hierarchy so that retract->resume under SWA pressure no longer kills the decode scheduler with NotImplementedError. * dsv4/index_buf_accessor.GetKAndS - new per-loc gather kernel mirroring SetKAndS (Triton implementation, ~5x faster than a torch fallback). * DeepSeekV4SingleKVPool / HiSparseC4DevicePool / DeepSeekV4IndexerPool / DeepSeekV4TokenToKVPool - round-trip the (k_nope_fp8, k_rope_bf16, scale) tuple per layer; aggregator derives sub-pool indices from full-pool indices using the same translations the decode write path uses (full_to_swa_index_mapping for swa, (full+1) % ratio == 0 masks for c4 / c128 / c4_indexer, swa_loc -> state_loc translation for the per-layer compress and indexer state pools). * CompressStatePool - row-indexed save/restore. Duplicate state_locs are tolerated because each row is independent. * test/srt/test_swa_pd_offload_retract.py - round-trip per pool + release_req source check. The aggregator does not call torch.cuda.synchronize(): D->H copies target pageable host memory (cudaMemcpyAsync to pageable is synchronous w.r.t. host), and pageable->device copies stage through a pinned bounce buffer before .to() returns, so the caller's del self.kv_cache_cpu is safe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request adds CPU offloading support for DeepSeek-V4 memory pools, specifically implementing get_cpu_copy and load_cpu_copy for KV caches, indexers, and compression states. The implementation includes new Triton kernels for data retrieval and a set of round-trip unit tests to verify correctness. I have no feedback to provide as there were no review comments.

gemini-code-assist · 2026-05-09T02:52:18Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yunkchen requested review from Fridge003, HaiShaw, Qiaolin-Yu, Ying1123, hanming-lu, hebiao064, hnyls2002, hzh0425, ispobock, merrymercy, xiezhq-hermann and yizhang2077 as code owners May 8, 2026 08:21

github-actions Bot added the deepseek label May 8, 2026

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

yunkchen marked this pull request as draft May 8, 2026 12:02

yunkchen marked this pull request as ready for review May 9, 2026 02:52

yunkchen added 3 commits May 9, 2026 11:05

Format code & remove unit test.

f23a258

Format code.

a12bf06

defer retracted-resume when kv pool is saturated.

14fc33c

yunkchen requested review from ByronHsu and ShangmingCai as code owners May 9, 2026 04:09

hzh0425 self-assigned this May 11, 2026

hzh0425 changed the title ~~[Feature] support SWA offloading for DeepSeek-V4~~ [Feature] Fix DeepSeek-V4 retract req offloading May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Fix DeepSeek-V4 retract req offloading#24675

[Feature] Fix DeepSeek-V4 retract req offloading#24675
yunkchen wants to merge 4 commits into
sgl-project:mainfrom
yunkchen:support_dsv4_kv_cpu_offload

yunkchen commented May 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yunkchen commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

gsm8k 500 samples

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yunkchen commented May 8, 2026 •

edited

Loading