Skip to content

[Feature] Fix DeepSeek-V4 retract req offloading#24675

Open
yunkchen wants to merge 4 commits into
sgl-project:mainfrom
yunkchen:support_dsv4_kv_cpu_offload
Open

[Feature] Fix DeepSeek-V4 retract req offloading#24675
yunkchen wants to merge 4 commits into
sgl-project:mainfrom
yunkchen:support_dsv4_kv_cpu_offload

Conversation

@yunkchen
Copy link
Copy Markdown
Contributor

@yunkchen yunkchen commented May 8, 2026

Motivation

Implements get_cpu_copy / load_cpu_copy across the full DSv4 KV-pool hierarchy so that retract->resume under SWA pressure no longer kills the decode scheduler with NotImplementedError.

Modifications

Rebase of closed PR #24042
Relate to #23639 , #23602

Accuracy Tests

gsm8k 500 samples

DeepSeek-V4-Flash-FP8 with PD-Disaggregation

Run Retract Score
before-fix baseline none 0.976 (488/500)
after-fix sanity none 0.976 (488/500)
after-fix forced retract SGLANG_TEST_RETRACT=1 & INTERVAL=40
retract_events=843
0.980 (490/500)

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Implements get_cpu_copy / load_cpu_copy across the full DSv4 KV-pool
hierarchy so that retract->resume under SWA pressure no longer kills
the decode scheduler with NotImplementedError.

* dsv4/index_buf_accessor.GetKAndS - new per-loc gather kernel
  mirroring SetKAndS (Triton implementation, ~5x faster than a torch
  fallback).
* DeepSeekV4SingleKVPool / HiSparseC4DevicePool / DeepSeekV4IndexerPool
  / DeepSeekV4TokenToKVPool - round-trip the (k_nope_fp8, k_rope_bf16,
  scale) tuple per layer; aggregator derives sub-pool indices from
  full-pool indices using the same translations the decode write path
  uses (full_to_swa_index_mapping for swa, (full+1) % ratio == 0 masks
  for c4 / c128 / c4_indexer, swa_loc -> state_loc translation for the
  per-layer compress and indexer state pools).
* CompressStatePool - row-indexed save/restore. Duplicate state_locs
  are tolerated because each row is independent.
* test/srt/test_swa_pd_offload_retract.py - round-trip per pool +
  release_req source check.

The aggregator does not call torch.cuda.synchronize(): D->H copies
target pageable host memory (cudaMemcpyAsync to pageable is
synchronous w.r.t. host), and pageable->device copies stage through
a pinned bounce buffer before .to() returns, so the caller's
del self.kv_cache_cpu is safe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds CPU offloading support for DeepSeek-V4 memory pools, specifically implementing get_cpu_copy and load_cpu_copy for KV caches, indexers, and compression states. The implementation includes new Triton kernels for data retrieval and a set of round-trip unit tests to verify correctness. I have no feedback to provide as there were no review comments.

@yunkchen yunkchen marked this pull request as draft May 8, 2026 12:02
@yunkchen yunkchen marked this pull request as ready for review May 9, 2026 02:52
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@hzh0425 hzh0425 self-assigned this May 11, 2026
@hzh0425 hzh0425 changed the title [Feature] support SWA offloading for DeepSeek-V4 [Feature] Fix DeepSeek-V4 retract req offloading May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants