[Feature] Fix DeepSeek-V4 retract req offloading#24675
Open
yunkchen wants to merge 4 commits into
Open
Conversation
Implements get_cpu_copy / load_cpu_copy across the full DSv4 KV-pool hierarchy so that retract->resume under SWA pressure no longer kills the decode scheduler with NotImplementedError. * dsv4/index_buf_accessor.GetKAndS - new per-loc gather kernel mirroring SetKAndS (Triton implementation, ~5x faster than a torch fallback). * DeepSeekV4SingleKVPool / HiSparseC4DevicePool / DeepSeekV4IndexerPool / DeepSeekV4TokenToKVPool - round-trip the (k_nope_fp8, k_rope_bf16, scale) tuple per layer; aggregator derives sub-pool indices from full-pool indices using the same translations the decode write path uses (full_to_swa_index_mapping for swa, (full+1) % ratio == 0 masks for c4 / c128 / c4_indexer, swa_loc -> state_loc translation for the per-layer compress and indexer state pools). * CompressStatePool - row-indexed save/restore. Duplicate state_locs are tolerated because each row is independent. * test/srt/test_swa_pd_offload_retract.py - round-trip per pool + release_req source check. The aggregator does not call torch.cuda.synchronize(): D->H copies target pageable host memory (cudaMemcpyAsync to pageable is synchronous w.r.t. host), and pageable->device copies stage through a pinned bounce buffer before .to() returns, so the caller's del self.kv_cache_cpu is safe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request adds CPU offloading support for DeepSeek-V4 memory pools, specifically implementing get_cpu_copy and load_cpu_copy for KV caches, indexers, and compression states. The implementation includes new Triton kernels for data retrieval and a set of round-trip unit tests to verify correctness. I have no feedback to provide as there were no review comments.
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Implements get_cpu_copy / load_cpu_copy across the full DSv4 KV-pool hierarchy so that retract->resume under SWA pressure no longer kills the decode scheduler with NotImplementedError.
Modifications
Rebase of closed PR #24042
Relate to #23639 , #23602
Accuracy Tests
gsm8k 500 samples
DeepSeek-V4-Flash-FP8 with PD-Disaggregation
SGLANG_TEST_RETRACT=1&INTERVAL=40retract_events=843
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci