Skip to content

Fix DeepSeek V4 expert distribution recording#25961

Closed
xutizhou wants to merge 1 commit into
sgl-project:mainfrom
xutizhou:fix/v4-eplb-record-support
Closed

Fix DeepSeek V4 expert distribution recording#25961
xutizhou wants to merge 1 commit into
sgl-project:mainfrom
xutizhou:fix/v4-eplb-record-support

Conversation

@xutizhou
Copy link
Copy Markdown
Collaborator

@xutizhou xutizhou commented May 21, 2026

Summary

  • Add DeepSeek V4 decoder-layer context for the expert distribution recorder so recorded expert counts are attributed to the correct layer.
  • Keep the same piecewise CUDA graph behavior as DeepSeek V2/V3 by only entering the recorder layer context when piecewise CUDA graph is disabled.
  • This enables DeepSeek V4 record dumps to be used for static EPLB expert location initialization.

Verification

  • PYTHONPATH=$PWD/python python -m py_compile python/sglang/srt/models/deepseek_v4.py
  • Pre-commit hooks from git commit --amend passed.
  • DeepSeek-V4-Flash-FP8 TP4 DeepEP record smoke:
    • start record -> generate -> stop record -> dump record all returned successfully.
    • Dumped logical_count shape: (16, 43, 256), total count: 2400.
    • Nonzero recorded layers: 40 / 43, matching V4's early dense/non-MoE layers followed by MoE layers.
  • DeepSeek-V4-Flash-FP8 TP4 DeepEP record workload smoke:
    • bench_serving random workload completed while recording: 16 requests, random input len 512, random output len 16, max concurrency 4.
    • Dumped logical_count shape: (32, 43, 256), total count: 189600.
    • Nonzero recorded layers: 40 / 43.
  • DeepSeek-V4-Flash-FP8 TP4 DeepEP static EPLB smoke:
    • Launched with --ep-dispatch-algorithm static --init-expert-location <record_dump>.
    • Server logged init_expert_location from init_by_eplb using ServerArgs.init_expert_location on all TP ranks.
    • /generate returned HTTP 200 after static EPLB initialization.

CI States

Latest PR Test (Base): ❌ Run #26263830117
Latest PR Test (Extra): ❌ Run #26263830039

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the forward pass of the DeepseekV4 model to support expert distribution recording. It introduces a conditional context manager within the layer iteration loop that activates the expert distribution recorder when piecewise CUDA graphs are disabled. I have no feedback to provide as there were no review comments.

@leihuang-sketch
Copy link
Copy Markdown
Contributor

leihuang-sketch commented May 21, 2026

PR #25948 already exists

@xutizhou
Copy link
Copy Markdown
Collaborator Author

Additional EPLB functional smoke using the PR #19290 MMLU prompt workload shape, adapted to DeepSeek-V4-Flash-FP8 TP4 single-node:

  • Record phase:

    • Started V4 TP4 DeepEP server with --ep-num-redundant-experts 4 --expert-distribution-recorder-mode stat --expert-distribution-recorder-buffer-size 64.
    • Sent 64 prompts from /lustre/raplab/client/xutingz/workspace/bench/waterfill/mmlu_record_2k.json with max_new_tokens=1, concurrency 4.
    • Result: 64/64 requests succeeded.
    • Dump file: /tmp/dsv4_mmlu_eplb_pr/expert_distribution_recorder_1779356669.2558024.pt.
    • Dump validation: logical_count shape (64, 43, 256), sum 5441520, nonzero layers 40 / 43, first three layers zero as expected for V4 early dense/non-MoE layers.
  • Static EPLB phase:

    • Restarted V4 TP4 DeepEP server with --ep-dispatch-algorithm static --init-expert-location /tmp/dsv4_mmlu_eplb_pr/expert_distribution_recorder_1779356669.2558024.pt.
    • /server_info confirmed ep_dispatch_algorithm=static, ep_num_redundant_experts=4, and the expected init_expert_location path.
    • Sent 64 prompts from /lustre/raplab/client/xutingz/workspace/bench/waterfill/mmlu_bench_2k.json with max_new_tokens=1, concurrency 4.
    • Result: 64/64 requests succeeded.

@xutizhou
Copy link
Copy Markdown
Collaborator Author

xutizhou commented May 21, 2026

Pure EPLB verification on DeepSeek V4 Flash FP8, current PR commit bfee2a34020d8e85adb334b0e5412358a30ee01b.

Additional audit result:

  • V4 has num_hash_layers=3; these are HashTopK MoE layers, not dense layers like V3's first 3. They should be recorded.
  • This PR now records both paths: the outer V4 layer context sets the current layer, and HashTopK reports topk_ids after EPLB logical-to-physical remap and padding mask, matching normal TopK recorder semantics.

Corrected record validation:

  • MMLU record set: 2000/2000 requests succeeded, 0 errors.
  • Dumped logical_count shape: (2000, 43, 256).
  • Total recorded selections: 121,900,614.
  • Non-zero MoE layers: 43/43; first3 nonzero [True, True, True].
  • First 8 layer sums: all 2,834,898.
  • Record file: /lustre/raplab/client/xutingz/workspace/bench/waterfill/dsv4_pr25961_record_2000_hash_topk_20260522_095540/expert_distribution_recorder_1779415690.286911.pt.

Serving throughput, MMLU bench set disjoint from record set, batch 1000 prompts, concurrency 256, max_tokens=1, 4 warmup + 8 measured rounds, DeepEP normal, CUDA graph disabled. No DeepEP Waterfill was enabled in either run:

  • No EPLB / trivial expert location: 51,811 tok/s.
  • Explicit EPLB via --init-expert-location ... --enable-eplb --eplb-rebalance-num-iterations 1000000: 56,346 tok/s, +8.75%.

Both runs completed all measured rounds with 0 request errors. Logs also show enable_deepep_waterfill=False for the EPLB run and enable_eplb=False, init_expert_location='trivial' for the no-EPLB run.

Run dirs:

  • Record: /lustre/raplab/client/xutingz/workspace/bench/waterfill/dsv4_pr25961_record_2000_hash_topk_20260522_095540
  • No EPLB: /lustre/raplab/client/xutingz/workspace/bench/waterfill/v4_ab_pr25961_hashfix_pure_no_eplb_1000c256_20260522_100858
  • EPLB: /lustre/raplab/client/xutingz/workspace/bench/waterfill/v4_ab_pr25961_hashfix_pure_enable_eplb_1000c256_20260522_101410

@ch-wan
Copy link
Copy Markdown
Collaborator

ch-wan commented May 24, 2026

I have merged #25948, and added you as a co-author. Thanks.

@ch-wan ch-wan closed this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants