This repository was archived by the owner on Apr 20, 2026. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 37
Qwen3.5 disagg support #185
Merged
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
2cbbcc0
[WIP] qwen3.5 disagg support
YAMY1234 e98bf91
temp deep ep config
YAMY1234 a0feada
update agg configs
YAMY1234 10b7ed5
clean up and fix
YAMY1234 f9d6028
add some comment for deepep
YAMY1234 84abf86
resolve code-rabbit comments
YAMY1234 e6e5efc
remove content len for deepep config
YAMY1234 96be4cc
naming and fixing
YAMY1234 432220d
add experimental folder
YAMY1234 80eb08d
combine kNumMaxTopK
YAMY1234 1567392
fix rebuild_deepep
YAMY1234 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| #!/bin/bash | ||
| set -eux | ||
|
|
||
| echo "=== Rebuilding DeepEP with kNumMaxTopK=16 for Qwen3.5 (topk=10) ===" | ||
|
|
||
| DEEPEP_SRC="/sgl-workspace/DeepEP" | ||
|
|
||
| if [ ! -d "$DEEPEP_SRC" ]; then | ||
| echo "ERROR: DeepEP source not found at $DEEPEP_SRC (mount via extra_mount)" | ||
| exit 1 | ||
| fi | ||
|
|
||
| cd "$DEEPEP_SRC" | ||
|
|
||
| # Find NVSHMEM | ||
| NVSHMEM_DIR=$(find /usr/local -name "nvshmem" -type d 2>/dev/null | head -1) | ||
| if [ -z "${NVSHMEM_DIR:-}" ]; then | ||
| echo "ERROR: NVSHMEM installation not found under /usr/local" >&2 | ||
| exit 1 | ||
| fi | ||
| echo "NVSHMEM_DIR=$NVSHMEM_DIR" | ||
|
|
||
| # Fix missing nvshmem symlinks (container has .so.3 but not .so) | ||
| NVSHMEM_LIB="$NVSHMEM_DIR/lib" | ||
| if [ ! -f "$NVSHMEM_LIB/libnvshmem_host.so" ] && [ -f "$NVSHMEM_LIB/libnvshmem_host.so.3" ]; then | ||
| echo "Creating missing nvshmem symlinks..." | ||
| ln -sf libnvshmem_host.so.3 "$NVSHMEM_LIB/libnvshmem_host.so" | ||
| fi | ||
|
|
||
| # Apply kNumMaxTopK=16 patch (Qwen3.5 uses topk=10, default kNumMaxTopK=8 is insufficient) | ||
| # Note: source has both kNumMaxTopK (uppercase) and kNumMaxTopk (lowercase) as separate variables | ||
| sed -i 's/kNumMaxTopK[[:space:]]*=[[:space:]]*[0-9][0-9]*/kNumMaxTopK = 16/g' csrc/kernels/internode_ll.cu | ||
| sed -i 's/kNumMaxTopk[[:space:]]*=[[:space:]]*[0-9][0-9]*/kNumMaxTopk = 16/g' csrc/kernels/internode_ll.cu | ||
|
|
||
| # Verify the patch was applied | ||
| grep -q "kNumMaxTop. = 16" csrc/kernels/internode_ll.cu && echo "Patch verified: kNumMaxTopK/k=16" || { | ||
|
YAMY1234 marked this conversation as resolved.
|
||
| echo "ERROR: kNumMaxTopK patch failed to apply!"; exit 1; | ||
| } | ||
|
|
||
| # Build with full output so we can debug failures | ||
| # set -e will auto-exit on failure | ||
| TORCH_CUDA_ARCH_LIST="10.0" \ | ||
| NVSHMEM_DIR="$NVSHMEM_DIR" \ | ||
| pip install -e . --no-build-isolation 2>&1 | ||
|
|
||
| echo "=== DeepEP rebuild complete ===" | ||
| python3 -c "import deep_ep; print('deep_ep imported successfully')" | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,126 @@ | ||
| # Qwen3.5-397B-A17B-FP8 Disaggregated 1P1D: DEP4 Prefill + DEP4 Decode | ||
| # Both sides use Data Expert Parallel (DP4 + TP4 + EP4) with dp-attention | ||
| # Homogeneous TP layout to avoid KV/Mamba state slice transfer overhead | ||
|
|
||
| name: "qwen3.5-1p1d-dep4-dep4" | ||
|
|
||
| model: | ||
| path: "qwen3.5-fp8" | ||
| container: "dev" # docker://lmsysorg/sglang:dev | ||
| precision: "fp8" | ||
|
|
||
| resources: | ||
| gpu_type: "gb200" | ||
| gpus_per_node: 4 | ||
| prefill_nodes: 1 | ||
| decode_nodes: 1 | ||
| prefill_workers: 1 | ||
| decode_workers: 1 | ||
|
|
||
| backend: | ||
|
|
||
| prefill_environment: | ||
| TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" | ||
| PYTHONUNBUFFERED: "1" | ||
| NCCL_MNNVL_ENABLE: "1" | ||
| NCCL_CUMEM_ENABLE: "1" | ||
| MC_FORCE_MNNVL: "1" | ||
| SGLANG_DG_CACHE_DIR: "/configs/deepgemm-cache" | ||
| FLASHINFER_WORKSPACE_BASE: "/configs/flashinfer-cache" | ||
| SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" | ||
| SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" | ||
| SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" | ||
|
|
||
| decode_environment: | ||
| TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" | ||
| PYTHONUNBUFFERED: "1" | ||
| NCCL_MNNVL_ENABLE: "1" | ||
| NCCL_CUMEM_ENABLE: "1" | ||
| MC_FORCE_MNNVL: "1" | ||
| SGLANG_DG_CACHE_DIR: "/configs/deepgemm-cache" | ||
| FLASHINFER_WORKSPACE_BASE: "/configs/flashinfer-cache" | ||
| SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
| SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" | ||
| SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" | ||
| SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" | ||
| SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" | ||
|
|
||
| sglang_config: | ||
| prefill: | ||
| served-model-name: "Qwen/Qwen3.5-397B-A17B-FP8" | ||
| model-path: "/model/" | ||
|
|
||
| attention-backend: "trtllm_mha" | ||
| quantization: "fp8" | ||
| kv-cache-dtype: "fp8_e4m3" | ||
| moe-runner-backend: "flashinfer_trtllm" | ||
|
|
||
| # DEP4: DP4 + TP4 + EP4 with dp-attention (same layout as decode) | ||
| tensor-parallel-size: 4 | ||
| data-parallel-size: 4 | ||
| expert-parallel-size: 4 | ||
| enable-dp-attention: true | ||
| enable-dp-lm-head: true | ||
| moe-dense-tp-size: 1 | ||
|
|
||
| mamba-scheduler-strategy: "no_buffer" | ||
| mamba-track-interval: 2048 | ||
| mamba-ssm-dtype: "bfloat16" | ||
|
|
||
| disaggregation-mode: "prefill" | ||
| disable-radix-cache: true | ||
| disaggregation-decode-tp: 4 | ||
| disaggregation-decode-dp: 4 | ||
|
|
||
| mem-fraction-static: 0.80 | ||
| chunked-prefill-size: 16384 | ||
| context-length: 2020 | ||
| load-balance-method: "round_robin" | ||
| watchdog-timeout: 1000000 | ||
| disable-cuda-graph: true | ||
|
|
||
| decode: | ||
| served-model-name: "Qwen/Qwen3.5-397B-A17B-FP8" | ||
| model-path: "/model/" | ||
|
|
||
| attention-backend: "trtllm_mha" | ||
| quantization: "fp8" | ||
| kv-cache-dtype: "fp8_e4m3" | ||
| moe-runner-backend: "flashinfer_trtllm" | ||
|
|
||
| # DEP4: DP4 + TP4 + EP4 with dp-attention | ||
| tensor-parallel-size: 4 | ||
| data-parallel-size: 4 | ||
| expert-parallel-size: 4 | ||
| enable-dp-attention: true | ||
| enable-dp-lm-head: true | ||
| moe-dense-tp-size: 1 | ||
|
|
||
| mamba-scheduler-strategy: "no_buffer" | ||
| mamba-track-interval: 2048 | ||
| mamba-ssm-dtype: "bfloat16" | ||
|
|
||
| disaggregation-mode: "decode" | ||
| disable-radix-cache: true | ||
|
|
||
| mem-fraction-static: 0.80 | ||
| chunked-prefill-size: 16384 | ||
| context-length: 2020 | ||
| cuda-graph-max-bs: 1024 | ||
| watchdog-timeout: 1000000 | ||
|
|
||
| decode-log-interval: 1 | ||
| stream-interval: 50 | ||
|
|
||
| benchmark: | ||
| type: "sa-bench" | ||
| isl: 1000 | ||
| osl: 1000 | ||
| concurrencies: "1x2x4x8x16x32x64x128x256x512x1024" | ||
| req_rate: "inf" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,115 @@ | ||
| # Qwen3.5-397B-A17B-FP8 Disaggregated 1P1D: TEP4 Prefill + TEP4 Decode | ||
| # Both sides use Tensor Expert Parallel (TP4 + EP4), no dp-attention | ||
|
|
||
| name: "qwen3.5-1p1d-tep4-tep4" | ||
|
|
||
| model: | ||
| path: "qwen3.5-fp8" | ||
| container: "dev" # docker://lmsysorg/sglang:dev | ||
| precision: "fp8" | ||
|
YAMY1234 marked this conversation as resolved.
|
||
|
|
||
| resources: | ||
| gpu_type: "gb200" | ||
| gpus_per_node: 4 | ||
| prefill_nodes: 1 | ||
| decode_nodes: 1 | ||
| prefill_workers: 1 | ||
| decode_workers: 1 | ||
|
|
||
| backend: | ||
|
|
||
| prefill_environment: | ||
| TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" | ||
| PYTHONUNBUFFERED: "1" | ||
| NCCL_MNNVL_ENABLE: "1" | ||
| NCCL_CUMEM_ENABLE: "1" | ||
| MC_FORCE_MNNVL: "1" | ||
| SGLANG_DG_CACHE_DIR: "/configs/deepgemm-cache" | ||
| FLASHINFER_WORKSPACE_BASE: "/configs/flashinfer-cache" | ||
| SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" | ||
| SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" | ||
| SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" | ||
|
YAMY1234 marked this conversation as resolved.
|
||
|
|
||
| decode_environment: | ||
| TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" | ||
| PYTHONUNBUFFERED: "1" | ||
| NCCL_MNNVL_ENABLE: "1" | ||
| NCCL_CUMEM_ENABLE: "1" | ||
| MC_FORCE_MNNVL: "1" | ||
| SGLANG_DG_CACHE_DIR: "/configs/deepgemm-cache" | ||
| FLASHINFER_WORKSPACE_BASE: "/configs/flashinfer-cache" | ||
| SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
| SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" | ||
| SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" | ||
| SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" | ||
| SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" | ||
|
|
||
| sglang_config: | ||
| prefill: | ||
| served-model-name: "Qwen/Qwen3.5-397B-A17B-FP8" | ||
| model-path: "/model/" | ||
|
|
||
|
|
||
| attention-backend: "trtllm_mha" | ||
| quantization: "fp8" | ||
| kv-cache-dtype: "fp8_e4m3" | ||
|
|
||
| # TEP4: TP4 + EP4, standard TP attention (no dp-attention) | ||
| tensor-parallel-size: 4 | ||
| expert-parallel-size: 4 | ||
| moe-dense-tp-size: 1 | ||
|
|
||
| mamba-scheduler-strategy: "no_buffer" | ||
| mamba-track-interval: 2048 | ||
| mamba-ssm-dtype: "bfloat16" | ||
|
|
||
| disaggregation-mode: "prefill" | ||
| disable-radix-cache: true | ||
| disaggregation-decode-tp: 4 | ||
| disaggregation-decode-dp: 1 | ||
|
|
||
| mem-fraction-static: 0.75 | ||
| chunked-prefill-size: 16384 | ||
| context-length: 2020 | ||
|
YAMY1234 marked this conversation as resolved.
|
||
| load-balance-method: "round_robin" | ||
| watchdog-timeout: 1000000 | ||
| disable-cuda-graph: true | ||
|
|
||
| decode: | ||
| served-model-name: "Qwen/Qwen3.5-397B-A17B-FP8" | ||
| model-path: "/model/" | ||
|
|
||
|
|
||
| attention-backend: "trtllm_mha" | ||
| quantization: "fp8" | ||
| kv-cache-dtype: "fp8_e4m3" | ||
|
|
||
| # TEP4: TP4 + EP4, standard TP attention (no dp-attention) | ||
| tensor-parallel-size: 4 | ||
| expert-parallel-size: 4 | ||
| moe-dense-tp-size: 1 | ||
|
|
||
| mamba-scheduler-strategy: "no_buffer" | ||
| mamba-track-interval: 2048 | ||
| mamba-ssm-dtype: "bfloat16" | ||
|
|
||
| disaggregation-mode: "decode" | ||
| disable-radix-cache: true | ||
|
|
||
| mem-fraction-static: 0.70 | ||
| chunked-prefill-size: 16384 | ||
| context-length: 2020 | ||
| watchdog-timeout: 1000000 | ||
|
|
||
| benchmark: | ||
| type: "sa-bench" | ||
| isl: 1000 | ||
| osl: 1000 | ||
| concurrencies: "8x32x128x256x512x1024" | ||
| req_rate: "inf" | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.