diff --git a/docs/advanced_features/separate_reasoning.ipynb b/docs/advanced_features/separate_reasoning.ipynb index 0c20c5a08bd2..56a28f03ceae 100644 --- a/docs/advanced_features/separate_reasoning.ipynb +++ b/docs/advanced_features/separate_reasoning.ipynb @@ -13,7 +13,7 @@ "| Model | Reasoning tags | Parser | Notes |\n", "|---------|-----------------------------|------------------|-------|\n", "| [DeepSeek‑R1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) | `` … `` | `deepseek-r1` | Supports all variants (R1, R1-0528, R1-Distill) |\n", - "| [DeepSeek‑V3.1](https://huggingface.co/deepseek-ai/DeepSeek-V3.1) | `` … `` | `deepseek-v3` | Supports `thinking` parameter |\n", + "| [DeepSeek‑V3 series](https://huggingface.co/deepseek-ai/DeepSeek-V3.1) | `` … `` | `deepseek-v3` | Including [DeepSeek‑V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp). Supports `thinking` parameter |\n", "| [Standard Qwen3 models](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) | `` … `` | `qwen3` | Supports `enable_thinking` parameter |\n", "| [Qwen3-Thinking models](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507) | `` … `` | `qwen3` or `qwen3-thinking` | Always generates thinking content |\n", "| [Kimi models](https://huggingface.co/moonshotai/models) | `◁think▷` … `◁/think▷` | `kimi` | Uses special thinking delimiters |\n", @@ -26,7 +26,7 @@ "- Both are handled by the same `deepseek-r1` parser\n", "\n", "**DeepSeek-V3 Family:**\n", - "- DeepSeek-V3.1: Hybrid model supporting both thinking and non-thinking modes, use the `deepseek-v3` parser and `thinking` parameter (NOTE: not `enable_thinking`)\n", + "- DeepSeek-V3.1/V3.2: Hybrid model supporting both thinking and non-thinking modes, use the `deepseek-v3` parser and `thinking` parameter (NOTE: not `enable_thinking`)\n", "\n", "**Qwen3 Family:**\n", "- Standard Qwen3 (e.g., Qwen3-2507): Use `qwen3` parser, supports `enable_thinking` in chat templates\n", diff --git a/docs/basic_usage/deepseek.md b/docs/basic_usage/deepseek.md index 96f43ab0a7a9..39d3f4ab6fbc 100644 --- a/docs/basic_usage/deepseek.md +++ b/docs/basic_usage/deepseek.md @@ -170,7 +170,7 @@ python3 -m sglang.launch_server \ - The best configuration for `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` can be searched with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py) script for given batch size. The minimum configuration is `--speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2`, which can achieve speedup for larger batch sizes. - FlashAttention3, FlashMLA, and Triton backend fully supports MTP usage. For FlashInfer backend (`--attention-backend flashinfer`) with speculative decoding,`--speculative-eagle-topk` parameter should be set to `1`. MTP support for the CutlassMLA and TRTLLM MLA backends are still under development. - To enable DeepSeek MTP for large batch sizes (>32), there are some parameters should be changed (Reference [this discussion](https://github.com/sgl-project/sglang/issues/4543#issuecomment-2737413756)): - - Adjust `--max-running-requests` to a larger number. The default value is `32` for MTP. For larger batch sizes, you should increase this value beyond the default value. + - Adjust `--max-running-requests` to a larger number. The default value is `48` for MTP. For larger batch sizes, you should increase this value beyond the default value. - Set `--cuda-graph-bs`. It's a list of batch sizes for cuda graph capture. The default captured batch sizes for speculative decoding is set [here](https://github.com/sgl-project/sglang/blob/49420741746c8f3e80e0eb17e7d012bfaf25793a/python/sglang/srt/model_executor/cuda_graph_runner.py#L126). You can include more batch sizes into it. diff --git a/docs/basic_usage/deepseek_v32.md b/docs/basic_usage/deepseek_v32.md new file mode 100644 index 000000000000..bac87498b31a --- /dev/null +++ b/docs/basic_usage/deepseek_v32.md @@ -0,0 +1,150 @@ +# DeepSeek V3.2 Usage + +[DeepSeek-V3.2-Exp](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) equips DeepSeek-V3.1-Terminus with DeepSeek Sparse Attention (DSA) through continued training. With DSA, a fine-grained sparse attention mechanism powered by a lightning indexer, DeepSeek-V3.2 achieves efficiency improvements in long-context scenarios. + +For reporting issues or tracking upcoming features, please refer to this [Roadmap](https://github.com/sgl-project/sglang/issues/11060). + +## Installation + +### Docker + +```bash +# H200/B200 +docker pull lmsysorg/sglang:latest + +# MI350/MI355 +docker pull lmsysorg/sglang:dsv32-rocm + +# NPUs +docker pull lmsysorg/sglang:dsv32-a2 +docker pull lmsysorg/sglang:dsv32-a3 +``` + +### Build From Source + +```bash +# Install SGLang +git clone https://github.com/sgl-project/sglang +cd sglang +pip3 install pip --upgrade +pip3 install -e "python[all]" + +# Install flash_mla +git clone https://github.com/deepseek-ai/FlashMLA.git flash-mla +cd flash-mla +git submodule update --init --recursive +pip install -v . +``` +## Launch DeepSeek V3.2 with SGLang + +To serve DeepSeek-V3.2-Exp on 8xH200/B200 GPUs: + +```bash +# Launch with TP + DP +python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention + +# Launch with EP + DP +python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --ep 8 --dp 8 --enable-dp-attention +``` + +### Configuration Tips +- **DP Attention**: For DeepSeek V3.2 model, the kernels are customized for the use case of `dp_size=8`. So +- **Choices of Attention Kernels**: The attention backend is automatically set to `nsa` attention backend for DeepSeek V3.2 model. In this backend, different kernels for sparse prefilling/decoding are implemented, which can be specified by `--nsa-prefill-backend` and `--nsa-decode-backend` server arguments. The choices of nsa prefill/decode attention kernels include: + - `flashmla_sparse`: `flash_mla_sparse_fwd` kernel from `flash_mla` library. Can run on both Hopper and Blackwell GPUs. + - `flashmla_kv`: `flash_mla_with_kvcache` kernel from `flash_mla` library. Can run on both Hopper and Blackwell GPUs. + - `fa3`: `flash_attn_with_kvcache` kernel from `flash_attn` library. Can only run on Hopper GPUs. + - `tilelang`: `tilelang` implementation that can run on GPU, HPU and NPU. + - `alter`: Alter kernel on AMD HPUs. Can only be used as decode kernel. +- On the basis of performance benchmarks, the default configuration on H200 and B200 are set as follows : + - H200: `flashmla_sparse` prefill attention, `fa3` decode attention, `bf16` kv cache dtype. + - B200: `flashmla_kv` prefill attention, `flashmla_kv` decode attention, `fp8_e4m3` kv cache dtype. + - Currently we don't enable `prefill=flashmla_sparse` with `decode=flashmla_kv` due to latency caused by kv cache quantization operations. In the future we might shift to this setting after attention/quantization kernels are optimized. + +### Multi-token Prediction +SGLang implements Multi-Token Prediction (MTP) for DeepSeek V3.2 based on [EAGLE speculative decoding](https://docs.sglang.ai/advanced_features/speculative_decoding.html#EAGLE-Decoding). With this optimization, the decoding speed can be improved significantly on small batch sizes. Please look at [this PR](https://github.com/sgl-project/sglang/pull/11652) for more information. + +Example usage: +```bash +python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 +``` +- The best configuration for `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` can be searched with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py) script for given batch size. The minimum configuration is `--speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2`, which can achieve speedup for larger batch sizes. +- The default value of `--max-running-requests` is set to `48` for MTP. For larger batch sizes, this value should be increased beyond the default value. + + +# Function Calling and Reasoning Parser +The usage of function calling and reasoning parser is the same as DeepSeek V3.1. Please refer to [Reasoning Parser](https://docs.sglang.ai/advanced_features/separate_reasoning.html) and [Tool Parser](https://docs.sglang.ai/advanced_features/tool_parser.html) documents. + +# PD Disaggregation + +Prefill Command: +```bash +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3.2-Exp \ + --disaggregation-mode prefill \ + --host $LOCAL_IP \ + --port $PORT \ + --tp 8 \ + --dp 8 \ + --enable-dp-attention \ + --dist-init-addr ${HOST}:${DIST_PORT} \ + --trust-remote-code \ + --disaggregation-bootstrap-port 8998 \ + --mem-fraction-static 0.9 \ +``` + +Decode command: +```bash +python -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V3.2-Exp \ + --disaggregation-mode decode \ + --host $LOCAL_IP \ + --port $PORT \ + --tp 8 \ + --dp 8 \ + --enable-dp-attention \ + --dist-init-addr ${HOST}:${DIST_PORT} \ + --trust-remote-code \ + --mem-fraction-static 0.9 \ +``` + +Router command: +```bash +python -m sglang_router.launch_router --pd-disaggregation \ + --prefill $PREFILL_ADDR 8998 \ + --decode $DECODE_ADDR \ + --host 127.0.0.1 \ + --port 8000 \ +``` + +If you need more advanced deployment methods or production-ready deployment methods, such as RBG or LWS-based deployment, please refer to [references/multi_node_deployment/rbg_pd/deepseekv32_pd.md](../references/multi_node_deployment/rbg_pd/deepseekv32_pd.md). Additionally, you can also find startup commands for DeepEP-based EP parallelism in the aforementioned documentation. + + +## Benchmarking Results + +### Accuracy Test with `gsm8k` +A simple accuracy benchmark can be tested with `gsm8k` dataset: +```bash +python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319 +``` + +The result is 0.956, which matches our expectation: +```bash +Accuracy: 0.956 +Invalid: 0.000 +Latency: 25.109 s +Output throughput: 5226.235 token/s +``` + + +### Accuracy Test with `gpqa-diamond` + +Accuracy benchmark on long context can be tested on GPQA-diamond dataset with long output tokens and thinking enabled: +```bash +python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3 +``` + +The mean accuracy over 8 runs shows 0.797, which matches the number 79.9 in official tech report. +```bash +Repeat: 8, mean: 0.797 +Scores: ['0.808', '0.798', '0.808', '0.798', '0.783', '0.788', '0.803', '0.793'] +``` diff --git a/docs/references/multi_node_deployment/rbg_pd/deepseekv32_pd.md b/docs/references/multi_node_deployment/rbg_pd/deepseekv32_pd.md new file mode 100644 index 000000000000..30ff16ebcba7 --- /dev/null +++ b/docs/references/multi_node_deployment/rbg_pd/deepseekv32_pd.md @@ -0,0 +1,570 @@ +# DeepSeekV32-Exp RBG Based PD Deploy + +## 0. Prerequisites + +1. k8s >=1.26 +2. lws installed on k8s. +3. rbg installed on k8s. + +For RBG installation, please refer to: https://github.com/sgl-project/rbg + +## 1. Image Preparation + +`lmsysorg/sglang:latest` + + +### 2. All In One manifest file + +*Note: The NodeSelector section, model location section, and taint toleration section can be adjusted according to your actual deployment environment* + +rbg-dsv32.yml + +```yaml +apiVersion: workloads.x-k8s.io/v1alpha1 +kind: RoleBasedGroup +metadata: + name: deepseek-rbg-32exp + namespace: default +spec: + roles: + - name: prefill + replicas: 1 + workload: + apiVersion: leaderworkerset.x-k8s.io/v1 + kind: LeaderWorkerSet + restartPolicy: None + leaderWorkerSet: + size: 1 + patchLeaderTemplate: + metadata: + labels: + role: leader + pd_role: prefill + spec: + containers: + - command: + - python3 + - -m + - sglang.launch_server + - --model-path + - /work/models + - --port + - "30000" + - --trust-remote + - --host + - 0.0.0.0 + - --disable-radix-cache + - --disaggregation-ib-device + - mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 + - --disable-radix-cache + - --chunked-prefill-size + - "131072" + - --page-size + - "64" + # - --enable-eplb + - --ep-dispatch-algorithm + - dynamic + - --eplb-algorithm + - deepseek + - --enable-dp-lm-head + - --enable-dp-attention + - --dp-size + - "8" + - --moe-a2a-backend + - deepep + - --deepep-mode + - normal + - --disaggregation-mode + - prefill + - --mem-fraction-static + - "0.8" + - --max-prefill-tokens + - "32768" + - --context-length + - "32768" + - --tp + - "8" + - --dist-init-addr + - $(LWS_LEADER_ADDRESS):20102 + - --nnodes + - $(LWS_GROUP_SIZE) + - --node-rank + - $(LWS_WORKER_INDEX) + - --trust-remote-code + - --ep-num-redundant-experts + - "32" + - --moe-dense-tp-size + - "1" + - --max-running-requests + - "1024" + env: + - name: LWS_WORKER_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index'] + livenessProbe: + failureThreshold: 3000 + httpGet: + path: /health + port: 30000 + initialDelaySeconds: 300 + periodSeconds: 60 + successThreshold: 1 + timeoutSeconds: 10 + readinessProbe: + failureThreshold: 20 + httpGet: + path: /health + port: 30000 + periodSeconds: 30 + successThreshold: 1 + timeoutSeconds: 10 + name: sglang + ports: + - containerPort: 30000 + name: sglang-http + protocol: TCP + + patchWorkerTemplate: {} + template: + metadata: + labels: + inference-framework: sglang + inference-stack.io/monitoring: "enabled" + spec: + containers: + - name: sglang + image: lmsysorg/sglang:latest + env: + - name: SGLANG_SKIP_SGL_KERNEL_VERSION_CHECK + value: "1" + - name: CUDA_LAUNCH_BLOCKING + value: "0" + - name: SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT + value: "1000000000" + - name: NVSHMEM_IB_TRAFFIC_CLASS + value: "16" + - name: NVSHMEM_DISABLE_P2P + value: "0" + - name: ENABLE_METRICS + value: "true" + - name: NVSHMEM_IB_GID_INDEX + value: "3" + - name: NVSHMEM_IB_SL + value: "5" + - name: SGLANG_SET_CPU_AFFINITY + value: "true" + - name: SGL_ENABLE_JIT_DEEPGEMM + value: "1" + - name: NCCL_IB_QPS_PER_CONNECTION + value: "8" + - name: NCCL_IB_SPLIT_DATA_ON_QPS + value: "1" + - name: NCCL_NET_PLUGIN + value: "none" + - name: NCCL_IB_TC + value: "136" + - name: NCCL_IB_SL + value: "5" + - name: NCCL_IB_TIMEOUT + value: "22" + - name: NCCL_IB_GID_INDEX + value: "3" + - name: NCCL_MIN_NCHANNELS + value: "4" + - name: NCCL_SOCKET_IFNAME + value: bond0 + - name: GLOO_SOCKET_IFNAME + value: bond0 + - name: NCCL_IB_HCA + value: ^=mlx5_0,mlx5_5,mlx5_6 + - name: NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME + value: "bond0" + - name: MC_TE_METRIC + value: "false" + resources: + limits: + nvidia.com/gpu: "8" + securityContext: + capabilities: + add: + - IPC_LOCK + privileged: true + volumeMounts: + - mountPath: /root/.cache + name: sgl-cache + - mountPath: /dev/shm + name: dshm + - mountPath: /work/models + name: model + - mountPath: /dev/infiniband + name: ib + - mountPath: /sgl-workspace/sglang + name: src + + dnsPolicy: ClusterFirstWithHostNet + hostIPC: true + hostNetwork: true + nodeSelector: + pd: "yes" + tolerations: + - key: pd + operator: Exists + volumes: + - hostPath: + path: /var/run/sys-topology + name: topo + - hostPath: + path: /data1/sgl_cache4 + type: DirectoryOrCreate + name: sgl-cache + - emptyDir: + medium: Memory + name: dshm + - hostPath: + path: /data/DeepSeek-V3.2-Exp + name: model + - hostPath: + path: /dev/infiniband + name: ib + - hostPath: + path: /data/src/sglang + type: DirectoryOrCreate + name: src + + - name: decode + replicas: 1 + workload: + apiVersion: leaderworkerset.x-k8s.io/v1 + kind: LeaderWorkerSet + leaderWorkerSet: + size: 1 + patchLeaderTemplate: + metadata: + labels: + role: leader + pd_role: decode + spec: + containers: + - command: + - python3 + - -m + - sglang.launch_server + - --model-path + - /work/models + - --port + - "30000" + - --trust-remote + - --host + - 0.0.0.0 + - --disaggregation-ib-device + - mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 + - --chunked-prefill-size + - "131072" + - --prefill-round-robin-balance + - --eplb-rebalance-layers-per-chunk + - "29" + - --page-size + - "64" + - --enable-dp-attention + - --enable-dp-lm-head + - --dp-size + - "8" + - --moe-a2a-backend + - deepep + - --deepep-mode + - low_latency + - --disaggregation-mode + - decode + - --mem-fraction-static + - "0.8" + - --context-length + - "32768" + - --max-running-requests + - "2048" + - --tp-size + - "8" # Size of Tensor Parallelism + - --cuda-graph-max-bs + - "16" + - --dist-init-addr + - $(LWS_LEADER_ADDRESS):20102 + - --nnodes + - $(LWS_GROUP_SIZE) + - --node-rank + - $(LWS_WORKER_INDEX) + - --trust-remote-code + - --ep-num-redundant-experts + - "32" + - --moe-dense-tp-size + - "1" + env: + - name: LWS_WORKER_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index'] + livenessProbe: + failureThreshold: 30000 + httpGet: + path: /health + port: 30000 + initialDelaySeconds: 300 + periodSeconds: 60 + successThreshold: 1 + timeoutSeconds: 10 + name: sglang + readinessProbe: + failureThreshold: 20 + httpGet: + path: /health + port: 30000 + periodSeconds: 30 + successThreshold: 1 + timeoutSeconds: 10 + patchWorkerTemplate: + spec: + containers: + - command: + - python3 + - -m + - sglang.launch_server + - --model-path + - /work/models + - --crash-dump-folder + - /log + - --chunked-prefill-size + - "262144" + - --prefill-round-robin-balance + - --eplb-rebalance-layers-per-chunk + - "29" + - --page-size + - "64" + - --enable-dp-attention + - --enable-dp-lm-head + - --dp-size + - "32" + - --moe-a2a-backend + - "deepep" + - --deepep-mode + - low_latency + - --disaggregation-mode + - decode + - --mem-fraction-static + - "0.849" + - --context-length + - "32768" + - --disaggregation-ib-device + - mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 + - --max-running-requests + - "4096" + - --cuda-graph-max-bs + - "16" + - --tp-size + - "8" # Size of Tensor Parallelism + - --dist-init-addr + - $(LWS_LEADER_ADDRESS):20102 + - --nnodes + - $(LWS_GROUP_SIZE) + - --node-rank + - $(LWS_WORKER_INDEX) + - --trust-remote-code + - --ep-num-redundant-experts + - "32" + - --moe-dense-tp-size + - "1" + env: + - name: LWS_WORKER_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index'] + name: sglang + template: + metadata: + labels: + inference-framework: sglang-unuse + inference-stack.io/monitoring: "enabled" + spec: + containers: + - image: lmsysorg/sglang:latest + name: sglang + resources: + limits: + nvidia.com/gpu: "8" + securityContext: + capabilities: + add: + - IPC_LOCK + privileged: true + volumeMounts: + - mountPath: /root/.cache + name: sgl-cache + - mountPath: /dev/shm + name: dshm + - mountPath: /work/models + name: model + - mountPath: /dev/infiniband + name: ib + - mountPath: /sgl-workspace/sglang + name: src + env: + - name: SGLANG_SKIP_SGL_KERNEL_VERSION_CHECK + value: "1" + - name: SGLANG_DISAGGREGATION_WAITING_TIMEOUT + value: "100000000" + - name: NVSHMEM_DISABLE_P2P + value: "0" + - name: NVSHMEM_IB_TRAFFIC_CLASS + value: "16" + - name: NVSHMEM_IB_SL + value: "5" + - name: ENABLE_METRICS + value: "true" + - name: CUDA_LAUNCH_BLOCKING + value: "0" + - name: NVSHMEM_IB_GID_INDEX + value: "3" + - name: NCCL_IB_QPS_PER_CONNECTION + value: "8" + - name: NCCL_IB_SPLIT_DATA_ON_QPS + value: "1" + - name: NCCL_NET_PLUGIN + value: "none" + - name: NCCL_IB_TC + value: "136" + - name: NCCL_IB_SL + value: "5" + - name: NCCL_IB_TIMEOUT + value: "22" + - name: NCCL_IB_GID_INDEX + value: "3" + - name: NCCL_MIN_NCHANNELS + value: "4" + - name: NCCL_SOCKET_IFNAME + value: bond0 + - name: GLOO_SOCKET_IFNAME + value: bond0 + - name: NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME + value: "bond0" + - name: NCCL_IB_HCA + value: ^=mlx5_0,mlx5_5,mlx5_6 + - name: MC_TE_METRIC + value: "false" + - name: SGL_ENABLE_JIT_DEEPGEMM + value: "1" + dnsPolicy: ClusterFirstWithHostNet + hostIPC: true + hostNetwork: true + nodeSelector: + pd: "yes" + tolerations: + - key: pd + operator: Exists + volumes: + - hostPath: + path: /var/run/sys-topology + name: topo + - hostPath: + path: /data1/sgl_cache4 + type: DirectoryOrCreate + name: sgl-cache + - hostPath: + path: /data/src/sglang + type: DirectoryOrCreate + name: src + - emptyDir: + medium: Memory + name: dshm + - hostPath: + path: /data/DeepSeek-V3.2-Exp + name: model + - hostPath: + path: /dev/infiniband + name: ib + - name: router + replicas: 1 + dependencies: [ "decode", "prefill" ] + template: + spec: + containers: + - name: scheduler + image: lmsysorg/sglang:latest + command: + - sh + - -c + - > + python3 -m sglang_router.launch_router + --host 0.0.0.0 + --port 8080 + --pd-disaggregation + --policy random + --service-discovery + --service-discovery-namespace ${NAMESPACE} + --service-discovery-port 30000 + --prefill-selector pd_role=prefill + --decode-selector pd_role=decode + --max-payload-size 2147483648 + --worker-startup-timeout-secs 1200 + env: + - name: NAMESPACE + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.namespace +--- +apiVersion: v1 +kind: Service +metadata: + labels: + app: deepseek-rbg-32exp + name: deepseek-rbg-32exp + namespace: default +spec: + ports: + - name: http + port: 8080 + protocol: TCP + targetPort: 8080 + nodePort: 30080 + + selector: + rolebasedgroup.workloads.x-k8s.io/name: deepseek-rbg-32exp + rolebasedgroup.workloads.x-k8s.io/role: router + type: NodePort + +``` + +```bash +[root@ecs-001]# kubectl get po -n default +deepseek-rbg-32exp-decode-main-0 1/1 Running 0 74m +deepseek-rbg-32exp-decode-0-1 1/1 Running 0 74m +deepseek-rbg-32exp-router-9c5dbfc57 1/1 Running 0 22m +deepseek-rbg-32exp-prefill-0 1/1 Running 0 74m + +[root@ecs-cbm-x1-pd-cpu-001 main_doc]# kubectl get svc |grep dee +deepseek-rbg-32exp-decode ClusterIP None 97m +deepseek-rbg-32exp-router-service NodePort 172.16.242.169 8000:30800/TCP 22m +deepseek-rbg-32exp-prefill ClusterIP None 97m +``` + +At this point, select a nodePort:30800 to access: + +```bash +[root@ecs-001]# curl -X POST "http://{nodePort}:30800/v1/chat/completions" \ +> -H "Content-Type: application/json" \ +> -H "Authorization: Bearer None" \ +> -d '{ +> "rid":"ccccdd", +> "model": "dsv32", +> "messages": [ +> {"role": "system", "content": "0: You are a helpful AI assistant"}, +> {"role": "user", "content": "你是谁?."} +> ], +> "max_tokens":221 +> }' +{"id":"ccccdd","object":"chat.completion","created":1750252498,"model":"qwen2","choices":[{"index":0,"message":{"role":"assistant","content":"\n嗯,用户问了一个很基础的自我介绍问题"你是谁?"。这可能是第一次互动时的常规开场白,也可能是想确认我的身份和功能范围。\n\n用户没有提供任何背景信息,语气简洁中性。这种场景下新用户的可能性较高,需要给出清晰友好的自我介绍,同时突出实用价值来降低陌生感。\n\n考虑到中文用户,应该用简体中文回复。重点要说明三点:身份归属(深度求索)、功能定位(AI助手)、服务范围(学习/工作/生活)。结尾用开放性问题引导对话很关键——既能了解需求,又能避免让用户面对空白输入框时不知所措。\n\n用波浪线结尾可以软化语气,那个笑脸表情😊刚好能中和AI的机械感。不过要控制表情符号数量,避免显得轻浮。\n\n你好呀!我是你的AI助手,由深度求索公司(DeepSeek)开发的语言模型,名字叫 **DeepSeek-V32**。你可以把我当成一个知识丰富、随叫随到的小帮手~😊\n\n我的任务就是陪你聊天、解答问题、","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":14,"total_tokens":235,"completion_tokens":221,"prompt_tokens_details":null}} + +``` +## FAQ + +1. The current deployment startup parameters may not be fully compatible with all RDMA scenarios. Different RDMA NCCL-related environment configurations may be needed in different network environments. + +2. Please ensure that the sglang code in the image has incorporated the changes from [PR #10912](https://github.com/sgl-project/sglang/pull/10912).