feat(sa-bench): add sglang DeepSeek-V4 tokenizer (depends on #71) by YAMY1234 · Pull Request #72 · NVIDIA/srt-slurm

YAMY1234 · 2026-04-24T17:15:29Z

Summary

Extends #71's deepseek_v4 custom_tokenizer option with a sglang backend path, so sa-bench can accurately tokenize DeepSeek-V4-Pro prompts for sglang servers the same way it already does for vllm servers in #71.

Why this is needed. DeepSeek-V4 ships no Hugging Face chat template, so tokenizer.apply_chat_template() raises ValueError. Sglang's server solves this internally by replacing the HF template path with a hard-coded DSML encoder (encoding_dsv4.encode_messages) whenever arch == "DeepseekV4ForCausalLM" — see sgl-project/sglang#23600. Without a matching client-side encoder, the sa-bench tokenizer silently falls back to raw-text tokenization, so input_tokens reported by the client no longer matches the server's #new-token. This in turn skews ISL, TPOT, TTFT, and MTP accept-rate accounting.

What we add. There is no client-side DeepseekV4Tokenizer package on the sglang side (vllm has vllm.tokenizers.deepseek_v4; sglang has none). So we vendor sglang's own server-side encoder (encoding_dsv4.py from sgl-project/sglang PR #23600, commit f5d03db853862c8fb0e805df591bed883a71868b) under sa-bench/tokenizers/ and wrap it in a HF-compatible SGLangDeepseekV4Tokenizer.apply_chat_template(). The wrapper mirrors exactly what sglang server does:

Insert an empty system message if missing (matches serving_chat._resolve_chat_encoding_spec).
Default thinking_mode="chat", reasoning_effort=None (matches sglang defaults).
Call vendored encode_messages(...) to render the raw DSML string.
hf_tokenizer.encode(..., add_special_tokens=False) (the encoder already adds <｜begin▁of▁sentence｜>).

The vendored file carries Apache-2.0 header + upstream commit SHA so it can be dropped when sglang publishes an official client-side package.

Routing is backend-aware — no implicit fallback. Per reviewer preference, we do not fall back to vllm if sglang install is present or vice versa; ambiguity is rejected:

if backend == "sglang":
    return SGLangDeepseekV4Tokenizer.from_pretrained(...)
if backend in (None, "vllm"):
    return vllm.tokenizers.deepseek_v4.DeepseekV4Tokenizer.from_pretrained(...)
raise ValueError(f"custom_tokenizer='deepseek_v4' does not support backend={backend!r}")

benchmark_serving.py passes args.backend into get_tokenizer(...).

Files

src/srtctl/benchmarks/scripts/sa-bench/tokenizers/__init__.py — new package
src/srtctl/benchmarks/scripts/sa-bench/tokenizers/_sglang_encoding_dsv4.py — vendored (Apache-2.0), 840 lines, unmodified upstream content
src/srtctl/benchmarks/scripts/sa-bench/tokenizers/sglang_deepseek_v4.py — HF-compatible wrapper
backend_request_func.py — adds backend kwarg to get_tokenizer, splits deepseek_v4 into explicit sglang / vllm branches
benchmark_serving.py — plumbs args.backend through

Relationship to other PRs

Depends on Combine DeepSeek V4 recipe and tokenizer changes #71 (aflowers/dsv4-pr67-pr68) — that PR introduces the deepseek_v4 custom_tokenizer option for vllm; this PR extends it for sglang.
Orthogonal to feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg) #70 — recipes for DeepSeek-V4-Pro (including the gb300 1-node agg MTP recipe used for smoke testing) are added by elvischenv in feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg) #70. This PR intentionally adds no recipe files.

Test plan

python -m py_compile passes on all three new / modified files.
Module import: from tokenizers.sglang_deepseek_v4 import SGLangDeepseekV4Tokenizer succeeds with the vendored encoder present.
Offline equivalence check: SGLangDeepseekV4Tokenizer(...).apply_chat_template(messages) token IDs match server-side tokenizer.encode(encode_messages(messages)) for a representative GSM8K prompt.
Online smoke: short sa-bench run against recipes/gb300-fp4/1k1k-dsv4/agg-low-latency.yaml (from feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg) #70) with this tokenizer, asserting client input_tokens == server #new-token.

Auto-detect container type at runtime: if /sgl-workspace exists (SGLang), use original install path unchanged; otherwise use portable /tmp build path with conditional dependency installation for non-SGLang containers.

* Add Kimi-K2.5 vLLM recipes and fix NIXL side channel host - Add kimi-k2.5 1k1k and 8k1k disagg GB200 recipes (from NVIDIA#7) - Fix vLLM NIXL handshake failures: set VLLM_NIXL_SIDE_CHANNEL_HOST to node's routable IP in get_process_environment() instead of leaving it as 0.0.0.0/localhost which caused transfer handshake failures - Update test_vllm_get_process_environment to cover NIXL host env var Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: run checks on PRs targeting sa-submission-q2-2026 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

NVIDIA#24) * Add Kimi K2.5 disagg STP and MTP recipes for GB200 NVfp4 (ISL8K_OSL1K and ISL1K_OSL1K) Add optimized disaggregated inference recipes for Kimi K2.5 model with NVfp4 precision on GB200 GPUs. Includes both STP and MTP configurations for ISL8K_OSL1K and ISL1K_OSL1K workloads covering concurrency points from 5 to 2253, with Eagle speculative decoding for MTP variants. * Update Kimi K2.5 recipes: container, model path, concurrency format, and env cleanup - Update container to tensorrtllm-runtime-1.1.0-dev.2.sqsh - Point model path to shared /mnt/lustre01/models/kimi-k2.5-nvfp4 - Update Eagle model mount path for MTP configs - Remove HF_HOME (defaults to ~/.cache/huggingface) - Fix concurrency separator from space to 'x' for sa-bench compatibility - Enable multiple frontends for ctx1dep4_gen1dep32_batch64 * Use generic model path and container aliases for cluster portability Replace cluster-specific paths with generic alias names that are resolved via srtslurm.yaml model_paths and containers mappings, as per upstream convention. * Add extra_mount alias resolution and use generic Eagle model path Add model_paths alias resolution for extra_mount host paths in config.py, enabling MTP recipes to use generic name "kimi-k2.5-eagle3" instead of cluster-specific path for the Eagle speculative decoding model. * Use HuggingFace model names and full NVCR container paths Per review feedback, update model paths to HuggingFace format (nvidia/Kimi-K2.5-NVFP4) and container to full NVCR registry path (nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2) so recipes are portable and work without pre-built sqsh files. --------- Co-authored-by: nlevin-ui <nlevin@nvidia.com>

* recipes for minimax m2.5 fp4 b200 agg vllm * commit for signature

* Add lm-eval benchmark runner for InferenceX evals Adds support for running lm-eval accuracy evaluations as a post-benchmark step, leveraging the InferenceX benchmark_lib.sh harness.

…NVIDIA#47) * fix tokenizer for glm5 (NVIDIA#20) fix * add nvidia pre-release url (NVIDIA#22)

Add 66 GLM5 NVFP4 disaggregated recipe configs for GB200 and GB300 on the sa-submission branch; standardize model path and container values across the recipe set for consistency.

Extends PR NVIDIA#71's 'deepseek_v4' custom_tokenizer option with a sglang backend path. Because sglang has no client-side DeepseekV4Tokenizer package equivalent to vllm.tokenizers.deepseek_v4, we vendor sglang's own server-side encoder (encoding_dsv4.py from sgl-project/sglang PR #23600, commit f5d03db) under sa-bench/tokenizers/ so the client renders the exact same DSML prompt the sglang server builds. This lets input_tokens on the client match #new-token on the server. Routing is backend-aware with no implicit fallback: backend=sglang -> SGLangDeepseekV4Tokenizer (vendored) backend=vllm -> vllm.tokenizers.deepseek_v4.DeepseekV4Tokenizer (PR NVIDIA#71) else -> explicit ValueError Depends on NVIDIA#71 (vllm deepseek_v4 path). Recipes for DeepSeek-V4-Pro are handled separately in NVIDIA#70.

YAMY1234 · 2026-04-24T17:24:44Z

Superseded by a clean PR based on main — see follow-up.

Albert Cheng (Engrg-Hardware 1) and others added 19 commits April 2, 2026 14:17

Make Dynamo source install container-agnostic (vLLM, SGLang, etc.)

9cc6d50

Auto-detect container type at runtime: if /sgl-workspace exists (SGLang), use original install path unchanged; otherwise use portable /tmp build path with conditional dependency installation for non-SGLang containers.

Add Minimax M2.5 NVFP4 agg B200 single-node configs (NVIDIA#36)

b0f5b83

* recipes for minimax m2.5 fp4 b200 agg vllm * commit for signature

Add lm-eval benchmark runner for InferenceX evals (NVIDIA#12)

f61dbba

* Add lm-eval benchmark runner for InferenceX evals Adds support for running lm-eval accuracy evaluations as a post-benchmark step, leveraging the InferenceX benchmark_lib.sh harness.

fix: add glm5 dynamo trtllm benchmark support to sa submission branch (…

10f4ac9

…NVIDIA#47) * fix tokenizer for glm5 (NVIDIA#20) fix * add nvidia pre-release url (NVIDIA#22)

Add GLM5 disaggregated recipes for SA submission (NVIDIA#48)

a10acd3

Add 66 GLM5 NVFP4 disaggregated recipe configs for GB200 and GB300 on the sa-submission branch; standardize model path and container values across the recipe set for consistency.

add

6da81c4

update

d3b958b

fix dynamo

af67085

1p1d

951248d

fix

daa1e4a

add

1bf11fb

add

f330771

add

466bb99

add

6f4e65c

add dsv4 tokenizer

6573922

Set DeepSeek V4 SA-Bench tokenizer

cc50dc3

YAMY1234 requested review from alec-flowers, csahithi, hjjq, ishandhanani, kedarpotdar-nv, kyleliang-nv, nlevin-ui and qiching as code owners April 24, 2026 17:15

YAMY1234 closed this Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sa-bench): add sglang DeepSeek-V4 tokenizer (depends on #71)#72

feat(sa-bench): add sglang DeepSeek-V4 tokenizer (depends on #71)#72
YAMY1234 wants to merge 19 commits intoNVIDIA:mainfrom
YAMY1234:yangminl/dsv4-sglang-tokenizer

YAMY1234 commented Apr 24, 2026

Uh oh!

YAMY1234 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

YAMY1234 commented Apr 24, 2026

Summary

Files

Relationship to other PRs

Test plan

Uh oh!

YAMY1234 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants