Skip to content

feat(sa-bench): add sglang DeepSeek-V4 tokenizer (depends on #71)#72

Closed
YAMY1234 wants to merge 19 commits intoNVIDIA:mainfrom
YAMY1234:yangminl/dsv4-sglang-tokenizer
Closed

feat(sa-bench): add sglang DeepSeek-V4 tokenizer (depends on #71)#72
YAMY1234 wants to merge 19 commits intoNVIDIA:mainfrom
YAMY1234:yangminl/dsv4-sglang-tokenizer

Conversation

@YAMY1234
Copy link
Copy Markdown
Collaborator

Summary

Extends #71's deepseek_v4 custom_tokenizer option with a sglang backend path, so sa-bench can accurately tokenize DeepSeek-V4-Pro prompts for sglang servers the same way it already does for vllm servers in #71.

Why this is needed. DeepSeek-V4 ships no Hugging Face chat template, so tokenizer.apply_chat_template() raises ValueError. Sglang's server solves this internally by replacing the HF template path with a hard-coded DSML encoder (encoding_dsv4.encode_messages) whenever arch == "DeepseekV4ForCausalLM" — see sgl-project/sglang#23600. Without a matching client-side encoder, the sa-bench tokenizer silently falls back to raw-text tokenization, so input_tokens reported by the client no longer matches the server's #new-token. This in turn skews ISL, TPOT, TTFT, and MTP accept-rate accounting.

What we add. There is no client-side DeepseekV4Tokenizer package on the sglang side (vllm has vllm.tokenizers.deepseek_v4; sglang has none). So we vendor sglang's own server-side encoder (encoding_dsv4.py from sgl-project/sglang PR #23600, commit f5d03db853862c8fb0e805df591bed883a71868b) under sa-bench/tokenizers/ and wrap it in a HF-compatible SGLangDeepseekV4Tokenizer.apply_chat_template(). The wrapper mirrors exactly what sglang server does:

  1. Insert an empty system message if missing (matches serving_chat._resolve_chat_encoding_spec).
  2. Default thinking_mode="chat", reasoning_effort=None (matches sglang defaults).
  3. Call vendored encode_messages(...) to render the raw DSML string.
  4. hf_tokenizer.encode(..., add_special_tokens=False) (the encoder already adds <|begin▁of▁sentence|>).

The vendored file carries Apache-2.0 header + upstream commit SHA so it can be dropped when sglang publishes an official client-side package.

Routing is backend-aware — no implicit fallback. Per reviewer preference, we do not fall back to vllm if sglang install is present or vice versa; ambiguity is rejected:

if backend == "sglang":
    return SGLangDeepseekV4Tokenizer.from_pretrained(...)
if backend in (None, "vllm"):
    return vllm.tokenizers.deepseek_v4.DeepseekV4Tokenizer.from_pretrained(...)
raise ValueError(f"custom_tokenizer='deepseek_v4' does not support backend={backend!r}")

benchmark_serving.py passes args.backend into get_tokenizer(...).

Files

  • src/srtctl/benchmarks/scripts/sa-bench/tokenizers/__init__.py — new package
  • src/srtctl/benchmarks/scripts/sa-bench/tokenizers/_sglang_encoding_dsv4.py — vendored (Apache-2.0), 840 lines, unmodified upstream content
  • src/srtctl/benchmarks/scripts/sa-bench/tokenizers/sglang_deepseek_v4.py — HF-compatible wrapper
  • backend_request_func.py — adds backend kwarg to get_tokenizer, splits deepseek_v4 into explicit sglang / vllm branches
  • benchmark_serving.py — plumbs args.backend through

Relationship to other PRs

Test plan

  • python -m py_compile passes on all three new / modified files.
  • Module import: from tokenizers.sglang_deepseek_v4 import SGLangDeepseekV4Tokenizer succeeds with the vendored encoder present.
  • Offline equivalence check: SGLangDeepseekV4Tokenizer(...).apply_chat_template(messages) token IDs match server-side tokenizer.encode(encode_messages(messages)) for a representative GSM8K prompt.
  • Online smoke: short sa-bench run against recipes/gb300-fp4/1k1k-dsv4/agg-low-latency.yaml (from feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg) #70) with this tokenizer, asserting client input_tokens == server #new-token.

Albert Cheng (Engrg-Hardware 1) and others added 19 commits April 2, 2026 14:17
Auto-detect container type at runtime: if /sgl-workspace exists (SGLang),
use original install path unchanged; otherwise use portable /tmp build path
with conditional dependency installation for non-SGLang containers.
* Add Kimi-K2.5 vLLM recipes and fix NIXL side channel host

- Add kimi-k2.5 1k1k and 8k1k disagg GB200 recipes (from NVIDIA#7)
- Fix vLLM NIXL handshake failures: set VLLM_NIXL_SIDE_CHANNEL_HOST to
  node's routable IP in get_process_environment() instead of leaving it
  as 0.0.0.0/localhost which caused transfer handshake failures
- Update test_vllm_get_process_environment to cover NIXL host env var

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ci: run checks on PRs targeting sa-submission-q2-2026

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
NVIDIA#24)

* Add Kimi K2.5 disagg STP and MTP recipes for GB200 NVfp4 (ISL8K_OSL1K and ISL1K_OSL1K)

Add optimized disaggregated inference recipes for Kimi K2.5 model with NVfp4
precision on GB200 GPUs. Includes both STP and MTP configurations for
ISL8K_OSL1K and ISL1K_OSL1K workloads covering concurrency points from 5
to 2253, with Eagle speculative decoding for MTP variants.

* Update Kimi K2.5 recipes: container, model path, concurrency format, and env cleanup

- Update container to tensorrtllm-runtime-1.1.0-dev.2.sqsh
- Point model path to shared /mnt/lustre01/models/kimi-k2.5-nvfp4
- Update Eagle model mount path for MTP configs
- Remove HF_HOME (defaults to ~/.cache/huggingface)
- Fix concurrency separator from space to 'x' for sa-bench compatibility
- Enable multiple frontends for ctx1dep4_gen1dep32_batch64

* Use generic model path and container aliases for cluster portability

Replace cluster-specific paths with generic alias names that are resolved
via srtslurm.yaml model_paths and containers mappings, as per upstream convention.

* Add extra_mount alias resolution and use generic Eagle model path

Add model_paths alias resolution for extra_mount host paths in config.py,
enabling MTP recipes to use generic name "kimi-k2.5-eagle3" instead of
cluster-specific path for the Eagle speculative decoding model.

* Use HuggingFace model names and full NVCR container paths

Per review feedback, update model paths to HuggingFace format
(nvidia/Kimi-K2.5-NVFP4) and container to full NVCR registry path
(nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2) so recipes
are portable and work without pre-built sqsh files.

---------

Co-authored-by: nlevin-ui <nlevin@nvidia.com>
* recipes for minimax m2.5 fp4 b200 agg vllm

* commit for signature
* Add lm-eval benchmark runner for InferenceX evals

Adds support for running lm-eval accuracy evaluations as a post-benchmark
step, leveraging the InferenceX benchmark_lib.sh harness.
Add 66 GLM5 NVFP4 disaggregated recipe configs for GB200 and GB300 on the sa-submission branch; standardize model path and container values across the recipe set for consistency.
Extends PR NVIDIA#71's 'deepseek_v4' custom_tokenizer option with a sglang
backend path. Because sglang has no client-side DeepseekV4Tokenizer
package equivalent to vllm.tokenizers.deepseek_v4, we vendor sglang's
own server-side encoder (encoding_dsv4.py from sgl-project/sglang
PR #23600, commit f5d03db) under sa-bench/tokenizers/ so the client
renders the exact same DSML prompt the sglang server builds. This
lets input_tokens on the client match #new-token on the server.

Routing is backend-aware with no implicit fallback:
  backend=sglang  -> SGLangDeepseekV4Tokenizer (vendored)
  backend=vllm    -> vllm.tokenizers.deepseek_v4.DeepseekV4Tokenizer (PR NVIDIA#71)
  else            -> explicit ValueError

Depends on NVIDIA#71 (vllm deepseek_v4 path).
Recipes for DeepSeek-V4-Pro are handled separately in NVIDIA#70.
@YAMY1234
Copy link
Copy Markdown
Collaborator Author

Superseded by a clean PR based on main — see follow-up.

@YAMY1234 YAMY1234 closed this Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants