feat(sa-bench): add sglang DeepSeek-V4 tokenizer (depends on #71)#72
Closed
YAMY1234 wants to merge 19 commits intoNVIDIA:mainfrom
Closed
feat(sa-bench): add sglang DeepSeek-V4 tokenizer (depends on #71)#72YAMY1234 wants to merge 19 commits intoNVIDIA:mainfrom
YAMY1234 wants to merge 19 commits intoNVIDIA:mainfrom
Conversation
Auto-detect container type at runtime: if /sgl-workspace exists (SGLang), use original install path unchanged; otherwise use portable /tmp build path with conditional dependency installation for non-SGLang containers.
* Add Kimi-K2.5 vLLM recipes and fix NIXL side channel host - Add kimi-k2.5 1k1k and 8k1k disagg GB200 recipes (from NVIDIA#7) - Fix vLLM NIXL handshake failures: set VLLM_NIXL_SIDE_CHANNEL_HOST to node's routable IP in get_process_environment() instead of leaving it as 0.0.0.0/localhost which caused transfer handshake failures - Update test_vllm_get_process_environment to cover NIXL host env var Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: run checks on PRs targeting sa-submission-q2-2026 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
NVIDIA#24) * Add Kimi K2.5 disagg STP and MTP recipes for GB200 NVfp4 (ISL8K_OSL1K and ISL1K_OSL1K) Add optimized disaggregated inference recipes for Kimi K2.5 model with NVfp4 precision on GB200 GPUs. Includes both STP and MTP configurations for ISL8K_OSL1K and ISL1K_OSL1K workloads covering concurrency points from 5 to 2253, with Eagle speculative decoding for MTP variants. * Update Kimi K2.5 recipes: container, model path, concurrency format, and env cleanup - Update container to tensorrtllm-runtime-1.1.0-dev.2.sqsh - Point model path to shared /mnt/lustre01/models/kimi-k2.5-nvfp4 - Update Eagle model mount path for MTP configs - Remove HF_HOME (defaults to ~/.cache/huggingface) - Fix concurrency separator from space to 'x' for sa-bench compatibility - Enable multiple frontends for ctx1dep4_gen1dep32_batch64 * Use generic model path and container aliases for cluster portability Replace cluster-specific paths with generic alias names that are resolved via srtslurm.yaml model_paths and containers mappings, as per upstream convention. * Add extra_mount alias resolution and use generic Eagle model path Add model_paths alias resolution for extra_mount host paths in config.py, enabling MTP recipes to use generic name "kimi-k2.5-eagle3" instead of cluster-specific path for the Eagle speculative decoding model. * Use HuggingFace model names and full NVCR container paths Per review feedback, update model paths to HuggingFace format (nvidia/Kimi-K2.5-NVFP4) and container to full NVCR registry path (nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2) so recipes are portable and work without pre-built sqsh files. --------- Co-authored-by: nlevin-ui <nlevin@nvidia.com>
* recipes for minimax m2.5 fp4 b200 agg vllm * commit for signature
* Add lm-eval benchmark runner for InferenceX evals Adds support for running lm-eval accuracy evaluations as a post-benchmark step, leveraging the InferenceX benchmark_lib.sh harness.
…NVIDIA#47) * fix tokenizer for glm5 (NVIDIA#20) fix * add nvidia pre-release url (NVIDIA#22)
Add 66 GLM5 NVFP4 disaggregated recipe configs for GB200 and GB300 on the sa-submission branch; standardize model path and container values across the recipe set for consistency.
Extends PR NVIDIA#71's 'deepseek_v4' custom_tokenizer option with a sglang backend path. Because sglang has no client-side DeepseekV4Tokenizer package equivalent to vllm.tokenizers.deepseek_v4, we vendor sglang's own server-side encoder (encoding_dsv4.py from sgl-project/sglang PR #23600, commit f5d03db) under sa-bench/tokenizers/ so the client renders the exact same DSML prompt the sglang server builds. This lets input_tokens on the client match #new-token on the server. Routing is backend-aware with no implicit fallback: backend=sglang -> SGLangDeepseekV4Tokenizer (vendored) backend=vllm -> vllm.tokenizers.deepseek_v4.DeepseekV4Tokenizer (PR NVIDIA#71) else -> explicit ValueError Depends on NVIDIA#71 (vllm deepseek_v4 path). Recipes for DeepSeek-V4-Pro are handled separately in NVIDIA#70.
Collaborator
Author
|
Superseded by a clean PR based on main — see follow-up. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends #71's
deepseek_v4custom_tokenizer option with a sglang backend path, sosa-benchcan accurately tokenize DeepSeek-V4-Pro prompts for sglang servers the same way it already does for vllm servers in #71.Why this is needed. DeepSeek-V4 ships no Hugging Face chat template, so
tokenizer.apply_chat_template()raisesValueError. Sglang's server solves this internally by replacing the HF template path with a hard-coded DSML encoder (encoding_dsv4.encode_messages) wheneverarch == "DeepseekV4ForCausalLM"— see sgl-project/sglang#23600. Without a matching client-side encoder, thesa-benchtokenizer silently falls back to raw-text tokenization, soinput_tokensreported by the client no longer matches the server's#new-token. This in turn skews ISL, TPOT, TTFT, and MTP accept-rate accounting.What we add. There is no client-side
DeepseekV4Tokenizerpackage on the sglang side (vllm hasvllm.tokenizers.deepseek_v4; sglang has none). So we vendor sglang's own server-side encoder (encoding_dsv4.pyfrom sgl-project/sglang PR #23600, commitf5d03db853862c8fb0e805df591bed883a71868b) undersa-bench/tokenizers/and wrap it in a HF-compatibleSGLangDeepseekV4Tokenizer.apply_chat_template(). The wrapper mirrors exactly what sglang server does:systemmessage if missing (matchesserving_chat._resolve_chat_encoding_spec).thinking_mode="chat",reasoning_effort=None(matches sglang defaults).encode_messages(...)to render the raw DSML string.hf_tokenizer.encode(..., add_special_tokens=False)(the encoder already adds<|begin▁of▁sentence|>).The vendored file carries Apache-2.0 header + upstream commit SHA so it can be dropped when sglang publishes an official client-side package.
Routing is backend-aware — no implicit fallback. Per reviewer preference, we do not fall back to vllm if sglang install is present or vice versa; ambiguity is rejected:
benchmark_serving.pypassesargs.backendintoget_tokenizer(...).Files
src/srtctl/benchmarks/scripts/sa-bench/tokenizers/__init__.py— new packagesrc/srtctl/benchmarks/scripts/sa-bench/tokenizers/_sglang_encoding_dsv4.py— vendored (Apache-2.0), 840 lines, unmodified upstream contentsrc/srtctl/benchmarks/scripts/sa-bench/tokenizers/sglang_deepseek_v4.py— HF-compatible wrapperbackend_request_func.py— addsbackendkwarg toget_tokenizer, splitsdeepseek_v4into explicit sglang / vllm branchesbenchmark_serving.py— plumbsargs.backendthroughRelationship to other PRs
aflowers/dsv4-pr67-pr68) — that PR introduces thedeepseek_v4custom_tokenizer option for vllm; this PR extends it for sglang.Test plan
python -m py_compilepasses on all three new / modified files.from tokenizers.sglang_deepseek_v4 import SGLangDeepseekV4Tokenizersucceeds with the vendored encoder present.SGLangDeepseekV4Tokenizer(...).apply_chat_template(messages)token IDs match server-sidetokenizer.encode(encode_messages(messages))for a representative GSM8K prompt.sa-benchrun againstrecipes/gb300-fp4/1k1k-dsv4/agg-low-latency.yaml(from feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg) #70) with this tokenizer, asserting clientinput_tokens == server #new-token.