sa-bench: make SGLangDeepseekV4Tokenizer callable#144
Merged
ishandhanani merged 2 commits intoNVIDIA:mainfrom May 9, 2026
Merged
sa-bench: make SGLangDeepseekV4Tokenizer callable#144ishandhanani merged 2 commits intoNVIDIA:mainfrom
ishandhanani merged 2 commits intoNVIDIA:mainfrom
Conversation
sa-bench's `calculate_metrics` in benchmark_serving.py:657 counts
generated tokens with `tokenizer(text, add_special_tokens=False).input_ids`.
SGLangDeepseekV4Tokenizer is a wrapper around an HF tokenizer
(`self._hf`) but doesn't implement `__call__`, so a multi-node MTP
sa-bench run fails with:
TypeError: 'SGLangDeepseekV4Tokenizer' object is not callable
File "/srtctl-benchmarks/sa-bench/benchmark_serving.py", line 657
num_tokens = len(tokenizer(output.text_chunks[i], ...).input_ids)
VLLMDeepseekV4Tokenizer sidesteps this by returning the HF
PreTrainedTokenizerFast subclass directly; the SGLang wrapper needs
explicit delegation since it must keep the sglang-specific
`apply_chat_template` override.
Add:
__call__ -> delegate to self._hf so token-counting works.
__getattr__ -> proxy any other attribute (encode, pad_token, ...)
through to the HF tokenizer so callers that expect a
full PreTrainedTokenizerFast API work without knowing
about this wrapper.
apply_chat_template is defined below and wins via normal attribute
lookup before __getattr__ runs.
ch-wan
added a commit
to SemiAnalysisAI/InferenceX
that referenced
this pull request
May 8, 2026
… custom_tokenizer NVIDIA/srt-slurm#144 adds __call__ / __getattr__ to SGLangDeepseekV4Tokenizer so sa-bench's calculate_metrics (benchmark_serving.py:657 — `tokenizer(text).input_ids`) can count generated tokens for DSv4-Pro multi-node MTP runs without throwing ``TypeError: 'SGLangDeepseekV4Tokenizer' object is not callable``. Until that PR merges, pin gb300-cw's sglang launcher to ``ch-wan/srt-slurm @ c901ad38`` (the same fix), and restore ``custom_tokenizer: sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer`` in the 6 MTP recipes. ``use_chat_template: true`` is required by AGENTS.md for MTP correctness (EAGLE acceptance regresses on raw random tokens). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sglang main 2cf1a4ab renamed SGLANG_ENABLE_THINKING -> SGLANG_DEFAULT_THINKING and SGLANG_REASONING_EFFORT -> SGLANG_DSV4_REASONING_EFFORT. Read the new names primarily, with the old deprecated names as fallback for backward compatibility.
ishandhanani
approved these changes
May 9, 2026
ch-wan
added a commit
to SemiAnalysisAI/InferenceX
that referenced
this pull request
May 10, 2026
NVIDIA/srt-slurm#144 (``sa-bench: make SGLangDeepseekV4Tokenizer callable``) merged as 0cbc7eb4. Drop the ch-wan/srt-slurm fork pin that was only there while #144 was in review and pin to the upstream merge commit instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
functionstackx
pushed a commit
to SemiAnalysisAI/InferenceX
that referenced
this pull request
May 11, 2026
…g MTP disagg benchmarks (#1297) * add mtp configs * Add sbatch_directives to MTP recipes (root-cause fix) Without `cpus-per-task: 144` and `mem: 0`, slurm hands out 1 CPU and ~4 MB per task, and the dynamo cold source build (~500 rust crates) is OOM-killed before any worker comes up. Manifests as `Sweep failed (exit code: 137)` ~30 s after orchestrator start. Mirrors the block already present in the working main 8k1k recipes (e.g. disagg-gb300-1p1d-tp4-tp4-2-c1.yaml). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Change deepgemm flags * Move MTP recipes up to 8k1k/ with -mtp filename suffix Mirrors the convention used elsewhere in the repo: per-config files at the same depth as their non-MTP siblings, distinguished only by the -mtp suffix. CONFIG_FILE references in nvidia-master.yaml updated accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix * Drop custom_tokenizer from MTP recipes — incompatible with sa-bench sa-bench's calculate_metrics calls `tokenizer(text)` to count output tokens, but `sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer` doesn't implement __call__: TypeError: 'SGLangDeepseekV4Tokenizer' object is not callable File "/srtctl-benchmarks/sa-bench/benchmark_serving.py", line 657 num_tokens = len(tokenizer(output.text_chunks[i], ...).input_ids) This is the actual cause of the benchmark-task failures while eval-only tasks succeed (lm-eval doesn't go through this path). Removing custom_tokenizer falls back to AutoTokenizer.from_pretrained(/model). The chat_template is stored in the model's tokenizer_config.json, so `use_chat_template: true` continues to apply via the HF tokenizer (required for MTP correctness per AGENTS.md). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Pin srt-slurm to fork w/ SGLangDeepseekV4Tokenizer callable + restore custom_tokenizer NVIDIA/srt-slurm#144 adds __call__ / __getattr__ to SGLangDeepseekV4Tokenizer so sa-bench's calculate_metrics (benchmark_serving.py:657 — `tokenizer(text).input_ids`) can count generated tokens for DSv4-Pro multi-node MTP runs without throwing ``TypeError: 'SGLangDeepseekV4Tokenizer' object is not callable``. Until that PR merges, pin gb300-cw's sglang launcher to ``ch-wan/srt-slurm @ c901ad38`` (the same fix), and restore ``custom_tokenizer: sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer`` in the 6 MTP recipes. ``use_chat_template: true`` is required by AGENTS.md for MTP correctness (EAGLE acceptance regresses on raw random tokens). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump sglang container to nightly-dev-cu13-20260508-2cf1a4ab (latest main) Pinned to the multi-arch image produced by sgl-project/sglang Build and Push Development Docker Images run #25574279419 (head_sha 2cf1a4ab, HEAD of sglang main). Replaces the older staging image lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev (May 7). The nightly-dev-cu13 image carries the full sglang main as of 2026-05-08 21:06 UTC, including upstream fixes since the May-7 staging snapshot. Multi-arch manifest covers amd64 + arm64, so it works on the gb300 (Grace) compute nodes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Restore base dsv4-fp4-gb300-dynamo-sglang image to staging tag The previous commit accidentally bumped the non-MTP base entry's image too. The base 8k1k recipes still pin ``container: lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev``, and the launcher requires the matrix's ``image:`` to match the recipe's ``container:`` (it templates ``\"\${IMAGE}\": \${SQUASH_FILE}`` into srtslurm.yaml). Mismatching them would break the base sweep. Only the dsv4-fp4-gb300-dynamo-sglang-mtp entry needs the nightly-dev-cu13 bump (paired with the MTP recipe ``container:`` field). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Pin MTP recipes to dynamo 81d0555e (matches working base recipes) The 6 MTP recipes were imported with dynamo hash 9d3c913d from the upstream srt-slurm fork, but the working non-MTP base recipes already on this branch use 81d0555ee23519cea80a42b4fe824e30368b7300 — paired with the sglang nightly cu13 main builds. The 9d3c913d wheel is incompatible with sglang main 2cf1a4ab: the decode scheduler subprocess (rank 0) is SIGQUIT'd during sgl.Engine() init at dynamo.sglang.init_llm:77, surfacing as "Rank 0 scheduler died during initialization (exit code: -3)" in CI run 25580956722. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Explicitly disable CAR_V2 in multi-node decode MTP recipes The 4 multi-node decode MTP recipes had a comment saying SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 was "intentionally NOT set", but sglang main 2cf1a4ab defaults this on. CAR_V2 is single-node only, and on multi-node decode it silently fails to construct its backing ``self.obj``, then segfaults during cuda graph capture: AttributeError: 'CustomAllReduceV2' object has no attribute 'obj' at custom_all_reduce_v2.py:97 in capture() The scheduler is SIGQUIT'd, surfacing as "Rank 0 scheduler died during initialization (exit code: -3)" in dynamo's wrapper. Explicitly setting the env to "0" matches the intent of the pre-existing comment. Affects: dep4-dep8, dep4-dep16, 2p1d-dep4-dep8, 4p1d-dep4-dep8. Single-node decode recipes (1p1d-tp4-tp4, 1p6d-dep4-tp4) keep the default since CAR_V2 works in single-node. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Explicitly disable CAR_V2 in 8k1k base decode recipes too Apply the same explicit ``SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "0"`` to the existing 8k1k base decode recipes that had only the ``intentionally NOT set`` comment. The MTP fix in 6d28994 proved the comment-only pattern is brittle: sglang main 2cf1a4ab defaults the env on, and CAR_V2 segfaults during cuda graph capture on multi-node decode. Make the disable explicit so a future image bump on the base sweep can't regress the same way. Affects 6 recipes: 1p1d-tp4-tp4-2-c1, 1p1d-dep4-dep16-5-c1024, 4p1d-dep4-dep16-8-c1024, 8p1d-dep4-dep16-12-c4096, 10p1d-dep4-dep16-14-c8192, 12p1d-dep4-dep12-15-c21504. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Set both old and new sglang thinking/reasoning env vars in MTP recipes sglang main 2cf1a4ab moved ``SGLANG_ENABLE_THINKING`` → ``SGLANG_DEFAULT_THINKING`` and ``SGLANG_REASONING_EFFORT`` → ``SGLANG_DSV4_REASONING_EFFORT``. The deprecation helper ``_print_deprecated_env`` (environ.py:642) only emits a warning — it does NOT propagate the value to the new name. So the old env vars were silently ignored: server defaulted to non-thinking mode with empty reasoning effort, dropping GSM8K accuracy from ~95% to ~40% (eval_results_all from run 25583345967: em_strict=0.4291 for 1p6d-dep4-tp4 conc=64, 0.4056 for 4p1d-dep4-dep8 conc=1024). Set both names in prefill_environment and decode_environment of all six MTP recipes: * old names — read by the sa-bench client tokenizer (sa_bench_tokenizers.sglang_deepseek_v4) for prompt-rendering parity with the server. * new names — read by the sglang server in 2cf1a4ab+. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Set tool-call-parser=deepseekv4 to enable DSV4 chat encoding (gsm8k regression fix) GSM8K accuracy on the latest sweep dropped from the expected ~95% to ~40% (em_strict=0.4291 for 1p6d-dep4-tp4 conc=64; 0.4056 for 4p1d-dep4-dep8 conc=1024 — run 25583345967 eval_results_all). Inspecting samples_gsm8k_*.jsonl revealed every response was prefixed with junk like "Weapon:" / "Weaponized" / "We黑白颠倒", and the reasoning often answered a different question than what was asked — classic symptom of a malformed chat-template prompt. Root cause in sglang main 2cf1a4ab (entrypoints/openai/serving_chat.py:296): def _resolve_chat_encoding_spec(self) -> Optional[str]: if self.tool_call_parser == "deepseekv4": return "dsv4" if self.tool_call_parser == "deepseekv32": return "dsv32" The dsv4 chat-encoding spec — which routes DSV4 prompts through ``encoding_dsv4.encode_messages`` with thinking-mode and reasoning-effort handling — only activates when ``--tool-call-parser deepseekv4`` is set. Without it the server falls back to the vanilla HF chat template (``apply_chat_template``), which doesn't know about DSV4's special tokens, ``<think>`` blocks, or the ``thinking_mode`` argument. The MTP recipes never set this flag, so ServerArgs reports ``tool_call_parser=None`` and the model receives a malformed prompt. Add ``tool-call-parser: deepseekv4`` to both prefill and decode ``sglang_config`` blocks in all 6 MTP recipes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert CAR_V2 explicit-disable in non-MTP base 8k1k recipes Restore the 6 base recipes to their state on origin/main; the explicit ``SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: \"0\"`` was added defensively in 9c4c244, but the base sweep is happy on its current staging-dev image and shouldn't be touched in this PR. Reverts files: disagg-gb300-1p1d-tp4-tp4-2-c1.yaml disagg-gb300-1p1d-dep4-dep16-5-c1024.yaml disagg-gb300-4p1d-dep4-dep16-8-c1024.yaml disagg-gb300-8p1d-dep4-dep16-12-c4096.yaml disagg-gb300-10p1d-dep4-dep16-14-c8192.yaml disagg-gb300-12p1d-dep4-dep12-15-c21504.yaml Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Trim verbose comments and drop deprecated env var names in MTP recipes - Drop ``SGLANG_ENABLE_THINKING`` / ``SGLANG_REASONING_EFFORT`` (deprecated since sglang main 2cf1a4ab); keep only the new names ``SGLANG_DEFAULT_THINKING`` / ``SGLANG_DSV4_REASONING_EFFORT``. - Bump the srt-slurm fork pin to 51847632 so the sa-bench client tokenizer reads the new env names (with old names as fallback). - Trim multi-line block comments down to one-line tail comments for the CAR_V2 disable and ``tool-call-parser: deepseekv4`` flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert MTP recipes to staging-dev container (gsm8k accuracy fix) The bump to ``lmsysorg/sglang:nightly-dev-cu13-20260508-2cf1a4ab`` introduced an MTP-path accuracy regression: gsm8k em_strict dropped from the expected ~0.93 to ~0.42 (run 25585766931 eval_results_all shows 0.4200 for 4p1d-dep4-dep8 conc=1024). Local repro on the cluster: the failed 5-shot prompt sent through plain sglang chat completion returns the correct answer; through the dynamo+nightly pipeline it returns garbage prefixed with junk tokens. Restore the same staging-dev container the base ``dsv4-fp4-gb300- dynamo-sglang`` sweep already runs on. Drop the dependent flags that only existed because of the nightly bump: - container: nightly-dev-cu13-20260508-2cf1a4ab → sglang-staging: deepseek-v4-grace-blackwell-dev (matches the matrix entry's image) - ``tool-call-parser: deepseekv4`` removed (the chat-encoding-spec routing it gated on doesn't exist in staging-dev; HF chat_template handles DSV4 prompts directly via dynamo's native Rust formatter). - Env vars reverted to ``SGLANG_ENABLE_THINKING`` / ``SGLANG_REASONING_EFFORT`` (the names staging-dev recognizes). - nvidia-master.yaml MTP entry image updated to match. The dynamo hash, srt-slurm fork pin, sbatch_directives, and multi-node CAR_V2 disable all stay (still required). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump dynamo hash to 34d55a5 to fix DSV4 chat-template formatter Local repro on the cluster (job 2226, slurm-gb300-139-009) confirmed the regression is in dynamo's wrapper, not in sglang main: - sglang main (2cf1a4ab) standalone, same failed 5-shot: prompt_tokens=1128, answer=18 (correct). - sglang main + dynamo 81d0555e (CI): answer="Weapon:#### 16" (em_strict=0.42). The pinned dynamo at 81d0555e ships an older Rust DSV4 prompt formatter whose ``render()`` always calls ``encode_messages(...)`` — which hardcodes ``reasoning_effort=None`` and ignores ``chat_template_kwargs`` entirely. That produces a prompt the model fails on under MTP. Dynamo PR #9322 (commit 34d55a5, "Deduplicate DeepSeek prompt encoders v3.2 and v4") rewrote ``render()`` to read ``reasoning_effort`` and ``drop_thinking`` from ``chat_template_args`` and plumb them into ``encode_messages_with_options``, fixing the DSV4 prompt rendering. Restore the changes the staging-dev revert had to undo: - container: nightly-dev-cu13-20260508-2cf1a4ab - tool-call-parser: deepseekv4 (gates the dsv4 chat-encoding spec) - SGLANG_DEFAULT_THINKING / SGLANG_DSV4_REASONING_EFFORT - dynamo.hash 81d0555e -> 34d55a5 - nvidia-master.yaml MTP entry image CAR_V2 disable on multi-node decode and the srt-slurm fork pin remain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump sglang container to nightly-dev-cu13-20260509-9ee83034 Latest sglang main build (sgl-project/sglang Actions run 25586829316, head_sha 9ee83034, completed 2026-05-09 00:51 UTC). Pairs with the dynamo bump in 9b06113 (commit 34d55a5, PR #9322 — DSV4 chat- template formatter rewrite). Updated all 6 MTP recipe ``container:`` fields and the ``dsv4-fp4-gb300-dynamo-sglang-mtp`` matrix entry's ``image:`` in nvidia-master.yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Switch DSV4 MTP recipes to nixl KV transfer backend The mooncake backend has a KV-transfer bug that produces wrong gsm8k answers when prompts end on the `<think>` token (id 128821). Empirically: same input on monolithic sglang gives correct answer, mooncake-disagg gives wrong, nixl-disagg gives correct. Bug filed upstream; using nixl as workaround. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert "Switch DSV4 MTP recipes to nixl KV transfer backend" This reverts commit 3275282. * Bump MTP recipes to sglang nightly with mooncake DSv4 fix Picks up sgl-project/sglang#24878 (merged as c7f674e4), which adds the missing dsv4 state_type branch to MooncakeKVManager.maybe_send_extra. Combined with the prior revert of #1297's nixl switch (commit daa6785), the mooncake backend now correctly transfers DSv4's flat heterogeneous state pool for both non-MTP and MTP runs. Validated on GB300 1P+1D: comp_with_think.json (the prompt ending on the literal `<think>` token that previously surfaced the corruption) now returns the correct gsm8k Janet answer (`#### 18`) on mooncake disagg, matching mono and the NIXL control. MTP sa-bench delivers ~136 tok/s output throughput (~1.7x non-MTP), confirming draft acceptance is working. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * gb300-cw: switch srt-slurm pin to NVIDIA/srt-slurm main (#144 merged) NVIDIA/srt-slurm#144 (``sa-bench: make SGLangDeepseekV4Tokenizer callable``) merged as 0cbc7eb4. Drop the ch-wan/srt-slurm fork pin that was only there while #144 was in review and pin to the upstream merge commit instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * gb300-cw: track NVIDIA/srt-slurm main instead of pinning a commit Now that #144 is merged, no longer need to pin a specific commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump MTP recipes to sglang nightly 20260510-2473659e Picks up sgl-project/sglang main commit 2473659e (built via upstream workflow run 25639473178). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: use shared gb300 dsv4 model path --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
__call__and__getattr__toSGLangDeepseekV4Tokenizerso it delegates to the wrapped HF tokenizer.__call__, sa-bench'scalculate_metrics(benchmark_serving.py:657,num_tokens = len(tokenizer(output.text_chunks[i], ...).input_ids)) fails on multi-node MTP runs withTypeError: 'SGLangDeepseekV4Tokenizer' object is not callable.VLLMDeepseekV4Tokenizeralready sidesteps this by returning the HFPreTrainedTokenizerFastsubclass directly; the SGLang wrapper has to keep the sglang-specificapply_chat_templateoverride, so explicit delegation is required.Test plan
calculate_metrics).apply_chat_templatestill wins via normal attribute lookup (defined on the class before__getattr__runs).🤖 Generated with Claude Code