sa-bench: make SGLangDeepseekV4Tokenizer callable by ch-wan · Pull Request #144 · NVIDIA/srt-slurm

ch-wan · 2026-05-08T18:01:30Z

Summary

Add __call__ and __getattr__ to SGLangDeepseekV4Tokenizer so it delegates to the wrapped HF tokenizer.
Without __call__, sa-bench's calculate_metrics (benchmark_serving.py:657, num_tokens = len(tokenizer(output.text_chunks[i], ...).input_ids)) fails on multi-node MTP runs with TypeError: 'SGLangDeepseekV4Tokenizer' object is not callable.
VLLMDeepseekV4Tokenizer already sidesteps this by returning the HF PreTrainedTokenizerFast subclass directly; the SGLang wrapper has to keep the sglang-specific apply_chat_template override, so explicit delegation is required.

Test plan

Multi-node DSv4-Pro SGLang MTP recipe completes a sa-bench run end-to-end (token counting at line 657 succeeds).
Eval-only path still passes through (lm-eval doesn't go through calculate_metrics).
apply_chat_template still wins via normal attribute lookup (defined on the class before __getattr__ runs).

🤖 Generated with Claude Code

sa-bench's `calculate_metrics` in benchmark_serving.py:657 counts generated tokens with `tokenizer(text, add_special_tokens=False).input_ids`. SGLangDeepseekV4Tokenizer is a wrapper around an HF tokenizer (`self._hf`) but doesn't implement `__call__`, so a multi-node MTP sa-bench run fails with: TypeError: 'SGLangDeepseekV4Tokenizer' object is not callable File "/srtctl-benchmarks/sa-bench/benchmark_serving.py", line 657 num_tokens = len(tokenizer(output.text_chunks[i], ...).input_ids) VLLMDeepseekV4Tokenizer sidesteps this by returning the HF PreTrainedTokenizerFast subclass directly; the SGLang wrapper needs explicit delegation since it must keep the sglang-specific `apply_chat_template` override. Add: __call__ -> delegate to self._hf so token-counting works. __getattr__ -> proxy any other attribute (encode, pad_token, ...) through to the HF tokenizer so callers that expect a full PreTrainedTokenizerFast API work without knowing about this wrapper. apply_chat_template is defined below and wins via normal attribute lookup before __getattr__ runs.

… custom_tokenizer NVIDIA/srt-slurm#144 adds __call__ / __getattr__ to SGLangDeepseekV4Tokenizer so sa-bench's calculate_metrics (benchmark_serving.py:657 — `tokenizer(text).input_ids`) can count generated tokens for DSv4-Pro multi-node MTP runs without throwing ``TypeError: 'SGLangDeepseekV4Tokenizer' object is not callable``. Until that PR merges, pin gb300-cw's sglang launcher to ``ch-wan/srt-slurm @ c901ad38`` (the same fix), and restore ``custom_tokenizer: sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer`` in the 6 MTP recipes. ``use_chat_template: true`` is required by AGENTS.md for MTP correctness (EAGLE acceptance regresses on raw random tokens). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sglang main 2cf1a4ab renamed SGLANG_ENABLE_THINKING -> SGLANG_DEFAULT_THINKING and SGLANG_REASONING_EFFORT -> SGLANG_DSV4_REASONING_EFFORT. Read the new names primarily, with the old deprecated names as fallback for backward compatibility.

NVIDIA/srt-slurm#144 (``sa-bench: make SGLangDeepseekV4Tokenizer callable``) merged as 0cbc7eb4. Drop the ch-wan/srt-slurm fork pin that was only there while #144 was in review and pin to the upstream merge commit instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…g MTP disagg benchmarks (#1297) * add mtp configs * Add sbatch_directives to MTP recipes (root-cause fix) Without `cpus-per-task: 144` and `mem: 0`, slurm hands out 1 CPU and ~4 MB per task, and the dynamo cold source build (~500 rust crates) is OOM-killed before any worker comes up. Manifests as `Sweep failed (exit code: 137)` ~30 s after orchestrator start. Mirrors the block already present in the working main 8k1k recipes (e.g. disagg-gb300-1p1d-tp4-tp4-2-c1.yaml). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Change deepgemm flags * Move MTP recipes up to 8k1k/ with -mtp filename suffix Mirrors the convention used elsewhere in the repo: per-config files at the same depth as their non-MTP siblings, distinguished only by the -mtp suffix. CONFIG_FILE references in nvidia-master.yaml updated accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix * Drop custom_tokenizer from MTP recipes — incompatible with sa-bench sa-bench's calculate_metrics calls `tokenizer(text)` to count output tokens, but `sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer` doesn't implement __call__: TypeError: 'SGLangDeepseekV4Tokenizer' object is not callable File "/srtctl-benchmarks/sa-bench/benchmark_serving.py", line 657 num_tokens = len(tokenizer(output.text_chunks[i], ...).input_ids) This is the actual cause of the benchmark-task failures while eval-only tasks succeed (lm-eval doesn't go through this path). Removing custom_tokenizer falls back to AutoTokenizer.from_pretrained(/model). The chat_template is stored in the model's tokenizer_config.json, so `use_chat_template: true` continues to apply via the HF tokenizer (required for MTP correctness per AGENTS.md). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Pin srt-slurm to fork w/ SGLangDeepseekV4Tokenizer callable + restore custom_tokenizer NVIDIA/srt-slurm#144 adds __call__ / __getattr__ to SGLangDeepseekV4Tokenizer so sa-bench's calculate_metrics (benchmark_serving.py:657 — `tokenizer(text).input_ids`) can count generated tokens for DSv4-Pro multi-node MTP runs without throwing ``TypeError: 'SGLangDeepseekV4Tokenizer' object is not callable``. Until that PR merges, pin gb300-cw's sglang launcher to ``ch-wan/srt-slurm @ c901ad38`` (the same fix), and restore ``custom_tokenizer: sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer`` in the 6 MTP recipes. ``use_chat_template: true`` is required by AGENTS.md for MTP correctness (EAGLE acceptance regresses on raw random tokens). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump sglang container to nightly-dev-cu13-20260508-2cf1a4ab (latest main) Pinned to the multi-arch image produced by sgl-project/sglang Build and Push Development Docker Images run #25574279419 (head_sha 2cf1a4ab, HEAD of sglang main). Replaces the older staging image lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev (May 7). The nightly-dev-cu13 image carries the full sglang main as of 2026-05-08 21:06 UTC, including upstream fixes since the May-7 staging snapshot. Multi-arch manifest covers amd64 + arm64, so it works on the gb300 (Grace) compute nodes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Restore base dsv4-fp4-gb300-dynamo-sglang image to staging tag The previous commit accidentally bumped the non-MTP base entry's image too. The base 8k1k recipes still pin ``container: lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev``, and the launcher requires the matrix's ``image:`` to match the recipe's ``container:`` (it templates ``\"\${IMAGE}\": \${SQUASH_FILE}`` into srtslurm.yaml). Mismatching them would break the base sweep. Only the dsv4-fp4-gb300-dynamo-sglang-mtp entry needs the nightly-dev-cu13 bump (paired with the MTP recipe ``container:`` field). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Pin MTP recipes to dynamo 81d0555e (matches working base recipes) The 6 MTP recipes were imported with dynamo hash 9d3c913d from the upstream srt-slurm fork, but the working non-MTP base recipes already on this branch use 81d0555ee23519cea80a42b4fe824e30368b7300 — paired with the sglang nightly cu13 main builds. The 9d3c913d wheel is incompatible with sglang main 2cf1a4ab: the decode scheduler subprocess (rank 0) is SIGQUIT'd during sgl.Engine() init at dynamo.sglang.init_llm:77, surfacing as "Rank 0 scheduler died during initialization (exit code: -3)" in CI run 25580956722. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Explicitly disable CAR_V2 in multi-node decode MTP recipes The 4 multi-node decode MTP recipes had a comment saying SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 was "intentionally NOT set", but sglang main 2cf1a4ab defaults this on. CAR_V2 is single-node only, and on multi-node decode it silently fails to construct its backing ``self.obj``, then segfaults during cuda graph capture: AttributeError: 'CustomAllReduceV2' object has no attribute 'obj' at custom_all_reduce_v2.py:97 in capture() The scheduler is SIGQUIT'd, surfacing as "Rank 0 scheduler died during initialization (exit code: -3)" in dynamo's wrapper. Explicitly setting the env to "0" matches the intent of the pre-existing comment. Affects: dep4-dep8, dep4-dep16, 2p1d-dep4-dep8, 4p1d-dep4-dep8. Single-node decode recipes (1p1d-tp4-tp4, 1p6d-dep4-tp4) keep the default since CAR_V2 works in single-node. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Explicitly disable CAR_V2 in 8k1k base decode recipes too Apply the same explicit ``SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "0"`` to the existing 8k1k base decode recipes that had only the ``intentionally NOT set`` comment. The MTP fix in 6d28994 proved the comment-only pattern is brittle: sglang main 2cf1a4ab defaults the env on, and CAR_V2 segfaults during cuda graph capture on multi-node decode. Make the disable explicit so a future image bump on the base sweep can't regress the same way. Affects 6 recipes: 1p1d-tp4-tp4-2-c1, 1p1d-dep4-dep16-5-c1024, 4p1d-dep4-dep16-8-c1024, 8p1d-dep4-dep16-12-c4096, 10p1d-dep4-dep16-14-c8192, 12p1d-dep4-dep12-15-c21504. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Set both old and new sglang thinking/reasoning env vars in MTP recipes sglang main 2cf1a4ab moved ``SGLANG_ENABLE_THINKING`` → ``SGLANG_DEFAULT_THINKING`` and ``SGLANG_REASONING_EFFORT`` → ``SGLANG_DSV4_REASONING_EFFORT``. The deprecation helper ``_print_deprecated_env`` (environ.py:642) only emits a warning — it does NOT propagate the value to the new name. So the old env vars were silently ignored: server defaulted to non-thinking mode with empty reasoning effort, dropping GSM8K accuracy from ~95% to ~40% (eval_results_all from run 25583345967: em_strict=0.4291 for 1p6d-dep4-tp4 conc=64, 0.4056 for 4p1d-dep4-dep8 conc=1024). Set both names in prefill_environment and decode_environment of all six MTP recipes: * old names — read by the sa-bench client tokenizer (sa_bench_tokenizers.sglang_deepseek_v4) for prompt-rendering parity with the server. * new names — read by the sglang server in 2cf1a4ab+. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Set tool-call-parser=deepseekv4 to enable DSV4 chat encoding (gsm8k regression fix) GSM8K accuracy on the latest sweep dropped from the expected ~95% to ~40% (em_strict=0.4291 for 1p6d-dep4-tp4 conc=64; 0.4056 for 4p1d-dep4-dep8 conc=1024 — run 25583345967 eval_results_all). Inspecting samples_gsm8k_*.jsonl revealed every response was prefixed with junk like "Weapon:" / "Weaponized" / "We黑白颠倒", and the reasoning often answered a different question than what was asked — classic symptom of a malformed chat-template prompt. Root cause in sglang main 2cf1a4ab (entrypoints/openai/serving_chat.py:296): def _resolve_chat_encoding_spec(self) -> Optional[str]: if self.tool_call_parser == "deepseekv4": return "dsv4" if self.tool_call_parser == "deepseekv32": return "dsv32" The dsv4 chat-encoding spec — which routes DSV4 prompts through ``encoding_dsv4.encode_messages`` with thinking-mode and reasoning-effort handling — only activates when ``--tool-call-parser deepseekv4`` is set. Without it the server falls back to the vanilla HF chat template (``apply_chat_template``), which doesn't know about DSV4's special tokens, ``<think>`` blocks, or the ``thinking_mode`` argument. The MTP recipes never set this flag, so ServerArgs reports ``tool_call_parser=None`` and the model receives a malformed prompt. Add ``tool-call-parser: deepseekv4`` to both prefill and decode ``sglang_config`` blocks in all 6 MTP recipes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert CAR_V2 explicit-disable in non-MTP base 8k1k recipes Restore the 6 base recipes to their state on origin/main; the explicit ``SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: \"0\"`` was added defensively in 9c4c244, but the base sweep is happy on its current staging-dev image and shouldn't be touched in this PR. Reverts files: disagg-gb300-1p1d-tp4-tp4-2-c1.yaml disagg-gb300-1p1d-dep4-dep16-5-c1024.yaml disagg-gb300-4p1d-dep4-dep16-8-c1024.yaml disagg-gb300-8p1d-dep4-dep16-12-c4096.yaml disagg-gb300-10p1d-dep4-dep16-14-c8192.yaml disagg-gb300-12p1d-dep4-dep12-15-c21504.yaml Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Trim verbose comments and drop deprecated env var names in MTP recipes - Drop ``SGLANG_ENABLE_THINKING`` / ``SGLANG_REASONING_EFFORT`` (deprecated since sglang main 2cf1a4ab); keep only the new names ``SGLANG_DEFAULT_THINKING`` / ``SGLANG_DSV4_REASONING_EFFORT``. - Bump the srt-slurm fork pin to 51847632 so the sa-bench client tokenizer reads the new env names (with old names as fallback). - Trim multi-line block comments down to one-line tail comments for the CAR_V2 disable and ``tool-call-parser: deepseekv4`` flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert MTP recipes to staging-dev container (gsm8k accuracy fix) The bump to ``lmsysorg/sglang:nightly-dev-cu13-20260508-2cf1a4ab`` introduced an MTP-path accuracy regression: gsm8k em_strict dropped from the expected ~0.93 to ~0.42 (run 25585766931 eval_results_all shows 0.4200 for 4p1d-dep4-dep8 conc=1024). Local repro on the cluster: the failed 5-shot prompt sent through plain sglang chat completion returns the correct answer; through the dynamo+nightly pipeline it returns garbage prefixed with junk tokens. Restore the same staging-dev container the base ``dsv4-fp4-gb300- dynamo-sglang`` sweep already runs on. Drop the dependent flags that only existed because of the nightly bump: - container: nightly-dev-cu13-20260508-2cf1a4ab → sglang-staging: deepseek-v4-grace-blackwell-dev (matches the matrix entry's image) - ``tool-call-parser: deepseekv4`` removed (the chat-encoding-spec routing it gated on doesn't exist in staging-dev; HF chat_template handles DSV4 prompts directly via dynamo's native Rust formatter). - Env vars reverted to ``SGLANG_ENABLE_THINKING`` / ``SGLANG_REASONING_EFFORT`` (the names staging-dev recognizes). - nvidia-master.yaml MTP entry image updated to match. The dynamo hash, srt-slurm fork pin, sbatch_directives, and multi-node CAR_V2 disable all stay (still required). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump dynamo hash to 34d55a5 to fix DSV4 chat-template formatter Local repro on the cluster (job 2226, slurm-gb300-139-009) confirmed the regression is in dynamo's wrapper, not in sglang main: - sglang main (2cf1a4ab) standalone, same failed 5-shot: prompt_tokens=1128, answer=18 (correct). - sglang main + dynamo 81d0555e (CI): answer="Weapon:#### 16" (em_strict=0.42). The pinned dynamo at 81d0555e ships an older Rust DSV4 prompt formatter whose ``render()`` always calls ``encode_messages(...)`` — which hardcodes ``reasoning_effort=None`` and ignores ``chat_template_kwargs`` entirely. That produces a prompt the model fails on under MTP. Dynamo PR #9322 (commit 34d55a5, "Deduplicate DeepSeek prompt encoders v3.2 and v4") rewrote ``render()`` to read ``reasoning_effort`` and ``drop_thinking`` from ``chat_template_args`` and plumb them into ``encode_messages_with_options``, fixing the DSV4 prompt rendering. Restore the changes the staging-dev revert had to undo: - container: nightly-dev-cu13-20260508-2cf1a4ab - tool-call-parser: deepseekv4 (gates the dsv4 chat-encoding spec) - SGLANG_DEFAULT_THINKING / SGLANG_DSV4_REASONING_EFFORT - dynamo.hash 81d0555e -> 34d55a5 - nvidia-master.yaml MTP entry image CAR_V2 disable on multi-node decode and the srt-slurm fork pin remain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump sglang container to nightly-dev-cu13-20260509-9ee83034 Latest sglang main build (sgl-project/sglang Actions run 25586829316, head_sha 9ee83034, completed 2026-05-09 00:51 UTC). Pairs with the dynamo bump in 9b06113 (commit 34d55a5, PR #9322 — DSV4 chat- template formatter rewrite). Updated all 6 MTP recipe ``container:`` fields and the ``dsv4-fp4-gb300-dynamo-sglang-mtp`` matrix entry's ``image:`` in nvidia-master.yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Switch DSV4 MTP recipes to nixl KV transfer backend The mooncake backend has a KV-transfer bug that produces wrong gsm8k answers when prompts end on the `<think>` token (id 128821). Empirically: same input on monolithic sglang gives correct answer, mooncake-disagg gives wrong, nixl-disagg gives correct. Bug filed upstream; using nixl as workaround. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert "Switch DSV4 MTP recipes to nixl KV transfer backend" This reverts commit 3275282. * Bump MTP recipes to sglang nightly with mooncake DSv4 fix Picks up sgl-project/sglang#24878 (merged as c7f674e4), which adds the missing dsv4 state_type branch to MooncakeKVManager.maybe_send_extra. Combined with the prior revert of #1297's nixl switch (commit daa6785), the mooncake backend now correctly transfers DSv4's flat heterogeneous state pool for both non-MTP and MTP runs. Validated on GB300 1P+1D: comp_with_think.json (the prompt ending on the literal `<think>` token that previously surfaced the corruption) now returns the correct gsm8k Janet answer (`#### 18`) on mooncake disagg, matching mono and the NIXL control. MTP sa-bench delivers ~136 tok/s output throughput (~1.7x non-MTP), confirming draft acceptance is working. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * gb300-cw: switch srt-slurm pin to NVIDIA/srt-slurm main (#144 merged) NVIDIA/srt-slurm#144 (``sa-bench: make SGLangDeepseekV4Tokenizer callable``) merged as 0cbc7eb4. Drop the ch-wan/srt-slurm fork pin that was only there while #144 was in review and pin to the upstream merge commit instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * gb300-cw: track NVIDIA/srt-slurm main instead of pinning a commit Now that #144 is merged, no longer need to pin a specific commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump MTP recipes to sglang nightly 20260510-2473659e Picks up sgl-project/sglang main commit 2473659e (built via upstream workflow run 25639473178). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: use shared gb300 dsv4 model path --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>

ch-wan requested review from alec-flowers, csahithi, ishandhanani and nlevin-ui as code owners May 8, 2026 18:01

ishandhanani approved these changes May 9, 2026

View reviewed changes

ishandhanani merged commit 0cbc7eb into NVIDIA:main May 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sa-bench: make SGLangDeepseekV4Tokenizer callable#144

sa-bench: make SGLangDeepseekV4Tokenizer callable#144
ishandhanani merged 2 commits intoNVIDIA:mainfrom
ch-wan:cwan/fix-sglang-dsv4-tokenizer-callable

ch-wan commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ch-wan commented May 8, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants