Skip to content

sa-bench: make SGLangDeepseekV4Tokenizer callable#144

Merged
ishandhanani merged 2 commits intoNVIDIA:mainfrom
ch-wan:cwan/fix-sglang-dsv4-tokenizer-callable
May 9, 2026
Merged

sa-bench: make SGLangDeepseekV4Tokenizer callable#144
ishandhanani merged 2 commits intoNVIDIA:mainfrom
ch-wan:cwan/fix-sglang-dsv4-tokenizer-callable

Conversation

@ch-wan
Copy link
Copy Markdown
Contributor

@ch-wan ch-wan commented May 8, 2026

Summary

  • Add __call__ and __getattr__ to SGLangDeepseekV4Tokenizer so it delegates to the wrapped HF tokenizer.
  • Without __call__, sa-bench's calculate_metrics (benchmark_serving.py:657, num_tokens = len(tokenizer(output.text_chunks[i], ...).input_ids)) fails on multi-node MTP runs with TypeError: 'SGLangDeepseekV4Tokenizer' object is not callable.
  • VLLMDeepseekV4Tokenizer already sidesteps this by returning the HF PreTrainedTokenizerFast subclass directly; the SGLang wrapper has to keep the sglang-specific apply_chat_template override, so explicit delegation is required.

Test plan

  • Multi-node DSv4-Pro SGLang MTP recipe completes a sa-bench run end-to-end (token counting at line 657 succeeds).
  • Eval-only path still passes through (lm-eval doesn't go through calculate_metrics).
  • apply_chat_template still wins via normal attribute lookup (defined on the class before __getattr__ runs).

🤖 Generated with Claude Code

sa-bench's `calculate_metrics` in benchmark_serving.py:657 counts
generated tokens with `tokenizer(text, add_special_tokens=False).input_ids`.
SGLangDeepseekV4Tokenizer is a wrapper around an HF tokenizer
(`self._hf`) but doesn't implement `__call__`, so a multi-node MTP
sa-bench run fails with:

  TypeError: 'SGLangDeepseekV4Tokenizer' object is not callable
    File "/srtctl-benchmarks/sa-bench/benchmark_serving.py", line 657
      num_tokens = len(tokenizer(output.text_chunks[i], ...).input_ids)

VLLMDeepseekV4Tokenizer sidesteps this by returning the HF
PreTrainedTokenizerFast subclass directly; the SGLang wrapper needs
explicit delegation since it must keep the sglang-specific
`apply_chat_template` override.

Add:
  __call__   -> delegate to self._hf so token-counting works.
  __getattr__ -> proxy any other attribute (encode, pad_token, ...)
                 through to the HF tokenizer so callers that expect a
                 full PreTrainedTokenizerFast API work without knowing
                 about this wrapper.

apply_chat_template is defined below and wins via normal attribute
lookup before __getattr__ runs.
ch-wan added a commit to SemiAnalysisAI/InferenceX that referenced this pull request May 8, 2026
… custom_tokenizer

NVIDIA/srt-slurm#144 adds __call__ / __getattr__ to
SGLangDeepseekV4Tokenizer so sa-bench's calculate_metrics
(benchmark_serving.py:657 — `tokenizer(text).input_ids`) can count
generated tokens for DSv4-Pro multi-node MTP runs without throwing
``TypeError: 'SGLangDeepseekV4Tokenizer' object is not callable``.

Until that PR merges, pin gb300-cw's sglang launcher to
``ch-wan/srt-slurm @ c901ad38`` (the same fix), and restore
``custom_tokenizer: sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer``
in the 6 MTP recipes. ``use_chat_template: true`` is required by
AGENTS.md for MTP correctness (EAGLE acceptance regresses on raw
random tokens).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sglang main 2cf1a4ab renamed SGLANG_ENABLE_THINKING ->
SGLANG_DEFAULT_THINKING and SGLANG_REASONING_EFFORT ->
SGLANG_DSV4_REASONING_EFFORT. Read the new names primarily, with the
old deprecated names as fallback for backward compatibility.
@ishandhanani ishandhanani merged commit 0cbc7eb into NVIDIA:main May 9, 2026
ch-wan added a commit to SemiAnalysisAI/InferenceX that referenced this pull request May 10, 2026
NVIDIA/srt-slurm#144 (``sa-bench: make SGLangDeepseekV4Tokenizer
callable``) merged as 0cbc7eb4. Drop the ch-wan/srt-slurm fork pin
that was only there while #144 was in review and pin to the upstream
merge commit instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
functionstackx pushed a commit to SemiAnalysisAI/InferenceX that referenced this pull request May 11, 2026
…g MTP disagg benchmarks (#1297)

* add mtp configs

* Add sbatch_directives to MTP recipes (root-cause fix)

Without `cpus-per-task: 144` and `mem: 0`, slurm hands out 1 CPU and
~4 MB per task, and the dynamo cold source build (~500 rust crates)
is OOM-killed before any worker comes up. Manifests as
`Sweep failed (exit code: 137)` ~30 s after orchestrator start.

Mirrors the block already present in the working main 8k1k recipes
(e.g. disagg-gb300-1p1d-tp4-tp4-2-c1.yaml).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Change deepgemm flags

* Move MTP recipes up to 8k1k/ with -mtp filename suffix

Mirrors the convention used elsewhere in the repo: per-config files at
the same depth as their non-MTP siblings, distinguished only by the
-mtp suffix. CONFIG_FILE references in nvidia-master.yaml updated
accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix

* Drop custom_tokenizer from MTP recipes — incompatible with sa-bench

sa-bench's calculate_metrics calls `tokenizer(text)` to count output
tokens, but `sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer`
doesn't implement __call__:

  TypeError: 'SGLangDeepseekV4Tokenizer' object is not callable
    File "/srtctl-benchmarks/sa-bench/benchmark_serving.py", line 657
      num_tokens = len(tokenizer(output.text_chunks[i], ...).input_ids)

This is the actual cause of the benchmark-task failures while eval-only
tasks succeed (lm-eval doesn't go through this path).

Removing custom_tokenizer falls back to AutoTokenizer.from_pretrained(/model).
The chat_template is stored in the model's tokenizer_config.json, so
`use_chat_template: true` continues to apply via the HF tokenizer
(required for MTP correctness per AGENTS.md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Pin srt-slurm to fork w/ SGLangDeepseekV4Tokenizer callable + restore custom_tokenizer

NVIDIA/srt-slurm#144 adds __call__ / __getattr__ to
SGLangDeepseekV4Tokenizer so sa-bench's calculate_metrics
(benchmark_serving.py:657 — `tokenizer(text).input_ids`) can count
generated tokens for DSv4-Pro multi-node MTP runs without throwing
``TypeError: 'SGLangDeepseekV4Tokenizer' object is not callable``.

Until that PR merges, pin gb300-cw's sglang launcher to
``ch-wan/srt-slurm @ c901ad38`` (the same fix), and restore
``custom_tokenizer: sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer``
in the 6 MTP recipes. ``use_chat_template: true`` is required by
AGENTS.md for MTP correctness (EAGLE acceptance regresses on raw
random tokens).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Bump sglang container to nightly-dev-cu13-20260508-2cf1a4ab (latest main)

Pinned to the multi-arch image produced by sgl-project/sglang Build and
Push Development Docker Images run #25574279419 (head_sha 2cf1a4ab,
HEAD of sglang main). Replaces the older staging image
lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev (May 7).

The nightly-dev-cu13 image carries the full sglang main as of 2026-05-08
21:06 UTC, including upstream fixes since the May-7 staging snapshot.
Multi-arch manifest covers amd64 + arm64, so it works on the gb300
(Grace) compute nodes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Restore base dsv4-fp4-gb300-dynamo-sglang image to staging tag

The previous commit accidentally bumped the non-MTP base entry's
image too. The base 8k1k recipes still pin
``container: lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev``,
and the launcher requires the matrix's ``image:`` to match the
recipe's ``container:`` (it templates ``\"\${IMAGE}\": \${SQUASH_FILE}``
into srtslurm.yaml). Mismatching them would break the base sweep.

Only the dsv4-fp4-gb300-dynamo-sglang-mtp entry needs the
nightly-dev-cu13 bump (paired with the MTP recipe ``container:``
field).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Pin MTP recipes to dynamo 81d0555e (matches working base recipes)

The 6 MTP recipes were imported with dynamo hash 9d3c913d from the
upstream srt-slurm fork, but the working non-MTP base recipes already
on this branch use 81d0555ee23519cea80a42b4fe824e30368b7300 — paired
with the sglang nightly cu13 main builds.

The 9d3c913d wheel is incompatible with sglang main 2cf1a4ab: the
decode scheduler subprocess (rank 0) is SIGQUIT'd during sgl.Engine()
init at dynamo.sglang.init_llm:77, surfacing as "Rank 0 scheduler died
during initialization (exit code: -3)" in CI run 25580956722.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Explicitly disable CAR_V2 in multi-node decode MTP recipes

The 4 multi-node decode MTP recipes had a comment saying
SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 was "intentionally NOT set", but
sglang main 2cf1a4ab defaults this on. CAR_V2 is single-node only,
and on multi-node decode it silently fails to construct its backing
``self.obj``, then segfaults during cuda graph capture:

  AttributeError: 'CustomAllReduceV2' object has no attribute 'obj'
    at custom_all_reduce_v2.py:97 in capture()

The scheduler is SIGQUIT'd, surfacing as
"Rank 0 scheduler died during initialization (exit code: -3)" in
dynamo's wrapper. Explicitly setting the env to "0" matches the
intent of the pre-existing comment.

Affects: dep4-dep8, dep4-dep16, 2p1d-dep4-dep8, 4p1d-dep4-dep8.
Single-node decode recipes (1p1d-tp4-tp4, 1p6d-dep4-tp4) keep the
default since CAR_V2 works in single-node.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Explicitly disable CAR_V2 in 8k1k base decode recipes too

Apply the same explicit ``SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "0"``
to the existing 8k1k base decode recipes that had only the
``intentionally NOT set`` comment. The MTP fix in 6d28994 proved the
comment-only pattern is brittle: sglang main 2cf1a4ab defaults the
env on, and CAR_V2 segfaults during cuda graph capture on multi-node
decode. Make the disable explicit so a future image bump on the base
sweep can't regress the same way.

Affects 6 recipes: 1p1d-tp4-tp4-2-c1, 1p1d-dep4-dep16-5-c1024,
4p1d-dep4-dep16-8-c1024, 8p1d-dep4-dep16-12-c4096,
10p1d-dep4-dep16-14-c8192, 12p1d-dep4-dep12-15-c21504.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Set both old and new sglang thinking/reasoning env vars in MTP recipes

sglang main 2cf1a4ab moved ``SGLANG_ENABLE_THINKING`` →
``SGLANG_DEFAULT_THINKING`` and ``SGLANG_REASONING_EFFORT`` →
``SGLANG_DSV4_REASONING_EFFORT``. The deprecation helper
``_print_deprecated_env`` (environ.py:642) only emits a warning — it
does NOT propagate the value to the new name. So the old env vars
were silently ignored: server defaulted to non-thinking mode with
empty reasoning effort, dropping GSM8K accuracy from ~95% to ~40%
(eval_results_all from run 25583345967: em_strict=0.4291 for
1p6d-dep4-tp4 conc=64, 0.4056 for 4p1d-dep4-dep8 conc=1024).

Set both names in prefill_environment and decode_environment of all
six MTP recipes:
  * old names — read by the sa-bench client tokenizer
    (sa_bench_tokenizers.sglang_deepseek_v4) for prompt-rendering
    parity with the server.
  * new names — read by the sglang server in 2cf1a4ab+.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Set tool-call-parser=deepseekv4 to enable DSV4 chat encoding (gsm8k regression fix)

GSM8K accuracy on the latest sweep dropped from the expected ~95% to
~40% (em_strict=0.4291 for 1p6d-dep4-tp4 conc=64; 0.4056 for
4p1d-dep4-dep8 conc=1024 — run 25583345967 eval_results_all).

Inspecting samples_gsm8k_*.jsonl revealed every response was prefixed
with junk like "Weapon:" / "Weaponized" / "We黑白颠倒", and the
reasoning often answered a different question than what was asked —
classic symptom of a malformed chat-template prompt.

Root cause in sglang main 2cf1a4ab
(entrypoints/openai/serving_chat.py:296):

    def _resolve_chat_encoding_spec(self) -> Optional[str]:
        if self.tool_call_parser == "deepseekv4":
            return "dsv4"
        if self.tool_call_parser == "deepseekv32":
            return "dsv32"

The dsv4 chat-encoding spec — which routes DSV4 prompts through
``encoding_dsv4.encode_messages`` with thinking-mode and
reasoning-effort handling — only activates when
``--tool-call-parser deepseekv4`` is set. Without it the server falls
back to the vanilla HF chat template (``apply_chat_template``), which
doesn't know about DSV4's special tokens, ``<think>`` blocks, or the
``thinking_mode`` argument. The MTP recipes never set this flag, so
ServerArgs reports ``tool_call_parser=None`` and the model receives a
malformed prompt.

Add ``tool-call-parser: deepseekv4`` to both prefill and decode
``sglang_config`` blocks in all 6 MTP recipes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Revert CAR_V2 explicit-disable in non-MTP base 8k1k recipes

Restore the 6 base recipes to their state on origin/main; the
explicit ``SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: \"0\"`` was added
defensively in 9c4c244, but the base sweep is happy on its current
staging-dev image and shouldn't be touched in this PR.

Reverts files:
  disagg-gb300-1p1d-tp4-tp4-2-c1.yaml
  disagg-gb300-1p1d-dep4-dep16-5-c1024.yaml
  disagg-gb300-4p1d-dep4-dep16-8-c1024.yaml
  disagg-gb300-8p1d-dep4-dep16-12-c4096.yaml
  disagg-gb300-10p1d-dep4-dep16-14-c8192.yaml
  disagg-gb300-12p1d-dep4-dep12-15-c21504.yaml

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Trim verbose comments and drop deprecated env var names in MTP recipes

- Drop ``SGLANG_ENABLE_THINKING`` / ``SGLANG_REASONING_EFFORT``
  (deprecated since sglang main 2cf1a4ab); keep only the new names
  ``SGLANG_DEFAULT_THINKING`` / ``SGLANG_DSV4_REASONING_EFFORT``.
- Bump the srt-slurm fork pin to 51847632 so the sa-bench client
  tokenizer reads the new env names (with old names as fallback).
- Trim multi-line block comments down to one-line tail comments
  for the CAR_V2 disable and ``tool-call-parser: deepseekv4`` flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Revert MTP recipes to staging-dev container (gsm8k accuracy fix)

The bump to ``lmsysorg/sglang:nightly-dev-cu13-20260508-2cf1a4ab``
introduced an MTP-path accuracy regression: gsm8k em_strict dropped
from the expected ~0.93 to ~0.42 (run 25585766931 eval_results_all
shows 0.4200 for 4p1d-dep4-dep8 conc=1024). Local repro on the
cluster: the failed 5-shot prompt sent through plain sglang chat
completion returns the correct answer; through the dynamo+nightly
pipeline it returns garbage prefixed with junk tokens.

Restore the same staging-dev container the base ``dsv4-fp4-gb300-
dynamo-sglang`` sweep already runs on. Drop the dependent flags that
only existed because of the nightly bump:

- container: nightly-dev-cu13-20260508-2cf1a4ab → sglang-staging:
  deepseek-v4-grace-blackwell-dev (matches the matrix entry's image)
- ``tool-call-parser: deepseekv4`` removed (the chat-encoding-spec
  routing it gated on doesn't exist in staging-dev; HF chat_template
  handles DSV4 prompts directly via dynamo's native Rust formatter).
- Env vars reverted to ``SGLANG_ENABLE_THINKING`` /
  ``SGLANG_REASONING_EFFORT`` (the names staging-dev recognizes).
- nvidia-master.yaml MTP entry image updated to match.

The dynamo hash, srt-slurm fork pin, sbatch_directives, and
multi-node CAR_V2 disable all stay (still required).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Bump dynamo hash to 34d55a5 to fix DSV4 chat-template formatter

Local repro on the cluster (job 2226, slurm-gb300-139-009) confirmed
the regression is in dynamo's wrapper, not in sglang main:

  - sglang main (2cf1a4ab) standalone, same failed 5-shot:
    prompt_tokens=1128, answer=18 (correct).
  - sglang main + dynamo 81d0555e (CI):
    answer="Weapon:#### 16" (em_strict=0.42).

The pinned dynamo at 81d0555e ships an older Rust DSV4 prompt
formatter whose ``render()`` always calls ``encode_messages(...)`` —
which hardcodes ``reasoning_effort=None`` and ignores
``chat_template_kwargs`` entirely. That produces a prompt the model
fails on under MTP.

Dynamo PR #9322 (commit 34d55a5, "Deduplicate DeepSeek prompt
encoders v3.2 and v4") rewrote ``render()`` to read
``reasoning_effort`` and ``drop_thinking`` from
``chat_template_args`` and plumb them into
``encode_messages_with_options``, fixing the DSV4 prompt rendering.

Restore the changes the staging-dev revert had to undo:

  - container: nightly-dev-cu13-20260508-2cf1a4ab
  - tool-call-parser: deepseekv4 (gates the dsv4 chat-encoding spec)
  - SGLANG_DEFAULT_THINKING / SGLANG_DSV4_REASONING_EFFORT
  - dynamo.hash 81d0555e -> 34d55a5
  - nvidia-master.yaml MTP entry image

CAR_V2 disable on multi-node decode and the srt-slurm fork pin
remain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Bump sglang container to nightly-dev-cu13-20260509-9ee83034

Latest sglang main build (sgl-project/sglang Actions run 25586829316,
head_sha 9ee83034, completed 2026-05-09 00:51 UTC). Pairs with the
dynamo bump in 9b06113 (commit 34d55a5, PR #9322 — DSV4 chat-
template formatter rewrite).

Updated all 6 MTP recipe ``container:`` fields and the
``dsv4-fp4-gb300-dynamo-sglang-mtp`` matrix entry's ``image:`` in
nvidia-master.yaml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Switch DSV4 MTP recipes to nixl KV transfer backend

The mooncake backend has a KV-transfer bug that produces wrong gsm8k
answers when prompts end on the `<think>` token (id 128821).
Empirically: same input on monolithic sglang gives correct answer,
mooncake-disagg gives wrong, nixl-disagg gives correct. Bug filed
upstream; using nixl as workaround.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Revert "Switch DSV4 MTP recipes to nixl KV transfer backend"

This reverts commit 3275282.

* Bump MTP recipes to sglang nightly with mooncake DSv4 fix

Picks up sgl-project/sglang#24878 (merged as c7f674e4),
which adds the missing dsv4 state_type branch to
MooncakeKVManager.maybe_send_extra. Combined with the prior
revert of #1297's nixl switch (commit daa6785), the mooncake
backend now correctly transfers DSv4's flat heterogeneous
state pool for both non-MTP and MTP runs.

Validated on GB300 1P+1D: comp_with_think.json (the prompt
ending on the literal `<think>` token that previously surfaced
the corruption) now returns the correct gsm8k Janet answer
(`#### 18`) on mooncake disagg, matching mono and the NIXL
control. MTP sa-bench delivers ~136 tok/s output throughput
(~1.7x non-MTP), confirming draft acceptance is working.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* gb300-cw: switch srt-slurm pin to NVIDIA/srt-slurm main (#144 merged)

NVIDIA/srt-slurm#144 (``sa-bench: make SGLangDeepseekV4Tokenizer
callable``) merged as 0cbc7eb4. Drop the ch-wan/srt-slurm fork pin
that was only there while #144 was in review and pin to the upstream
merge commit instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* gb300-cw: track NVIDIA/srt-slurm main instead of pinning a commit

Now that #144 is merged, no longer need to pin a specific commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Bump MTP recipes to sglang nightly 20260510-2473659e

Picks up sgl-project/sglang main commit 2473659e (built via
upstream workflow run 25639473178).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: use shared gb300 dsv4 model path

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants