[CI] Stabilize Hybrid SSM disagg PD gsm8k accuracy test (#43301) by haosdent · Pull Request #43570 · vllm-project/vllm

haosdent · 2026-05-25T06:04:40Z

Proposal

The granite-4.0-h-tiny gsm8k drop reported in #43301 is bf16 cuBLAS lm_head ULP-tie variance on L4 SM89 (top-1 / top-2 logit gap below bf16 ULP, sampler argmax flips on 1-2 prompts under TP=2). Stabilize the test rather than touch lm_head: greedy decode + N=200 averages out the ULP-tie noise to ~0.015, well within strict RTOL=0.03.

Reverts #43186's threshold loosening. Qwen3.5-0.8B expected 0.33 → 0.36 to match its measurement under the new deterministic settings.

Test plan

Buildkite #67901 (passed) under the prior NUM_CONCURRENT=1 setting:

config	measured	margin
granite TP=1	0.790	-0.010
granite TP=2	0.775	-0.025
Qwen3.5-0.8B TP=1	0.355	-0.005
Qwen3.5-0.8B TP=2	0.370	+0.010

Buildkite #67997 re-verifies with NUM_CONCURRENT=100 (current form).

gemini-code-assist

Code Review

This pull request updates the accuracy integration tests by switching to sequential, greedy decoding to ensure reproducible measurements and eliminate variance caused by concurrent batching and random sampling. Key changes include setting NUM_CONCURRENT to 1, increasing the GSM8K sample limit to 200, tightening the relative tolerance (RTOL) to 0.03, and updating the expected accuracy values for the Granite and Qwen models. Additionally, the test now explicitly logs measured values to facilitate tracking in CI environments. I have no feedback to provide as there were no review comments.

The granite-4.0-h-tiny gsm8k drop reported in vllm-project#43301 is caused by bf16 cuBLAS lm_head GEMM picking a different reduction-order kernel for the per-rank vocab shard (TP=2) vs the full vocab (TP=1) on L4 SM89. When top-1 / top-2 logit gap is below bf16 ULP, sampler argmax flips and a small fraction of prompts generate different answers, dropping gsm8k from 0.78 (TP=1) to 0.74 (TP=2) on the same hardware. Rather than introduce an fp32 lm_head fix (extra memory + latency for all users to remove an L4-specific symptom), stabilize the test so it is robust to the underlying ULP-level variance: - gen_kwargs temperature=0/top_k=1: greedy decode, no temperature sampling noise. - GSM8K_LIMIT=200: bf16 cuBLAS at lm_head argmax has ULP-tie sensitive prompts (~2-4% on math-reasoning where top-1/top-2 are both numerical tokens). On 50 prompts these flips contribute ~0.04 per-config variance; on 200 prompts they average to ~0.015, within RTOL=0.03. This reverts vllm-project#43186's threshold loosening (granite expected 0.77 -> 0.80, RTOL 0.05 -> 0.03). Qwen/Qwen3.5-0.8B expected updated 0.33 -> 0.36 to match its measurement under the new deterministic settings. Always-print [ACCURACY] line added so future tuning can read measured values from CI logs without parsing assertion errors. Signed-off-by: haosdent <haosdent@gmail.com>

haosdent requested review from ApostaC and orozery as code owners May 25, 2026 06:04

mergify Bot added v1 kv-connector labels May 25, 2026

gemini-code-assist Bot reviewed May 25, 2026

View reviewed changes

haosdent force-pushed the fix/43301-stabilize-disagg-gsm8k branch from faf2e64 to 7d09bf3 Compare May 25, 2026 06:06

haosdent closed this May 25, 2026

This was referenced Jun 6, 2026

[CI] Add opt-in statistically-calibrated lm-eval accuracy gate (Wilson lower bound) #44704

Draft

[RFC]: Automated baselining and degradation detection #34333

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Stabilize Hybrid SSM disagg PD gsm8k accuracy test (#43301)#43570

[CI] Stabilize Hybrid SSM disagg PD gsm8k accuracy test (#43301)#43570
haosdent wants to merge 1 commit into
vllm-project:mainfrom
haosdent:fix/43301-stabilize-disagg-gsm8k

haosdent commented May 25, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

haosdent commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposal

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

haosdent commented May 25, 2026 •

edited

Loading