Skip to content

[CI] Stabilize Hybrid SSM disagg PD gsm8k accuracy test (#43301)#43570

Closed
haosdent wants to merge 1 commit into
vllm-project:mainfrom
haosdent:fix/43301-stabilize-disagg-gsm8k
Closed

[CI] Stabilize Hybrid SSM disagg PD gsm8k accuracy test (#43301)#43570
haosdent wants to merge 1 commit into
vllm-project:mainfrom
haosdent:fix/43301-stabilize-disagg-gsm8k

Conversation

@haosdent
Copy link
Copy Markdown
Contributor

@haosdent haosdent commented May 25, 2026

Proposal

Fixes #43301

The granite-4.0-h-tiny gsm8k drop reported in #43301 is bf16 cuBLAS lm_head ULP-tie variance on L4 SM89 (top-1 / top-2 logit gap below bf16 ULP, sampler argmax flips on 1-2 prompts under TP=2). Stabilize the test rather than touch lm_head: greedy decode + N=200 averages out the ULP-tie noise to ~0.015, well within strict RTOL=0.03.

Reverts #43186's threshold loosening. Qwen3.5-0.8B expected 0.33 → 0.36 to match its measurement under the new deterministic settings.

Test plan

Buildkite #67901 (passed) under the prior NUM_CONCURRENT=1 setting:

config measured margin
granite TP=1 0.790 -0.010
granite TP=2 0.775 -0.025
Qwen3.5-0.8B TP=1 0.355 -0.005
Qwen3.5-0.8B TP=2 0.370 +0.010

Buildkite #67997 re-verifies with NUM_CONCURRENT=100 (current form).

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the accuracy integration tests by switching to sequential, greedy decoding to ensure reproducible measurements and eliminate variance caused by concurrent batching and random sampling. Key changes include setting NUM_CONCURRENT to 1, increasing the GSM8K sample limit to 200, tightening the relative tolerance (RTOL) to 0.03, and updating the expected accuracy values for the Granite and Qwen models. Additionally, the test now explicitly logs measured values to facilitate tracking in CI environments. I have no feedback to provide as there were no review comments.

The granite-4.0-h-tiny gsm8k drop reported in vllm-project#43301 is caused by bf16
cuBLAS lm_head GEMM picking a different reduction-order kernel for the
per-rank vocab shard (TP=2) vs the full vocab (TP=1) on L4 SM89. When
top-1 / top-2 logit gap is below bf16 ULP, sampler argmax flips and a
small fraction of prompts generate different answers, dropping gsm8k
from 0.78 (TP=1) to 0.74 (TP=2) on the same hardware.

Rather than introduce an fp32 lm_head fix (extra memory + latency for
all users to remove an L4-specific symptom), stabilize the test so it
is robust to the underlying ULP-level variance:

- gen_kwargs temperature=0/top_k=1: greedy decode, no temperature
  sampling noise.
- GSM8K_LIMIT=200: bf16 cuBLAS at lm_head argmax has ULP-tie sensitive
  prompts (~2-4% on math-reasoning where top-1/top-2 are both numerical
  tokens). On 50 prompts these flips contribute ~0.04 per-config
  variance; on 200 prompts they average to ~0.015, within RTOL=0.03.

This reverts vllm-project#43186's threshold loosening (granite expected 0.77 ->
0.80, RTOL 0.05 -> 0.03). Qwen/Qwen3.5-0.8B expected updated 0.33 ->
0.36 to match its measurement under the new deterministic settings.

Always-print [ACCURACY] line added so future tuning can read measured
values from CI logs without parsing assertion errors.

Signed-off-by: haosdent <haosdent@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Hybrid SSM] Investigate accuracy divergence between mamba_chunk_scan and selective_state_update kernels

1 participant