[CI] Stabilize Hybrid SSM disagg PD gsm8k accuracy test (#43301)#43570
Closed
haosdent wants to merge 1 commit into
Closed
[CI] Stabilize Hybrid SSM disagg PD gsm8k accuracy test (#43301)#43570haosdent wants to merge 1 commit into
haosdent wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates the accuracy integration tests by switching to sequential, greedy decoding to ensure reproducible measurements and eliminate variance caused by concurrent batching and random sampling. Key changes include setting NUM_CONCURRENT to 1, increasing the GSM8K sample limit to 200, tightening the relative tolerance (RTOL) to 0.03, and updating the expected accuracy values for the Granite and Qwen models. Additionally, the test now explicitly logs measured values to facilitate tracking in CI environments. I have no feedback to provide as there were no review comments.
The granite-4.0-h-tiny gsm8k drop reported in vllm-project#43301 is caused by bf16 cuBLAS lm_head GEMM picking a different reduction-order kernel for the per-rank vocab shard (TP=2) vs the full vocab (TP=1) on L4 SM89. When top-1 / top-2 logit gap is below bf16 ULP, sampler argmax flips and a small fraction of prompts generate different answers, dropping gsm8k from 0.78 (TP=1) to 0.74 (TP=2) on the same hardware. Rather than introduce an fp32 lm_head fix (extra memory + latency for all users to remove an L4-specific symptom), stabilize the test so it is robust to the underlying ULP-level variance: - gen_kwargs temperature=0/top_k=1: greedy decode, no temperature sampling noise. - GSM8K_LIMIT=200: bf16 cuBLAS at lm_head argmax has ULP-tie sensitive prompts (~2-4% on math-reasoning where top-1/top-2 are both numerical tokens). On 50 prompts these flips contribute ~0.04 per-config variance; on 200 prompts they average to ~0.015, within RTOL=0.03. This reverts vllm-project#43186's threshold loosening (granite expected 0.77 -> 0.80, RTOL 0.05 -> 0.03). Qwen/Qwen3.5-0.8B expected updated 0.33 -> 0.36 to match its measurement under the new deterministic settings. Always-print [ACCURACY] line added so future tuning can read measured values from CI logs without parsing assertion errors. Signed-off-by: haosdent <haosdent@gmail.com>
faf2e64 to
7d09bf3
Compare
This was referenced Jun 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposal
Fixes #43301
The granite-4.0-h-tiny gsm8k drop reported in #43301 is bf16 cuBLAS lm_head ULP-tie variance on L4 SM89 (top-1 / top-2 logit gap below bf16 ULP, sampler argmax flips on 1-2 prompts under TP=2). Stabilize the test rather than touch lm_head: greedy decode + N=200 averages out the ULP-tie noise to ~0.015, well within strict RTOL=0.03.
Reverts #43186's threshold loosening. Qwen3.5-0.8B
expected0.33 → 0.36 to match its measurement under the new deterministic settings.Test plan
Buildkite #67901 (passed) under the prior NUM_CONCURRENT=1 setting:
Buildkite #67997 re-verifies with NUM_CONCURRENT=100 (current form).