[Model Runner v2] Fix NixlConnector PD + Spec Decode acceptance (2 GPUs) issue by yewentao256 · Pull Request #44227 · vllm-project/vllm

yewentao256 · 2026-06-01T15:29:04Z

Purpose

VLLM_USE_V2_MODEL_RUNNER=1 bash tests/v1/kv_connector/nixl_integration/spec_decode_acceptance_test.sh

Originally

=========================================================================== FAILURES ===========================================================================
--
  | ______________________________________________________________ test_spec_decode_acceptance_length ______________________________________________________________
  |  
  | def test_spec_decode_acceptance_length():
  | """Validate PD+SD acceptance length against standalone baseline.
  |  
  | Sends MT-Bench prompts through the PD proxy (completions API),
  | then checks that the decode server's speculative decoding metrics
  | match the known standalone baselines.
  | """
  | config = _get_model_config()
  | rtol = config.rtol if config.rtol is not None else DEFAULT_RTOL
  |  
  | prompts = _get_mt_bench_prompts()
  | assert len(prompts) == DEFAULT_NUM_PROMPTS, (
  | f"Expected {DEFAULT_NUM_PROMPTS} prompts, got {len(prompts)}"
  | )
  |  
  | client = openai.OpenAI(api_key="EMPTY", base_url=PROXY_BASE_URL)
  | for i, prompt in enumerate(prompts):
  | resp = client.completions.create(
  | model=MODEL_NAME,
  | prompt=prompt,
  | max_tokens=DEFAULT_OUTPUT_LEN,
  | temperature=0.0,
  | top_p=1.0,
  | )
  | if i < 3:
  | text = resp.choices[0].text.strip()[:100]
  | print(f"  [{i}] {prompt[:60]}... -> {text}...")
  |  
  | # ── Extract metrics from decode server ────────────────────────────
  | n_drafts = _fetch_metric("vllm:spec_decode_num_drafts_total")
  | n_accepted = _fetch_metric("vllm:spec_decode_num_accepted_tokens_total")
  |  
  | assert n_drafts > 0, "No spec-decode drafts were generated"
  |  
  | acceptance_length = 1 + (n_accepted / n_drafts)
  | expected = config.expected_acceptance_length
  |  
  | print(
  | f"\n{config.id}: acceptance_length={acceptance_length:.3f} "
  | f"(expected={expected:.3f})"
  | )
  | print(f"  Drafts: {n_drafts:.0f}, Accepted: {n_accepted:.0f}")
  |  
  | # ── Assert acceptance length (all methods) ────────────────────────
  | rel_error = abs(acceptance_length - expected) / expected
  | assert rel_error <= rtol, (
  | f"Acceptance length regression for {config.id}! "
  | f"Expected: {expected:.3f}, "
  | f"Got: {acceptance_length:.3f}, "
  | f"Relative error: {rel_error:.2%} (tolerance: {rtol:.0%})"
  | )
  |  
  | # ── Assert per-position acceptance (EAGLE3) ───────────────────────
  | if config.expected_acceptance_lengths_per_pos:
  | per_pos_counts = _fetch_per_position_acceptance()
  | per_pos_rates = [
  | per_pos_counts.get(i, 0) / n_drafts
  | for i in range(len(config.expected_acceptance_lengths_per_pos))
  | ]
  | for i, (actual, exp) in enumerate(
  | zip(per_pos_rates, config.expected_acceptance_lengths_per_pos)
  | ):
  | print(f"  Position {i}: {actual:.4f} (expected: {exp:.4f})")
  | if exp > 0:
  | pos_err = abs(actual - exp) / exp
  | >                   assert pos_err <= rtol, (
  | f"Per-position regression at pos {i} for {config.id}! "
  | f"Expected: {exp:.4f}, Got: {actual:.4f}, "
  | f"Relative error: {pos_err:.2%} (tolerance: {rtol:.0%})"
  | )
  | E                   AssertionError: Per-position regression at pos 2 for llama3-8b-eagle3! Expected: 0.3545, Got: 0.3360, Relative error: 5.21% (tolerance: 5%)
  | E                   assert 0.05210165713168825 <= 0.05
  |  
  | v1/kv_connector/nixl_integration/test_spec_decode_acceptance.py:203: AssertionError
  | ======================================================================= warnings summary =======================================================================
 
  | FAILED v1/kv_connector/nixl_integration/test_spec_decode_acceptance.py::test_spec_decode_acceptance_length - AssertionError: Per-position regression at pos 2 for llama3-8b-eagle3! Expected: 0.3545, Got: 0.3360, Relative error: 5.21% (tolerance: 5%)
  | assert 0.05210165713168825 <= 0.05
  | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  | ========================================================== 1 failed, 22 warnings in 160.03s (0:02:40) ==========================================================

This PR fixes the issue

Now

================= 1 passed, 16 warnings in 138.72s (0:02:18) ==================

issue Signed-off-by: yewentao256 <zhyanwentao@126.com>

Signed-off-by: yewentao256 <zhyanwentao@126.com>

…ceptance-(2-GPUs)

NickLucche

Hey @yewentao256 .
We actually discussed this recently here #43733.
What we fixed there was partially even due to num_computed_tokens==0 being too generic.
When the async load request is first scheduled, it should have num_computed_tokens=0 but should also query the connector as to whether that request should be loaded.
If that request is going to be loaded, it should not load the drafter's kv.

So I don't quite understand why this is failing rn.
More importantly, this should be failing for V1 just as well.

Some more context here #43996.

njhill · 2026-06-02T19:59:08Z

@NickLucche Actually I spent a while digging into the failure. It turned out to be two different bugs contributing to the acceptance rate drop:

Double-BOS which presumably mismatched baseline, fixed in [Test][BugFix] Fix double-BOS in PD+specdec acceptance test #44234. This affected both MRV1 and MRV2 but doesn't quite breach the 5% tolerance
The lookahead cache-trimming issue: this also corrupts kv cache for both MRV1 and MRV2, but in the MRV1 case it involved fresh zero'd blocks, so the invalid kvcache did not have a huge numerical impact. In MRV2 case we actually run a warmup at startup which uses some kv blocks and so they were no longer zeroed when this test reused them,

The larger impact of (2) in the MRV2 case was enough to breach the test tolerance (the fix of 1 was also enough by itself to fix this particular test failure, but the more subtle kv cache corruption should still be fixed of course!)

…ceptance-(2-GPUs)

NickLucche · 2026-06-03T17:06:06Z

Thanks for the work and for the breakdown @njhill !

run a warmup at startup which uses some kv blocks and so they were no longer zeroed when this test reused

this thing looks particularly nasty.

So, the case which in this fix as proposed here would you make you ditch load_kv_asyncin favor of request.num_computed_tokens == 0 is the warmup bit, where the former condition wouldn't trigger?

Signed-off-by: yewentao256 <zhyanwentao@126.com>

…ceptance-(2-GPUs)

yewentao256

@NickLucche Thanks! Took a deeper look, the load_kv_asyncing may not work for prefill node, so make it more narrow now.

NickLucche

I am sorry but this looks messier than before.

                is_pd_prefill_producer = ( # is this P?
                    request.num_computed_tokens == 0
                    and request.kv_transfer_params is not None
                    and request.kv_transfer_params.get("do_remote_decode", False) # ? This is True on P
                )

Also it won't trigger for request with partial prefix cache hits.num_computed_tokens==0 was meant to be used (perhaps confusingly) to detect load requests on D.

For P or is_pd_prefill_producer I would still prefer the proposed fix from here #43996, that is on P (identified by the role) we set self.num_lookahead_tokens to 0 and skip sampling. I can put a PR up asap.

If the PD test is still blocking your development (@yewentao256 can you double check after @njhill changes?) could you relax the acceptance rate for the time being?

This is a part of the code where I would like to keep complexity as low as possible if possible, because every time I come back to it it takes me a while to load the whole context back 😅

yewentao256

@NickLucche ok, feel free to directly submit your PR.

Fix NixlConnector PD + Spec Decode acceptance (2 GPUs)

a76c77e

issue Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, orozery, robertgshaw2-redhat and ywang96 as code owners June 1, 2026 15:29

mergify Bot added v1 kv-connector labels Jun 1, 2026

fix dflash

39494b0

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 1, 2026

Merge branch 'main' into wentao-fix-NixlConnector-PD-+-Spec-Decode-ac…

eddbdc0

…ceptance-(2-GPUs)

NickLucche reviewed Jun 1, 2026

View reviewed changes

Merge branch 'main' into wentao-fix-NixlConnector-PD-+-Spec-Decode-ac…

7c935ad

…ceptance-(2-GPUs)

yewentao256 added 2 commits June 3, 2026 18:14

update to prefill node only

61b3b6b

Signed-off-by: yewentao256 <zhyanwentao@126.com>

Merge branch 'main' into wentao-fix-NixlConnector-PD-+-Spec-Decode-ac…

1aa72fa

…ceptance-(2-GPUs)

yewentao256 commented Jun 3, 2026

View reviewed changes

NickLucche requested changes Jun 4, 2026

View reviewed changes

yewentao256 commented Jun 4, 2026

View reviewed changes

yewentao256 closed this Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model Runner v2] Fix NixlConnector PD + Spec Decode acceptance (2 GPUs) issue#44227

[Model Runner v2] Fix NixlConnector PD + Spec Decode acceptance (2 GPUs) issue#44227
yewentao256 wants to merge 6 commits into
mainfrom
wentao-fix-NixlConnector-PD-+-Spec-Decode-acceptance-(2-GPUs)

yewentao256 commented Jun 1, 2026 •

edited

Loading

Uh oh!

NickLucche left a comment

Uh oh!

njhill commented Jun 2, 2026 •

edited

Loading

Uh oh!

NickLucche commented Jun 3, 2026

Uh oh!

yewentao256 left a comment

Uh oh!

NickLucche left a comment

Uh oh!

yewentao256 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

yewentao256 commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

njhill commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NickLucche commented Jun 3, 2026

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yewentao256 commented Jun 1, 2026 •

edited

Loading

njhill commented Jun 2, 2026 •

edited

Loading