Skip to content

[Model Runner v2] Fix NixlConnector PD + Spec Decode acceptance (2 GPUs) issue#44227

Closed
yewentao256 wants to merge 6 commits into
mainfrom
wentao-fix-NixlConnector-PD-+-Spec-Decode-acceptance-(2-GPUs)
Closed

[Model Runner v2] Fix NixlConnector PD + Spec Decode acceptance (2 GPUs) issue#44227
yewentao256 wants to merge 6 commits into
mainfrom
wentao-fix-NixlConnector-PD-+-Spec-Decode-acceptance-(2-GPUs)

Conversation

@yewentao256
Copy link
Copy Markdown
Member

@yewentao256 yewentao256 commented Jun 1, 2026

Purpose

Part of #41286

VLLM_USE_V2_MODEL_RUNNER=1 bash tests/v1/kv_connector/nixl_integration/spec_decode_acceptance_test.sh

Originally

=========================================================================== FAILURES ===========================================================================
--
  | ______________________________________________________________ test_spec_decode_acceptance_length ______________________________________________________________
  |  
  | def test_spec_decode_acceptance_length():
  | """Validate PD+SD acceptance length against standalone baseline.
  |  
  | Sends MT-Bench prompts through the PD proxy (completions API),
  | then checks that the decode server's speculative decoding metrics
  | match the known standalone baselines.
  | """
  | config = _get_model_config()
  | rtol = config.rtol if config.rtol is not None else DEFAULT_RTOL
  |  
  | prompts = _get_mt_bench_prompts()
  | assert len(prompts) == DEFAULT_NUM_PROMPTS, (
  | f"Expected {DEFAULT_NUM_PROMPTS} prompts, got {len(prompts)}"
  | )
  |  
  | client = openai.OpenAI(api_key="EMPTY", base_url=PROXY_BASE_URL)
  | for i, prompt in enumerate(prompts):
  | resp = client.completions.create(
  | model=MODEL_NAME,
  | prompt=prompt,
  | max_tokens=DEFAULT_OUTPUT_LEN,
  | temperature=0.0,
  | top_p=1.0,
  | )
  | if i < 3:
  | text = resp.choices[0].text.strip()[:100]
  | print(f"  [{i}] {prompt[:60]}... -> {text}...")
  |  
  | # ── Extract metrics from decode server ────────────────────────────
  | n_drafts = _fetch_metric("vllm:spec_decode_num_drafts_total")
  | n_accepted = _fetch_metric("vllm:spec_decode_num_accepted_tokens_total")
  |  
  | assert n_drafts > 0, "No spec-decode drafts were generated"
  |  
  | acceptance_length = 1 + (n_accepted / n_drafts)
  | expected = config.expected_acceptance_length
  |  
  | print(
  | f"\n{config.id}: acceptance_length={acceptance_length:.3f} "
  | f"(expected={expected:.3f})"
  | )
  | print(f"  Drafts: {n_drafts:.0f}, Accepted: {n_accepted:.0f}")
  |  
  | # ── Assert acceptance length (all methods) ────────────────────────
  | rel_error = abs(acceptance_length - expected) / expected
  | assert rel_error <= rtol, (
  | f"Acceptance length regression for {config.id}! "
  | f"Expected: {expected:.3f}, "
  | f"Got: {acceptance_length:.3f}, "
  | f"Relative error: {rel_error:.2%} (tolerance: {rtol:.0%})"
  | )
  |  
  | # ── Assert per-position acceptance (EAGLE3) ───────────────────────
  | if config.expected_acceptance_lengths_per_pos:
  | per_pos_counts = _fetch_per_position_acceptance()
  | per_pos_rates = [
  | per_pos_counts.get(i, 0) / n_drafts
  | for i in range(len(config.expected_acceptance_lengths_per_pos))
  | ]
  | for i, (actual, exp) in enumerate(
  | zip(per_pos_rates, config.expected_acceptance_lengths_per_pos)
  | ):
  | print(f"  Position {i}: {actual:.4f} (expected: {exp:.4f})")
  | if exp > 0:
  | pos_err = abs(actual - exp) / exp
  | >                   assert pos_err <= rtol, (
  | f"Per-position regression at pos {i} for {config.id}! "
  | f"Expected: {exp:.4f}, Got: {actual:.4f}, "
  | f"Relative error: {pos_err:.2%} (tolerance: {rtol:.0%})"
  | )
  | E                   AssertionError: Per-position regression at pos 2 for llama3-8b-eagle3! Expected: 0.3545, Got: 0.3360, Relative error: 5.21% (tolerance: 5%)
  | E                   assert 0.05210165713168825 <= 0.05
  |  
  | v1/kv_connector/nixl_integration/test_spec_decode_acceptance.py:203: AssertionError
  | ======================================================================= warnings summary =======================================================================
 
  | FAILED v1/kv_connector/nixl_integration/test_spec_decode_acceptance.py::test_spec_decode_acceptance_length - AssertionError: Per-position regression at pos 2 for llama3-8b-eagle3! Expected: 0.3545, Got: 0.3360, Relative error: 5.21% (tolerance: 5%)
  | assert 0.05210165713168825 <= 0.05
  | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  | ========================================================== 1 failed, 22 warnings in 160.03s (0:02:40) ==========================================================

This PR fixes the issue

Now

================= 1 passed, 16 warnings in 138.72s (0:02:18) ==================

 issue

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 1, 2026
Copy link
Copy Markdown
Member

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @yewentao256 .
We actually discussed this recently here #43733.
What we fixed there was partially even due to num_computed_tokens==0 being too generic.
When the async load request is first scheduled, it should have num_computed_tokens=0 but should also query the connector as to whether that request should be loaded.
If that request is going to be loaded, it should not load the drafter's kv.

So I don't quite understand why this is failing rn.
More importantly, this should be failing for V1 just as well.

Some more context here #43996.

@njhill
Copy link
Copy Markdown
Member

njhill commented Jun 2, 2026

@NickLucche Actually I spent a while digging into the failure. It turned out to be two different bugs contributing to the acceptance rate drop:

  1. Double-BOS which presumably mismatched baseline, fixed in [Test][BugFix] Fix double-BOS in PD+specdec acceptance test #44234. This affected both MRV1 and MRV2 but doesn't quite breach the 5% tolerance
  2. The lookahead cache-trimming issue: this also corrupts kv cache for both MRV1 and MRV2, but in the MRV1 case it involved fresh zero'd blocks, so the invalid kvcache did not have a huge numerical impact. In MRV2 case we actually run a warmup at startup which uses some kv blocks and so they were no longer zeroed when this test reused them,

The larger impact of (2) in the MRV2 case was enough to breach the test tolerance (the fix of 1 was also enough by itself to fix this particular test failure, but the more subtle kv cache corruption should still be fixed of course!)

@NickLucche
Copy link
Copy Markdown
Member

Thanks for the work and for the breakdown @njhill !

run a warmup at startup which uses some kv blocks and so they were no longer zeroed when this test reused

this thing looks particularly nasty.

So, the case which in this fix as proposed here would you make you ditch load_kv_asyncin favor of request.num_computed_tokens == 0 is the warmup bit, where the former condition wouldn't trigger?

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Copy link
Copy Markdown
Member Author

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NickLucche Thanks! Took a deeper look, the load_kv_asyncing may not work for prefill node, so make it more narrow now.

Copy link
Copy Markdown
Member

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry but this looks messier than before.

                is_pd_prefill_producer = ( # is this P?
                    request.num_computed_tokens == 0
                    and request.kv_transfer_params is not None
                    and request.kv_transfer_params.get("do_remote_decode", False) # ? This is True on P
                )

Also it won't trigger for request with partial prefix cache hits.num_computed_tokens==0 was meant to be used (perhaps confusingly) to detect load requests on D.

For P or is_pd_prefill_producer I would still prefer the proposed fix from here #43996, that is on P (identified by the role) we set self.num_lookahead_tokens to 0 and skip sampling. I can put a PR up asap.

If the PD test is still blocking your development (@yewentao256 can you double check after @njhill changes?) could you relax the acceptance rate for the time being?

This is a part of the code where I would like to keep complexity as low as possible if possible, because every time I come back to it it takes me a while to load the whole context back 😅

Copy link
Copy Markdown
Member Author

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NickLucche ok, feel free to directly submit your PR.

@yewentao256 yewentao256 closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants