[Model Runner v2] Fix NixlConnector PD + Spec Decode acceptance (2 GPUs) issue#44227
[Model Runner v2] Fix NixlConnector PD + Spec Decode acceptance (2 GPUs) issue#44227yewentao256 wants to merge 6 commits into
Conversation
issue Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
…ceptance-(2-GPUs)
NickLucche
left a comment
There was a problem hiding this comment.
Hey @yewentao256 .
We actually discussed this recently here #43733.
What we fixed there was partially even due to num_computed_tokens==0 being too generic.
When the async load request is first scheduled, it should have num_computed_tokens=0 but should also query the connector as to whether that request should be loaded.
If that request is going to be loaded, it should not load the drafter's kv.
So I don't quite understand why this is failing rn.
More importantly, this should be failing for V1 just as well.
Some more context here #43996.
|
@NickLucche Actually I spent a while digging into the failure. It turned out to be two different bugs contributing to the acceptance rate drop:
The larger impact of (2) in the MRV2 case was enough to breach the test tolerance (the fix of 1 was also enough by itself to fix this particular test failure, but the more subtle kv cache corruption should still be fixed of course!) |
…ceptance-(2-GPUs)
|
Thanks for the work and for the breakdown @njhill !
this thing looks particularly nasty. So, the case which in this fix as proposed here would you make you ditch |
Signed-off-by: yewentao256 <zhyanwentao@126.com>
…ceptance-(2-GPUs)
yewentao256
left a comment
There was a problem hiding this comment.
@NickLucche Thanks! Took a deeper look, the load_kv_asyncing may not work for prefill node, so make it more narrow now.
NickLucche
left a comment
There was a problem hiding this comment.
I am sorry but this looks messier than before.
is_pd_prefill_producer = ( # is this P?
request.num_computed_tokens == 0
and request.kv_transfer_params is not None
and request.kv_transfer_params.get("do_remote_decode", False) # ? This is True on P
)
Also it won't trigger for request with partial prefix cache hits.num_computed_tokens==0 was meant to be used (perhaps confusingly) to detect load requests on D.
For P or is_pd_prefill_producer I would still prefer the proposed fix from here #43996, that is on P (identified by the role) we set self.num_lookahead_tokens to 0 and skip sampling. I can put a PR up asap.
If the PD test is still blocking your development (@yewentao256 can you double check after @njhill changes?) could you relax the acceptance rate for the time being?
This is a part of the code where I would like to keep complexity as low as possible if possible, because every time I come back to it it takes me a while to load the whole context back 😅
yewentao256
left a comment
There was a problem hiding this comment.
@NickLucche ok, feel free to directly submit your PR.
Purpose
Part of #41286
VLLM_USE_V2_MODEL_RUNNER=1 bash tests/v1/kv_connector/nixl_integration/spec_decode_acceptance_test.shOriginally
This PR fixes the issue
Now
================= 1 passed, 16 warnings in 138.72s (0:02:18) ==================