[Bugfix] Account for GQA replication in NIXL handshake block_len validation#45337
[Bugfix] Account for GQA replication in NIXL handshake block_len validation#45337waynehacking8 wants to merge 1 commit into
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. Agent GuidelinesIMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban. 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
8db69e0 to
3b0c2ee
Compare
|
Rebased onto main to resolve the conflict with #44583 (per-region KV transfer classification). The rebase is not mechanical, so summarizing what changed: #44583 rewrote this validation as a per-region REPLICATE/SPLIT loop, where Ported the fix onto the new structure: SPLIT regions now compare against the actual per-rank head ratio ( Tests after rebase: Full-file run has one failure, |
…dation _validate_remote_agent_handshake assumed block_len scales linearly with 1/tp_size. When tp_size > total_num_kv_heads, GQA replication caps the per-rank head count at max(1, heads // tp), so block_len stops scaling: with 8 KV heads, a D_TP=16 worker pulling from P_TP=8 sees identical block_lens on both sides (one head per rank each) and the assertion expecting local_block_len * tp_ratio rejected the valid handshake. Compare against the actual per-rank head ratio instead — identical to the raw tp_ratio whenever neither side is capped, so all currently passing configurations are unaffected. The transfer path already handles the capped case (tp_mapping rank_offset_factor for tp_size > total_num_kv_heads); only the validation was wrong. The P_TP > D_TP with replicated-remote combination remains explicitly rejected by the existing guard at the top of the validation (unchanged). Fixes vllm-project#45330 Co-authored-by: Claude Signed-off-by: Wayne Chiu <waynehacking8@gmail.com>
3b0c2ee to
0f68a02
Compare
Purpose
Fixes #45330.
_validate_remote_agent_handshakeassumesblock_lenscales linearly with1/tp_size. Whentp_size > total_num_kv_heads, GQA replication caps the per-rank head count atmax(1, heads // tp), soblock_lenstops scaling. With the reporter's matrix (8 KV heads): D_TP=16 ← P_TP=8 has one head per rank on both sides → identical block_lens, but the assertion expectslocal_block_len * tp_ratio(2×) and rejects the valid handshake.Fix
Compute the expected remote block_len from the actual per-rank head ratio (
max(1, heads // tp)on each side — the same quantity the topology already exposes aslocal_physical_heads) instead of the rawtp_ratio:tp_mappingcomputesrank_offset_factorexplicitly fortp_size > total_num_kv_heads); only the validation was wrong.Test Plan
New
test_handshake_validates_gqa_replicated_block_len: local TP=16 with 8 total KV heads (capped at 1 head/rank), remote TP=8 (also 1 head/rank), identical block_lens — must validate cleanly.Test Result
TestNixlHandshake: 12/12 pass.test_nixl_connector.py: 8 pass, 1 pre-existing env failure (test_abort_timeout_on_prefiller[ray], "FlashInfer requires GPUs with sm75 or higher" inside the ray worker — verified identical on a clean tree in this cu128/SM120 env, unrelated).pre-commit(ruff check/format, mypy) passes on changed files.Duplicate-work check
gh issue view 45330 --comments: 0 comments, 0 assignees (filed 2026-06-12 by kannakAWS).gh pr list --search "45330 in:body"/--search "nixl handshake block_len": no open PRs. No competing work as of this submission.AI assistance disclosure
Developed with AI assistance (Claude Code). I reviewed every changed line and ran the tests above.
🤖 Generated with Claude Code