[BugFix][Nixl][PD] Fix heterogenous TP#22663
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
Thanks @wseaton for spotting the correctness issue 🙏🏻 |
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a critical bug that caused garbled output in heterogeneous tensor parallelism setups. The fix correctly determines the KV cache layout by refactoring the KVConnectorFactory to expose a get_connector_class method, which is a clean and effective solution. Additionally, the inclusion of a null check in KVOutputAggregator is a good defensive measure that prevents potential runtime errors. The changes are well-implemented, improving both the correctness and maintainability of the code.
|
Can you merge from main to fix CI? |
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
cfdc991 to
fdb247d
Compare
|
@DarkLight1337 rebased |
|
Please fix pre-commit |
|
Why isn't the pre-commit CI job showing the diff of the changes? I think yapf and isort are conflicting , yapf changes the file and isort brings it back to how it was lol so no changes are left but pre-commit still fails @hmellor what do you think? Also I don't like that isort behaves differently based on whether the imports are installed or not in your venv, which is causing the CI to not match my local env.. |
|
If they are conflicting, you can disable |
Signed-off-by: NickLucche <nlucches@redhat.com>
|
ok done, still it would be nice to sort this out with configs |
|
We are slowly migrating to using |
|
This is failing on main but it really shouldn't, is someone looking into it already? |
|
I put up another PR for that failing test.. #22727 @DarkLight1337 |
Follow-on from vllm-project#22663. Signed-off-by: Nick Hill <nhill@redhat.com>
That's the plan! While so much is changing in |
Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com>

In the attempt to deprecate V0 here #21785, we introduced a subtle bug that would prevent a KVConnector from forwarding its KV cache layout preference.
This would result in the following log:
which, in the case of NixlConnector for D TP != P TP means breaking one of the single core assumption about the layout. This results in a garbled output.
This PR brings things back to
Also added @njhill fix for KVConnectorOutput or it would break even earlier than observing the correctness regression.