Skip to content

[BugFix][Nixl][PD] Fix heterogenous TP#22663

Merged
vllm-bot merged 3 commits intovllm-project:mainfrom
NickLucche:pd-fix-hetero
Aug 12, 2025
Merged

[BugFix][Nixl][PD] Fix heterogenous TP#22663
vllm-bot merged 3 commits intovllm-project:mainfrom
NickLucche:pd-fix-hetero

Conversation

@NickLucche
Copy link
Collaborator

In the attempt to deprecate V0 here #21785, we introduced a subtle bug that would prevent a KVConnector from forwarding its KV cache layout preference.
This would result in the following log:

(EngineCore_0 pid=1204777) INFO 08-11 16:32:10 [utils.py:113] Connectors do not specify a kv cache layout, defaulting to NHD.

which, in the case of NixlConnector for D TP != P TP means breaking one of the single core assumption about the layout. This results in a garbled output.

This PR brings things back to

(EngineCore_0 pid=1306559) INFO 08-11 17:17:38 [nixl_connector.py:151] NixlConnector setting KV cache layout to HND for better xfer performance.

Also added @njhill fix for KVConnectorOutput or it would break even earlier than observing the correctness regression.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@NickLucche
Copy link
Collaborator Author

Thanks @wseaton for spotting the correctness issue 🙏🏻

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a critical bug that caused garbled output in heterogeneous tensor parallelism setups. The fix correctly determines the KV cache layout by refactoring the KVConnectorFactory to expose a get_connector_class method, which is a clean and effective solution. Additionally, the inclusion of a null check in KVOutputAggregator is a good defensive measure that prevents potential runtime errors. The changes are well-implemented, improving both the correctness and maintainability of the code.

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NickLucche

@njhill njhill changed the title [Nixl][PD] Fix heterogenous TP [BugFix][Nixl][PD] Fix heterogenous TP Aug 11, 2025
@njhill njhill added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed labels Aug 11, 2025
@DarkLight1337
Copy link
Member

Can you merge from main to fix CI?

NickLucche and others added 2 commits August 12, 2025 07:27
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
@NickLucche
Copy link
Collaborator Author

@DarkLight1337 rebased

@DarkLight1337
Copy link
Member

Please fix pre-commit

@NickLucche
Copy link
Collaborator Author

NickLucche commented Aug 12, 2025

Why isn't the pre-commit CI job showing the diff of the changes?
image

I think yapf and isort are conflicting , yapf changes the file and isort brings it back to how it was lol so no changes are left but pre-commit still fails

(tmp) ➜  vllm git:(pd-fix-hetero) ✗ pre-commit run yapf --files vllm/distributed/kv_transfer/kv_connector/factory.py  
yapf.....................................................................Failed
- hook id: yapf
- files were modified by this hook

Reformatting vllm/distributed/kv_transfer/kv_connector/factory.py

(tmp) ➜  vllm git:(pd-fix-hetero) ✗ pre-commit run isort --files vllm/distributed/kv_transfer/kv_connector/factory.py 
isort....................................................................Failed
- hook id: isort
- files were modified by this hook

Fixing /home/nicolo/llmd/vllm/vllm/distributed/kv_transfer/kv_connector/factory.py

(tmp) ➜  vllm git:(pd-fix-hetero) ✗ gd vllm/distributed/kv_transfer/kv_connector/factory.py | cat                                            

@hmellor what do you think?

Also I don't like that isort behaves differently based on whether the imports are installed or not in your venv, which is causing the CI to not match my local env..

@DarkLight1337
Copy link
Member

If they are conflicting, you can disable yapf for those lines.

Signed-off-by: NickLucche <nlucches@redhat.com>
@NickLucche
Copy link
Collaborator Author

ok done, still it would be nice to sort this out with configs

@DarkLight1337
Copy link
Member

DarkLight1337 commented Aug 12, 2025

We are slowly migrating to using ruff to do both so there will be no more conflicts. Hopefully the process will speed up once V0 is removed

@NickLucche
Copy link
Collaborator Author

This is failing on main

v1/kv_connector/unit/test_remote_decode_lifecycle.py::test_short_prompt_lifecycle

but it really shouldn't, is someone looking into it already?

@NickLucche
Copy link
Collaborator Author

I put up another PR for that failing test.. #22727 @DarkLight1337

@vllm-bot vllm-bot merged commit d030b01 into vllm-project:main Aug 12, 2025
39 of 45 checks passed
njhill added a commit to njhill/vllm that referenced this pull request Aug 12, 2025
Follow-on from vllm-project#22663.

Signed-off-by: Nick Hill <nhill@redhat.com>
@hmellor
Copy link
Member

hmellor commented Aug 12, 2025

Hopefully the process will speed up once V0 is removed

That's the plan! While so much is changing in tests/ and vllm/ it'd cause too many conflicts. Everything outside of these directorie should already be using Ruff though

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Paul Pak <paulpak58@gmail.com>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Xiao Yu <xiao.yu@amd.com>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants