[BugFix][Nixl][PD] Fix heterogenous TP by NickLucche · Pull Request #22663 · vllm-project/vllm

NickLucche · 2025-08-11T17:25:32Z

In the attempt to deprecate V0 here #21785, we introduced a subtle bug that would prevent a KVConnector from forwarding its KV cache layout preference.
This would result in the following log:

(EngineCore_0 pid=1204777) INFO 08-11 16:32:10 [utils.py:113] Connectors do not specify a kv cache layout, defaulting to NHD.

which, in the case of NixlConnector for D TP != P TP means breaking one of the single core assumption about the layout. This results in a garbled output.

This PR brings things back to

(EngineCore_0 pid=1306559) INFO 08-11 17:17:38 [nixl_connector.py:151] NixlConnector setting KV cache layout to HND for better xfer performance.

Also added @njhill fix for KVConnectorOutput or it would break even earlier than observing the correctness regression.

github-actions · 2025-08-11T17:25:41Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

NickLucche · 2025-08-11T17:26:36Z

Thanks @wseaton for spotting the correctness issue 🙏🏻

gemini-code-assist

Code Review

This pull request effectively addresses a critical bug that caused garbled output in heterogeneous tensor parallelism setups. The fix correctly determines the KV cache layout by refactoring the KVConnectorFactory to expose a get_connector_class method, which is a clean and effective solution. Additionally, the inclusion of a null check in KVOutputAggregator is a good defensive measure that prevents potential runtime errors. The changes are well-implemented, improving both the correctness and maintainability of the code.

njhill

Thanks @NickLucche

vllm/distributed/kv_transfer/kv_connector/utils.py

DarkLight1337 · 2025-08-12T07:22:04Z

Can you merge from main to fix CI?

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche · 2025-08-12T07:28:05Z

@DarkLight1337 rebased

DarkLight1337 · 2025-08-12T08:10:10Z

Please fix pre-commit

NickLucche · 2025-08-12T08:44:40Z

Why isn't the pre-commit CI job showing the diff of the changes?

I think yapf and isort are conflicting , yapf changes the file and isort brings it back to how it was lol so no changes are left but pre-commit still fails

(tmp) ➜  vllm git:(pd-fix-hetero) ✗ pre-commit run yapf --files vllm/distributed/kv_transfer/kv_connector/factory.py  
yapf.....................................................................Failed
- hook id: yapf
- files were modified by this hook

Reformatting vllm/distributed/kv_transfer/kv_connector/factory.py

(tmp) ➜  vllm git:(pd-fix-hetero) ✗ pre-commit run isort --files vllm/distributed/kv_transfer/kv_connector/factory.py 
isort....................................................................Failed
- hook id: isort
- files were modified by this hook

Fixing /home/nicolo/llmd/vllm/vllm/distributed/kv_transfer/kv_connector/factory.py

(tmp) ➜  vllm git:(pd-fix-hetero) ✗ gd vllm/distributed/kv_transfer/kv_connector/factory.py | cat

@hmellor what do you think?

Also I don't like that isort behaves differently based on whether the imports are installed or not in your venv, which is causing the CI to not match my local env..

DarkLight1337 · 2025-08-12T08:48:20Z

If they are conflicting, you can disable yapf for those lines.

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche · 2025-08-12T08:51:14Z

ok done, still it would be nice to sort this out with configs

DarkLight1337 · 2025-08-12T08:54:46Z

We are slowly migrating to using ruff to do both so there will be no more conflicts. Hopefully the process will speed up once V0 is removed

NickLucche · 2025-08-12T11:33:36Z

This is failing on main

v1/kv_connector/unit/test_remote_decode_lifecycle.py::test_short_prompt_lifecycle

but it really shouldn't, is someone looking into it already?

NickLucche · 2025-08-12T11:48:13Z

I put up another PR for that failing test.. #22727 @DarkLight1337

Follow-on from vllm-project#22663. Signed-off-by: Nick Hill <nhill@redhat.com>

hmellor · 2025-08-12T14:08:14Z

Hopefully the process will speed up once V0 is removed

That's the plan! While so much is changing in tests/ and vllm/ it'd cause too many conflicts. Everything outside of these directorie should already be using Ruff though

Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com>

Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com>

gemini-code-assist bot reviewed Aug 11, 2025

View reviewed changes

njhill approved these changes Aug 11, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/utils.py Outdated Show resolved Hide resolved

njhill changed the title ~~[Nixl][PD] Fix heterogenous TP~~ [BugFix][Nixl][PD] Fix heterogenous TP Aug 11, 2025

njhill added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed labels Aug 11, 2025

njhill mentioned this pull request Aug 11, 2025

[V1][P/D]Bug fix: handle edge case where KVConnectorOutput is None #22473

Closed

NickLucche and others added 2 commits August 12, 2025 07:27

fix layout detection

704a210

Signed-off-by: NickLucche <nlucches@redhat.com>

fix kvoutput none

fdb247d

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche force-pushed the pd-fix-hetero branch from cfdc991 to fdb247d Compare August 12, 2025 07:27

disable yapf

1392fef

Signed-off-by: NickLucche <nlucches@redhat.com>

vllm-bot merged commit d030b01 into vllm-project:main Aug 12, 2025
39 of 45 checks passed

njhill added a commit to njhill/vllm that referenced this pull request Aug 12, 2025

[BugFix][KVConn] Fix use of get_required_kvcache_layout

4be1eed

Follow-on from vllm-project#22663. Signed-off-by: Nick Hill <nhill@redhat.com>

njhill mentioned this pull request Aug 12, 2025

[BugFix][KVConn] Fix use of get_required_kvcache_layout #22734

Merged

NickLucche mentioned this pull request Aug 12, 2025

[CI][Nixl] Check kv cache layout during handshake #22745

Merged

orozery mentioned this pull request Aug 14, 2025

KVOutputAggregator: Fix handling of empty output #22899

Closed

yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025

[BugFix][Nixl][PD] Fix heterogenous TP (vllm-project#22663)

49013ce

Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[BugFix][Nixl][PD] Fix heterogenous TP (vllm-project#22663)

8a02248

Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

[BugFix][Nixl][PD] Fix heterogenous TP (vllm-project#22663)

b3f06c4

Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[BugFix][Nixl][PD] Fix heterogenous TP (vllm-project#22663)

88a9c3b

Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com>

Uh oh!

Conversation

NickLucche commented Aug 11, 2025

Uh oh!

github-actions bot commented Aug 11, 2025

Uh oh!

NickLucche commented Aug 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DarkLight1337 commented Aug 12, 2025

Uh oh!

NickLucche commented Aug 12, 2025

Uh oh!

DarkLight1337 commented Aug 12, 2025

Uh oh!

NickLucche commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Aug 12, 2025

Uh oh!

NickLucche commented Aug 12, 2025

Uh oh!

DarkLight1337 commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NickLucche commented Aug 12, 2025

Uh oh!

NickLucche commented Aug 12, 2025

Uh oh!

Uh oh!

hmellor commented Aug 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NickLucche commented Aug 12, 2025 •

edited

Loading

DarkLight1337 commented Aug 12, 2025 •

edited

Loading