[CI] Add P2pNccl integration test + rename nixl_integration to pd_integration by eicherseiji · Pull Request #34050 · vllm-project/vllm

eicherseiji · 2026-02-07T09:21:39Z

Summary

Rename tests/v1/kv_connector/nixl_integration/ to pd_integration/ — the directory now hosts tests for multiple PD connectors, not just Nixl
Add run_p2p_nccl_accuracy_test.sh for P2pNcclConnector PD accuracy testing on 2 GPUs
Add CI entry in distributed.yaml (marked optional)

Test plan

Existing NixlConnector CI tests pass with renamed directory
P2pNccl accuracy test passes on NVLink GPUs (2 GPU)

…egration Rename nixl_integration/ to pd_integration/ since the directory now hosts tests for multiple PD connectors, not just Nixl. Add run_p2p_nccl_accuracy_test.sh for P2pNcclConnector PD accuracy testing on 2 GPUs (marked optional in CI). Signed-off-by: Seiji Eicher <seiji@anyscale.com>

gemini-code-assist

Code Review

This pull request refactors the integration test directory for PD connectors from nixl_integration to the more general pd_integration and adds a new integration test for the P2pNcclConnector. The changes look good, but I've identified a potential race condition in the new test script run_p2p_nccl_accuracy_test.sh. My feedback focuses on making the server readiness checks more robust to prevent flaky tests by using health check endpoints instead of relying on less reliable methods and fixed sleep durations.

gemini-code-assist · 2026-02-07T09:23:17Z

tests/v1/kv_connector/pd_integration/run_p2p_nccl_accuracy_test.sh

+wait_for_server() {
+  local port=$1
+  timeout 600 bash -c "until curl -s localhost:${port}/v1/completions > /dev/null; do sleep 1; done"
+}


The current wait_for_server implementation is not robust. It uses curl to check the /v1/completions endpoint. A GET request to this POST-only endpoint will receive a 405 "Method Not Allowed" response, but curl without the -f flag will still exit with code 0. This can cause a race condition where the script proceeds before the server is fully ready.

A better approach is to use the /health endpoint, which is a GET endpoint, and the curl -f flag to ensure curl fails on HTTP errors. This makes the server readiness check much more reliable and prevents flaky tests.

Suggested change

wait_for_server() {

local port=$1

timeout 600 bash -c "until curl -s localhost:${port}/v1/completions > /dev/null; do sleep 1; done"

}

wait_for_server() {

local port=$1

local endpoint=${2:-/health}

timeout 600 bash -c "until curl -fs http://localhost:${port}${endpoint} > /dev/null; do sleep 1; done"

}

gemini-code-assist · 2026-02-07T09:23:17Z

tests/v1/kv_connector/pd_integration/run_p2p_nccl_accuracy_test.sh

+  --prefill-kv-port 14579 \
+  --decode-kv-port 14580 &
+
+sleep 5


Instead of a fixed sleep 5, it's more reliable to actively wait for the proxy server to be ready. The proxy server exposes a /healthcheck endpoint. Using an improved wait_for_server function to poll this endpoint ensures that the script only proceeds once the proxy is fully initialized, avoiding potential race conditions.

Suggested change

sleep 5

echo "Waiting for proxy instance on port 8192 to start..."

wait_for_server 8192 /healthcheck

mergify bot added ci/build v1 tpu Related to Google TPUs kv-connector labels Feb 7, 2026

eicherseiji mentioned this pull request Feb 7, 2026

[Bugfix][KV Transfer] Use kv_transfer_params for P2pNcclConnector coordination #33947

Open

gemini-code-assist bot reviewed Feb 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Add P2pNccl integration test + rename nixl_integration to pd_integration#34050

[CI] Add P2pNccl integration test + rename nixl_integration to pd_integration#34050
eicherseiji wants to merge 1 commit intovllm-project:mainfrom
eicherseiji:ci/p2p-nccl-integration-test

eicherseiji commented Feb 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Uh oh!

gemini-code-assist bot Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	sleep 5
	echo "Waiting for proxy instance on port 8192 to start..."
	wait_for_server 8192 /healthcheck

Uh oh!

Conversation

eicherseiji commented Feb 7, 2026

Summary

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant