Skip to content

[CI] Add P2pNccl integration test + rename nixl_integration to pd_integration#34050

Draft
eicherseiji wants to merge 1 commit intovllm-project:mainfrom
eicherseiji:ci/p2p-nccl-integration-test
Draft

[CI] Add P2pNccl integration test + rename nixl_integration to pd_integration#34050
eicherseiji wants to merge 1 commit intovllm-project:mainfrom
eicherseiji:ci/p2p-nccl-integration-test

Conversation

@eicherseiji
Copy link
Contributor

Summary

  • Rename tests/v1/kv_connector/nixl_integration/ to pd_integration/ — the directory now hosts tests for multiple PD connectors, not just Nixl
  • Add run_p2p_nccl_accuracy_test.sh for P2pNcclConnector PD accuracy testing on 2 GPUs
  • Add CI entry in distributed.yaml (marked optional)

Test plan

  • Existing NixlConnector CI tests pass with renamed directory
  • P2pNccl accuracy test passes on NVLink GPUs (2 GPU)

…egration

Rename nixl_integration/ to pd_integration/ since the directory now
hosts tests for multiple PD connectors, not just Nixl.

Add run_p2p_nccl_accuracy_test.sh for P2pNcclConnector PD accuracy
testing on 2 GPUs (marked optional in CI).

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the integration test directory for PD connectors from nixl_integration to the more general pd_integration and adds a new integration test for the P2pNcclConnector. The changes look good, but I've identified a potential race condition in the new test script run_p2p_nccl_accuracy_test.sh. My feedback focuses on making the server readiness checks more robust to prevent flaky tests by using health check endpoints instead of relying on less reliable methods and fixed sleep durations.

Comment on lines +24 to +27
wait_for_server() {
local port=$1
timeout 600 bash -c "until curl -s localhost:${port}/v1/completions > /dev/null; do sleep 1; done"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current wait_for_server implementation is not robust. It uses curl to check the /v1/completions endpoint. A GET request to this POST-only endpoint will receive a 405 "Method Not Allowed" response, but curl without the -f flag will still exit with code 0. This can cause a race condition where the script proceeds before the server is fully ready.

A better approach is to use the /health endpoint, which is a GET endpoint, and the curl -f flag to ensure curl fails on HTTP errors. This makes the server readiness check much more reliable and prevents flaky tests.

Suggested change
wait_for_server() {
local port=$1
timeout 600 bash -c "until curl -s localhost:${port}/v1/completions > /dev/null; do sleep 1; done"
}
wait_for_server() {
local port=$1
local endpoint=${2:-/health}
timeout 600 bash -c "until curl -fs http://localhost:${port}${endpoint} > /dev/null; do sleep 1; done"
}

--prefill-kv-port 14579 \
--decode-kv-port 14580 &

sleep 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Instead of a fixed sleep 5, it's more reliable to actively wait for the proxy server to be ready. The proxy server exposes a /healthcheck endpoint. Using an improved wait_for_server function to poll this endpoint ensures that the script only proceeds once the proxy is fully initialized, avoiding potential race conditions.

Suggested change
sleep 5
echo "Waiting for proxy instance on port 8192 to start..."
wait_for_server 8192 /healthcheck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant