[CI] Add P2pNccl integration test + rename nixl_integration to pd_integration#34050
[CI] Add P2pNccl integration test + rename nixl_integration to pd_integration#34050eicherseiji wants to merge 1 commit intovllm-project:mainfrom
Conversation
…egration Rename nixl_integration/ to pd_integration/ since the directory now hosts tests for multiple PD connectors, not just Nixl. Add run_p2p_nccl_accuracy_test.sh for P2pNcclConnector PD accuracy testing on 2 GPUs (marked optional in CI). Signed-off-by: Seiji Eicher <seiji@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the integration test directory for PD connectors from nixl_integration to the more general pd_integration and adds a new integration test for the P2pNcclConnector. The changes look good, but I've identified a potential race condition in the new test script run_p2p_nccl_accuracy_test.sh. My feedback focuses on making the server readiness checks more robust to prevent flaky tests by using health check endpoints instead of relying on less reliable methods and fixed sleep durations.
| wait_for_server() { | ||
| local port=$1 | ||
| timeout 600 bash -c "until curl -s localhost:${port}/v1/completions > /dev/null; do sleep 1; done" | ||
| } |
There was a problem hiding this comment.
The current wait_for_server implementation is not robust. It uses curl to check the /v1/completions endpoint. A GET request to this POST-only endpoint will receive a 405 "Method Not Allowed" response, but curl without the -f flag will still exit with code 0. This can cause a race condition where the script proceeds before the server is fully ready.
A better approach is to use the /health endpoint, which is a GET endpoint, and the curl -f flag to ensure curl fails on HTTP errors. This makes the server readiness check much more reliable and prevents flaky tests.
| wait_for_server() { | |
| local port=$1 | |
| timeout 600 bash -c "until curl -s localhost:${port}/v1/completions > /dev/null; do sleep 1; done" | |
| } | |
| wait_for_server() { | |
| local port=$1 | |
| local endpoint=${2:-/health} | |
| timeout 600 bash -c "until curl -fs http://localhost:${port}${endpoint} > /dev/null; do sleep 1; done" | |
| } |
| --prefill-kv-port 14579 \ | ||
| --decode-kv-port 14580 & | ||
|
|
||
| sleep 5 |
There was a problem hiding this comment.
Instead of a fixed sleep 5, it's more reliable to actively wait for the proxy server to be ready. The proxy server exposes a /healthcheck endpoint. Using an improved wait_for_server function to poll this endpoint ensures that the script only proceeds once the proxy is fully initialized, avoiding potential race conditions.
| sleep 5 | |
| echo "Waiting for proxy instance on port 8192 to start..." | |
| wait_for_server 8192 /healthcheck |
Summary
tests/v1/kv_connector/nixl_integration/topd_integration/— the directory now hosts tests for multiple PD connectors, not just Nixlrun_p2p_nccl_accuracy_test.shfor P2pNcclConnector PD accuracy testing on 2 GPUsdistributed.yaml(marked optional)Test plan