[Test] Add PD disagg + SD acceptance tests#35760
[Test] Add PD disagg + SD acceptance tests#35760ZhanqiuHu wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new test script (pd_spec_decode_eagle3_test.sh) and a corresponding pytest file (test_pd_spec_decode_eagle3.py) to validate the acceptance length of PD disaggregation with EAGLE3 speculative decoding via NixlConnector. The changes aim to ensure that the acceptance lengths match standalone SD baselines for specific model configurations. The review focuses on identifying potential issues related to script logic, error handling, and the correctness of the acceptance length validation.
|
|
||
| GIT_ROOT=$(git rev-parse --show-toplevel) | ||
|
|
||
| SMI_BIN=$(which nvidia-smi || which rocm-smi || echo "") |
There was a problem hiding this comment.
The which command can return an empty string if the command is not found, which might lead to unexpected behavior. It's safer to provide a default value directly within the command substitution to ensure SMI_BIN always has a valid value, even if the binaries are not found.
| SMI_BIN=$(which nvidia-smi || which rocm-smi || echo "") | |
| SMI_BIN=$(which nvidia-smi || which rocm-smi || echo "/usr/bin/nvidia-smi") |
| cleanup_instances() { | ||
| echo "" | ||
| echo "Cleaning up..." | ||
| kill $(jobs -pr) 2>/dev/null || true |
There was a problem hiding this comment.
Using kill $(jobs -pr) might not always terminate all background processes, especially if they are not direct children of the script. Consider using pkill with a more specific pattern to ensure all relevant processes are terminated, or storing the PIDs of the started processes and killing them directly.
| kill $(jobs -pr) 2>/dev/null || true | |
| kill "$PREFILL_PID" "$DECODE_PID" "$PROXY_PID" 2>/dev/null || true |
| local deadline=600 | ||
| local elapsed=0 | ||
| echo "Waiting for server on port ${port}..." | ||
| while [ $elapsed -lt $deadline ]; do |
There was a problem hiding this comment.
The curl command lacks error handling. If curl fails (e.g., due to network issues or the server not being ready), the script will continue, potentially leading to incorrect test results. Add a check for the curl exit code to ensure the server is properly running.
| while [ $elapsed -lt $deadline ]; do | |
| if ! curl -s "localhost:${port}/v1/completions" > /dev/null 2>&1; then | |
| echo "curl failed on port ${port}" | |
| return 1 | |
| fi |
| pytest -xvs \ | ||
| "${GIT_ROOT}/tests/v1/kv_connector/nixl_integration/test_pd_spec_decode_eagle3.py" |
There was a problem hiding this comment.
The pytest command does not include a timeout. If the tests hang, the script will not terminate, potentially causing CI failures. Add a timeout to the pytest command to ensure it completes within a reasonable time.
| pytest -xvs \ | |
| "${GIT_ROOT}/tests/v1/kv_connector/nixl_integration/test_pd_spec_decode_eagle3.py" | |
| pytest -xvs --timeout=600 "${GIT_ROOT}/tests/v1/kv_connector/nixl_integration/test_pd_spec_decode_eagle3.py" |
e6a2732 to
3864b8f
Compare
|
Documentation preview: https://vllm--35760.org.readthedocs.build/en/35760/ |
6fa746b to
8995473
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
8995473 to
e8ac25d
Compare
Summary
Follow-up of #35158.
Add integration tests for PD disaggregation + speculative decoding via
NixlConnector.Currently covers two configs: Qwen3-8B + EAGLE3 (FLASH_ATTN, acceptance length 2.245 vs 2.260 expected) and GPT-OSS-20B + EAGLE3 (TRITON_ATTN, 2.549 vs 2.560 expected), both validated on H100.
Not wired into default CI — run manually with
bash tests/v1/kv_connector/nixl_integration/pd_spec_decode_eagle3_test.sh --all. Future work includes attention backend sweeps and MTP model configs once #35158 merges.Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.