Skip to content

[Test] Add PD disagg + SD acceptance tests#35760

Open
ZhanqiuHu wants to merge 1 commit intovllm-project:mainfrom
ZhanqiuHu:pd-sd-eagle3-nixl-tests
Open

[Test] Add PD disagg + SD acceptance tests#35760
ZhanqiuHu wants to merge 1 commit intovllm-project:mainfrom
ZhanqiuHu:pd-sd-eagle3-nixl-tests

Conversation

@ZhanqiuHu
Copy link
Contributor

@ZhanqiuHu ZhanqiuHu commented Mar 2, 2026

Summary

Follow-up of #35158.
Add integration tests for PD disaggregation + speculative decoding via NixlConnector.

Currently covers two configs: Qwen3-8B + EAGLE3 (FLASH_ATTN, acceptance length 2.245 vs 2.260 expected) and GPT-OSS-20B + EAGLE3 (TRITON_ATTN, 2.549 vs 2.560 expected), both validated on H100.

Not wired into default CI — run manually with bash tests/v1/kv_connector/nixl_integration/pd_spec_decode_eagle3_test.sh --all. Future work includes attention backend sweeps and MTP model configs once #35158 merges.

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new test script (pd_spec_decode_eagle3_test.sh) and a corresponding pytest file (test_pd_spec_decode_eagle3.py) to validate the acceptance length of PD disaggregation with EAGLE3 speculative decoding via NixlConnector. The changes aim to ensure that the acceptance lengths match standalone SD baselines for specific model configurations. The review focuses on identifying potential issues related to script logic, error handling, and the correctness of the acceptance length validation.


GIT_ROOT=$(git rev-parse --show-toplevel)

SMI_BIN=$(which nvidia-smi || which rocm-smi || echo "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The which command can return an empty string if the command is not found, which might lead to unexpected behavior. It's safer to provide a default value directly within the command substitution to ensure SMI_BIN always has a valid value, even if the binaries are not found.

Suggested change
SMI_BIN=$(which nvidia-smi || which rocm-smi || echo "")
SMI_BIN=$(which nvidia-smi || which rocm-smi || echo "/usr/bin/nvidia-smi")

cleanup_instances() {
echo ""
echo "Cleaning up..."
kill $(jobs -pr) 2>/dev/null || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using kill $(jobs -pr) might not always terminate all background processes, especially if they are not direct children of the script. Consider using pkill with a more specific pattern to ensure all relevant processes are terminated, or storing the PIDs of the started processes and killing them directly.

Suggested change
kill $(jobs -pr) 2>/dev/null || true
kill "$PREFILL_PID" "$DECODE_PID" "$PROXY_PID" 2>/dev/null || true

local deadline=600
local elapsed=0
echo "Waiting for server on port ${port}..."
while [ $elapsed -lt $deadline ]; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The curl command lacks error handling. If curl fails (e.g., due to network issues or the server not being ready), the script will continue, potentially leading to incorrect test results. Add a check for the curl exit code to ensure the server is properly running.

Suggested change
while [ $elapsed -lt $deadline ]; do
if ! curl -s "localhost:${port}/v1/completions" > /dev/null 2>&1; then
echo "curl failed on port ${port}"
return 1
fi

Comment on lines +257 to +258
pytest -xvs \
"${GIT_ROOT}/tests/v1/kv_connector/nixl_integration/test_pd_spec_decode_eagle3.py"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The pytest command does not include a timeout. If the tests hang, the script will not terminate, potentially causing CI failures. Add a timeout to the pytest command to ensure it completes within a reasonable time.

Suggested change
pytest -xvs \
"${GIT_ROOT}/tests/v1/kv_connector/nixl_integration/test_pd_spec_decode_eagle3.py"
pytest -xvs --timeout=600 "${GIT_ROOT}/tests/v1/kv_connector/nixl_integration/test_pd_spec_decode_eagle3.py"

@ZhanqiuHu ZhanqiuHu force-pushed the pd-sd-eagle3-nixl-tests branch from e6a2732 to 3864b8f Compare March 9, 2026 16:23
@mergify
Copy link

mergify bot commented Mar 9, 2026

Documentation preview: https://vllm--35760.org.readthedocs.build/en/35760/

@mergify mergify bot added documentation Improvements or additions to documentation ci/build labels Mar 9, 2026
@ZhanqiuHu ZhanqiuHu changed the title [Test][WIP] Add PD disagg + SD acceptance tests [Test] Add PD disagg + SD acceptance tests Mar 9, 2026
@ZhanqiuHu ZhanqiuHu marked this pull request as ready for review March 9, 2026 21:10
@ZhanqiuHu ZhanqiuHu force-pushed the pd-sd-eagle3-nixl-tests branch from 6fa746b to 8995473 Compare March 10, 2026 13:40
@mergify
Copy link

mergify bot commented Mar 16, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ZhanqiuHu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 16, 2026
@NickLucche NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 19, 2026
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
@ZhanqiuHu ZhanqiuHu force-pushed the pd-sd-eagle3-nixl-tests branch from 8995473 to e8ac25d Compare March 19, 2026 13:25
@mergify mergify bot removed the needs-rebase label Mar 19, 2026
@NickLucche NickLucche enabled auto-merge (squash) March 19, 2026 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants