Skip to content

[BugFix] Fix flakiness in test_eagle_dp for PyTorch 2.10#31915

Merged
zou3519 merged 1 commit intovllm-project:mainfrom
zou3519:fix_test_eagle_dp2
Jan 8, 2026
Merged

[BugFix] Fix flakiness in test_eagle_dp for PyTorch 2.10#31915
zou3519 merged 1 commit intovllm-project:mainfrom
zou3519:fix_test_eagle_dp2

Conversation

@zou3519
Copy link
Copy Markdown
Collaborator

@zou3519 zou3519 commented Jan 7, 2026

Purpose

Fix flakiness in this test for PyTorch 2.10. The test can fail in PyTorch 2.9 too, see #31913 for explanation.

Test Plan

Ran TP_SIZE=2 DP_SIZE=2 pytest tests/v1/distributed/test_eagle_dp.py::test_run_eagle_dp on 2x L4 and verified it passed.

Test Result

Pass

Signed-off-by: Richard Zou <zou3519@gmail.com>
@mergify mergify bot added the v1 label Jan 7, 2026
Copy link
Copy Markdown
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses flakiness in the test_eagle_dp test by reducing the number of expected tokens from 100 to 20. While this change is likely to make the test more stable, I have a concern that it also reduces the test's coverage. A bug that only appears in longer generation sequences might be missed with this change. I've left a comment suggesting to investigate the root cause of the flakiness, such as increasing timeouts if it's a performance issue, rather than reducing the test's scope.

Comment on lines +50 to +52
# This test might be flaky, see
# https://github.com/vllm-project/vllm/issues/31913
num_expected_tokens = 20
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While reducing num_expected_tokens from 100 to 20 might fix the test flakiness, it also reduces the test's coverage. A bug in the data parallel logic for Eagle that only manifests with longer sequences (more than 20 tokens) might now be missed. This could be problematic for ensuring correctness.

Consider investigating the root cause of the flakiness. If it's a timeout issue (see line 75), increasing the timeout might be a better solution. If it's a deeper race condition or non-determinism, that should be addressed directly. Reducing the test's scope should be a last resort. Reverting this change is suggested if a better fix can be found.

Suggested change
# This test might be flaky, see
# https://github.com/vllm-project/vllm/issues/31913
num_expected_tokens = 20
num_expected_tokens = 100

Copy link
Copy Markdown
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@zou3519 zou3519 added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 8, 2026
@zou3519 zou3519 enabled auto-merge (squash) January 8, 2026 02:17
@zou3519 zou3519 merged commit a79079f into vllm-project:main Jan 8, 2026
27 checks passed
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…t#31915)

Signed-off-by: Richard Zou <zou3519@gmail.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants