[Bugfix][CI] Skip flaky test_eagle test#38566
[Bugfix][CI] Skip flaky test_eagle test#38566MatthewBonanni merged 2 commits intovllm-project:mainfrom
test_eagle test#38566Conversation
test_eagle testtest_eagle test
There was a problem hiding this comment.
Code Review
This pull request modifies the GPUModelRunner to disable asynchronous speculative decoding when data parallelism is enabled by ensuring data_parallel_size is equal to 1. The review feedback suggests adding a comment or a TODO to document this restriction and track future work to support this feature.
vllm/v1/worker/gpu_model_runner.py
Outdated
| self.use_async_scheduling and self.num_spec_tokens > 0 | ||
| self.use_async_scheduling | ||
| and self.num_spec_tokens > 0 | ||
| and self.vllm_config.parallel_config.data_parallel_size == 1 |
There was a problem hiding this comment.
This condition is critical for correctness when using data parallelism. To prevent future regressions and improve code clarity, please add a comment explaining why asynchronous speculative decoding is disabled when data_parallel_size > 1. A TODO comment would also be appropriate to track future work to support this feature.
# TODO: Support async speculative decoding with data parallelism.
and self.vllm_config.parallel_config.data_parallel_size == 1|
Like you mention, I don't think disabling async spec decode for DP > 1 is the right move, especially because that test is known flaky prior to #32951, and this is primarily an issue of batch invariance |
|
Running with batch invariance doesn't really help here. Also I can occasionally repro with a single request too. |
test_eagle testtest_eagle test
|
xref #38584 |
Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
Tentative fix for #31913, do not merge until reviewed and approved locally.
EDIT: I think disabling async with DP>1 is too harsh, still I would appreciate any comment with more context to clarify this is indeed not needed.