[Bugfix] fix logging and d2h bug for flash comm1 by realliujiaxu · Pull Request #3505 · vllm-project/vllm-ascend

realliujiaxu · 2025-10-16T11:23:14Z

What this PR does / why we need it?

Fix 3 bugs in flash comm1 of Allgather EP(#3334):

call enable_sp() with argument vllm_config trigger a lot of warning log, this PR caches its return value.
num_tokens_after_padding should be cpu tensor as it will used as num_tokens_across_dp_cpu in DPMetadata. It will causes may d2h copy when running model.
In PD, model runner will execute kv_connector_no_forward，where num_tokens is None

Does this PR introduce any user-facing change?

No

How was this patch tested?

vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: realliujiaxu <realliujiaxu@163.com>

github-actions · 2025-10-16T11:23:24Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces two bug fixes. The first caches the result of enable_sp to avoid repeated warnings, and the second corrects the device for num_tokens_after_padding to CPU to prevent a device mismatch.

My review focuses on the caching implementation for enable_sp. While the intent to cache is good, the current implementation has a flaw that could lead to incorrect behavior by returning stale cached data when the function is called with different configurations. I've provided a critical comment with a suggested fix to address this. The other change correctly fixes a device mismatch and looks good.

MengqingCao · 2025-10-16T11:37:57Z

-        # Flash comm 1 should be enabled by env VLLM_ASCEND_ENABLE_FLASHCOMM1
-        # We retain the env VLLM_ASCEND_ENABLE_FLASHCOMM here for backward compatibility.
-        or bool(int(os.getenv("VLLM_ASCEND_ENABLE_FLASHCOMM", '0'))))
+    global _ENABLE_SP


caching _ENABLE_SP is a good idea, but it seems will also print a lot of logs when both _ENABLE_SP and vllm_config are None, maybe we chould assert vllm_config is not None when _ENABLE_SP is None to remind developers to pass in this parameter

The first call of enable_sp() is in linear_op when initializing model, which is in set_current_vllm_config context. So
vllm_config can only can only be obtained from get_current_vllm_config when _ENABLE_SP is None

Signed-off-by: realliujiaxu <realliujiaxu@163.com>

MengqingCao

LGTM

jianzs · 2025-10-17T09:16:10Z

        if is_moe_model(vllm_config):
            sp_enabled = enable_sp(vllm_config) and \
-                tp_world_size > 1
+                tp_world_size > 1 and num_tokens is not None
        else:
            sp_enabled = enable_sp(vllm_config) and \
                tp_world_size > 1 and \
                num_tokens is not None and num_tokens > 1000


Suggested change

sp_enabled = (enable_sp(vllm_config) and tp_world_size > 1 and

num_tokens is not None) and (is_moe_model(vllm_config)

or num_tokens > 1000)

### What this PR does / why we need it? Fix 3 bugs in flash comm1 of Allgather EP(vllm-project#3334): 1. call `enable_sp()` with argument `vllm_config` trigger a lot of warning log, this PR caches its return value. 2. `num_tokens_after_padding` should be cpu tensor as it will used as `num_tokens_across_dp_cpu` in `DPMetadata`. It will causes may d2h copy when running model. 3. In PD, model runner will execute `kv_connector_no_forward`，where `num_tokens` is None - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>

### What this PR does / why we need it? Fix 3 bugs in flash comm1 of Allgather EP(vllm-project#3334): 1. call `enable_sp()` with argument `vllm_config` trigger a lot of warning log, this PR caches its return value. 2. `num_tokens_after_padding` should be cpu tensor as it will used as `num_tokens_across_dp_cpu` in `DPMetadata`. It will causes may d2h copy when running model. 3. In PD, model runner will execute `kv_connector_no_forward`，where `num_tokens` is None - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Signed-off-by: luolun <luolun1995@cmbchina.com>

### What this PR does / why we need it? Fix 3 bugs in flash comm1 of Allgather EP(vllm-project#3334): 1. call `enable_sp()` with argument `vllm_config` trigger a lot of warning log, this PR caches its return value. 2. `num_tokens_after_padding` should be cpu tensor as it will used as `num_tokens_across_dp_cpu` in `DPMetadata`. It will causes may d2h copy when running model. 3. In PD, model runner will execute `kv_connector_no_forward`，where `num_tokens` is None - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Signed-off-by: hwhaokun <haokun0405@163.com>

### What this PR does / why we need it? Fix 3 bugs in flash comm1 of Allgather EP(vllm-project#3334): 1. call `enable_sp()` with argument `vllm_config` trigger a lot of warning log, this PR caches its return value. 2. `num_tokens_after_padding` should be cpu tensor as it will used as `num_tokens_across_dp_cpu` in `DPMetadata`. It will causes may d2h copy when running model. 3. In PD, model runner will execute `kv_connector_no_forward`，where `num_tokens` is None - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Signed-off-by: nsdie <yeyifan@huawei.com>

### What this PR does / why we need it? Fix 3 bugs in flash comm1 of Allgather EP(vllm-project#3334): 1. call `enable_sp()` with argument `vllm_config` trigger a lot of warning log, this PR caches its return value. 2. `num_tokens_after_padding` should be cpu tensor as it will used as `num_tokens_across_dp_cpu` in `DPMetadata`. It will causes may d2h copy when running model. 3. In PD, model runner will execute `kv_connector_no_forward`，where `num_tokens` is None - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>

fix bug

9a29b88

Signed-off-by: realliujiaxu <realliujiaxu@163.com>

github-actions bot added the module:core label Oct 16, 2025

gemini-code-assist bot reviewed Oct 16, 2025

View reviewed changes

Comment thread vllm_ascend/utils.py

jianzs added ready read for review ready-for-test start test by label for PR labels Oct 16, 2025

MengqingCao reviewed Oct 16, 2025

View reviewed changes

realliujiaxu closed this Oct 17, 2025

realliujiaxu reopened this Oct 17, 2025

fix PD bug

defafef

Signed-off-by: realliujiaxu <realliujiaxu@163.com>

MengqingCao approved these changes Oct 17, 2025

View reviewed changes

jianzs reviewed Oct 17, 2025

View reviewed changes

jianzs approved these changes Oct 17, 2025

View reviewed changes

MengqingCao merged commit b154a8e into vllm-project:main Oct 17, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] fix logging and d2h bug for flash comm1#3505

[Bugfix] fix logging and d2h bug for flash comm1#3505
MengqingCao merged 2 commits intovllm-project:mainfrom
realliujiaxu:fix-fc-bug

realliujiaxu commented Oct 16, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

MengqingCao Oct 16, 2025

Uh oh!

realliujiaxu Oct 16, 2025 •

edited

Loading

Uh oh!

MengqingCao left a comment

Uh oh!

jianzs Oct 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

+        sp_enabled = (enable_sp(vllm_config) and tp_world_size > 1 and
+                      num_tokens is not None) and (is_moe_model(vllm_config)
+                                                   or num_tokens > 1000)

Conversation

realliujiaxu commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Oct 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

MengqingCao Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

realliujiaxu Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MengqingCao left a comment

Choose a reason for hiding this comment

Uh oh!

jianzs Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

realliujiaxu commented Oct 16, 2025 •

edited

Loading

realliujiaxu Oct 16, 2025 •

edited

Loading