Skip to content

[Bugfix] fix logging and d2h bug for flash comm1#3505

Merged
MengqingCao merged 2 commits intovllm-project:mainfrom
realliujiaxu:fix-fc-bug
Oct 17, 2025
Merged

[Bugfix] fix logging and d2h bug for flash comm1#3505
MengqingCao merged 2 commits intovllm-project:mainfrom
realliujiaxu:fix-fc-bug

Conversation

@realliujiaxu
Copy link
Copy Markdown
Collaborator

@realliujiaxu realliujiaxu commented Oct 16, 2025

What this PR does / why we need it?

Fix 3 bugs in flash comm1 of Allgather EP(#3334):

  1. call enable_sp() with argument vllm_config trigger a lot of warning log, this PR caches its return value.
  2. num_tokens_after_padding should be cpu tensor as it will used as num_tokens_across_dp_cpu in DPMetadata. It will causes may d2h copy when running model.
  3. In PD, model runner will execute kv_connector_no_forward,where num_tokens is None

Does this PR introduce any user-facing change?

No

How was this patch tested?

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two bug fixes. The first caches the result of enable_sp to avoid repeated warnings, and the second corrects the device for num_tokens_after_padding to CPU to prevent a device mismatch.

My review focuses on the caching implementation for enable_sp. While the intent to cache is good, the current implementation has a flaw that could lead to incorrect behavior by returning stale cached data when the function is called with different configurations. I've provided a critical comment with a suggested fix to address this. The other change correctly fixes a device mismatch and looks good.

Comment thread vllm_ascend/utils.py
@jianzs jianzs added ready read for review ready-for-test start test by label for PR labels Oct 16, 2025
Comment thread vllm_ascend/utils.py
# Flash comm 1 should be enabled by env VLLM_ASCEND_ENABLE_FLASHCOMM1
# We retain the env VLLM_ASCEND_ENABLE_FLASHCOMM here for backward compatibility.
or bool(int(os.getenv("VLLM_ASCEND_ENABLE_FLASHCOMM", '0'))))
global _ENABLE_SP
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caching _ENABLE_SP is a good idea, but it seems will also print a lot of logs when both _ENABLE_SP and vllm_config are None, maybe we chould assert vllm_config is not None when _ENABLE_SP is None to remind developers to pass in this parameter

Copy link
Copy Markdown
Collaborator Author

@realliujiaxu realliujiaxu Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first call of enable_sp() is in linear_op when initializing model, which is in set_current_vllm_config context. So
vllm_config can only can only be obtained from get_current_vllm_config when _ENABLE_SP is None
image

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Copy link
Copy Markdown
Collaborator

@MengqingCao MengqingCao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 115 to 121
if is_moe_model(vllm_config):
sp_enabled = enable_sp(vllm_config) and \
tp_world_size > 1
tp_world_size > 1 and num_tokens is not None
else:
sp_enabled = enable_sp(vllm_config) and \
tp_world_size > 1 and \
num_tokens is not None and num_tokens > 1000
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sp_enabled = (enable_sp(vllm_config) and tp_world_size > 1 and
num_tokens is not None) and (is_moe_model(vllm_config)
or num_tokens > 1000)

@MengqingCao MengqingCao merged commit b154a8e into vllm-project:main Oct 17, 2025
17 checks passed
ZYang6263 pushed a commit to rjg-lyh/vllm-ascend that referenced this pull request Oct 23, 2025
### What this PR does / why we need it?

Fix 3 bugs in flash comm1 of Allgather
EP(vllm-project#3334):
1. call `enable_sp()` with argument `vllm_config` trigger a lot of
warning log, this PR caches its return value.
2. `num_tokens_after_padding` should be cpu tensor as it will used as
`num_tokens_across_dp_cpu` in `DPMetadata`. It will causes may d2h copy
when running model.
3. In PD, model runner will execute `kv_connector_no_forward`,where
`num_tokens` is None

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
luolun pushed a commit to luolun/vllm-ascend that referenced this pull request Nov 19, 2025
### What this PR does / why we need it?

Fix 3 bugs in flash comm1 of Allgather
EP(vllm-project#3334):
1. call `enable_sp()` with argument `vllm_config` trigger a lot of
warning log, this PR caches its return value.
2. `num_tokens_after_padding` should be cpu tensor as it will used as
`num_tokens_across_dp_cpu` in `DPMetadata`. It will causes may d2h copy
when running model.
3. In PD, model runner will execute `kv_connector_no_forward`,where
`num_tokens` is None

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
hwhaokun pushed a commit to hwhaokun/vllm-ascend that referenced this pull request Nov 19, 2025
### What this PR does / why we need it?

Fix 3 bugs in flash comm1 of Allgather
EP(vllm-project#3334):
1. call `enable_sp()` with argument `vllm_config` trigger a lot of
warning log, this PR caches its return value.
2. `num_tokens_after_padding` should be cpu tensor as it will used as
`num_tokens_across_dp_cpu` in `DPMetadata`. It will causes may d2h copy
when running model.
3. In PD, model runner will execute `kv_connector_no_forward`,where
`num_tokens` is None

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Signed-off-by: hwhaokun <haokun0405@163.com>
NSDie pushed a commit to NSDie/vllm-ascend that referenced this pull request Nov 24, 2025
### What this PR does / why we need it?

Fix 3 bugs in flash comm1 of Allgather
EP(vllm-project#3334):
1. call `enable_sp()` with argument `vllm_config` trigger a lot of
warning log, this PR caches its return value.
2. `num_tokens_after_padding` should be cpu tensor as it will used as
`num_tokens_across_dp_cpu` in `DPMetadata`. It will causes may d2h copy
when running model.
3. In PD, model runner will execute `kv_connector_no_forward`,where
`num_tokens` is None

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Signed-off-by: nsdie <yeyifan@huawei.com>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 9, 2025
### What this PR does / why we need it?

Fix 3 bugs in flash comm1 of Allgather
EP(vllm-project#3334):
1. call `enable_sp()` with argument `vllm_config` trigger a lot of
warning log, this PR caches its return value.
2. `num_tokens_after_padding` should be cpu tensor as it will used as
`num_tokens_across_dp_cpu` in `DPMetadata`. It will causes may d2h copy
when running model.
3. In PD, model runner will execute `kv_connector_no_forward`,where
`num_tokens` is None

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:core ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants