[Bugfix] fix logging and d2h bug for flash comm1#3505
[Bugfix] fix logging and d2h bug for flash comm1#3505MengqingCao merged 2 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request introduces two bug fixes. The first caches the result of enable_sp to avoid repeated warnings, and the second corrects the device for num_tokens_after_padding to CPU to prevent a device mismatch.
My review focuses on the caching implementation for enable_sp. While the intent to cache is good, the current implementation has a flaw that could lead to incorrect behavior by returning stale cached data when the function is called with different configurations. I've provided a critical comment with a suggested fix to address this. The other change correctly fixes a device mismatch and looks good.
| # Flash comm 1 should be enabled by env VLLM_ASCEND_ENABLE_FLASHCOMM1 | ||
| # We retain the env VLLM_ASCEND_ENABLE_FLASHCOMM here for backward compatibility. | ||
| or bool(int(os.getenv("VLLM_ASCEND_ENABLE_FLASHCOMM", '0')))) | ||
| global _ENABLE_SP |
There was a problem hiding this comment.
caching _ENABLE_SP is a good idea, but it seems will also print a lot of logs when both _ENABLE_SP and vllm_config are None, maybe we chould assert vllm_config is not None when _ENABLE_SP is None to remind developers to pass in this parameter
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
| if is_moe_model(vllm_config): | ||
| sp_enabled = enable_sp(vllm_config) and \ | ||
| tp_world_size > 1 | ||
| tp_world_size > 1 and num_tokens is not None | ||
| else: | ||
| sp_enabled = enable_sp(vllm_config) and \ | ||
| tp_world_size > 1 and \ | ||
| num_tokens is not None and num_tokens > 1000 |
There was a problem hiding this comment.
| sp_enabled = (enable_sp(vllm_config) and tp_world_size > 1 and | |
| num_tokens is not None) and (is_moe_model(vllm_config) | |
| or num_tokens > 1000) |
### What this PR does / why we need it? Fix 3 bugs in flash comm1 of Allgather EP(vllm-project#3334): 1. call `enable_sp()` with argument `vllm_config` trigger a lot of warning log, this PR caches its return value. 2. `num_tokens_after_padding` should be cpu tensor as it will used as `num_tokens_across_dp_cpu` in `DPMetadata`. It will causes may d2h copy when running model. 3. In PD, model runner will execute `kv_connector_no_forward`,where `num_tokens` is None - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it? Fix 3 bugs in flash comm1 of Allgather EP(vllm-project#3334): 1. call `enable_sp()` with argument `vllm_config` trigger a lot of warning log, this PR caches its return value. 2. `num_tokens_after_padding` should be cpu tensor as it will used as `num_tokens_across_dp_cpu` in `DPMetadata`. It will causes may d2h copy when running model. 3. In PD, model runner will execute `kv_connector_no_forward`,where `num_tokens` is None - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Signed-off-by: luolun <luolun1995@cmbchina.com>
### What this PR does / why we need it? Fix 3 bugs in flash comm1 of Allgather EP(vllm-project#3334): 1. call `enable_sp()` with argument `vllm_config` trigger a lot of warning log, this PR caches its return value. 2. `num_tokens_after_padding` should be cpu tensor as it will used as `num_tokens_across_dp_cpu` in `DPMetadata`. It will causes may d2h copy when running model. 3. In PD, model runner will execute `kv_connector_no_forward`,where `num_tokens` is None - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Signed-off-by: hwhaokun <haokun0405@163.com>
### What this PR does / why we need it? Fix 3 bugs in flash comm1 of Allgather EP(vllm-project#3334): 1. call `enable_sp()` with argument `vllm_config` trigger a lot of warning log, this PR caches its return value. 2. `num_tokens_after_padding` should be cpu tensor as it will used as `num_tokens_across_dp_cpu` in `DPMetadata`. It will causes may d2h copy when running model. 3. In PD, model runner will execute `kv_connector_no_forward`,where `num_tokens` is None - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Signed-off-by: nsdie <yeyifan@huawei.com>
### What this PR does / why we need it? Fix 3 bugs in flash comm1 of Allgather EP(vllm-project#3334): 1. call `enable_sp()` with argument `vllm_config` trigger a lot of warning log, this PR caches its return value. 2. `num_tokens_after_padding` should be cpu tensor as it will used as `num_tokens_across_dp_cpu` in `DPMetadata`. It will causes may d2h copy when running model. 3. In PD, model runner will execute `kv_connector_no_forward`,where `num_tokens` is None - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>

What this PR does / why we need it?
Fix 3 bugs in flash comm1 of Allgather EP(#3334):
enable_sp()with argumentvllm_configtrigger a lot of warning log, this PR caches its return value.num_tokens_after_paddingshould be cpu tensor as it will used asnum_tokens_across_dp_cpuinDPMetadata. It will causes may d2h copy when running model.kv_connector_no_forward,wherenum_tokensis NoneDoes this PR introduce any user-facing change?
No
How was this patch tested?