[Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage#6442
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
The pull request successfully refactors the MLP weight prefetch mechanism, moving from environment variables to a more unified additional-config approach, consistent with the MoE Model's prefetching. This change streamlines configuration and aligns the prefetching logic across different model types. The documentation and test cases have been updated to reflect this new configuration method. However, there are a few critical issues identified in the new prefetching logic and some test cases that need to be addressed to ensure correctness and maintainability.
| elif prefetch_layer_name == self.MLP_DOWN: | ||
| self._maybe_prefetch_mlp_down_weight_preprocess(x_dependency, forward_context) | ||
| else: | ||
| raise ValueError(f"Unsupported prefetch weight name: {prefetch_weight_name}") |
There was a problem hiding this comment.
There is a typo in the ValueError message. The variable prefetch_weight_name is used, but it is not defined in this scope. It should be prefetch_layer_name.
| raise ValueError(f"Unsupported prefetch weight name: {prefetch_weight_name}") | |
| raise ValueError(f"Unsupported prefetch weight name: {prefetch_layer_name}") |
| raise ValueError("curr_layer_prefix must been specified when prefetching mlp gate_up_proj weight") | ||
|
|
||
| # start point of gate_up_proj weight prefetch | ||
| if curr_layer_prefix.split('.')[-2] == "self_attn": |
There was a problem hiding this comment.
The condition curr_layer_prefix.split('.')[-2] == "self_attn" is used to determine if MLP gate_up_proj weight prefetching should occur. However, "self_attn" refers to the attention mechanism, not the MLP layer. This logic seems incorrect for MLP prefetching and could lead to the prefetching being triggered at the wrong time or not at all for MLP layers. The condition should accurately identify MLP layers.
|
|
||
| @pytest.mark.parametrize("model", QWEN_DENSE_MODELS) | ||
| @patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_PREFETCH_MLP": "1"}) | ||
| @patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_FLASHCOMM1": "1"}) |
There was a problem hiding this comment.
The test test_qwen3_dense_prefetch_mlp_weight_tp2 is intended to test MLP weight prefetching. However, it is patching the environment variable VLLM_ASCEND_ENABLE_FLASHCOMM1, which is related to FlashComm optimization, not MLP prefetching. Since MLP prefetching is now configured via additional_config (as correctly done in line 240 of this test), this environment variable patch is misleading and potentially incorrect, as it might not be enabling the intended feature for this specific test, or it's patching an unrelated feature. This could lead to false positives or incorrect test coverage.
| @patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_FLASHCOMM1": "1"}) | |
| @pytest.mark.parametrize("model", QWEN_DENSE_MODELS) |
| SUPPORTED_MODULES = ["attn", "mlp", "moe"] | ||
| MOE_PREFETCH_TOKEN_THRESHOLD = 96 | ||
|
|
||
| MAX_PREFETCH_WEIGHT_SIZE = 18 * 1024 * 1024 |
There was a problem hiding this comment.
The constant MAX_PREFETCH_WEIGHT_SIZE is defined with a magic number 18 * 1024 * 1024. While this value was previously a default in environment variables, it's good practice to define such critical configuration values with a descriptive name and potentially make it configurable if it's a tuning parameter. Consider adding a comment explaining the origin or purpose of this specific size, or making it configurable through WeightPrefetchConfig if it's meant to be dynamic.
| prefetch_ratio=weight_prefetch_config.prefetch_ratio.get( | ||
| "mlp", {}) or {'gate_up': 1.0, 'down': 1.0}) | ||
|
|
||
| print(f'mlp prefetch config: {self.mlp} self.is_moe:{self.is_moe} ==============================================================') |
| weight_size = weight.data.element_size() * weight.data.numel() * self.mlp.prefetch_ratio.get("gate_up", 0) | ||
| if weight_size > MAX_PREFETCH_WEIGHT_SIZE: | ||
| weight_size = MAX_PREFETCH_WEIGHT_SIZE | ||
| print(f'mlp prefetch gate_up current layer prefix:{curr_layer_prefix}, weight size: {weight_size} ==============================================================') |
| max_weight_size=int(weight_size)) | ||
| forward_context.prefetch_mlp_down_proj = True | ||
| forward_context.layer_idx += 1 | ||
| print(f'mlp prefetch down layer idx:{layer_idx}, layer_idx for next forward:{forward_context.layer_idx}, weight size: {weight_size} ==============================================================') |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
7fd12a6 to
2f851d7
Compare
Signed-off-by: leo-pony <nengjunma@outlook.com>
2f851d7 to
c444bf6
Compare
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
| class AscendSiluAndMul310(AscendSiluAndMul): | ||
| def forward(self, x: torch.Tensor) -> torch.Tensor: | ||
| torch.ops.vllm.maybe_prefetch_mlp_down_proj(x) | ||
| weight_prefetch_method = get_weight_prefetch_method() |
There was a problem hiding this comment.
maybe we shoult drop support for 310p first.
| It is important to emphasize that, since we use vector computations to hide the weight prefetching pipeline, the setting of the prefetch buffer size is crucial. If the buffer size is too small, the optimization benefits will not be fully realized, while a larger buffer size may lead to resource contention, resulting in performance degradation. To accommodate different scenarios, we have exposed two environment variables `VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE` and `VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE` to allow for flexible buffer size configuration based on the specific workload. | ||
|
|
||
| This optimization requires setting the environment variable `VLLM_ASCEND_ENABLE_PREFETCH_MLP = 1` to be enabled. | ||
| Previously, the environment variables VLLM_ASCEND_ENABLE_PREFETCH_MLP used to enable MLP weight prefetch and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE and VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE used to set the weight prefetch size for MLP gate_up_proj and down_proj were deprecated. Please use the following configuration instead: "weight_prefetch_config": { "enabled": true, "prefetch_ratio": { "mlp": { "gate_up": 1.0, "down": 1.0}}}. See User Guide->Feature Guide->Weight Prefetch Guide for details. |
There was a problem hiding this comment.
See User Guide->Feature Guide->Weight Prefetch Guide for details. this can be set to link instead.
…ching in terms of code and usage (vllm-project#6442) Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: ZYang6263 <zy626375@gmail.com>
…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (59 commits) [Feat.]: 310p support MOE models (vllm-project#6530) [Doc] backport 0.13.0 release note (vllm-project#6584) [CI] Update UT CANN version to 8.5.0 for main branch (vllm-project#6564) [CI] Change A2 runner (vllm-project#6557) [Bugfix] Fix the incorrect use of the output parameter in _forward_fia_slidingwindow (vllm-project#6469) [main2main] upgrade vllm main 0202 (vllm-project#6560) [CI][npugraph_ex]Fix npugraph ex e2e test (vllm-project#6553) [Feature]KV pool supports sparse attention (vllm-project#6339) [bugfix]Fix accuracy issue in PCP/DCP with speculative decoding (vllm-project#6491) perf: adaptive block size selection in linear_persistent kernel (vllm-project#6537) [ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint (vllm-project#6475) [Bugfix]Fix of Pooling Code and Update of Pooling Usage Guide (vllm-project#6126) [Fusion] Add rmsnorm dynamic quant fusion pass (vllm-project#6274) [Bugfix] Synchronize only the current stream to avoid device sync (vllm-project#6432) [CI] Add long and short prompt tests for DeepSeek-V3.2 (vllm-project#6499) [Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (vllm-project#6442) [bugfix][npugraph_ex]duplicate pattern issue (vllm-project#6513) [bugfix][npugraph_ex]add the extra check for allreduce rmsnorm fusion pass (vllm-project#6430) [Quant] GLM4.7-Flash Support W8A8 (vllm-project#6492) [Nightly][BugFix] Remove kv_cache nz test case for test_mla_preprocess_nq.py (vllm-project#6505) ...
…ching in terms of code and usage (vllm-project#6442) ### What this PR does / why we need it? Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>
…ching in terms of code and usage (vllm-project#6442) ### What this PR does / why we need it? Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…ching in terms of code and usage (vllm-project#6442) ### What this PR does / why we need it? Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: leo-pony <nengjunma@outlook.com>
…ching in terms of code and usage (vllm-project#6442) ### What this PR does / why we need it? Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…ching in terms of code and usage (vllm-project#6442) ### What this PR does / why we need it? Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: leo-pony <nengjunma@outlook.com>
…ching in terms of code and usage (vllm-project#6442) ### What this PR does / why we need it? Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: leo-pony <nengjunma@outlook.com>
What this PR does / why we need it?
Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage.
Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following:
--additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}'
Does this PR introduce any user-facing change?
How was this patch tested?