[Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage by leo-pony · Pull Request #6442 · vllm-project/vllm-ascend

leo-pony · 2026-01-30T12:55:54Z

What this PR does / why we need it?

Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage.
Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following:

--additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}'

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.14.1
vLLM main: vllm-project/vllm@dc917cc

github-actions · 2026-01-30T12:56:09Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

The pull request successfully refactors the MLP weight prefetch mechanism, moving from environment variables to a more unified additional-config approach, consistent with the MoE Model's prefetching. This change streamlines configuration and aligns the prefetching logic across different model types. The documentation and test cases have been updated to reflect this new configuration method. However, there are a few critical issues identified in the new prefetching logic and some test cases that need to be addressed to ensure correctness and maintainability.

gemini-code-assist · 2026-01-30T12:57:54Z

+        elif prefetch_layer_name == self.MLP_DOWN:
+            self._maybe_prefetch_mlp_down_weight_preprocess(x_dependency, forward_context)
+        else:
+            raise ValueError(f"Unsupported prefetch weight name: {prefetch_weight_name}")


There is a typo in the ValueError message. The variable prefetch_weight_name is used, but it is not defined in this scope. It should be prefetch_layer_name.

Suggested change

raise ValueError(f"Unsupported prefetch weight name: {prefetch_weight_name}")

raise ValueError(f"Unsupported prefetch weight name: {prefetch_layer_name}")

gemini-code-assist · 2026-01-30T12:57:54Z

+            raise ValueError("curr_layer_prefix must been specified when prefetching mlp gate_up_proj weight")
+
+        # start point of gate_up_proj weight prefetch
+        if curr_layer_prefix.split('.')[-2] == "self_attn":


The condition curr_layer_prefix.split('.')[-2] == "self_attn" is used to determine if MLP gate_up_proj weight prefetching should occur. However, "self_attn" refers to the attention mechanism, not the MLP layer. This logic seems incorrect for MLP prefetching and could lead to the prefetching being triggered at the wrong time or not at all for MLP layers. The condition should accurately identify MLP layers.

gemini-code-assist · 2026-01-30T12:57:54Z


 @pytest.mark.parametrize("model", QWEN_DENSE_MODELS)
-@patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_PREFETCH_MLP": "1"})
+@patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_FLASHCOMM1": "1"})


The test test_qwen3_dense_prefetch_mlp_weight_tp2 is intended to test MLP weight prefetching. However, it is patching the environment variable VLLM_ASCEND_ENABLE_FLASHCOMM1, which is related to FlashComm optimization, not MLP prefetching. Since MLP prefetching is now configured via additional_config (as correctly done in line 240 of this test), this environment variable patch is misleading and potentially incorrect, as it might not be enabling the intended feature for this specific test, or it's patching an unrelated feature. This could lead to false positives or incorrect test coverage.

Suggested change

@patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_FLASHCOMM1": "1"})

@pytest.mark.parametrize("model", QWEN_DENSE_MODELS)

gemini-code-assist · 2026-01-30T12:57:54Z

 SUPPORTED_MODULES = ["attn", "mlp", "moe"]
 MOE_PREFETCH_TOKEN_THRESHOLD = 96
-
+MAX_PREFETCH_WEIGHT_SIZE = 18 * 1024 * 1024


The constant MAX_PREFETCH_WEIGHT_SIZE is defined with a magic number 18 * 1024 * 1024. While this value was previously a default in environment variables, it's good practice to define such critical configuration values with a descriptive name and potentially make it configurable if it's a tuning parameter. Consider adding a comment explaining the origin or purpose of this specific size, or making it configurable through WeightPrefetchConfig if it's meant to be dynamic.

gemini-code-assist · 2026-01-30T12:57:54Z

+            prefetch_ratio=weight_prefetch_config.prefetch_ratio.get(
+                "mlp", {}) or {'gate_up': 1.0, 'down': 1.0})
+
+        print(f'mlp prefetch config: {self.mlp} self.is_moe:{self.is_moe} ==============================================================')


This print statement appears to be for debugging purposes. Debug prints should be removed from production code to avoid unnecessary console output and potential performance overhead.

gemini-code-assist · 2026-01-30T12:57:54Z

+            weight_size = weight.data.element_size() * weight.data.numel() * self.mlp.prefetch_ratio.get("gate_up", 0)
+            if weight_size > MAX_PREFETCH_WEIGHT_SIZE:
+                weight_size = MAX_PREFETCH_WEIGHT_SIZE
+            print(f'mlp prefetch gate_up current layer prefix:{curr_layer_prefix}, weight size: {weight_size} ==============================================================')


This print statement appears to be for debugging purposes. Debug prints should be removed from production code to avoid unnecessary console output and potential performance overhead.

gemini-code-assist · 2026-01-30T12:57:54Z

+                                        max_weight_size=int(weight_size))
+        forward_context.prefetch_mlp_down_proj = True
+        forward_context.layer_idx += 1
+        print(f'mlp prefetch down layer idx:{layer_idx}, layer_idx for next forward:{forward_context.layer_idx}, weight size: {weight_size} ==============================================================')


This print statement appears to be for debugging purposes. Debug prints should be removed from production code to avoid unnecessary console output and potential performance overhead.

github-actions · 2026-02-02T08:14:16Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: leo-pony <nengjunma@outlook.com>

wangxiyuan · 2026-02-02T08:42:51Z

 class AscendSiluAndMul310(AscendSiluAndMul):
    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        torch.ops.vllm.maybe_prefetch_mlp_down_proj(x)
+        weight_prefetch_method = get_weight_prefetch_method()


maybe we shoult drop support for 310p first.

wangxiyuan · 2026-02-04T01:06:15Z

-It is important to emphasize that, since we use vector computations to hide the weight prefetching pipeline, the setting of the prefetch buffer size is crucial. If the buffer size is too small, the optimization benefits will not be fully realized, while a larger buffer size may lead to resource contention, resulting in performance degradation. To accommodate different scenarios, we have exposed two environment variables `VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE` and `VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE` to allow for flexible buffer size configuration based on the specific workload.
-
-This optimization requires setting the environment variable `VLLM_ASCEND_ENABLE_PREFETCH_MLP = 1` to be enabled.
+Previously, the environment variables VLLM_ASCEND_ENABLE_PREFETCH_MLP used to enable MLP weight prefetch and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE and VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE used to set the weight prefetch size for MLP gate_up_proj and down_proj were deprecated. Please use the following configuration instead: "weight_prefetch_config": { "enabled": true, "prefetch_ratio": { "mlp": { "gate_up": 1.0, "down": 1.0}}}. See User Guide->Feature Guide->Weight Prefetch Guide for details.


See User Guide->Feature Guide->Weight Prefetch Guide for details. this can be set to link instead.

…ching in terms of code and usage (vllm-project#6442) Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: ZYang6263 <zy626375@gmail.com>

…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (59 commits) [Feat.]: 310p support MOE models (vllm-project#6530) [Doc] backport 0.13.0 release note (vllm-project#6584) [CI] Update UT CANN version to 8.5.0 for main branch (vllm-project#6564) [CI] Change A2 runner (vllm-project#6557) [Bugfix] Fix the incorrect use of the output parameter in _forward_fia_slidingwindow (vllm-project#6469) [main2main] upgrade vllm main 0202 (vllm-project#6560) [CI][npugraph_ex]Fix npugraph ex e2e test (vllm-project#6553) [Feature]KV pool supports sparse attention (vllm-project#6339) [bugfix]Fix accuracy issue in PCP/DCP with speculative decoding (vllm-project#6491) perf: adaptive block size selection in linear_persistent kernel (vllm-project#6537) [ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint (vllm-project#6475) [Bugfix]Fix of Pooling Code and Update of Pooling Usage Guide (vllm-project#6126) [Fusion] Add rmsnorm dynamic quant fusion pass (vllm-project#6274) [Bugfix] Synchronize only the current stream to avoid device sync (vllm-project#6432) [CI] Add long and short prompt tests for DeepSeek-V3.2 (vllm-project#6499) [Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (vllm-project#6442) [bugfix][npugraph_ex]duplicate pattern issue (vllm-project#6513) [bugfix][npugraph_ex]add the extra check for allreduce rmsnorm fusion pass (vllm-project#6430) [Quant] GLM4.7-Flash Support W8A8 (vllm-project#6492) [Nightly][BugFix] Remove kv_cache nz test case for test_mla_preprocess_nq.py (vllm-project#6505) ...

…ching in terms of code and usage (vllm-project#6442) ### What this PR does / why we need it? Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>

…ching in terms of code and usage (vllm-project#6442) ### What this PR does / why we need it? Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…ching in terms of code and usage (vllm-project#6442) ### What this PR does / why we need it? Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: leo-pony <nengjunma@outlook.com>

…ching in terms of code and usage (vllm-project#6442) ### What this PR does / why we need it? Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…ching in terms of code and usage (vllm-project#6442) ### What this PR does / why we need it? Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: leo-pony <nengjunma@outlook.com>

leo-pony requested review from LCAIZJ, Yikun, realliujiaxu, wangxiyuan, whx-sjtu and zzzzwwjj as code owners January 30, 2026 12:55

github-actions bot added documentation Improvements or additions to documentation module:tests module:ops module:core labels Jan 30, 2026

leo-pony marked this pull request as draft January 30, 2026 12:56

gemini-code-assist bot reviewed Jan 30, 2026

View reviewed changes

github-actions bot added the merge-conflicts label Feb 2, 2026

leo-pony force-pushed the weight_prefetch_p1 branch from 7fd12a6 to 2f851d7 Compare February 2, 2026 08:28

github-actions bot removed the merge-conflicts label Feb 2, 2026

Refact MLP weight prefetch to consist with moe weight prefetch

c444bf6

Signed-off-by: leo-pony <nengjunma@outlook.com>

leo-pony force-pushed the weight_prefetch_p1 branch from 2f851d7 to c444bf6 Compare February 2, 2026 08:33

leo-pony marked this pull request as ready for review February 2, 2026 08:33

leo-pony added 6 commits February 2, 2026 16:37

Remove unused description in qwen3-dense.md

de31878

Signed-off-by: leo-pony <nengjunma@outlook.com>

Add backward cenv config cmpitable to MLP prefetch

2fbab46

Signed-off-by: leo-pony <nengjunma@outlook.com>

Add description to weight prefetch

41e16f4

Signed-off-by: leo-pony <nengjunma@outlook.com>

Add new line for md

4b453be

Signed-off-by: leo-pony <nengjunma@outlook.com>

Add MLP Prefetch user guide

3ace646

Signed-off-by: leo-pony <nengjunma@outlook.com>

Add prefeth ratio change method

3e0bdad

Signed-off-by: leo-pony <nengjunma@outlook.com>

leo-pony added ready read for review ready-for-test start test by label for PR labels Feb 3, 2026

leo-pony commented Feb 3, 2026

View reviewed changes

Comment thread docs/source/tutorials/Qwen3-Dense.md Outdated

Comment thread docs/source/tutorials/Qwen3-Dense.md

Comment thread docs/source/tutorials/Qwen3-Dense.md

wangxiyuan approved these changes Feb 4, 2026

View reviewed changes

wangxiyuan merged commit 78fad4e into vllm-project:main Feb 4, 2026
40 of 41 checks passed

	raise ValueError(f"Unsupported prefetch weight name: {prefetch_weight_name}")
	raise ValueError(f"Unsupported prefetch weight name: {prefetch_layer_name}")

	@patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_FLASHCOMM1": "1"})
	@pytest.mark.parametrize("model", QWEN_DENSE_MODELS)

Conversation

leo-pony commented Jan 30, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangxiyuan Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

wangxiyuan Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leo-pony commented Jan 30, 2026 •

edited by github-actions bot

Loading