-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[Refactor] MLP weight prefetch to consistency with MoE Model's prefetching #6442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c444bf6
de31878
2fbab46
41e16f4
4b453be
3ace646
3e0bdad
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -171,9 +171,6 @@ export TASK_QUEUE_ENABLE=1 | |
| # Enable the AIVector core to directly schedule ROCE communication | ||
| export HCCL_OP_EXPANSION_MODE="AIV" | ||
|
|
||
| # Enable MLP prefetch for better performance. | ||
| export VLLM_ASCEND_ENABLE_PREFETCH_MLP=1 | ||
|
|
||
| # Enable FlashComm_v1 optimization when tensor parallel is enabled. | ||
| export VLLM_ASCEND_ENABLE_FLASHCOMM1=1 | ||
|
|
||
|
|
@@ -187,7 +184,7 @@ vllm serve /model/Qwen3-32B-W8A8 \ | |
| --max-model-len 5500 \ | ||
| --max-num-batched-tokens 40960 \ | ||
| --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ | ||
| --additional-config '{"pa_shape_list":[48,64,72,80]}' \ | ||
| --additional-config '{"pa_shape_list":[48,64,72,80], "weight_prefetch_config":{"enabled":true}}' \ | ||
| --port 8113 \ | ||
| --block-size 128 \ | ||
| --gpu-memory-utilization 0.9 | ||
|
|
@@ -348,9 +345,7 @@ Weight prefetching optimizes memory usage by preloading weights into the cache b | |
|
|
||
| In dense model scenarios, the MLP's gate_up_proj and down_proj linear layers often exhibit relatively high MTE utilization. To address this, we create a separate pipeline specifically for weight prefetching, which runs in parallel with the original vector computation pipeline, such as RMSNorm and SiLU, before the MLP. This approach allows the weights to be preloaded to L2 cache ahead of time, reducing MTE utilization during the MLP computations and indirectly improving Cube computation efficiency by minimizing resource contention and optimizing data flow. | ||
|
leo-pony marked this conversation as resolved.
|
||
|
|
||
| It is important to emphasize that, since we use vector computations to hide the weight prefetching pipeline, the setting of the prefetch buffer size is crucial. If the buffer size is too small, the optimization benefits will not be fully realized, while a larger buffer size may lead to resource contention, resulting in performance degradation. To accommodate different scenarios, we have exposed two environment variables `VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE` and `VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE` to allow for flexible buffer size configuration based on the specific workload. | ||
|
|
||
| This optimization requires setting the environment variable `VLLM_ASCEND_ENABLE_PREFETCH_MLP = 1` to be enabled. | ||
| Previously, the environment variables VLLM_ASCEND_ENABLE_PREFETCH_MLP used to enable MLP weight prefetch and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE and VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE used to set the weight prefetch size for MLP gate_up_proj and down_proj were deprecated. Please use the following configuration instead: "weight_prefetch_config": { "enabled": true, "prefetch_ratio": { "mlp": { "gate_up": 1.0, "down": 1.0}}}. See User Guide->Feature Guide->Weight Prefetch Guide for details. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
|
||
| ### 6. Zerolike Elimination | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,4 +23,5 @@ layer_sharding | |
| speculative_decoding | ||
| context_parallel | ||
| npugraph_ex | ||
| weight_prefetch | ||
| ::: | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,73 @@ | ||
| # Weight Prefetch Guide | ||
|
|
||
| Weight prefetching optimizes memory usage by preloading weights into the cache before they are needed, minimizing delays caused by memory access during model execution. Linear layers sometimes exhibit relatively high MTE utilization. To address this, we create a separate pipeline specifically for weight prefetching, which runs in parallel with the original vector computation pipeline, such as quantize, MoE gating top_k, RMSNorm and SwiGlu. This approach allows the weights to be preloaded to L2 cache ahead of time, reducing MTE utilization during the linear layer computations and indirectly improving Cube computation efficiency by minimizing resource contention and optimizing data flow. | ||
|
|
||
| Since we use vector computations to hide the weight prefetching pipeline, it has effect on computation, if you prioritize low latency over high throughput, then it it best not to enable prefetching. | ||
|
|
||
| ## Quick Start | ||
|
|
||
| With `--additional-config '{"weight_prefetch_config": {"enabled": true}}'` to open weight prefetch. | ||
|
|
||
| ## Fine-tune Prefetch Ratio | ||
|
|
||
| Since weight prefetch use vector computations to hide the weight prefetching pipeline, the setting of the prefetch size is crucial. If the size is too small, the optimization benefits will not be fully realized, while a larger size may lead to resource contention, resulting in performance degradation. To accommodate different scenarios, we have add `prefetch_ratio` to allow for flexible size configuration based on the specific workload, detail as following: | ||
|
|
||
| With `prefetch_ratio` in `"weight_prefetch_config"` to custom the weight prefetch ratio for specify linear layers. | ||
|
|
||
| The “attn” and “moe” configuration options are used for MoE model, detail as following: | ||
|
|
||
| `"attn": { "qkv": 1.0, "o": 1.0}, "moe": {"gate_up": 0.8}` | ||
|
|
||
| The “mlp” configuration option is used to optimize the performance of the Dense model, detail as following: | ||
|
|
||
| `"mlp": {"gate_up": 1.0, "down": 1.0}` | ||
|
|
||
| Above value are the default config, the default value has a good performance for Qwen3-235B-A22B-W8A8 when `--max-num-seqs`is 144, for Qwen3-32B-W8A8 when `--max-num-seqs`is 72. | ||
|
|
||
| However, this may not be the optimal configuration for your scenario. For higher concurrency, you can try increasing the prefetch size. For lower concurrency, prefetching may not offer any advantages, so you can decrease the size or disable prefetching. Determine if the prefetch size is appropriate by collecting profiling data. Specifically, check if the time required for the prefetch operation (e.g., MLP Down Proj weight prefetching) overlaps with the time required for parallel vector computation operators (e.g., SwiGlu computation), and whether the prefetch operation is no later than the completion time of the vector computation operator. In the profiling timeline, a prefetch operation appears as a CMO operation on a single stream; this CMO operation is the prefetch operation. | ||
|
|
||
| Notices: | ||
|
|
||
| 1) Weight prefetch of MLP `down` project prefetch dependence sequence parallel, if you want open for mlp `down` please also enable sequence parallel. | ||
| 2) Due to the current size of the L2 cache, the maximum prefetch cannot exceed 18MB. If `prefetch_ration * lineaer_layer_weight_size >= 18 * 1024 * 1024` bytes, the backend will only prefetch 18MB. | ||
|
|
||
| ## Example | ||
|
|
||
| 1) For MoE model: | ||
|
|
||
| ```shell | ||
| --additional-config \ | ||
| '{ | ||
| "weight_prefetch_config": { | ||
| "enabled": true, | ||
| "prefetch_ratio": { | ||
| "attn": { | ||
| "qkv": 1.0, | ||
| "o": 1.0 | ||
| }, | ||
| "moe": { | ||
| "gate_up": 0.8 | ||
| } | ||
| } | ||
| } | ||
| }' | ||
| ``` | ||
|
|
||
| 2) For dense model: | ||
|
|
||
| Following is the default configuration that can get a good performance for `--max-num-seqs`is 72 for Qwen3-32B-W8A8 | ||
|
|
||
| ```shell | ||
| --additional-config \ | ||
| '{ | ||
| "weight_prefetch_config": { | ||
| "enabled": true, | ||
| "prefetch_ratio": { | ||
| "mlp": { | ||
| "gate_up": 1.0, | ||
| "down": 1.0 | ||
| } | ||
| } | ||
| } | ||
| }' | ||
| ``` |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -222,7 +222,7 @@ def test_qwen3_dense_fc1_tp2(model): | |||||
|
|
||||||
|
|
||||||
| @pytest.mark.parametrize("model", QWEN_DENSE_MODELS) | ||||||
| @patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_PREFETCH_MLP": "1"}) | ||||||
| @patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_FLASHCOMM1": "1"}) | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The test
Suggested change
|
||||||
| def test_qwen3_dense_prefetch_mlp_weight_tp2(model): | ||||||
| example_prompts = [ | ||||||
| "Hello, my name is", | ||||||
|
|
@@ -236,6 +236,7 @@ def test_qwen3_dense_prefetch_mlp_weight_tp2(model): | |||||
| tensor_parallel_size=2, | ||||||
| cudagraph_capture_sizes=[1, 2, 4, 8], | ||||||
| quantization="ascend", | ||||||
| additional_config={"weight_prefetch_config": {"enabled": True}}, | ||||||
| ) as vllm_model: | ||||||
| vllm_model.generate_greedy(example_prompts, max_tokens) | ||||||
|
|
||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,12 +19,16 @@ | |
| import torch.nn.functional as F | ||
|
|
||
| from vllm_ascend.ops.activation import AscendSiluAndMul | ||
| from vllm_ascend.utils import get_weight_prefetch_method | ||
|
|
||
|
|
||
| class AscendSiluAndMul310(AscendSiluAndMul): | ||
| def forward(self, x: torch.Tensor) -> torch.Tensor: | ||
| torch.ops.vllm.maybe_prefetch_mlp_down_proj(x) | ||
| weight_prefetch_method = get_weight_prefetch_method() | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe we shoult drop support for 310p first. |
||
| if weight_prefetch_method: | ||
| weight_prefetch_method.maybe_prefetch_mlp_weight_preprocess(weight_prefetch_method.MLP_DOWN, x) | ||
| h = x.shape[-1] // 2 | ||
| out = F.silu(x[..., :h]) * x[..., h:] | ||
| torch.ops.vllm.maybe_wait_prefetch_done(out) | ||
| out = (F.silu(x[..., :h].to(torch.float32)) * x[..., h:].to(torch.float32)).to(torch.float16) | ||
| if weight_prefetch_method: | ||
| weight_prefetch_method.maybe_prefetch_mlp_weight_postprocess(out) | ||
| return out | ||
Uh oh!
There was an error while loading. Please reload this page.