[Perf] Deepseekv3 performance optimization for eager mode by ganyi1996ppo · Pull Request #598 · vllm-project/vllm-ascend

ganyi1996ppo · 2025-04-21T12:25:01Z

What this PR does / why we need it?

Deepseek v3 now adopt vanilla chunked prefill on MLA part which is ineffcient for computing but necessary for chunked prefill. Since PR #543 bring v0 scheduler into vllm-ascend, we can now adopt torch_npu._npu_flash_attention inside the mla backend for more performance boost. Also there are some redundant computation inside the rope, which is also removed. This PR should bring some performance gain for deepseek eager mode inference.

Does this PR introduce any user-facing change?

How was this patch tested?

ganyi1996ppo · 2025-04-21T13:22:02Z

Testing this PR on [2048in, 128out] * 16 scen with tp=2, ep=2 on deepseekv3-lite eager mode, got roughly 50% perf boost.

…0 scheduler Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

… cache Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

ganyi1996ppo · 2025-04-22T04:18:14Z

Found custom rotary_embedding was not opened before, after enable custom rotary_embedding, performance gain reach to roughly 100% compared to the main branch

ganyi1996ppo · 2025-04-22T04:22:39Z

BTW, this PR will enable v0 scheduler plug into engine v1 by default. Maybe we should discuss about this @wangxiyuan @wuhuikx .

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

ganyi1996ppo · 2025-04-22T05:35:07Z

@wuhuikx @wangxiyuan please help to review this pr

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

ganyi1996ppo · 2025-04-22T07:12:21Z

Disabled default v0 scheduler

@wuhuikx @wangxiyuan please help to review this pr

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

wangxiyuan · 2025-04-22T11:42:35Z


-# TODO: Patch when aclnn ops avaiable
 RotaryEmbedding.forward_oot = rope_forward_oot
+DeepseekScalingRotaryEmbedding.__init__ = deepseek_rope_init_func


For this kind of change which override vllm. We should add note to describe why. L269 is the same.

the DeepseekScalingRotaryEmbeeding is not a custom op in stock vLLM? Can we file a PR to vLLM to make it a custom op, after that can use the forward_oot, right?

it is a custom op, but there are some thing exist makes us needs for this changes.

native deepseek rope cache the cos_sin differently compared with the naive implement in deepseek huggingface, which we found is more ascend friendly compared with vllm's impl

its just override the forward rather than reuse its custom op's interface.

I'll add some comments on those line to explain the specific reason

wuhuikx · 2025-04-22T14:42:27Z

        if self._num_prefills > 0:
            reqs_start = self._num_decodes  # prefill_start
            tokens_start = self._num_decode_tokens
+            max_query_len = query_lens[tokens_start:].max().item()


query_lens is a device tensor? if so, many D2H here, is this operation necessary?

query_lens is actually a cpu tensor, so no d2h operation will happened here, you can refer to line 220

wuhuikx · 2025-04-22T14:45:55Z

+        # TODO: below padding should be removed after kernel is ready
+        # we found npu_flash_attention can only works on 128 divisible head_dim, we pad it to target size here
+        # and slice the final result to guarantee its functionality.
+        self.padding_head_dim = (


In prefill, we use MHA for computation, then the head_dim = nope_dim + rope_dim (192), while in decode, the absorbed and move_elision strategies are adopt, the head_dim=nope_dim, and we don't need pad, am I right?

You are definately right, this padding dim is used for prefill to padding the tensor. Not just for v_head_dim vs (qk_rope + qk_nope), but also for the 128 divisble head_dim alignment requirements for the _npu_flash_attention

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

ganyi1996ppo · 2025-04-27T12:27:22Z

The above data is measured under the v0 scheduler + v1 engine, if v0 scheduler is not enable, vanilla chunked prefill will be used to guarantee the functionality, which is combined with bunch of small ops. Vanilla operation harm the performance enoumously with lots of additional host and device operation, so we strongly recommand to adopt scheduler v0 for perf test by just add one single line.

To enable v0 scheduler in engine v1, you can pass additional_config with dict ascend_scheduler_config. For example, in your LLM offline_inference script, you should pass additional_config with kwarg ascend_scheduler_config to activate v0 scheduler in v1 engine in LLM:

    llm = LLM(model="/data/weights/deepseek-ai/deepseekv3-lite-base-latest",
              tensor_parallel_size=2,
              enforce_eager=True,
              trust_remote_code=True,
              max_model_len=1024,
              additional_config={"ascend_scheduler_config": {}})

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

wangxiyuan · 2025-04-29T09:11:23Z

-    else:
-        CUSTOM_OP_ENABLED = True
+except ImportError:
+    logging.warning(


use logger from vllm is better. And line 39 can be moved to L33. We can fix the nit later.

wangxiyuan · 2025-04-29T09:13:53Z

The above data is measured under the v0 scheduler + v1 engine, if v0 scheduler is not enable, vanilla chunked prefill will be used to guarantee the functionality, which is combined with bunch of small ops. Vanilla operation harm the performance enoumously with lots of additional host and device operation, so we strongly recommand to adopt scheduler v0 for perf test by just add one single line.

To enable v0 scheduler in engine v1, you can pass additional_config with dict ascend_scheduler_config. For example, in your LLM offline_inference script, you should pass additional_config with kwarg ascend_scheduler_config to activate v0 scheduler in v1 engine in LLM:
    llm = LLM(model="/data/weights/deepseek-ai/deepseekv3-lite-base-latest",
              tensor_parallel_size=2,
              enforce_eager=True,
              trust_remote_code=True,
              max_model_len=1024,
              additional_config={"ascend_scheduler_config": {}})

This info is very useful. Consider we plan to write a feature doc about ascend scheduler, this info can be involved in.

@wangxiyuan

…ct#598) ### What this PR does / why we need it? Deepseek v3 now adopt vanilla chunked prefill on MLA part which is ineffcient for computing but necessary for chunked prefill. Since PR vllm-project#543 bring v0 scheduler into vllm-ascend, we can now adopt torch_npu._npu_flash_attention inside the mla backend for more performance boost. Also there are some redundant computation inside the rope, which is also removed. This PR should bring some performance gain for deepseek eager mode inference. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

@wangxiyuan

…ct#598) ### What this PR does / why we need it? Deepseek v3 now adopt vanilla chunked prefill on MLA part which is ineffcient for computing but necessary for chunked prefill. Since PR vllm-project#543 bring v0 scheduler into vllm-ascend, we can now adopt torch_npu._npu_flash_attention inside the mla backend for more performance boost. Also there are some redundant computation inside the rope, which is also removed. This PR should bring some performance gain for deepseek eager mode inference. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

@wangxiyuan

github-actions Bot added module:ops module:core labels Apr 21, 2025

ganyi1996ppo added 5 commits April 22, 2025 11:58

open scheduler v0 as default, and adopt flash attention in mla when v…

7bba874

…0 scheduler Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

optimize rope's execution path, stop consistently compute sin and cos…

3d0d3ec

… cache Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

bug fix on flash attention

b836b85

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

bug fix on flash attention

bcd36d5

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

solve issues of rope and enable rope by default

a61cf81

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

ganyi1996ppo force-pushed the ganyi/deepseek_v0_scheduler branch from 7a49e27 to a61cf81 Compare April 22, 2025 03:59

solve format issue

89fe930

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

ganyi1996ppo added 3 commits April 22, 2025 13:05

solve mypy format issue

a25cb95

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

enable custom rope at bf16

3cfa7a3

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

solve format issue

e7b3435

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

ganyi1996ppo requested a review from wangxiyuan April 22, 2025 05:34

disable v0 scheduler by default

0729e3d

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

fix isort issue

dc70465

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

wangxiyuan reviewed Apr 22, 2025

View reviewed changes

wuhuikx reviewed Apr 22, 2025

View reviewed changes

add comments for rope init code

2ede312

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

fix format issue

11a3199

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

ganyi1996ppo changed the title ~~Deepseekv3 performance optimization for eager mode~~ [Perf] Deepseekv3 performance optimization for eager mode Apr 27, 2025

fix ci issue

0694979

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

github-actions Bot added the module:tests label Apr 29, 2025

Merge branch 'main' into ganyi/deepseek_v0_scheduler

326f22d

wangxiyuan approved these changes Apr 29, 2025

View reviewed changes

wangxiyuan merged commit 0329fad into vllm-project:main Apr 29, 2025
14 checks passed

Yikun mentioned this pull request Jun 9, 2025

Init vLLM Ascend maintainers info #1124

Merged

Conversation

ganyi1996ppo commented Apr 21, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ganyi1996ppo commented Apr 21, 2025

Uh oh!

ganyi1996ppo commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ganyi1996ppo commented Apr 22, 2025

Uh oh!

ganyi1996ppo commented Apr 22, 2025

Uh oh!

ganyi1996ppo commented Apr 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ganyi1996ppo commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wangxiyuan commented Apr 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ganyi1996ppo commented Apr 22, 2025 •

edited

Loading

ganyi1996ppo commented Apr 27, 2025 •

edited

Loading