[Perf] Deepseekv3 performance optimization for eager mode#598
[Perf] Deepseekv3 performance optimization for eager mode#598wangxiyuan merged 15 commits intovllm-project:mainfrom
Conversation
|
Testing this PR on [2048in, 128out] * 16 scen with tp=2, ep=2 on deepseekv3-lite eager mode, got roughly 50% perf boost. |
…0 scheduler Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
… cache Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
7a49e27 to
a61cf81
Compare
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
|
Found custom rotary_embedding was not opened before, after enable custom rotary_embedding, performance gain reach to roughly 100% compared to the main branch |
|
BTW, this PR will enable v0 scheduler plug into engine v1 by default. Maybe we should discuss about this @wangxiyuan @wuhuikx . |
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
|
@wuhuikx @wangxiyuan please help to review this pr |
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
|
Disabled default v0 scheduler
|
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
|
|
||
| # TODO: Patch when aclnn ops avaiable | ||
| RotaryEmbedding.forward_oot = rope_forward_oot | ||
| DeepseekScalingRotaryEmbedding.__init__ = deepseek_rope_init_func |
There was a problem hiding this comment.
For this kind of change which override vllm. We should add note to describe why. L269 is the same.
There was a problem hiding this comment.
the DeepseekScalingRotaryEmbeeding is not a custom op in stock vLLM? Can we file a PR to vLLM to make it a custom op, after that can use the forward_oot, right?
There was a problem hiding this comment.
it is a custom op, but there are some thing exist makes us needs for this changes.
- native deepseek rope cache the cos_sin differently compared with the naive implement in deepseek huggingface, which we found is more ascend friendly compared with vllm's impl
- its just override the forward rather than reuse its custom op's interface.
I'll add some comments on those line to explain the specific reason
| if self._num_prefills > 0: | ||
| reqs_start = self._num_decodes # prefill_start | ||
| tokens_start = self._num_decode_tokens | ||
| max_query_len = query_lens[tokens_start:].max().item() |
There was a problem hiding this comment.
query_lens is a device tensor? if so, many D2H here, is this operation necessary?
There was a problem hiding this comment.
query_lens is actually a cpu tensor, so no d2h operation will happened here, you can refer to line 220
| # TODO: below padding should be removed after kernel is ready | ||
| # we found npu_flash_attention can only works on 128 divisible head_dim, we pad it to target size here | ||
| # and slice the final result to guarantee its functionality. | ||
| self.padding_head_dim = ( |
There was a problem hiding this comment.
In prefill, we use MHA for computation, then the head_dim = nope_dim + rope_dim (192), while in decode, the absorbed and move_elision strategies are adopt, the head_dim=nope_dim, and we don't need pad, am I right?
There was a problem hiding this comment.
You are definately right, this padding dim is used for prefill to padding the tensor. Not just for v_head_dim vs (qk_rope + qk_nope), but also for the 128 divisble head_dim alignment requirements for the _npu_flash_attention
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
|
The above data is measured under the v0 scheduler + v1 engine, if v0 scheduler is not enable, vanilla chunked prefill will be used to guarantee the functionality, which is combined with bunch of small ops. Vanilla operation harm the performance enoumously with lots of additional host and device operation, so we strongly recommand to adopt scheduler v0 for perf test by just add one single line. To enable v0 scheduler in engine v1, you can pass llm = LLM(model="/data/weights/deepseek-ai/deepseekv3-lite-base-latest",
tensor_parallel_size=2,
enforce_eager=True,
trust_remote_code=True,
max_model_len=1024,
additional_config={"ascend_scheduler_config": {}}) |
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
| else: | ||
| CUSTOM_OP_ENABLED = True | ||
| except ImportError: | ||
| logging.warning( |
There was a problem hiding this comment.
use logger from vllm is better. And line 39 can be moved to L33. We can fix the nit later.
This info is very useful. Consider we plan to write a feature doc about ascend scheduler, this info can be involved in. |
### What this PR does / why we need it? As plus of #1070, this patch adds `Nominating and Removing Maintainers` section (reference some design from [PyTorch Governance](https://docs.pytorch.org/docs/stable/community/governance.html)) Below are key info about existing maintainers: ## @wangxiyuan: - Super active code and high quality reviewer [450+ PR reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3Awangxiyuan). - One of the top contributors, he also active contribute [50+ commits ](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3Awangxiyuan+) with good quality, he dares to [refactor the code](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Awangxiyuan+is%3Aclosed+refactor), which also shows his deep understanding of vllm and vllm ascend. - He leads the [[RFC]: Hardware pluggable](vllm-project/vllm#11162) feature, this make vllm-ascend project become true. - Active community involved cross wechat group, slack, github issue. Involved on [150+ issue](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3Awangxiyuan) and help users. He is also the spearker of vLLM Beijing meetup help more users understand vLLM Ascend. - Relase manager of [v0.7.1rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.1rc1), [v0.7.3rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc1), [v0.7.3rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc2), [v0.8.4rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.4rc1), [v0.7.3.post1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3.post1). ## @Yikun: - High active code reviewer: [190+ PR reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3AYikun), especially for new developers to help them onboarding. - One of the top contributors with sustained contributions: [50+ commits](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3AYikun+) since the first day of vLLM Ascend. - High quality contributions around vLLM compatibility guarantee and also maintain [CI ](#1040) and [test Framework](#730). - Active community involved cross local group, github issue Involved on [170+ issue](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3AYikun). He is also main organizer of vLLM Beijing Meetup and speaker of [PyTorch Day China 2025](https://pytorchdaychina2025.sched.com/event/2401V/poster-session) to help vLLM Ascend growth. - Relase manager of [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.4rc2), [v0.8.5rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.5rc1), [v0.7.3](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3). ## @ganyi1996ppo - High active code and high quality reviewer: [90+ PR reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3Aganyi1996ppo), he has a deep understanding of Ascend operators can always find some key issues, has deeply understand of the codebase, good code quality and qualified judgement. - Major and high quality contributions: [10+ commits](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3Aganyi1996ppo) with high quality. - He is the main contributor of [Custom AscendC op support](#371), [Deepseekv3 performance optimization](#598). - Community Involvement: Involved on [11+ issue and help users](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3Aganyi1996ppo), share [custom ops topic](https://www.bilibili.com/video/BV1Z25az3EqS/?share_source=copy_web&vd_source=72ef9c665af5f2f1370abe26ce1f719f&t=1342) on vLLM Ascend Weekly meeting. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
…ct#598) ### What this PR does / why we need it? Deepseek v3 now adopt vanilla chunked prefill on MLA part which is ineffcient for computing but necessary for chunked prefill. Since PR vllm-project#543 bring v0 scheduler into vllm-ascend, we can now adopt torch_npu._npu_flash_attention inside the mla backend for more performance boost. Also there are some redundant computation inside the rope, which is also removed. This PR should bring some performance gain for deepseek eager mode inference. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
### What this PR does / why we need it? As plus of vllm-project#1070, this patch adds `Nominating and Removing Maintainers` section (reference some design from [PyTorch Governance](https://docs.pytorch.org/docs/stable/community/governance.html)) Below are key info about existing maintainers: ## @wangxiyuan: - Super active code and high quality reviewer [450+ PR reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3Awangxiyuan). - One of the top contributors, he also active contribute [50+ commits ](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3Awangxiyuan+) with good quality, he dares to [refactor the code](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Awangxiyuan+is%3Aclosed+refactor), which also shows his deep understanding of vllm and vllm ascend. - He leads the [[RFC]: Hardware pluggable](vllm-project/vllm#11162) feature, this make vllm-ascend project become true. - Active community involved cross wechat group, slack, github issue. Involved on [150+ issue](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3Awangxiyuan) and help users. He is also the spearker of vLLM Beijing meetup help more users understand vLLM Ascend. - Relase manager of [v0.7.1rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.1rc1), [v0.7.3rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc1), [v0.7.3rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc2), [v0.8.4rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.4rc1), [v0.7.3.post1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3.post1). ## @Yikun: - High active code reviewer: [190+ PR reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3AYikun), especially for new developers to help them onboarding. - One of the top contributors with sustained contributions: [50+ commits](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3AYikun+) since the first day of vLLM Ascend. - High quality contributions around vLLM compatibility guarantee and also maintain [CI ](vllm-project#1040) and [test Framework](vllm-project#730). - Active community involved cross local group, github issue Involved on [170+ issue](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3AYikun). He is also main organizer of vLLM Beijing Meetup and speaker of [PyTorch Day China 2025](https://pytorchdaychina2025.sched.com/event/2401V/poster-session) to help vLLM Ascend growth. - Relase manager of [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.4rc2), [v0.8.5rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.5rc1), [v0.7.3](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3). ## @ganyi1996ppo - High active code and high quality reviewer: [90+ PR reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3Aganyi1996ppo), he has a deep understanding of Ascend operators can always find some key issues, has deeply understand of the codebase, good code quality and qualified judgement. - Major and high quality contributions: [10+ commits](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3Aganyi1996ppo) with high quality. - He is the main contributor of [Custom AscendC op support](vllm-project#371), [Deepseekv3 performance optimization](vllm-project#598). - Community Involvement: Involved on [11+ issue and help users](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3Aganyi1996ppo), share [custom ops topic](https://www.bilibili.com/video/BV1Z25az3EqS/?share_source=copy_web&vd_source=72ef9c665af5f2f1370abe26ce1f719f&t=1342) on vLLM Ascend Weekly meeting. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
…ct#598) ### What this PR does / why we need it? Deepseek v3 now adopt vanilla chunked prefill on MLA part which is ineffcient for computing but necessary for chunked prefill. Since PR vllm-project#543 bring v0 scheduler into vllm-ascend, we can now adopt torch_npu._npu_flash_attention inside the mla backend for more performance boost. Also there are some redundant computation inside the rope, which is also removed. This PR should bring some performance gain for deepseek eager mode inference. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
### What this PR does / why we need it? As plus of vllm-project#1070, this patch adds `Nominating and Removing Maintainers` section (reference some design from [PyTorch Governance](https://docs.pytorch.org/docs/stable/community/governance.html)) Below are key info about existing maintainers: ## @wangxiyuan: - Super active code and high quality reviewer [450+ PR reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3Awangxiyuan). - One of the top contributors, he also active contribute [50+ commits ](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3Awangxiyuan+) with good quality, he dares to [refactor the code](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Awangxiyuan+is%3Aclosed+refactor), which also shows his deep understanding of vllm and vllm ascend. - He leads the [[RFC]: Hardware pluggable](vllm-project/vllm#11162) feature, this make vllm-ascend project become true. - Active community involved cross wechat group, slack, github issue. Involved on [150+ issue](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3Awangxiyuan) and help users. He is also the spearker of vLLM Beijing meetup help more users understand vLLM Ascend. - Relase manager of [v0.7.1rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.1rc1), [v0.7.3rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc1), [v0.7.3rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3rc2), [v0.8.4rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.4rc1), [v0.7.3.post1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3.post1). ## @Yikun: - High active code reviewer: [190+ PR reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3AYikun), especially for new developers to help them onboarding. - One of the top contributors with sustained contributions: [50+ commits](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3AYikun+) since the first day of vLLM Ascend. - High quality contributions around vLLM compatibility guarantee and also maintain [CI ](vllm-project#1040) and [test Framework](vllm-project#730). - Active community involved cross local group, github issue Involved on [170+ issue](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3AYikun). He is also main organizer of vLLM Beijing Meetup and speaker of [PyTorch Day China 2025](https://pytorchdaychina2025.sched.com/event/2401V/poster-session) to help vLLM Ascend growth. - Relase manager of [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.4rc2), [v0.8.5rc1](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.8.5rc1), [v0.7.3](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.7.3). ## @ganyi1996ppo - High active code and high quality reviewer: [90+ PR reviewed](https://github.com/vllm-project/vllm-ascend/pulls?q=commenter%3Aganyi1996ppo), he has a deep understanding of Ascend operators can always find some key issues, has deeply understand of the codebase, good code quality and qualified judgement. - Major and high quality contributions: [10+ commits](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Aclosed+review%3Aapproved+author%3Aganyi1996ppo) with high quality. - He is the main contributor of [Custom AscendC op support](vllm-project#371), [Deepseekv3 performance optimization](vllm-project#598). - Community Involvement: Involved on [11+ issue and help users](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3Aganyi1996ppo), share [custom ops topic](https://www.bilibili.com/video/BV1Z25az3EqS/?share_source=copy_web&vd_source=72ef9c665af5f2f1370abe26ce1f719f&t=1342) on vLLM Ascend Weekly meeting. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
What this PR does / why we need it?
Deepseek v3 now adopt vanilla chunked prefill on MLA part which is ineffcient for computing but necessary for chunked prefill. Since PR #543 bring v0 scheduler into vllm-ascend, we can now adopt torch_npu._npu_flash_attention inside the mla backend for more performance boost. Also there are some redundant computation inside the rope, which is also removed. This PR should bring some performance gain for deepseek eager mode inference.
Does this PR introduce any user-facing change?
How was this patch tested?