Support multistream of MLA vector operations#1135
Support multistream of MLA vector operations#1135ganyi1996ppo merged 1 commit intovllm-project:mainfrom
Conversation
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
546e6c7 to
5b20d64
Compare
d4eac5b to
4967ade
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
4967ade to
be59b9b
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
be59b9b to
dcea3aa
Compare
dcea3aa to
980b356
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
1 similar comment
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
980b356 to
b7e7700
Compare
b7e7700 to
fd0f6fa
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
With the expected overlaping being:
```
| q_rmsnorm | | kv_norm_rope_cache | | q_rope |
| matmul W_DQ | matmul W_DKV | matmul W_UQ | split | matmul W_KV_T |
```
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
fd0f6fa to
c89bf7b
Compare
|
there is a pr for refactor. I suggest to block this PR until that one is merged #1169 |
| decode_k_pe, decode_k_nope = self.exec_kv( | ||
| hidden_states_or_kv_c_normed, cos, sin, kv_cache, | ||
| attn_metadata.slot_mapping) | ||
| with npu_stream_switch("mla_secondary", |
There was a problem hiding this comment.
So, this npu_stream_switch can be used inside the torchair right? if that's so, can we further optimize the dbo path of deepseek to gain more performance boost.
There was a problem hiding this comment.
Discussed with torchair colleagues, the with torch.npu.stream(...): scheme is not supported by torchair graph mode while npu_stream_switch only works under graph mode, so I recommend leaving this to later PRs:
- to provide a joint version of
npu_stream_switchencapsulating both cases, as well as, - adding graph mode support for dbo
| hidden_states_or_q_c = self.q_a_layernorm(ckq) | ||
| use_multistream_mla = (self.enable_multistream_mla | ||
| and attn_metadata is not None | ||
| and attn_metadata.num_decodes > 0) |
There was a problem hiding this comment.
So, multistream mla will not be triggered if there are only prefill requests?
There was a problem hiding this comment.
Yes, current multistream utilities only support graph mode, i.e. npu_stream_switch returns contextlib.nullcontext() on non-graph mode, which is the case for prefill.
…)" This reverts commit e72f94e.
…)" This reverts commit e72f94e. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
| q_rmsnorm | | kv_norm_rope_cache | | q_rope |
| matmul W_DQ | matmul W_DKV | index | index | matmul W_UQ | split | matmul W_KV_T |
```
Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.
### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.
### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
| q_rmsnorm | | kv_norm_rope_cache | | q_rope |
| matmul W_DQ | matmul W_DKV | index | index | matmul W_UQ | split | matmul W_KV_T |
```
Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.
### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.
### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
| q_rmsnorm | | kv_norm_rope_cache | | q_rope |
| matmul W_DQ | matmul W_DKV | index | index | matmul W_UQ | split | matmul W_KV_T |
```
Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.
### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.
### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
| q_rmsnorm | | kv_norm_rope_cache | | q_rope |
| matmul W_DQ | matmul W_DKV | index | index | matmul W_UQ | split | matmul W_KV_T |
```
Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.
### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.
### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
| q_rmsnorm | | kv_norm_rope_cache | | q_rope |
| matmul W_DQ | matmul W_DKV | index | index | matmul W_UQ | split | matmul W_KV_T |
```
Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.
### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.
### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected overlaping being:
Currently, the
IndexByTensoroperators introduced by computation ofcosandsincan't be offloaded to the secondary stream due to a known bug of graph fusion optimization pass. So we instead keep it in the main stream, only requires it be computed beforematmul W_UQto avoid hindering later overlapping. The problem may be solved by later optimization (#993), which hoists the computation ofcosandsinup to the first layer.Does this PR introduce any user-facing change?
Controlled by
torchair_graph_config.enable_multistream_mla, defaulted to False.How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.