[Main][Ops] Make triton rope support index_selecting from cos_sin_cache#5450
[Main][Ops] Make triton rope support index_selecting from cos_sin_cache#5450whx-sjtu merged 7 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request refactors the Rotary Positional Embedding (RoPE) implementation to use a combined cos_sin_cache, which is a good simplification. However, I've found a few critical issues. The new Triton kernel for RoPE does not correctly use the token positions, leading to incorrect calculations. Additionally, an old RoPE function (rope_forward_triton) is now broken due to using undefined variables. Finally, the shape of the tensors returned by the new RoPE implementation is incorrect, which will likely cause errors in downstream attention operations. These issues need to be addressed before this PR can be merged.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
Any progress? If this PR is still alive, please rebase to main and make CI happy, otherwise you can close it. Thanks |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
6b0ec47 to
df33b34
Compare
e6a1b57 to
b3f5744
Compare
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Angazenn <supperccell@163.com>
…cache (#6602) ### What this PR does / why we need it? This PR adapts #5450, #6523 to v0.13.0, to fix #6612 . This PR extends original `rope_triton_forward` and `split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as inputs. This fully aligns to vLLM RoPE api interface. Compared with earlier implementation for RoPE, the benefits are: 1. avoiding pre-computation of `cos` `sin` before model execution, which helps to remove redundant codes. 2. allowing eagle3 draft model to have different rope parameters with main model (see #6612 ). This help to recover accept rate && accuracy in that case. In addition, this kernel change only introduces very small performance degradation. Those `index_select` or `chunk` operations are now changed into simple memory access in triton kernel **Highlights** - **RoPE Cache Unification**: Replaced separate _sin and _cos global tensors with a unified cos_sin_cache and explicit positions tensor for Rotary Positional Embeddings (RoPE), streamlining data handling. - **Triton Kernel Integration**: Updated Triton kernels (split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the cos_sin_cache and positions for more efficient and integrated RoPE calculations. - **Custom Operation Registration**: Registered `rope_forward_oot` as a new custom operation, allowing its use in fused compilation passes and providing a dedicated entry point for the new RoPE implementation. - **Refactored RoPE Forward Pass**: Modified the rope_forward_oot function to accept the new cos_sin_cache and positions arguments, enabling a more flexible and integrated RoPE application within the system. --------- Signed-off-by: Angazenn <supperccell@163.com>
…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: [Docs] Fix GLM-5 deploy command (vllm-project#6711) [npugraph_ex]enable npugraph_ex by default (vllm-project#6664) [doc]add GLM5.md (vllm-project#6709) [Model] GLM5 adaptation (vllm-project#6642) [Bugfix] Update target probs to target logits in rejection sample (vllm-project#6685) [Main][Ops] Make triton rope support index_selecting from cos_sin_cache (vllm-project#5450) [CI]fix nightly multi node test error for wait for pod ready (vllm-project#6675) [main to main] upgrade main 0210 (vllm-project#6673) [main][Quant] Remove unused rotation functions and parameters from W4A4 LAOS quantization (vllm-project#6648) [Test][BugFix] Fix torch.rand usage in triton penalty test (vllm-project#6680) Add Worker Interface:check_health (vllm-project#6681)
…he (vllm-project#5450) ### What this PR does / why we need it? This PR extends original `rope_triton_forward` and `split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as inputs. This fully aligns to vLLM RoPE api interface. Compared with earlier implementation for RoPE, the benefits are: 1. avoiding pre-computation of `cos` `sin` before model execution, which helps to remove redundant codes. 2. allowing eagle3 draft model to have different rope parameters with main model (see vllm-project#6612 ). This help to recover accept rate && accuracy in that case. In addition, this kernel change only introduces very small performance degradation. Those `index_select` or `chunk` operations are now changed into simple memory access in triton kernel (For example, https://github.com/vllm-project/vllm-ascend/pull/5450/changes#diff-a4c2d3071530df193b98f9bf38553874bc4d47571336711f116c26d019cfbb6aR77-R81). **Highlights** - **RoPE Cache Unification**: Replaced separate _sin and _cos global tensors with a unified cos_sin_cache and explicit positions tensor for Rotary Positional Embeddings (RoPE), streamlining data handling. - **Triton Kernel Integration**: Updated Triton kernels (split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the cos_sin_cache and positions for more efficient and integrated RoPE calculations. - **Custom Operation Registration**: Registered `rope_forward_oot` as a new custom operation, allowing its use in fused compilation passes and providing a dedicated entry point for the new RoPE implementation. - **Refactored RoPE Forward Pass**: Modified the rope_forward_oot function to accept the new cos_sin_cache and positions arguments, enabling a more flexible and integrated RoPE application within the system. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 Additional test on Qwen3-235b accuracy: | Aime2024 | GSM8K | Livecodebench | | -------- | -------- | -------- | | 83.33 | 96.26 | 70.23 | --------- Signed-off-by: Angazenn <supperccell@163.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>
…he (vllm-project#5450) ### What this PR does / why we need it? This PR extends original `rope_triton_forward` and `split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as inputs. This fully aligns to vLLM RoPE api interface. Compared with earlier implementation for RoPE, the benefits are: 1. avoiding pre-computation of `cos` `sin` before model execution, which helps to remove redundant codes. 2. allowing eagle3 draft model to have different rope parameters with main model (see vllm-project#6612 ). This help to recover accept rate && accuracy in that case. In addition, this kernel change only introduces very small performance degradation. Those `index_select` or `chunk` operations are now changed into simple memory access in triton kernel (For example, https://github.com/vllm-project/vllm-ascend/pull/5450/changes#diff-a4c2d3071530df193b98f9bf38553874bc4d47571336711f116c26d019cfbb6aR77-R81). **Highlights** - **RoPE Cache Unification**: Replaced separate _sin and _cos global tensors with a unified cos_sin_cache and explicit positions tensor for Rotary Positional Embeddings (RoPE), streamlining data handling. - **Triton Kernel Integration**: Updated Triton kernels (split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the cos_sin_cache and positions for more efficient and integrated RoPE calculations. - **Custom Operation Registration**: Registered `rope_forward_oot` as a new custom operation, allowing its use in fused compilation passes and providing a dedicated entry point for the new RoPE implementation. - **Refactored RoPE Forward Pass**: Modified the rope_forward_oot function to accept the new cos_sin_cache and positions arguments, enabling a more flexible and integrated RoPE application within the system. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 Additional test on Qwen3-235b accuracy: | Aime2024 | GSM8K | Livecodebench | | -------- | -------- | -------- | | 83.33 | 96.26 | 70.23 | --------- Signed-off-by: Angazenn <supperccell@163.com>
…he (vllm-project#5450) ### What this PR does / why we need it? This PR extends original `rope_triton_forward` and `split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as inputs. This fully aligns to vLLM RoPE api interface. Compared with earlier implementation for RoPE, the benefits are: 1. avoiding pre-computation of `cos` `sin` before model execution, which helps to remove redundant codes. 2. allowing eagle3 draft model to have different rope parameters with main model (see vllm-project#6612 ). This help to recover accept rate && accuracy in that case. In addition, this kernel change only introduces very small performance degradation. Those `index_select` or `chunk` operations are now changed into simple memory access in triton kernel (For example, https://github.com/vllm-project/vllm-ascend/pull/5450/changes#diff-a4c2d3071530df193b98f9bf38553874bc4d47571336711f116c26d019cfbb6aR77-R81). **Highlights** - **RoPE Cache Unification**: Replaced separate _sin and _cos global tensors with a unified cos_sin_cache and explicit positions tensor for Rotary Positional Embeddings (RoPE), streamlining data handling. - **Triton Kernel Integration**: Updated Triton kernels (split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the cos_sin_cache and positions for more efficient and integrated RoPE calculations. - **Custom Operation Registration**: Registered `rope_forward_oot` as a new custom operation, allowing its use in fused compilation passes and providing a dedicated entry point for the new RoPE implementation. - **Refactored RoPE Forward Pass**: Modified the rope_forward_oot function to accept the new cos_sin_cache and positions arguments, enabling a more flexible and integrated RoPE application within the system. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 Additional test on Qwen3-235b accuracy: | Aime2024 | GSM8K | Livecodebench | | -------- | -------- | -------- | | 83.33 | 96.26 | 70.23 | --------- Signed-off-by: Angazenn <supperccell@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…he (vllm-project#5450) ### What this PR does / why we need it? This PR extends original `rope_triton_forward` and `split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as inputs. This fully aligns to vLLM RoPE api interface. Compared with earlier implementation for RoPE, the benefits are: 1. avoiding pre-computation of `cos` `sin` before model execution, which helps to remove redundant codes. 2. allowing eagle3 draft model to have different rope parameters with main model (see vllm-project#6612 ). This help to recover accept rate && accuracy in that case. In addition, this kernel change only introduces very small performance degradation. Those `index_select` or `chunk` operations are now changed into simple memory access in triton kernel (For example, https://github.com/vllm-project/vllm-ascend/pull/5450/changes#diff-a4c2d3071530df193b98f9bf38553874bc4d47571336711f116c26d019cfbb6aR77-R81). **Highlights** - **RoPE Cache Unification**: Replaced separate _sin and _cos global tensors with a unified cos_sin_cache and explicit positions tensor for Rotary Positional Embeddings (RoPE), streamlining data handling. - **Triton Kernel Integration**: Updated Triton kernels (split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the cos_sin_cache and positions for more efficient and integrated RoPE calculations. - **Custom Operation Registration**: Registered `rope_forward_oot` as a new custom operation, allowing its use in fused compilation passes and providing a dedicated entry point for the new RoPE implementation. - **Refactored RoPE Forward Pass**: Modified the rope_forward_oot function to accept the new cos_sin_cache and positions arguments, enabling a more flexible and integrated RoPE application within the system. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 Additional test on Qwen3-235b accuracy: | Aime2024 | GSM8K | Livecodebench | | -------- | -------- | -------- | | 83.33 | 96.26 | 70.23 | --------- Signed-off-by: Angazenn <supperccell@163.com>
…he (vllm-project#5450) ### What this PR does / why we need it? This PR extends original `rope_triton_forward` and `split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as inputs. This fully aligns to vLLM RoPE api interface. Compared with earlier implementation for RoPE, the benefits are: 1. avoiding pre-computation of `cos` `sin` before model execution, which helps to remove redundant codes. 2. allowing eagle3 draft model to have different rope parameters with main model (see vllm-project#6612 ). This help to recover accept rate && accuracy in that case. In addition, this kernel change only introduces very small performance degradation. Those `index_select` or `chunk` operations are now changed into simple memory access in triton kernel (For example, https://github.com/vllm-project/vllm-ascend/pull/5450/changes#diff-a4c2d3071530df193b98f9bf38553874bc4d47571336711f116c26d019cfbb6aR77-R81). **Highlights** - **RoPE Cache Unification**: Replaced separate _sin and _cos global tensors with a unified cos_sin_cache and explicit positions tensor for Rotary Positional Embeddings (RoPE), streamlining data handling. - **Triton Kernel Integration**: Updated Triton kernels (split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the cos_sin_cache and positions for more efficient and integrated RoPE calculations. - **Custom Operation Registration**: Registered `rope_forward_oot` as a new custom operation, allowing its use in fused compilation passes and providing a dedicated entry point for the new RoPE implementation. - **Refactored RoPE Forward Pass**: Modified the rope_forward_oot function to accept the new cos_sin_cache and positions arguments, enabling a more flexible and integrated RoPE application within the system. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 Additional test on Qwen3-235b accuracy: | Aime2024 | GSM8K | Livecodebench | | -------- | -------- | -------- | | 83.33 | 96.26 | 70.23 | --------- Signed-off-by: Angazenn <supperccell@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…he (vllm-project#5450) ### What this PR does / why we need it? This PR extends original `rope_triton_forward` and `split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as inputs. This fully aligns to vLLM RoPE api interface. Compared with earlier implementation for RoPE, the benefits are: 1. avoiding pre-computation of `cos` `sin` before model execution, which helps to remove redundant codes. 2. allowing eagle3 draft model to have different rope parameters with main model (see vllm-project#6612 ). This help to recover accept rate && accuracy in that case. In addition, this kernel change only introduces very small performance degradation. Those `index_select` or `chunk` operations are now changed into simple memory access in triton kernel (For example, https://github.com/vllm-project/vllm-ascend/pull/5450/changes#diff-a4c2d3071530df193b98f9bf38553874bc4d47571336711f116c26d019cfbb6aR77-R81). **Highlights** - **RoPE Cache Unification**: Replaced separate _sin and _cos global tensors with a unified cos_sin_cache and explicit positions tensor for Rotary Positional Embeddings (RoPE), streamlining data handling. - **Triton Kernel Integration**: Updated Triton kernels (split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the cos_sin_cache and positions for more efficient and integrated RoPE calculations. - **Custom Operation Registration**: Registered `rope_forward_oot` as a new custom operation, allowing its use in fused compilation passes and providing a dedicated entry point for the new RoPE implementation. - **Refactored RoPE Forward Pass**: Modified the rope_forward_oot function to accept the new cos_sin_cache and positions arguments, enabling a more flexible and integrated RoPE application within the system. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 Additional test on Qwen3-235b accuracy: | Aime2024 | GSM8K | Livecodebench | | -------- | -------- | -------- | | 83.33 | 96.26 | 70.23 | --------- Signed-off-by: Angazenn <supperccell@163.com>
What this PR does / why we need it?
This PR extends original
rope_triton_forwardandsplit_qkv_rmsnorm_ropeto supportcos_sin_cache&&positionsas inputs. This fully aligns to vLLM RoPE api interface. Compared with earlier implementation for RoPE, the benefits are:cossinbefore model execution, which helps to remove redundant codes.In addition, this kernel change only introduces very small performance degradation. Those
index_selectorchunkoperations are now changed into simple memory access in triton kernel (For example, https://github.com/vllm-project/vllm-ascend/pull/5450/changes#diff-a4c2d3071530df193b98f9bf38553874bc4d47571336711f116c26d019cfbb6aR77-R81).Highlights
rope_forward_ootas a new custom operation, allowing its use in fused compilation passes and providing a dedicated entry point for the new RoPE implementation.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Additional test on Qwen3-235b accuracy: