[Kernel] Enable fused_qknorm_rope_kernel supports partial rope#30821
[Kernel] Enable fused_qknorm_rope_kernel supports partial rope#30821vllm-bot merged 12 commits intovllm-project:mainfrom
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request successfully introduces support for partial Rotary Positional Embeddings (RoPE) in the fused_qknorm_rope_kernel. The changes correctly propagate the rotary_dim parameter through the kernel and launch functions, ensuring that RoPE is applied only to the specified dimensions. The input validation for cos_sin_cache has been updated to reflect the new rotary_dim constraint, and the test suite has been extended to cover both full and partial RoPE scenarios. The logic for calculating pairOffset and conditionally applying RoPE within the kernel appears to be correctly adapted for this functionality.
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
@jeejeelee This PR broke ROCm builds. I will open an issue to track this. |
…project#30821) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
…project#30821) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
…project#30821) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…project#30821) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Purpose
Enable
fused_qknorm_rope_kernelto support partial rope, which can slightly improve the performance of models like GLM4.6-MoE. The gains are not very noticeable because torchcombo kernelhas already fused q and norm into a single kernel, meaning this fusion only consolidates 2 kernels into 1 kernel,see: #27165cc @mgoin @ProExpertProg @yewentao256
Metrics
Test Script
vllm serve zai-org/GLM-4.6 -tp 8 -dp 1 --max-num-seqs 128 --served-model-name glm46 --compilation-config '{"pass_config": {"enable_qk_norm_rope_fusion": "1"}}' --no-enable-prefix-cachinglm_eval --model local-completions --model_args "model=glm46,base_url=http://localhost:8000/v1/completions,max_length=8192,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" --tasks gsm8k --num_fewshot 5GSM8K
diable
enable
Performance
disable
enable
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.