[Perf] add packed recurrent fast path for decode#36596
[Perf] add packed recurrent fast path for decode#36596vllm-bot merged 9 commits intovllm-project:mainfrom
Conversation
Signed-off-by: hdj <1293066020@qq.com>
Signed-off-by: hdj <1293066020@qq.com>
|
Hi @caozuoba, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant performance optimization for Qwen3Next models by adding a packed recurrent fast path for the decode phase. The changes are well-structured and include a new Triton kernel that directly consumes a packed QKV tensor, which avoids materializing separate Q, K, and V tensors and fuses the gating logic. This new functionality is controlled by an environment variable and includes robust fallback mechanisms to the baseline implementation, ensuring safety. The pull request also adds a new test to verify the correctness of the new kernel. The integration into the existing model code is clean and follows established patterns. Overall, the changes appear to be correct, safe, and a valuable performance enhancement.
Signed-off-by: hdj <1293066020@qq.com>
|
@mgoin @tlrmchlsmth @WoosukKwon @ZJY0516 Hello everyone, could you please take a look at this PR and provide some feedback when you have a moment? This is a follow-up to a PR #35739 submitted last week. Due to some conflicts with an already merged PR, I’ve created a new PR and re-run the benchmark tests. Thanks! |
| tl.store(p_ht, b_h.to(p_ht.dtype.element_ty), mask=mask_h) | ||
|
|
||
|
|
||
| def fused_recurrent_gated_delta_rule_packed_decode_fwd( |
There was a problem hiding this comment.
| def fused_recurrent_gated_delta_rule_packed_decode_fwd( | |
| def fused_recurrent_gated_delta_rule_packed_decode( |
I prefer this name
| use_qk_l2norm_in_kernel=True, | ||
| ) | ||
| return | ||
| except ValueError as exc: |
There was a problem hiding this comment.
When this will fail? I don't think this needs try and except here
| else: | ||
| core_attn_out[:num_actual_tokens] = core_attn_out_non_spec.squeeze(0) | ||
|
|
||
| def _forward_core_packed( |
There was a problem hiding this comment.
| def _forward_core_packed( | |
| def _forward_core_decode_non_spec( |
vllm/envs.py
Outdated
| if "VLLM_DISABLED_KERNELS" not in os.environ | ||
| else os.environ["VLLM_DISABLED_KERNELS"].split(","), | ||
| "VLLM_ENABLE_FLA_PACKED_RECURRENT_DECODE": lambda: bool( | ||
| int(os.getenv("VLLM_ENABLE_FLA_PACKED_RECURRENT_DECODE", "0")) |
There was a problem hiding this comment.
Could we enable this by default?
Signed-off-by: hdj <1293066020@qq.com>
|
Accuracy Testing Command python3 -m lm_eval --model local-completions \
--model_args model=Qwen3.5-35B-A3B,base_url=http://127.0.0.1:19000/v1/completions,num_concurrent=80,tokenizer=/nas/disk1/Qwen3.5-35B-A3B \
--tasks gsm8kBaseline This PR |
|
@ZJY0516 Hi, I’ve addressed the review comments and added an accuracy test. When you have time, could you please take another look and share your thoughts on the next step? |
| else: | ||
| core_attn_out[:num_actual_tokens] = core_attn_out_non_spec.squeeze(0) | ||
|
|
||
| def _forward_core_decode_non_spec( |
There was a problem hiding this comment.
Can we switch to this inside _forward_core instead of falling back inside this decode method?
There was a problem hiding this comment.
Can we switch to this inside
_forward_coreinstead of falling back inside this decode method?
Good point, thanks. Moving this selection into _forward_core does make the flow cleaner.
Let me update that.
There was a problem hiding this comment.
Can we switch to this inside
_forward_coreinstead of falling back inside this decode method?
Updated accordingly, thanks!
|
also cc @vadiklyutiy |
Signed-off-by: hdj <1293066020@qq.com>
| attn_metadata = attn_metadata[self.prefix] | ||
| assert isinstance(attn_metadata, GDNAttentionMetadata) | ||
|
|
||
| if ( |
There was a problem hiding this comment.
I prefer this:
if is_decode:
return self._forward_core_decode_non_spec
# oringinal logic here
No need to introduce _forward_core_baseline
There was a problem hiding this comment.
I prefer this:
if is_decode: return self._forward_core_decode_non_spec # oringinal logic hereNo need to introduce
_forward_core_baseline
Makes sense — updated accordingly. Thanks.
Signed-off-by: hdj <1293066020@qq.com>
|
@ZJY0516 Thanks again for the review and approval! I also have a draft follow-up PR for the spec path, and I’m planning to submit it soon. I’d really appreciate it if you could take a look when you have time. |
mgoin
left a comment
There was a problem hiding this comment.
LGTM, I just don't see the need for an env var
| "VLLM_ENABLE_FLA_PACKED_RECURRENT_DECODE": lambda: bool( | ||
| int(os.getenv("VLLM_ENABLE_FLA_PACKED_RECURRENT_DECODE", "1")) | ||
| ), |
There was a problem hiding this comment.
Why do we need this env var at all if it is enabled by default?
There was a problem hiding this comment.
Why do we need this env var at all if it is enabled by default?
Good point. I plan to clean this up together with the spec-path follow-up so both paths stay consistent.
@mgoin When you have time, could you please help retrigger that failed CI check? It seems to have hit a 403 a few times already. |
|
We can just force merge for now |
Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: EricccYang <yangyang4991@gmail.com>
Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io>
Purpose
T=1).mixed_qkvin a decode-only fast path instead of materializing contiguousq/k/v.VLLM_ENABLE_FLA_PACKED_RECURRENT_DECODE(default:0).Test Result
Correctness (pytest)
Command
Result
Performance
Compared to the main branch, this PR improves Output token throughput (tok/s) by ~6.37%, reduces Mean TPOT (ms) by ~5.34%, reduces Mean E2EL (ms) by ~6.56%, and reduces Mean TTFT (ms) by ~9.79%.
Command