[Perf] Enable dual stream execution of input projection for Qwen3#36795
[Perf] Enable dual stream execution of input projection for Qwen3#36795DarkLight1337 merged 3 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This PR introduces dual-stream execution for input projection in the Qwen3 Next model to improve performance by parallelizing in_proj_qkvz and in_proj_ba operations. It also wraps the implementation in a custom op for torch.compile. The changes include adding an auxiliary stream, modifying the forward pass to use the custom op, and introducing a new function for dual-stream execution. I have identified a critical issue related to potential deadlocks when using auxiliary streams.
5d76ed2 to
911a451
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
Could you please also apply this to qwen 3.5? |
|
I would avoid passing the |
01d0d02 to
df1763c
Compare
|
consider leveraging a maybe_execute_in_parallel primitive as in #35968 |
@ZJY0516 Thanks for review! I have put the benchmark and accuracy testing result in PR description. |
|
@robertgshaw2-redhat Thanks for review! I have removed passing |
I don't see qwen 3.5 related code change |
2cbaae6 to
4e75f43
Compare
|
Hi @xyang16, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
I have pushed the change in qwen 3.5. Thanks! |
@benchislett I have added |
Signed-off-by: Xin Yang <xyangx@amazon.com>
|
@robertgshaw2-redhat Could you please review again? Thanks! |
|
Thanks for the implementation! #32828 |
| def _forward_in_proj( | ||
| self, hidden_states: torch.Tensor | ||
| ) -> tuple[torch.Tensor, torch.Tensor]: | ||
| projected_states_qkvz, projected_states_ba = maybe_execute_in_parallel( |
There was a problem hiding this comment.
I have a small question about the naming here. maybe means it may not run in parallel, but in this case, we always run in parallel, right?
There was a problem hiding this comment.
@ZJY0516 In maybe_execute_in_parallel, if aux_stream is not None it runs in parallel, otherwise runs sequentially. aux_stream is None in none-cuda platform. Thanks!
| def maybe_execute_in_parallel( | ||
| fn0: Callable[[], Any], | ||
| fn1: Callable[[], Any], | ||
| event0: torch.cuda.Event, | ||
| event1: torch.cuda.Event, | ||
| aux_stream: torch.cuda.Stream | None = None, | ||
| ) -> tuple[Any, Any]: | ||
| """Run two functions potentially in parallel on separate CUDA streams. | ||
| When aux_stream is provided, fn0 runs on the current (default) stream and | ||
| fn1 runs on aux_stream, synchronized via CUDA events. When aux_stream is | ||
| None, both functions execute sequentially on the current stream. |
There was a problem hiding this comment.
I do like this utility as a pattern to apply generally
| def gdn_in_proj( | ||
| hidden_states: torch.Tensor, | ||
| qkvz_output_size: int, | ||
| ba_output_size: int, | ||
| layer_name: str, | ||
| ) -> tuple[torch.Tensor, torch.Tensor]: | ||
| """ | ||
| Custom op for the input projection. | ||
| """ | ||
| forward_context: ForwardContext = get_forward_context() | ||
| self = forward_context.no_compile_layers[layer_name] | ||
| return self._forward_in_proj(hidden_states) |
There was a problem hiding this comment.
This indirection is pretty gross though. Could we avoid this somehow? I found it very confusing that you were passing in self.in_proj_qkvz.weight.shape[0] to this op instead of the module itself.
Also there is the concern of wrapping these MergedColumnParallelLinear modules that could be quantized - it seems we would lose the potential of torch.compile fusing the input quantization with previous ops or reaching inside of the linear op itself (less valid concern)
There was a problem hiding this comment.
@mgoin Thanks for review!
Actually this layer is already wrapped here https://github.com/vllm-project/vllm/blob/v0.18.0rc0/vllm/model_executor/models/qwen3_next.py#L1673-L1692. And I agree this should be improved once torch.compile supports multi stream.
| ba_output_size: int, | ||
| layer_name: str, | ||
| ) -> tuple[torch.Tensor, torch.Tensor]: | ||
| """ |
There was a problem hiding this comment.
Can we add a tracking issue somewhere for porting this over to native Inductor multi-stream support?
There was a problem hiding this comment.
I have created an issue to track #37372. Thanks!
…ct#36795) Signed-off-by: JartX <sagformas@epdcenter.es>
…lm-project#36795) Signed-off-by: Xin Yang <xyangx@amazon.com>
…ct#36795) (vllm-project#37427) Signed-off-by: JartX <sagformas@epdcenter.es> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…ct#36795) (vllm-project#37427) Signed-off-by: JartX <sagformas@epdcenter.es> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…with LoRA The `gdn_in_proj` custom op (introduced in f174000 / PR vllm-project#36795) uses `self.in_proj_qkvz.weight.shape[0]` to communicate the output tensor size to torch.compile's fake implementation. With LoRA + AWQ/GPTQ quantization, `.weight` returns the quantized `qweight` whose shape is packed (e.g. input_size // 8 for 4-bit), causing a dimension mismatch in the subsequent `.split()` call. Fix: compute output sizes analytically from model dimensions (key_dim, value_dim, num_v_heads, tp_size) instead of reading from the weight tensor shape. These computed values are identical to weight.shape[0] for non-quantized models, so there is no regression. Tested with: - cyankiwi/Qwen3.5-9B-AWQ-4bit + LoRA adapters (torch.compile) - Qwen/Qwen3.5-9B without quantization (torch.compile) - Qwen/Qwen3.5-9B + LoRA adapters without quantization (eager) - Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 (torch.compile) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…with LoRA The `gdn_in_proj` custom op (introduced in f174000 / PR vllm-project#36795) uses `self.in_proj_qkvz.weight.shape[0]` to communicate the output tensor size to torch.compile's fake implementation. With LoRA + AWQ/GPTQ quantization, `.weight` returns the quantized `qweight` whose shape is packed (e.g. input_size // 8 for 4-bit), causing a dimension mismatch in the subsequent `.split()` call. Fix: compute output sizes analytically from model dimensions (key_dim, value_dim, num_v_heads, tp_size) instead of reading from the weight tensor shape. These computed values are identical to weight.shape[0] for non-quantized models, so there is no regression. Tested with: - cyankiwi/Qwen3.5-9B-AWQ-4bit + LoRA adapters (torch.compile) - Qwen/Qwen3.5-9B without quantization (torch.compile) - Qwen/Qwen3.5-9B + LoRA adapters without quantization (eager) - Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 (torch.compile) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jake Writer <writer.j@northeastern.edu>
…lm-project#36795) Signed-off-by: Xin Yang <xyangx@amazon.com>
…ct#36795) (vllm-project#37427) Signed-off-by: JartX <sagformas@epdcenter.es>
| mixed_qkvz, ba = torch.ops.vllm.gdn_in_proj( | ||
| hidden_states, | ||
| self.in_proj_qkvz.weight.shape[0], | ||
| self.in_proj_ba.weight.shape[0], | ||
| self.prefix, | ||
| ) |
There was a problem hiding this comment.
This regresses cold compile times by baking in a string into the compiled graph. We should really make a lint rule for this or something
There was a problem hiding this comment.
@zou3519 Thanks for your comment. Looking into this. btw I was actually following torch.ops.vllm.gdn_attention_core ops in the same forward().
There was a problem hiding this comment.
torch.ops.vllm.gdn_attention_core is not included in the subgraph so it doesn't cause problems with compile times. I'm trying to figure out what to do with this. In theory we have a fix for this in PyTorch 2.11
There was a problem hiding this comment.
I can revert using this torch.ops.vllm.gdn_in_proj op and wait for PyTorch 2.11. Please let me know how you think. Thanks!
There was a problem hiding this comment.
@xyang16 are you able to refactor this so that the gdn_in_proj op does NOT need to pass a string as an input? Basically we would avoid stashing state into a side table. How difficult do you think that would be?
There was a problem hiding this comment.
@zou3519 Sure, I will look into this today.
There was a problem hiding this comment.
@zou3519 I see your PR 38123. So it will fix this issue?
There was a problem hiding this comment.
It will fix the issue for PyTorch 2.11. But vLLM is going to do one more release (0.19.0, branch cut this Monday) without PyTorch 2.11.
If we can wait for the performance improvement in this PR, the easiest thing for us to do is just revert this PR and then re-merge it after #38123 and we upgrde to 2.11 (probably Tuesday)
There was a problem hiding this comment.
@zou3519 Thanks for the help. I have created #38152 to revert this PR. cc @benchislett
…lm-project#36795) Signed-off-by: Xin Yang <xyangx@amazon.com>
…ct#36795) (vllm-project#37427) Signed-off-by: JartX <sagformas@epdcenter.es>
…lm-project#36795) Signed-off-by: Xin Yang <xyangx@amazon.com>
…ct#36795) (vllm-project#37427) Signed-off-by: JartX <sagformas@epdcenter.es>
Purpose
This PR Enable dual stream execution of input projection for Qwen3 Next.
in_proj_qkvzandin_proj_bain 2 streams, because their outputs are independent.Profiling
Main:
PR:
Main: nvjet_tst_64x8_64x16_4x2_h_bz_TNT (in_proj_qkvz) and nvjet_tst_64x8_64x16_1x2_h_bz_TNT (in_proj_ba) kernels launched sequentially.
PR: kernels launched in parallel.
Benchmarking
Benchmarked on H200.
Main:
PR:
Main:
PR:
Main:
PR:
Accuracy Testing
Main:
PR:
Main:
PR:
Main:
PR:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.