[Perf] Enable dual stream execution of input projection for Qwen3 by xyang16 · Pull Request #36795 · vllm-project/vllm

xyang16 · 2026-03-11T14:39:56Z

Purpose

This PR Enable dual stream execution of input projection for Qwen3 Next.

Parallelize the execution of in_proj_qkvz and in_proj_ba in 2 streams, because their outputs are independent.
Wrap the implementation in custom op for torch.compile.

Profiling

Main:

PR:

Main: nvjet_tst_64x8_64x16_4x2_h_bz_TNT (in_proj_qkvz) and nvjet_tst_64x8_64x16_1x2_h_bz_TNT (in_proj_ba) kernels launched sequentially.

PR: kernels launched in parallel.

Benchmarking

Benchmarked on H200.

Qwen3

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \
    --tensor-parallel-size 1 \
    --max-num-seqs 16 \
    --no-enable-prefix-caching

vllm bench serve \
        --model Qwen/Qwen3-Next-80B-A3B-Instruct \
        --dataset-name sharegpt \
        --dataset-path /tmp/ShareGPT_V3_unfiltered_cleaned_split.json \
        --sharegpt-output-len 300 \
        --num-prompts ${num_prompts} \
        --max-concurrency 16 \
        --num-warmups 50 \
        --ignore-eos \
        --temperature 0

Main:

============ Serving Benchmark Result ============
Successful requests:                     960       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  121.46    
Total input tokens:                      219140    
Total generated tokens:                  288000    
Request throughput (req/s):              7.90      
Output token throughput (tok/s):         2371.20   
Peak output token throughput (tok/s):    2640.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          4175.46   
---------------Time to First Token----------------
Mean TTFT (ms):                          176.72    
Median TTFT (ms):                        193.47    
P99 TTFT (ms):                           222.77    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          6.18      
Median TPOT (ms):                        6.15      
P99 TPOT (ms):                           6.48      
---------------Inter-token Latency----------------
Mean ITL (ms):                           6.18      
Median ITL (ms):                         6.13      
P99 ITL (ms):                            6.94      
==================================================

PR:

============ Serving Benchmark Result ============
Successful requests:                     960       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  120.20    
Total input tokens:                      219140    
Total generated tokens:                  288000    
Request throughput (req/s):              7.99      
Output token throughput (tok/s):         2396.09   
Peak output token throughput (tok/s):    2672.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          4219.28   
---------------Time to First Token----------------
Mean TTFT (ms):                          191.90    
Median TTFT (ms):                        214.37    
P99 TTFT (ms):                           249.63    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          6.06      
Median TPOT (ms):                        6.04      
P99 TPOT (ms):                           6.40      
---------------Inter-token Latency----------------
Mean ITL (ms):                           6.06      
Median ITL (ms):                         6.02      
P99 ITL (ms):                            6.72      
==================================================

Qwen3 fp8

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
    --tensor-parallel-size 1 \
    --max-num-seqs 16 \
    --no-enable-prefix-caching

vllm bench serve \
        --model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
        --dataset-name sharegpt \
        --dataset-path /tmp/ShareGPT_V3_unfiltered_cleaned_split.json \
        --sharegpt-output-len 300 \
        --num-prompts ${num_prompts} \
        --max-concurrency 16 \
        --num-warmups 50 \
        --ignore-eos \
        --temperature 0

Main:

============ Serving Benchmark Result ============
Successful requests:                     960       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  195.14    
Total input tokens:                      227546    
Total generated tokens:                  288000    
Request throughput (req/s):              4.92      
Output token throughput (tok/s):         1475.89   
Peak output token throughput (tok/s):    1648.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          2641.98   
---------------Time to First Token----------------
Mean TTFT (ms):                          162.16    
Median TTFT (ms):                        156.37    
P99 TTFT (ms):                           234.19    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.33     
Median TPOT (ms):                        10.31     
P99 TPOT (ms):                           10.77     
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.33     
Median ITL (ms):                         10.24     
P99 ITL (ms):                            11.43     
==================================================

PR:

============ Serving Benchmark Result ============
Successful requests:                     960       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  191.31    
Total input tokens:                      219140    
Total generated tokens:                  288000    
Request throughput (req/s):              5.02      
Output token throughput (tok/s):         1505.38   
Peak output token throughput (tok/s):    1712.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          2650.83   
---------------Time to First Token----------------
Mean TTFT (ms):                          236.09    
Median TTFT (ms):                        236.95    
P99 TTFT (ms):                           380.30    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.87      
Median TPOT (ms):                        9.83      
P99 TPOT (ms):                           10.49     
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.87      
Median ITL (ms):                         9.81      
P99 ITL (ms):                            10.99     
==================================================

Qwen3.5

vllm serve Qwen/Qwen3.5-35B-A3B \
    --tensor-parallel-size 1 \
    --max-num-seqs 16 \
    --no-enable-prefix-caching

vllm bench serve \
        --model Qwen/Qwen3.5-35B-A3B \
        --dataset-name sharegpt \
        --dataset-path /tmp/ShareGPT_V3_unfiltered_cleaned_split.json \
        --sharegpt-output-len 300 \
        --num-prompts ${num_prompts} \
        --max-concurrency 16 \
        --num-warmups 50 \
        --ignore-eos \
        --temperature 0

Main:

============ Serving Benchmark Result ============
Successful requests:                     960       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  197.96    
Total input tokens:                      227546    
Total generated tokens:                  288000    
Request throughput (req/s):              4.85      
Output token throughput (tok/s):         1454.81   
Peak output token throughput (tok/s):    1648.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          2604.25   
---------------Time to First Token----------------
Mean TTFT (ms):                          142.74    
Median TTFT (ms):                        152.15    
P99 TTFT (ms):                           199.72    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.55     
Median TPOT (ms):                        10.59     
P99 TPOT (ms):                           11.10     
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.55     
Median ITL (ms):                         10.24     
P99 ITL (ms):                            12.05     
==================================================

PR:

============ Serving Benchmark Result ============
Successful requests:                     960       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  190.61    
Total input tokens:                      227546    
Total generated tokens:                  288000    
Request throughput (req/s):              5.04      
Output token throughput (tok/s):         1510.93   
Peak output token throughput (tok/s):    1715.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          2704.71   
---------------Time to First Token----------------
Mean TTFT (ms):                          169.85    
Median TTFT (ms):                        173.71    
P99 TTFT (ms):                           254.99    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.05     
Median TPOT (ms):                        10.06     
P99 TPOT (ms):                           10.28     
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.05     
Median ITL (ms):                         10.06     
P99 ITL (ms):                            11.37     
==================================================

Accuracy Testing

Qwen3

python3 -m lm_eval --model local-completions \
  --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=16 \
  --tasks gsm8k

Main:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8575|±  |0.0096|
|     |       |strict-match    |     5|exact_match|↑  |0.8150|±  |0.0107|

PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8552|±  |0.0097|
|     |       |strict-match    |     5|exact_match|↑  |0.8082|±  |0.0108|

Qwen3 fp8

python3 -m lm_eval --model local-completions \
  --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct-FP8,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=16 \
  --tasks gsm8k

Main:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8491|±  |0.0099|
|     |       |strict-match    |     5|exact_match|↑  |0.8127|±  |0.0107|

PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8575|±  |0.0096|
|     |       |strict-match    |     5|exact_match|↑  |0.8127|±  |0.0107|

Qwen3.5

python3 -m lm_eval --model local-completions \
  --model_args model=Qwen/Qwen3.5-35B-A3B,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=16 \
  --tasks gsm8k

Main:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8476|±  |0.0099|
|     |       |strict-match    |     5|exact_match|↑  |0.8370|±  |0.0102|

PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8499|±  |0.0098|
|     |       |strict-match    |     5|exact_match|↑  |0.8332|±  |0.0103|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This PR introduces dual-stream execution for input projection in the Qwen3 Next model to improve performance by parallelizing in_proj_qkvz and in_proj_ba operations. It also wraps the implementation in a custom op for torch.compile. The changes include adding an auxiliary stream, modifying the forward pass to use the custom op, and introducing a new function for dual-stream execution. I have identified a critical issue related to potential deadlocks when using auxiliary streams.

vllm/model_executor/models/qwen3_next.py

mergify · 2026-03-12T08:08:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xyang16.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/models/qwen3_next.py

ZJY0516 · 2026-03-12T15:57:17Z

Could you please also apply this to qwen 3.5?

robertgshaw2-redhat · 2026-03-12T18:37:01Z

I would avoid passing the aux_stream through the class constructors

benchislett · 2026-03-13T01:39:12Z

consider leveraging a maybe_execute_in_parallel primitive as in #35968

xyang16 · 2026-03-13T01:46:34Z

Could you please also apply this to qwen 3.5?

@ZJY0516 Thanks for review! I have put the benchmark and accuracy testing result in PR description.

xyang16 · 2026-03-13T03:15:22Z

@robertgshaw2-redhat Thanks for review! I have removed passing aux_stream the class constructors.

ZJY0516 · 2026-03-13T03:16:45Z

Could you please also apply this to qwen 3.5?

@ZJY0516 Thanks for review! I have put the benchmark and accuracy testing result in PR description.

I don't see qwen 3.5 related code change

mergify · 2026-03-13T03:50:49Z

Hi @xyang16, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

xyang16 · 2026-03-13T04:17:40Z

Could you please also apply this to qwen 3.5?

@ZJY0516 Thanks for review! I have put the benchmark and accuracy testing result in PR description.

I don't see qwen 3.5 related code change

I have pushed the change in qwen 3.5. Thanks!

xyang16 · 2026-03-13T04:20:01Z

consider leveraging a maybe_execute_in_parallel primitive as in #35968

@benchislett I have added maybe_execute_in_parallel. Thanks!

Signed-off-by: Xin Yang <xyangx@amazon.com>

xyang16 · 2026-03-13T18:20:45Z

@robertgshaw2-redhat Could you please review again? Thanks!

jhaotingc · 2026-03-13T18:23:11Z

Thanks for the implementation! #32828

ZJY0516 · 2026-03-15T07:08:39Z

vllm/model_executor/models/qwen3_next.py

+    def _forward_in_proj(
+        self, hidden_states: torch.Tensor
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        projected_states_qkvz, projected_states_ba = maybe_execute_in_parallel(


I have a small question about the naming here. maybe means it may not run in parallel, but in this case, we always run in parallel, right?

@ZJY0516 In maybe_execute_in_parallel, if aux_stream is not None it runs in parallel, otherwise runs sequentially. aux_stream is None in none-cuda platform. Thanks!

mgoin · 2026-03-17T22:26:05Z

vllm/utils/multi_stream_utils.py

+def maybe_execute_in_parallel(
+    fn0: Callable[[], Any],
+    fn1: Callable[[], Any],
+    event0: torch.cuda.Event,
+    event1: torch.cuda.Event,
+    aux_stream: torch.cuda.Stream | None = None,
+) -> tuple[Any, Any]:
+    """Run two functions potentially in parallel on separate CUDA streams.
+
+    When aux_stream is provided, fn0 runs on the current (default) stream and
+    fn1 runs on aux_stream, synchronized via CUDA events.  When aux_stream is
+    None, both functions execute sequentially on the current stream.


I do like this utility as a pattern to apply generally

mgoin · 2026-03-17T22:29:47Z

vllm/model_executor/models/qwen3_next.py

+def gdn_in_proj(
+    hidden_states: torch.Tensor,
+    qkvz_output_size: int,
+    ba_output_size: int,
+    layer_name: str,
+) -> tuple[torch.Tensor, torch.Tensor]:
+    """
+    Custom op for the input projection.
+    """
+    forward_context: ForwardContext = get_forward_context()
+    self = forward_context.no_compile_layers[layer_name]
+    return self._forward_in_proj(hidden_states)


This indirection is pretty gross though. Could we avoid this somehow? I found it very confusing that you were passing in self.in_proj_qkvz.weight.shape[0] to this op instead of the module itself.

Also there is the concern of wrapping these MergedColumnParallelLinear modules that could be quantized - it seems we would lose the potential of torch.compile fusing the input quantization with previous ops or reaching inside of the linear op itself (less valid concern)

@mgoin Thanks for review!

Actually this layer is already wrapped here https://github.com/vllm-project/vllm/blob/v0.18.0rc0/vllm/model_executor/models/qwen3_next.py#L1673-L1692. And I agree this should be improved once torch.compile supports multi stream.

ProExpertProg · 2026-03-17T23:52:15Z

vllm/model_executor/models/qwen3_next.py

+    ba_output_size: int,
+    layer_name: str,
+) -> tuple[torch.Tensor, torch.Tensor]:
+    """


Can we add a tracking issue somewhere for porting this over to native Inductor multi-stream support?

I have created an issue to track #37372. Thanks!

…ct#36795) Signed-off-by: JartX <sagformas@epdcenter.es>

…lm-project#36795) Signed-off-by: Xin Yang <xyangx@amazon.com>

…37427) Signed-off-by: JartX <sagformas@epdcenter.es>

…ct#36795) (vllm-project#37427) Signed-off-by: JartX <sagformas@epdcenter.es> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

…with LoRA The `gdn_in_proj` custom op (introduced in f174000 / PR vllm-project#36795) uses `self.in_proj_qkvz.weight.shape[0]` to communicate the output tensor size to torch.compile's fake implementation. With LoRA + AWQ/GPTQ quantization, `.weight` returns the quantized `qweight` whose shape is packed (e.g. input_size // 8 for 4-bit), causing a dimension mismatch in the subsequent `.split()` call. Fix: compute output sizes analytically from model dimensions (key_dim, value_dim, num_v_heads, tp_size) instead of reading from the weight tensor shape. These computed values are identical to weight.shape[0] for non-quantized models, so there is no regression. Tested with: - cyankiwi/Qwen3.5-9B-AWQ-4bit + LoRA adapters (torch.compile) - Qwen/Qwen3.5-9B without quantization (torch.compile) - Qwen/Qwen3.5-9B + LoRA adapters without quantization (eager) - Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 (torch.compile) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…with LoRA The `gdn_in_proj` custom op (introduced in f174000 / PR vllm-project#36795) uses `self.in_proj_qkvz.weight.shape[0]` to communicate the output tensor size to torch.compile's fake implementation. With LoRA + AWQ/GPTQ quantization, `.weight` returns the quantized `qweight` whose shape is packed (e.g. input_size // 8 for 4-bit), causing a dimension mismatch in the subsequent `.split()` call. Fix: compute output sizes analytically from model dimensions (key_dim, value_dim, num_v_heads, tp_size) instead of reading from the weight tensor shape. These computed values are identical to weight.shape[0] for non-quantized models, so there is no regression. Tested with: - cyankiwi/Qwen3.5-9B-AWQ-4bit + LoRA adapters (torch.compile) - Qwen/Qwen3.5-9B without quantization (torch.compile) - Qwen/Qwen3.5-9B + LoRA adapters without quantization (eager) - Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 (torch.compile) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jake Writer <writer.j@northeastern.edu>

…lm-project#36795) Signed-off-by: Xin Yang <xyangx@amazon.com>

…ct#36795) (vllm-project#37427) Signed-off-by: JartX <sagformas@epdcenter.es>

zou3519 · 2026-03-23T19:51:30Z

vllm/model_executor/models/qwen3_5.py

+        mixed_qkvz, ba = torch.ops.vllm.gdn_in_proj(
+            hidden_states,
+            self.in_proj_qkvz.weight.shape[0],
+            self.in_proj_ba.weight.shape[0],
+            self.prefix,
+        )


This regresses cold compile times by baking in a string into the compiled graph. We should really make a lint rule for this or something

@zou3519 Thanks for your comment. Looking into this. btw I was actually following torch.ops.vllm.gdn_attention_core ops in the same forward().

torch.ops.vllm.gdn_attention_core is not included in the subgraph so it doesn't cause problems with compile times. I'm trying to figure out what to do with this. In theory we have a fix for this in PyTorch 2.11

I can revert using this torch.ops.vllm.gdn_in_proj op and wait for PyTorch 2.11. Please let me know how you think. Thanks!

@xyang16 are you able to refactor this so that the gdn_in_proj op does NOT need to pass a string as an input? Basically we would avoid stashing state into a side table. How difficult do you think that would be?

@zou3519 Sure, I will look into this today.

@zou3519 I see your PR 38123. So it will fix this issue?

It will fix the issue for PyTorch 2.11. But vLLM is going to do one more release (0.19.0, branch cut this Monday) without PyTorch 2.11.

If we can wait for the performance improvement in this PR, the easiest thing for us to do is just revert this PR and then re-merge it after #38123 and we upgrde to 2.11 (probably Tuesday)

@zou3519 Thanks for the help. I have created #38152 to revert this PR. cc @benchislett

…lm-project#36795) Signed-off-by: Xin Yang <xyangx@amazon.com>

…ct#36795) (vllm-project#37427) Signed-off-by: JartX <sagformas@epdcenter.es>

…lm-project#36795) Signed-off-by: Xin Yang <xyangx@amazon.com>

…ct#36795) (vllm-project#37427) Signed-off-by: JartX <sagformas@epdcenter.es>

xyang16 requested a review from sighingnow as a code owner March 11, 2026 14:39

mergify bot added the qwen Related to Qwen models label Mar 11, 2026

xyang16 force-pushed the multi_stream branch from 3c0eb77 to b9d607e Compare March 11, 2026 14:42

gemini-code-assist bot reviewed Mar 11, 2026

View reviewed changes

vllm/model_executor/models/qwen3_next.py Outdated Show resolved Hide resolved

xyang16 force-pushed the multi_stream branch 2 times, most recently from 5d76ed2 to 911a451 Compare March 11, 2026 15:09

mergify bot added the needs-rebase label Mar 12, 2026

ZJY0516 reviewed Mar 12, 2026

View reviewed changes

vllm/model_executor/models/qwen3_next.py Show resolved Hide resolved

xyang16 force-pushed the multi_stream branch from 911a451 to 89ef3ac Compare March 12, 2026 16:13

mergify bot removed the needs-rebase label Mar 12, 2026

xyang16 force-pushed the multi_stream branch 3 times, most recently from 01d0d02 to df1763c Compare March 13, 2026 00:45

xyang16 force-pushed the multi_stream branch from 1a9d076 to 4cd5041 Compare March 13, 2026 01:45

xyang16 force-pushed the multi_stream branch 2 times, most recently from 2cbaae6 to 4e75f43 Compare March 13, 2026 03:46

xyang16 force-pushed the multi_stream branch from 4e75f43 to 04e8ea0 Compare March 13, 2026 04:11

xyang16 changed the title ~~[Perf] Enable dual stream execution of input projection for Qwen3 Next~~ [Perf] Enable dual stream execution of input projection for Qwen3 Mar 13, 2026

xyang16 added 2 commits March 12, 2026 23:11

[Perf] Enable dual stream execution of input projection for Qwen3 Next

0507815

Signed-off-by: Xin Yang <xyangx@amazon.com>

Qwen3.5

62b7d74

Signed-off-by: Xin Yang <xyangx@amazon.com>

jhaotingc mentioned this pull request Mar 13, 2026

[Feature]: Qwen3-Next dual-stream execution in_proj_qkvz in_proj_ba #32828

Open

1 task

ZJY0516 reviewed Mar 15, 2026

View reviewed changes

benchislett mentioned this pull request Mar 16, 2026

[Performance] DeepSeek V3.2 multi-stream indexer overlap #35968

Open

Merge branch 'main' into multi_stream

a38c558

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 17, 2026

mgoin reviewed Mar 17, 2026

View reviewed changes

ProExpertProg reviewed Mar 17, 2026

View reviewed changes

xyang16 mentioned this pull request Mar 18, 2026

Port custom ops to native Inductor multi-stream support #37372

Open

DarkLight1337 merged commit f174000 into vllm-project:main Mar 18, 2026
54 checks passed

JartX added a commit to JartX/vllm that referenced this pull request Mar 18, 2026

[Bugfix] Fix ROCm crash in qwen3_next multi-stream events (vllm-proje…

31f7c99

…ct#36795) Signed-off-by: JartX <sagformas@epdcenter.es>

JartX mentioned this pull request Mar 18, 2026

[Bugfix] Fix ROCm crash in qwen3_next multi-stream events (#36795) #37427

Merged

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026

[Perf] Enable dual stream execution of input projection for Qwen3 (vl…

7dc154f

…lm-project#36795) Signed-off-by: Xin Yang <xyangx@amazon.com>

jhsmith409 mentioned this pull request Mar 18, 2026

Fix AttributeError in Qwen3.5 GDN layers with quantized models #37448

Merged

2 tasks

yewentao256 pushed a commit that referenced this pull request Mar 18, 2026

[Bugfix] Fix ROCm crash in qwen3_next multi-stream events (#36795) (#…

a913b61

…37427) Signed-off-by: JartX <sagformas@epdcenter.es>

JWriter20 mentioned this pull request Mar 19, 2026

[BugFix] Fix Qwen3.5 LoRA IndexError in GDN fused projections #36309

Closed

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[Perf] Enable dual stream execution of input projection for Qwen3 (vl…

d9ad238

…lm-project#36795) Signed-off-by: Xin Yang <xyangx@amazon.com>

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[Bugfix] Fix ROCm crash in qwen3_next multi-stream events (vllm-proje…

678698f

…ct#36795) (vllm-project#37427) Signed-off-by: JartX <sagformas@epdcenter.es>

xyang16 deleted the multi_stream branch March 19, 2026 22:42

zou3519 reviewed Mar 23, 2026

View reviewed changes

zou3519 mentioned this pull request Mar 23, 2026

[Bug]: Qwen3.5-35B-A3B compile cache miss 100% on subgraphs. #37919

Open

1 task

SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026

[Perf] Enable dual stream execution of input projection for Qwen3 (vl…

d22c8f2

…lm-project#36795) Signed-off-by: Xin Yang <xyangx@amazon.com>

SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026

[Bugfix] Fix ROCm crash in qwen3_next multi-stream events (vllm-proje…

c7e15c1

…ct#36795) (vllm-project#37427) Signed-off-by: JartX <sagformas@epdcenter.es>

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[Perf] Enable dual stream execution of input projection for Qwen3 (vl…

130eb5b

…lm-project#36795) Signed-off-by: Xin Yang <xyangx@amazon.com>

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[Bugfix] Fix ROCm crash in qwen3_next multi-stream events (vllm-proje…

b6691c7

…ct#36795) (vllm-project#37427) Signed-off-by: JartX <sagformas@epdcenter.es>

Uh oh!

Conversation

xyang16 commented Mar 11, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Profiling

Benchmarking

Accuracy Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Mar 12, 2026

Uh oh!

Uh oh!

ZJY0516 commented Mar 12, 2026

Uh oh!

robertgshaw2-redhat commented Mar 12, 2026

Uh oh!

benchislett commented Mar 13, 2026

Uh oh!

xyang16 commented Mar 13, 2026

Uh oh!

xyang16 commented Mar 13, 2026

Uh oh!

ZJY0516 commented Mar 13, 2026

Uh oh!

mergify bot commented Mar 13, 2026

Uh oh!

xyang16 commented Mar 13, 2026

Uh oh!

xyang16 commented Mar 13, 2026

Uh oh!

xyang16 commented Mar 13, 2026

Uh oh!

jhaotingc commented Mar 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xyang16 Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xyang16 Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xyang16 Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

xyang16 commented Mar 11, 2026 •

edited by github-actions bot

Loading

xyang16 Mar 23, 2026 •

edited

Loading

xyang16 Mar 23, 2026 •

edited

Loading

xyang16 Mar 24, 2026 •

edited

Loading