[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE by elvischenv · Pull Request #30647 · vllm-project/vllm

elvischenv · 2025-12-14T13:40:05Z

Purpose

Depends on Flashinfer update Bump Flashinfer to v0.6.1 #30993
Eliminated padding op before the MoE: by setting the alignment in flashinfer mxfp8 quant, the output quantized tensor will be padded.
Eliminated slicing op after the MoE: by passing the output tensor with unpadded hidden size to MoE kernel, this depends on a Flashinfer PR:
- feat: Support unpadded output hidden size for trtllm_fp4_block_scale_moe flashinfer-ai/flashinfer#2217
- This will also resolve the previous AR+Norm fusion broken by slice op issue: [Bugfix] Fix GPT-OSS AR+NORM fusion #28841
Cleaned up the padding logic: for mxfp4 quant, the padded hidden size is calculated in create_weights(), the maybe_roundup_hidden_size() in vllm/model_executor/layers/fused_moe/layer.py seems like a dup.

Test Plan && Test Result(GPT-OSS-120b TP8)

Accuracy

PR:

[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-high_temp1.0_20251213_233136', 'metric': 0.7803030303030303}]
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20251213_234320', 'metric': 0.8875}]

main:

[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-high_temp1.0_20251214_002509', 'metric': 0.7891414141414141}]
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20251214_001505', 'metric': 0.8875}]

Kernel

PR:

void cublasLt::splitKreduce_kernel                          2.400 μs
void tensorrt_llm::kernels::quantize_with_block_size        2.944 μs
void moe::dev::routing::routingRenormalize                  5.216 μs
bmm_MxE4m3_MxE2m1MxE4m3_Fp32_t128x16x256u2                 13.216 μs
bmm_Bfloat16_MxE2m1MxE4m3_Fp32_t128x16x256u2                9.504 μs
void moe::dev::finalize::finalizeKernel                     3.168 μs
void flashinfer::trtllm_allreduce_fusion                    7.744 μs (ar+norm)
nvjet_tst_32x64_64x16_4x1_v_bz_splitK_TNN                   4.448 μs

main:

void cublasLt::splitKreduce_kernel                          2.048 μs
triton_poi_fused_constant_pad_nd_moe_forward_0              1.407 μs (pad)
void tensorrt_llm::kernels::quantize_with_block_size        2.432 μs
void moe::dev::routing::routingRenormalize                  5.056 μs
bmm_MxE4m3_MxE2m1MxE4m3_Fp32_t128x16x256u2                 10.368 μs
bmm_Bfloat16_MxE2m1MxE4m3_Fp32_t128x16x256u2                8.512 μs
void moe::dev::finalize::finalizeKernel                     2.112 μs
void vllm::cross_device_reduce_1stage                       8.320 μs (ar)
triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_slice_1    2.336 μs (norm, slice)
nvjet_tst_32x64_64x16_4x1_v_bz_splitK_TNN                   4.448 μs

Perf (GPT-OSS-120b TP8 con8)

PR: 5% E2E improvement

============ Serving Benchmark Result ============
Successful requests:                     80
Failed requests:                         0
Maximum request concurrency:             8
Benchmark duration (s):                  28.90
Total input tokens:                      81920
Total generated tokens:                  81920
Request throughput (req/s):              2.77
Output token throughput (tok/s):         2834.13
Peak output token throughput (tok/s):    158.00
Peak concurrent requests:                16.00
Total token throughput (tok/s):          5668.25
---------------Time to First Token----------------
Mean TTFT (ms):                          50.37
Median TTFT (ms):                        58.07
P99 TTFT (ms):                           75.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.77
Median TPOT (ms):                        2.77
P99 TPOT (ms):                           2.84
---------------Inter-token Latency----------------
Mean ITL (ms):                           54.59
Median ITL (ms):                         55.28
P99 ITL (ms):                            68.89
==================================================

main:

============ Serving Benchmark Result ============
Successful requests:                     80
Failed requests:                         0
Maximum request concurrency:             8
Benchmark duration (s):                  30.43
Total input tokens:                      81920
Total generated tokens:                  81920
Request throughput (req/s):              2.63
Output token throughput (tok/s):         2692.11
Peak output token throughput (tok/s):    149.00
Peak concurrent requests:                16.00
Total token throughput (tok/s):          5384.22
---------------Time to First Token----------------
Mean TTFT (ms):                          52.61
Median TTFT (ms):                        57.20
P99 TTFT (ms):                           88.23
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.92
Median TPOT (ms):                        2.92
P99 TPOT (ms):                           2.99
---------------Inter-token Latency----------------
Mean ITL (ms):                           57.47
Median ITL (ms):                         58.17
P99 ITL (ms):                            70.35
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a performance optimization for Mixture-of-Experts (MoE) layers in GPT-OSS models using Flashinfer with MXFP4/MXFP8 quantization. The key changes involve eliminating explicit padding and slicing operations around the MoE computation. This is achieved by leveraging new capabilities in the Flashinfer library to handle padding within the quantization kernel and to write to an unpadded output buffer directly.

The main changes are:

Elimination of Padding/Slicing: The FusedMoE layer no longer performs manual padding before the MoE kernel for supported backends. Instead, the padding is handled by flashinfer::mxfp8_quantize, and the subsequent slicing is effectively done by the MoE kernel writing to a smaller, pre-allocated output tensor. This change enables better fusion opportunities, as seen by the all-reduce + norm fusion now being possible.
Code Refactoring: The logic for rounding up hidden sizes for MXFP4 quantization has been moved from the generic fused_moe/layer.py to the specific quantization/mxfp4.py, which is a more appropriate location. This removes duplicated code and improves modularity.
Conditional Logic: The new behavior is controlled by a support_padded_mxfp8_quant flag, ensuring that it only applies to the SM100_FI_MXFP4_MXFP8_TRTLLM backend on Blackwell GPUs, maintaining compatibility with other configurations.
Testing: New test cases have been added to test_fusions_e2e.py to validate the fusions and performance improvements for GPT-OSS models on Blackwell.

The changes are well-implemented and align with the stated goals of improving performance. The code is clean and the new logic is properly encapsulated. The performance benchmarks in the PR description show a significant 6% end-to-end improvement, which is a great result.

I have reviewed the code and found no critical or high-severity issues. The changes are correct and contribute to better performance and code structure.

mergify · 2025-12-17T09:45:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @elvischenv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

chatgpt-codex-connector

💡 Codex Review

https://github.com/vllm-project/vllm/blob/3648f8ab8e1f75350586bd226d8a55778f7e3ebc/vllm/model_executor/layers/fused_moe/layer.py#L510-L514
Keep MoE config hidden size in sync with MXFP4 padding

Here the MoE config is built with whatever hidden_size was passed in, but the MXFP4 backend now rounds hidden_size up later in Mxfp4MoEMethod.create_weights (e.g., to 256-aligned for SM100 FlashInfer, see vllm/model_executor/layers/quantization/mxfp4.py around lines 298-309). Because moe_config.hidden_dim stays at the unpadded value, any DP+EP run that uses the all2all kernels will size dispatch buffers from the smaller hidden_dim (see maybe_make_prepare_finalize in all2all_utils.py), while the kernel operates on the larger padded hidden size, leading to under-sized buffers and potential memory corruption for models whose hidden size is not already aligned. Please update the config’s hidden_dim after padding or pad before creating the config.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

mergify · 2026-02-04T06:14:06Z

Hi @elvischenv, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-05T00:56:09Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @elvischenv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ProExpertProg · 2026-02-08T14:14:23Z

@elvischenv could you post the new benchmarking numbers once you have them?

elvischenv · 2026-02-09T03:29:00Z

@elvischenv could you post the new benchmarking numbers once you have them?

This is the perf number based on main ToT:
PR

============ Serving Benchmark Result ============
Successful requests:                     40
Failed requests:                         0
Maximum request concurrency:             8
Benchmark duration (s):                  15.67
Total input tokens:                      40960
Total generated tokens:                  40960
Request throughput (req/s):              2.55
Output token throughput (tok/s):         2613.70
Peak output token throughput (tok/s):    141.00
Peak concurrent requests:                16.00
Total token throughput (tok/s):          5227.39
---------------Time to First Token----------------
Mean TTFT (ms):                          61.13
Median TTFT (ms):                        52.89
P99 TTFT (ms):                           93.71
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.00
Median TPOT (ms):                        3.00
P99 TPOT (ms):                           3.05
---------------Inter-token Latency----------------
Mean ITL (ms):                           59.05
Median ITL (ms):                         59.72
P99 ITL (ms):                            83.17
==================================================

main

============ Serving Benchmark Result ============
Successful requests:                     40
Failed requests:                         0
Maximum request concurrency:             8
Benchmark duration (s):                  16.46
Total input tokens:                      40960
Total generated tokens:                  40960
Request throughput (req/s):              2.43
Output token throughput (tok/s):         2488.50
Peak output token throughput (tok/s):    136.00
Peak concurrent requests:                16.00
Total token throughput (tok/s):          4977.00
---------------Time to First Token----------------
Mean TTFT (ms):                          64.64
Median TTFT (ms):                        64.21
P99 TTFT (ms):                           101.20
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.15
Median TPOT (ms):                        3.15
P99 TPOT (ms):                           3.22
---------------Inter-token Latency----------------
Mean ITL (ms):                           62.04
Median ITL (ms):                         63.02
P99 ITL (ms):                            64.60
==================================================

accuracy of PR

[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-high_temp1.0_20260208_183703', 'metric': 0.7992424242424242}]

mergify · 2026-02-10T17:23:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @elvischenv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

bnellnm · 2026-02-10T18:10:28Z

vllm/model_executor/layers/fused_moe/layer.py

-        elif (
-            current_platform.is_rocm()
-            or current_mxfp4_backend == Mxfp4Backend.SM100_FI_MXFP4_MXFP8_TRTLLM
-            or current_mxfp4_backend == Mxfp4Backend.SM100_FI_MXFP4_BF16
-        ):


I just ran into a situation where the mxfp4 marlin kernels require 256 element padding. Will this PR also address that or is it premature to remove this function?

tests/entrypoints/openai/responses/test_harmony.py fails if marlin is used for mxfp4.

See https://github.com/vllm-project/vllm/pull/32344/changes#diff-eddafffeb6f159f8c75f635d18a502fcfbf662a562b1ae7a8683a9790161a10b

Or maybe this needs to be incorporated into maybe_roundup_layer_hidden_size?

If you look into the create_weights() inside vllm/model_executor/layers/quantization/mxfp4.py, you will see a completely duplicated logic with this function.

So the current padding logic inside FusedMoE init() will first call maybe_roundup_hidden_size, which includes a small part of padding logic for mxfp4. Then it will call get_quant_method() and quant_method.create_weights(). create_weights() will go through the whole padding logic again if it is using Mxfp4MoEMethod. cc @robertgshaw2-redhat

Can we keep all the logic here in layer.py instead of having it in two places?

bnellnm · 2026-02-10T18:16:44Z

vllm/model_executor/layers/fused_moe/layer.py

            moe_quant_params["intermediate_size_full"] = intermediate_size

        self.quant_method.create_weights(layer=self, **moe_quant_params)
+        # hidden_size may be padded in create_weights


Can you point to where this happens?

Answered in the previous comment.
The calling order is like maybe_roundup_hidden_size -> create MoE config self.moe_config: FusedMoEConfig = FusedMoEConfig() -> get_quant_method() -> quant_method.create_weights().

There may be some paddings happening inside the create_weights() so need to update the MoE config.

The problem is where we should put the padding logic. Currently some are in vllm/model_executor/layers/fused_moe/layer.py, some are in vllm/model_executor/layers/quantization/mxfp4.py. And looks like PR #29008 depends on the logic in layers/fused_moe/layer.py.

cc @robertgshaw2-redhat

fyi I have a (wip) draft to refactor the roundup logic at #34285, to move the kernel dependent rounding logic to quant_method. fused_moe/layer.py still needs to invoke it to update the sizes.

I think delegating the decision to the quant methods is the correct approach.

ProExpertProg · 2026-03-16T22:30:39Z

Looks good to me but would want @mgoin or @robertgshaw2-redhat or @bnell to check the Moe code

bnellnm · 2026-03-17T19:53:29Z

vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py


+        # The padding in the forward pass can be skipped
+        self.skip_forward_padding = (
+            hasattr(self.quant_method, "support_skip_forward_padding")


Can you make this a method on FusedMoEMethodBase instead of an attribute? It could default to False.

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

zyongye · 2026-03-18T20:37:42Z

I am getting the error that the fusion pass trigger the assertion error when running VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 vllm serve openai/gpt-oss-120b -tp 2 on b200. only --enforce-eager works

(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932] WorkerProc hit an exception.
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932] Traceback (most recent call last):
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/v1/executor/multiproc_executor.py", line 927, in worker_busy_loop
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     output = func(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/v1/worker/gpu_worker.py", line 388, in determine_available_memory
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.model_runner.profile_run()
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/v1/worker/gpu_model_runner.py", line 5527, in profile_run
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     hidden_states, last_hidden_states = self._dummy_run(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                                         ^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/v1/worker/gpu_model_runner.py", line 5221, in _dummy_run
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     outputs = self.model(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]               ^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/cuda_graph.py", line 241, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self.runnable(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self._call_impl(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return forward_call(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/model_executor/models/gpt_oss.py", line 1218, in forward
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self.model(input_ids, positions, intermediate_tensors, inputs_embeds)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/decorators.py", line 583, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.aot_compiled_fn = self.aot_compile(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/wrapper.py", line 206, in aot_compile
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self._compiled_callable.aot_compile((args, kwargs))
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 832, in aot_compile
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return aot_compile_fullgraph(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 239, in aot_compile_fullgraph
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     compiled_fn = backend(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                   ^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/__init__.py", line 2509, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self.compiler_fn(model_, inputs_, **self.kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/usr/lib/python3.12/contextlib.py", line 81, in inner
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwds)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/backends.py", line 1063, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.configure_post_pass()
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/backends.py", line 847, in configure_post_pass
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.pass_manager.configure(self.vllm_config)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/pass_manager.py", line 125, in configure
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.passes += [AllReduceFusionPass(config)]
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/fusion/allreduce_rms_fusion.py", line 791, in __init__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.register_patterns()
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/inductor_pass.py", line 130, in fn_new
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     result = fn(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]              ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/fusion/allreduce_rms_fusion.py", line 817, in register_patterns
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     ).register(self.patterns)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]       ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/fusion/allreduce_rms_fusion.py", line 605, in register
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     pm.register_replacement(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/pattern_matcher.py", line 1602, in register_replacement
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     pattern, gm = gen_pattern_and_search_gm(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/usr/lib/python3.12/contextlib.py", line 81, in inner
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwds)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/pattern_matcher.py", line 1811, in gen_pattern_and_search_gm
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     search_gm = trace_fn(search_fn, flat_inputs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/pattern_matcher.py", line 2190, in fwd_only
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     gm = make_fx(fn, decompositions, tracing_mode="real")(*args)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 2701, in wrapped
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return make_fx_tracer.trace(f, *args)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 2626, in trace
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self._trace_inner(f, *args)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 2588, in _trace_inner
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     t = dispatch_trace(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]         ^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return disable_fn(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return fn(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 1460, in dispatch_trace
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     graph = tracer.trace(root, concrete_args)  # type: ignore[arg-type]
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return fn(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/_symbolic_trace.py", line 879, in trace
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     (self.create_arg(fn(*args)),),
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                      ^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 1526, in wrapped
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     out = f(*tensors)  # type:ignore[call-arg]
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]           ^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/fusion/allreduce_rms_fusion.py", line 563, in pattern
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     quant_out_tuple = auto_functionalized(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                       ^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_higher_order_ops/auto_functionalize.py", line 357, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     assert can_auto_functionalize(_mutable_op)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932] AssertionError
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932] Traceback (most recent call last):
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/v1/executor/multiproc_executor.py", line 927, in worker_busy_loop
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     output = func(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/v1/worker/gpu_worker.py", line 388, in determine_available_memory
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.model_runner.profile_run()
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/v1/worker/gpu_model_runner.py", line 5527, in profile_run
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     hidden_states, last_hidden_states = self._dummy_run(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                                         ^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/v1/worker/gpu_model_runner.py", line 5221, in _dummy_run
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     outputs = self.model(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]               ^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/cuda_graph.py", line 241, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self.runnable(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self._call_impl(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return forward_call(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/model_executor/models/gpt_oss.py", line 1218, in forward
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self.model(input_ids, positions, intermediate_tensors, inputs_embeds)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/decorators.py", line 583, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.aot_compiled_fn = self.aot_compile(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/wrapper.py", line 206, in aot_compile
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self._compiled_callable.aot_compile((args, kwargs))
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 832, in aot_compile
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return aot_compile_fullgraph(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 239, in aot_compile_fullgraph
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     compiled_fn = backend(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                   ^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/__init__.py", line 2509, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self.compiler_fn(model_, inputs_, **self.kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/usr/lib/python3.12/contextlib.py", line 81, in inner
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwds)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/backends.py", line 1063, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.configure_post_pass()
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/backends.py", line 847, in configure_post_pass
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.pass_manager.configure(self.vllm_config)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/pass_manager.py", line 125, in configure
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.passes += [AllReduceFusionPass(config)]
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/fusion/allreduce_rms_fusion.py", line 791, in __init__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.register_patterns()
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/inductor_pass.py", line 130, in fn_new
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     result = fn(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]              ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/fusion/allreduce_rms_fusion.py", line 817, in register_patterns
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     ).register(self.patterns)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]       ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/fusion/allreduce_rms_fusion.py", line 605, in register
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     pm.register_replacement(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/pattern_matcher.py", line 1602, in register_replacement
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     pattern, gm = gen_pattern_and_search_gm(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/usr/lib/python3.12/contextlib.py", line 81, in inner
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwds)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/pattern_matcher.py", line 1811, in gen_pattern_and_search_gm
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     search_gm = trace_fn(search_fn, flat_inputs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/pattern_matcher.py", line 2190, in fwd_only
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     gm = make_fx(fn, decompositions, tracing_mode="real")(*args)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 2701, in wrapped
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return make_fx_tracer.trace(f, *args)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 2626, in trace
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self._trace_inner(f, *args)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 2588, in _trace_inner
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     t = dispatch_trace(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]         ^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return disable_fn(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return fn(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 1460, in dispatch_trace
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     graph = tracer.trace(root, concrete_args)  # type: ignore[arg-type]
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return fn(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/_symbolic_trace.py", line 879, in trace
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     (self.create_arg(fn(*args)),),
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                      ^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 1526, in wrapped
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     out = f(*tensors)  # type:ignore[call-arg]
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]           ^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/fusion/allreduce_rms_fusion.py", line 563, in pattern
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     quant_out_tuple = auto_functionalized(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                       ^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_higher_order_ops/auto_functionalize.py", line 357, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     assert can_auto_functionalize(_mutable_op)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932] AssertionError

…XFP4 MXFP8 MoE (vllm-project#30647) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

mergify bot added ci/build gpt-oss Related to GPT-OSS models labels Dec 14, 2025

github-project-automation bot added this to gpt-oss Issues & Enhancements Dec 14, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Dec 14, 2025

gemini-code-assist bot reviewed Dec 14, 2025

View reviewed changes

nvpohanh mentioned this pull request Dec 16, 2025

[Tracking Issue][Performance] GPT-OSS B200/GB200 performance optimization tracker #30758

Open

12 tasks

mergify bot added the needs-rebase label Dec 17, 2025

elvischenv force-pushed the elvischenv/eliminate-padding-slicing branch from 4fd26b1 to 1fdd5ec Compare December 17, 2025 10:00

mergify bot removed the needs-rebase label Dec 17, 2025

elvischenv force-pushed the elvischenv/eliminate-padding-slicing branch from 1fdd5ec to 3648f8a Compare December 18, 2025 23:52

elvischenv marked this pull request as ready for review December 18, 2025 23:53

elvischenv requested review from WoosukKwon, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners December 18, 2025 23:53

mergify bot added the nvidia label Dec 18, 2025

github-project-automation bot added this to NVIDIA Dec 18, 2025

chatgpt-codex-connector bot reviewed Dec 18, 2025

View reviewed changes

This was referenced Jan 20, 2026

Use aiter triton fused_add_rmsnorm_pad for gpt-oss #30976

Merged

fix pad_align for gfx942 #32307

Closed

elvischenv force-pushed the elvischenv/eliminate-padding-slicing branch from 5c08ae1 to fc9b00f Compare February 4, 2026 06:08

elvischenv force-pushed the elvischenv/eliminate-padding-slicing branch from fc9b00f to ad3ff99 Compare February 4, 2026 06:20

mergify bot added the needs-rebase label Feb 5, 2026

mgoin self-assigned this Feb 5, 2026

elvischenv force-pushed the elvischenv/eliminate-padding-slicing branch from ad3ff99 to 938bf35 Compare February 7, 2026 15:21

elvischenv force-pushed the elvischenv/eliminate-padding-slicing branch from 938bf35 to 02510d0 Compare February 9, 2026 03:25

mergify bot added the needs-rebase label Feb 10, 2026

bnellnm reviewed Feb 10, 2026

View reviewed changes

SandishKumarHN mentioned this pull request Mar 13, 2026

[Bugfix] Fix FusedMoE weight loading with padded hidden dimensions #37010

Open

4 tasks

elvischenv force-pushed the elvischenv/eliminate-padding-slicing branch from 02510d0 to c0a5ab3 Compare March 16, 2026 11:40

mergify bot removed the needs-rebase label Mar 16, 2026

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 16, 2026

ProExpertProg approved these changes Mar 16, 2026

View reviewed changes

github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Mar 16, 2026

github-project-automation bot moved this to Ready in NVIDIA Mar 16, 2026

bnellnm reviewed Mar 17, 2026

View reviewed changes

elvischenv added 3 commits March 17, 2026 23:00

add e2e fusion test

739b800

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

eliminate padding and slice

a4c69a4

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

add skip_forward_padding property

d195774

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

elvischenv force-pushed the elvischenv/eliminate-padding-slicing branch from c0a5ab3 to d195774 Compare March 18, 2026 06:29

bnellnm approved these changes Mar 18, 2026

View reviewed changes

ProExpertProg approved these changes Mar 18, 2026

View reviewed changes

ProExpertProg enabled auto-merge (squash) March 18, 2026 13:57

ProExpertProg merged commit 296839a into vllm-project:main Mar 18, 2026
74 checks passed

github-project-automation bot moved this from Ready to Done in gpt-oss Issues & Enhancements Mar 18, 2026

github-project-automation bot moved this from Ready to Done in NVIDIA Mar 18, 2026

elvischenv deleted the elvischenv/eliminate-padding-slicing branch March 18, 2026 15:55

zyongye mentioned this pull request Mar 18, 2026

[MoE Refactor] Mxfp4 oracle rebased #37128

Open

5 tasks

Uh oh!

Conversation

elvischenv commented Dec 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan && Test Result(GPT-OSS-120b TP8)

Accuracy

Kernel

Perf (GPT-OSS-120b TP8 con8)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify bot commented Dec 17, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

mergify bot commented Feb 4, 2026

Uh oh!

mergify bot commented Feb 5, 2026

Uh oh!

ProExpertProg commented Feb 8, 2026

Uh oh!

elvischenv commented Feb 9, 2026

Uh oh!

mergify bot commented Feb 10, 2026

Uh oh!

bnellnm Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elvischenv Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ProExpertProg commented Mar 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zyongye commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

elvischenv commented Dec 14, 2025 •

edited by github-actions bot

Loading

bnellnm Feb 10, 2026 •

edited

Loading

elvischenv Feb 11, 2026 •

edited

Loading