Skip to content

[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE#30647

Merged
ProExpertProg merged 3 commits intovllm-project:mainfrom
elvischenv:elvischenv/eliminate-padding-slicing
Mar 18, 2026
Merged

[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE#30647
ProExpertProg merged 3 commits intovllm-project:mainfrom
elvischenv:elvischenv/eliminate-padding-slicing

Conversation

@elvischenv
Copy link
Contributor

@elvischenv elvischenv commented Dec 14, 2025

Purpose

Test Plan && Test Result(GPT-OSS-120b TP8)

Accuracy

PR:

[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-high_temp1.0_20251213_233136', 'metric': 0.7803030303030303}]
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20251213_234320', 'metric': 0.8875}]

main:

[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-high_temp1.0_20251214_002509', 'metric': 0.7891414141414141}]
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20251214_001505', 'metric': 0.8875}]

Kernel

PR:

void cublasLt::splitKreduce_kernel                          2.400 μs
void tensorrt_llm::kernels::quantize_with_block_size        2.944 μs
void moe::dev::routing::routingRenormalize                  5.216 μs
bmm_MxE4m3_MxE2m1MxE4m3_Fp32_t128x16x256u2                 13.216 μs
bmm_Bfloat16_MxE2m1MxE4m3_Fp32_t128x16x256u2                9.504 μs
void moe::dev::finalize::finalizeKernel                     3.168 μs
void flashinfer::trtllm_allreduce_fusion                    7.744 μs (ar+norm)
nvjet_tst_32x64_64x16_4x1_v_bz_splitK_TNN                   4.448 μs

main:

void cublasLt::splitKreduce_kernel                          2.048 μs
triton_poi_fused_constant_pad_nd_moe_forward_0              1.407 μs (pad)
void tensorrt_llm::kernels::quantize_with_block_size        2.432 μs
void moe::dev::routing::routingRenormalize                  5.056 μs
bmm_MxE4m3_MxE2m1MxE4m3_Fp32_t128x16x256u2                 10.368 μs
bmm_Bfloat16_MxE2m1MxE4m3_Fp32_t128x16x256u2                8.512 μs
void moe::dev::finalize::finalizeKernel                     2.112 μs
void vllm::cross_device_reduce_1stage                       8.320 μs (ar)
triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_slice_1    2.336 μs (norm, slice)
nvjet_tst_32x64_64x16_4x1_v_bz_splitK_TNN                   4.448 μs

Perf (GPT-OSS-120b TP8 con8)

PR: 5% E2E improvement

============ Serving Benchmark Result ============
Successful requests:                     80
Failed requests:                         0
Maximum request concurrency:             8
Benchmark duration (s):                  28.90
Total input tokens:                      81920
Total generated tokens:                  81920
Request throughput (req/s):              2.77
Output token throughput (tok/s):         2834.13
Peak output token throughput (tok/s):    158.00
Peak concurrent requests:                16.00
Total token throughput (tok/s):          5668.25
---------------Time to First Token----------------
Mean TTFT (ms):                          50.37
Median TTFT (ms):                        58.07
P99 TTFT (ms):                           75.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.77
Median TPOT (ms):                        2.77
P99 TPOT (ms):                           2.84
---------------Inter-token Latency----------------
Mean ITL (ms):                           54.59
Median ITL (ms):                         55.28
P99 ITL (ms):                            68.89
==================================================

main:

============ Serving Benchmark Result ============
Successful requests:                     80
Failed requests:                         0
Maximum request concurrency:             8
Benchmark duration (s):                  30.43
Total input tokens:                      81920
Total generated tokens:                  81920
Request throughput (req/s):              2.63
Output token throughput (tok/s):         2692.11
Peak output token throughput (tok/s):    149.00
Peak concurrent requests:                16.00
Total token throughput (tok/s):          5384.22
---------------Time to First Token----------------
Mean TTFT (ms):                          52.61
Median TTFT (ms):                        57.20
P99 TTFT (ms):                           88.23
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.92
Median TPOT (ms):                        2.92
P99 TPOT (ms):                           2.99
---------------Inter-token Latency----------------
Mean ITL (ms):                           57.47
Median ITL (ms):                         58.17
P99 ITL (ms):                            70.35
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization for Mixture-of-Experts (MoE) layers in GPT-OSS models using Flashinfer with MXFP4/MXFP8 quantization. The key changes involve eliminating explicit padding and slicing operations around the MoE computation. This is achieved by leveraging new capabilities in the Flashinfer library to handle padding within the quantization kernel and to write to an unpadded output buffer directly.

The main changes are:

  1. Elimination of Padding/Slicing: The FusedMoE layer no longer performs manual padding before the MoE kernel for supported backends. Instead, the padding is handled by flashinfer::mxfp8_quantize, and the subsequent slicing is effectively done by the MoE kernel writing to a smaller, pre-allocated output tensor. This change enables better fusion opportunities, as seen by the all-reduce + norm fusion now being possible.
  2. Code Refactoring: The logic for rounding up hidden sizes for MXFP4 quantization has been moved from the generic fused_moe/layer.py to the specific quantization/mxfp4.py, which is a more appropriate location. This removes duplicated code and improves modularity.
  3. Conditional Logic: The new behavior is controlled by a support_padded_mxfp8_quant flag, ensuring that it only applies to the SM100_FI_MXFP4_MXFP8_TRTLLM backend on Blackwell GPUs, maintaining compatibility with other configurations.
  4. Testing: New test cases have been added to test_fusions_e2e.py to validate the fusions and performance improvements for GPT-OSS models on Blackwell.

The changes are well-implemented and align with the stated goals of improving performance. The code is clean and the new logic is properly encapsulated. The performance benchmarks in the PR description show a significant 6% end-to-end improvement, which is a great result.

I have reviewed the code and found no critical or high-severity issues. The changes are correct and contribute to better performance and code structure.

@mergify
Copy link

mergify bot commented Dec 17, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @elvischenv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 17, 2025
@elvischenv elvischenv force-pushed the elvischenv/eliminate-padding-slicing branch from 4fd26b1 to 1fdd5ec Compare December 17, 2025 10:00
@mergify mergify bot removed the needs-rebase label Dec 17, 2025
@elvischenv elvischenv force-pushed the elvischenv/eliminate-padding-slicing branch from 1fdd5ec to 3648f8a Compare December 18, 2025 23:52
@elvischenv elvischenv marked this pull request as ready for review December 18, 2025 23:53
@mergify mergify bot added the nvidia label Dec 18, 2025
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

https://github.com/vllm-project/vllm/blob/3648f8ab8e1f75350586bd226d8a55778f7e3ebc/vllm/model_executor/layers/fused_moe/layer.py#L510-L514
P1 Badge Keep MoE config hidden size in sync with MXFP4 padding

Here the MoE config is built with whatever hidden_size was passed in, but the MXFP4 backend now rounds hidden_size up later in Mxfp4MoEMethod.create_weights (e.g., to 256-aligned for SM100 FlashInfer, see vllm/model_executor/layers/quantization/mxfp4.py around lines 298-309). Because moe_config.hidden_dim stays at the unpadded value, any DP+EP run that uses the all2all kernels will size dispatch buffers from the smaller hidden_dim (see maybe_make_prepare_finalize in all2all_utils.py), while the kernel operates on the larger padded hidden size, leading to under-sized buffers and potential memory corruption for models whose hidden size is not already aligned. Please update the config’s hidden_dim after padding or pad before creating the config.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@elvischenv elvischenv force-pushed the elvischenv/eliminate-padding-slicing branch from 5c08ae1 to fc9b00f Compare February 4, 2026 06:08
@mergify
Copy link

mergify bot commented Feb 4, 2026

Hi @elvischenv, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@elvischenv elvischenv force-pushed the elvischenv/eliminate-padding-slicing branch from fc9b00f to ad3ff99 Compare February 4, 2026 06:20
@mergify
Copy link

mergify bot commented Feb 5, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @elvischenv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 5, 2026
@mgoin mgoin self-assigned this Feb 5, 2026
@elvischenv elvischenv force-pushed the elvischenv/eliminate-padding-slicing branch from ad3ff99 to 938bf35 Compare February 7, 2026 15:21
@ProExpertProg
Copy link
Collaborator

@elvischenv could you post the new benchmarking numbers once you have them?

@elvischenv elvischenv force-pushed the elvischenv/eliminate-padding-slicing branch from 938bf35 to 02510d0 Compare February 9, 2026 03:25
@elvischenv
Copy link
Contributor Author

@elvischenv could you post the new benchmarking numbers once you have them?

This is the perf number based on main ToT:
PR

============ Serving Benchmark Result ============
Successful requests:                     40
Failed requests:                         0
Maximum request concurrency:             8
Benchmark duration (s):                  15.67
Total input tokens:                      40960
Total generated tokens:                  40960
Request throughput (req/s):              2.55
Output token throughput (tok/s):         2613.70
Peak output token throughput (tok/s):    141.00
Peak concurrent requests:                16.00
Total token throughput (tok/s):          5227.39
---------------Time to First Token----------------
Mean TTFT (ms):                          61.13
Median TTFT (ms):                        52.89
P99 TTFT (ms):                           93.71
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.00
Median TPOT (ms):                        3.00
P99 TPOT (ms):                           3.05
---------------Inter-token Latency----------------
Mean ITL (ms):                           59.05
Median ITL (ms):                         59.72
P99 ITL (ms):                            83.17
==================================================

main

============ Serving Benchmark Result ============
Successful requests:                     40
Failed requests:                         0
Maximum request concurrency:             8
Benchmark duration (s):                  16.46
Total input tokens:                      40960
Total generated tokens:                  40960
Request throughput (req/s):              2.43
Output token throughput (tok/s):         2488.50
Peak output token throughput (tok/s):    136.00
Peak concurrent requests:                16.00
Total token throughput (tok/s):          4977.00
---------------Time to First Token----------------
Mean TTFT (ms):                          64.64
Median TTFT (ms):                        64.21
P99 TTFT (ms):                           101.20
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.15
Median TPOT (ms):                        3.15
P99 TPOT (ms):                           3.22
---------------Inter-token Latency----------------
Mean ITL (ms):                           62.04
Median ITL (ms):                         63.02
P99 ITL (ms):                            64.60
==================================================

accuracy of PR

[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-high_temp1.0_20260208_183703', 'metric': 0.7992424242424242}]

@mergify
Copy link

mergify bot commented Feb 10, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @elvischenv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 10, 2026
Comment on lines -269 to -273
elif (
current_platform.is_rocm()
or current_mxfp4_backend == Mxfp4Backend.SM100_FI_MXFP4_MXFP8_TRTLLM
or current_mxfp4_backend == Mxfp4Backend.SM100_FI_MXFP4_BF16
):
Copy link
Collaborator

@bnellnm bnellnm Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just ran into a situation where the mxfp4 marlin kernels require 256 element padding. Will this PR also address that or is it premature to remove this function?

tests/entrypoints/openai/responses/test_harmony.py fails if marlin is used for mxfp4.

See https://github.com/vllm-project/vllm/pull/32344/changes#diff-eddafffeb6f159f8c75f635d18a502fcfbf662a562b1ae7a8683a9790161a10b

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe this needs to be incorporated into maybe_roundup_layer_hidden_size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look into the create_weights() inside vllm/model_executor/layers/quantization/mxfp4.py, you will see a completely duplicated logic with this function.

So the current padding logic inside FusedMoE init() will first call maybe_roundup_hidden_size, which includes a small part of padding logic for mxfp4. Then it will call get_quant_method() and quant_method.create_weights(). create_weights() will go through the whole padding logic again if it is using Mxfp4MoEMethod. cc @robertgshaw2-redhat

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep all the logic here in layer.py instead of having it in two places?

moe_quant_params["intermediate_size_full"] = intermediate_size

self.quant_method.create_weights(layer=self, **moe_quant_params)
# hidden_size may be padded in create_weights
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point to where this happens?

Copy link
Contributor Author

@elvischenv elvischenv Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered in the previous comment.
The calling order is like maybe_roundup_hidden_size -> create MoE config self.moe_config: FusedMoEConfig = FusedMoEConfig() -> get_quant_method() -> quant_method.create_weights().

There may be some paddings happening inside the create_weights() so need to update the MoE config.

The problem is where we should put the padding logic. Currently some are in vllm/model_executor/layers/fused_moe/layer.py, some are in vllm/model_executor/layers/quantization/mxfp4.py. And looks like PR #29008 depends on the logic in layers/fused_moe/layer.py.

cc @robertgshaw2-redhat

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi I have a (wip) draft to refactor the roundup logic at #34285, to move the kernel dependent rounding logic to quant_method. fused_moe/layer.py still needs to invoke it to update the sizes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think delegating the decision to the quant methods is the correct approach.

@elvischenv elvischenv force-pushed the elvischenv/eliminate-padding-slicing branch from 02510d0 to c0a5ab3 Compare March 16, 2026 11:40
@mergify mergify bot removed the needs-rebase label Mar 16, 2026
@ProExpertProg ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 16, 2026
@ProExpertProg
Copy link
Collaborator

Looks good to me but would want @mgoin or @robertgshaw2-redhat or @bnell to check the Moe code

@github-project-automation github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Mar 16, 2026
@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Mar 16, 2026

# The padding in the forward pass can be skipped
self.skip_forward_padding = (
hasattr(self.quant_method, "support_skip_forward_padding")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this a method on FusedMoEMethodBase instead of an attribute? It could default to False.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
@elvischenv elvischenv force-pushed the elvischenv/eliminate-padding-slicing branch from c0a5ab3 to d195774 Compare March 18, 2026 06:29
@ProExpertProg ProExpertProg enabled auto-merge (squash) March 18, 2026 13:57
@ProExpertProg ProExpertProg merged commit 296839a into vllm-project:main Mar 18, 2026
74 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Mar 18, 2026
@elvischenv elvischenv deleted the elvischenv/eliminate-padding-slicing branch March 18, 2026 15:55
@zyongye
Copy link
Member

zyongye commented Mar 18, 2026

I am getting the error that the fusion pass trigger the assertion error when running VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 vllm serve openai/gpt-oss-120b -tp 2 on b200. only --enforce-eager works

(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932] WorkerProc hit an exception.
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932] Traceback (most recent call last):
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/v1/executor/multiproc_executor.py", line 927, in worker_busy_loop
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     output = func(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/v1/worker/gpu_worker.py", line 388, in determine_available_memory
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.model_runner.profile_run()
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/v1/worker/gpu_model_runner.py", line 5527, in profile_run
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     hidden_states, last_hidden_states = self._dummy_run(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                                         ^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/v1/worker/gpu_model_runner.py", line 5221, in _dummy_run
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     outputs = self.model(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]               ^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/cuda_graph.py", line 241, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self.runnable(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self._call_impl(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return forward_call(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/model_executor/models/gpt_oss.py", line 1218, in forward
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self.model(input_ids, positions, intermediate_tensors, inputs_embeds)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/decorators.py", line 583, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.aot_compiled_fn = self.aot_compile(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/wrapper.py", line 206, in aot_compile
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self._compiled_callable.aot_compile((args, kwargs))
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 832, in aot_compile
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return aot_compile_fullgraph(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 239, in aot_compile_fullgraph
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     compiled_fn = backend(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                   ^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/__init__.py", line 2509, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self.compiler_fn(model_, inputs_, **self.kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/usr/lib/python3.12/contextlib.py", line 81, in inner
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwds)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/backends.py", line 1063, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.configure_post_pass()
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/backends.py", line 847, in configure_post_pass
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.pass_manager.configure(self.vllm_config)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/pass_manager.py", line 125, in configure
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.passes += [AllReduceFusionPass(config)]
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/fusion/allreduce_rms_fusion.py", line 791, in __init__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.register_patterns()
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/inductor_pass.py", line 130, in fn_new
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     result = fn(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]              ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/fusion/allreduce_rms_fusion.py", line 817, in register_patterns
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     ).register(self.patterns)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]       ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/fusion/allreduce_rms_fusion.py", line 605, in register
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     pm.register_replacement(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/pattern_matcher.py", line 1602, in register_replacement
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     pattern, gm = gen_pattern_and_search_gm(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/usr/lib/python3.12/contextlib.py", line 81, in inner
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwds)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/pattern_matcher.py", line 1811, in gen_pattern_and_search_gm
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     search_gm = trace_fn(search_fn, flat_inputs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/pattern_matcher.py", line 2190, in fwd_only
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     gm = make_fx(fn, decompositions, tracing_mode="real")(*args)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 2701, in wrapped
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return make_fx_tracer.trace(f, *args)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 2626, in trace
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self._trace_inner(f, *args)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 2588, in _trace_inner
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     t = dispatch_trace(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]         ^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return disable_fn(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return fn(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 1460, in dispatch_trace
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     graph = tracer.trace(root, concrete_args)  # type: ignore[arg-type]
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return fn(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/_symbolic_trace.py", line 879, in trace
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     (self.create_arg(fn(*args)),),
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                      ^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 1526, in wrapped
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     out = f(*tensors)  # type:ignore[call-arg]
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]           ^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/fusion/allreduce_rms_fusion.py", line 563, in pattern
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     quant_out_tuple = auto_functionalized(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                       ^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_higher_order_ops/auto_functionalize.py", line 357, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     assert can_auto_functionalize(_mutable_op)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932] AssertionError
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932] Traceback (most recent call last):
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/v1/executor/multiproc_executor.py", line 927, in worker_busy_loop
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     output = func(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/v1/worker/gpu_worker.py", line 388, in determine_available_memory
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.model_runner.profile_run()
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/v1/worker/gpu_model_runner.py", line 5527, in profile_run
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     hidden_states, last_hidden_states = self._dummy_run(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                                         ^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/v1/worker/gpu_model_runner.py", line 5221, in _dummy_run
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     outputs = self.model(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]               ^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/cuda_graph.py", line 241, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self.runnable(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self._call_impl(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return forward_call(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/model_executor/models/gpt_oss.py", line 1218, in forward
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self.model(input_ids, positions, intermediate_tensors, inputs_embeds)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/decorators.py", line 583, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.aot_compiled_fn = self.aot_compile(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/wrapper.py", line 206, in aot_compile
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self._compiled_callable.aot_compile((args, kwargs))
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 832, in aot_compile
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return aot_compile_fullgraph(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 239, in aot_compile_fullgraph
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     compiled_fn = backend(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                   ^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/__init__.py", line 2509, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self.compiler_fn(model_, inputs_, **self.kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/usr/lib/python3.12/contextlib.py", line 81, in inner
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwds)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/backends.py", line 1063, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.configure_post_pass()
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/backends.py", line 847, in configure_post_pass
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.pass_manager.configure(self.vllm_config)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/pass_manager.py", line 125, in configure
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.passes += [AllReduceFusionPass(config)]
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/fusion/allreduce_rms_fusion.py", line 791, in __init__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     self.register_patterns()
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/inductor_pass.py", line 130, in fn_new
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     result = fn(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]              ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/fusion/allreduce_rms_fusion.py", line 817, in register_patterns
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     ).register(self.patterns)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]       ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/fusion/allreduce_rms_fusion.py", line 605, in register
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     pm.register_replacement(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/pattern_matcher.py", line 1602, in register_replacement
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     pattern, gm = gen_pattern_and_search_gm(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/usr/lib/python3.12/contextlib.py", line 81, in inner
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwds)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/pattern_matcher.py", line 1811, in gen_pattern_and_search_gm
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     search_gm = trace_fn(search_fn, flat_inputs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/pattern_matcher.py", line 2190, in fwd_only
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     gm = make_fx(fn, decompositions, tracing_mode="real")(*args)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 2701, in wrapped
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return make_fx_tracer.trace(f, *args)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 2626, in trace
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return self._trace_inner(f, *args)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 2588, in _trace_inner
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     t = dispatch_trace(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]         ^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return disable_fn(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return fn(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 1460, in dispatch_trace
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     graph = tracer.trace(root, concrete_args)  # type: ignore[arg-type]
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     return fn(*args, **kwargs)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/_symbolic_trace.py", line 879, in trace
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     (self.create_arg(fn(*args)),),
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                      ^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 1526, in wrapped
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     out = f(*tensors)  # type:ignore[call-arg]
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]           ^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/vllm/compilation/passes/fusion/allreduce_rms_fusion.py", line 563, in pattern
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     quant_out_tuple = auto_functionalized(
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]                       ^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]   File "/home/yongye/vllm/.venv/lib/python3.12/site-packages/torch/_higher_order_ops/auto_functionalize.py", line 357, in __call__
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]     assert can_auto_functionalize(_mutable_op)
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=2452919) ERROR 03-18 20:35:19 [multiproc_executor.py:932] AssertionError

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
…XFP4 MXFP8 MoE (vllm-project#30647)

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build gpt-oss Related to GPT-OSS models nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants