[Fusion] Adopt inductor fusion and define quantization fusion pass by wxsIcey · Pull Request #4168 · vllm-project/vllm-ascend

wxsIcey · 2025-11-13T07:14:34Z

What this PR does / why we need it?

Part of: #4239

The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.In order to solve that, we propose to leverage torch.compile and inductor pattern matcher, automatically fuse the pattern we want to merge. For more details can refer to the RFC #4239

This pr integrates AddRMSNorm and the Quant operator, which can improve the inference speed of models using w8a8 quantization.

Performance improvement results:

Does this PR introduce any user-facing change?

Yes, add new additional_config

How was this patch tested?

def main():
    prompts = [
        "The president of the United States is Mr.",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
    # Create an LLM.
    llm = LLM(
        model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8",
              # enforce_eager=True,
              tensor_parallel_size=1,
              trust_remote_code=True,
              gpu_memory_utilization=0.7,
              quantization="ascend",
              )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden.  \nB. Mr. Trump is not Mr. Biden.  \nC. The president of the United States is not Mr. Trump.  \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of'

Co-authored-by: @ganyi1996ppo insipred by #2389

vLLM version: 86e178f
vLLM main: vllm-project/vllm@86e178f

github-actions · 2025-11-13T07:14:42Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2025-11-13T07:18:36Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

wxsIcey · 2025-11-13T13:02:28Z

Currently, operator fusion has been achieved through pattern matching using inductors. But it has been found that using aot-autograd causes accuracy issues. @whx-sjtu Would you be willing to review it?

whx-sjtu

Nice work. Finally we make it to utilize pattern_matcher of inductor to fuse our add_rms_norm_quant kernel into Fx graph. The whole idea looks good to me with some questions about details as reviewed following.

whx-sjtu · 2025-11-13T13:16:26Z

+    return shape_list
+
+
+class AscendAdaptor(CompilerInterface):


The name AscendAdaptor is too vague; I suggest a more specific one like AscendCompiler.

I have changed to AscendCompiler, it's definitely a better fit.

whx-sjtu · 2025-11-13T13:22:42Z

+          Pattern for AddRMSNormQuant fusion.
+          """
+            output = torch.ops.npu.npu_add_rms_norm(rms_norm_input, residual,
+                                                    rms_norm_weight, 1e-6)


Instead of fixed to 1e-6, the eps should be defined as a static variable of AddRMSNormQuantPattern, with different values of eps corresponding to different pattern objects. Some models might use different eps like 1e-5.

Thank you for your suggestion. I have revised it.

whx-sjtu · 2025-11-13T13:27:30Z

+
+    def __init__(self, vllm_config):
+        super().__init__(vllm_config)
+        self.patterns: PatternMatcherPass = PatternMatcherPass(


The name of self.patterns is a bit confusing here. It should be named as something like self.pattern_match_pass.

whx-sjtu · 2025-11-13T13:30:50Z

+            arg_dtypes, list) and len(arg_dtypes) > 0 else arg_dtypes
+        # We found that the kernel npu_add_rms_norm_quant accept varying data format for different dtypes, therefore, we only
+        # provide the solution on bfloat16 here.
+        return dtype in (torch.bfloat16, )


I don't quiet understand here. Does the format of data also influence pattern matching? Maybe we can define patterns separately for bf16 and fp16 to support them both?

Right, we usually don't decide the application of graph passes based on the concrete input. If we really have to do so, we have to add "guards" to make sure that the graph is recompiled when the input changes.

Thanks. I have removed this judgment. Currently, the fusion operator supports float16 and bfloat16, so no special processing is required.

whx-sjtu

I have another question here. With current proposal can we reuse the ready-made fusion passes defined in vLLM, like the SequenceParallel Fusion Pass. Because I'm not very familiar with the stack of the current Fusion pass in vLLM, I'm confirming it here. Reusability is what we expect.

whx-sjtu · 2025-11-13T13:40:22Z

This feature is very important for vllm-ascend. I also hope @jgong5 can take some time to review this PR. Thanks.

wxsIcey · 2025-11-13T13:45:03Z

I have another question here. With current proposal can we reuse the ready-made fusion passes defined in vLLM, like the SequenceParallel Fusion Pass. Because I'm not very familiar with the stack of the current Fusion pass in vLLM, I'm confirming it here. Reusability is what we expect.

Thank you for your reply. The current PR aims to define our own compiler backend to implement custom fusion. Reusing fusion passes in VLLM is my next goal. I will submit an RFC once the solution is finalized.

jgong5 · 2025-11-16T01:50:03Z


+class AscendCompilationConfig:
+    """
+    Configuration Object for ascend_compilation_config from additional_config


This comment doesn't bring extra info about this class. In fact, we can get that from the class name. If you want to explain anything meaningful here, you can consider to add why we need this configuration here and what are the rules to add more configurations under it etc.

jgong5 · 2025-11-16T01:50:34Z

+        self.enable_graph_fusion = enable_graph_fusion
+        self.fx_graph_eager = fx_graph_eager
+        self.enable_quantization_fusion = enable_quantization_fusion


Add the meaning as the code doc for each field.

Thanks. I have added it.

jgong5 · 2025-11-16T01:52:42Z

+                logger.info(
+                    "graph fusion enabled! Automatic kernel fusion is expected."
+                )
+
+                if ascend_config.ascend_compilation_config.enable_quantization_fusion:
+                    logger.info(
+                        "Quantization fusion enabled! op fusion on quantization are expected. "
+                    )


Take care of your grammar.

jgong5 · 2025-11-16T01:57:40Z

+            if is_310p():
+                orig_dtype = residual.dtype
+                x = x + residual.to(x.dtype)
+                residual = x.to(orig_dtype)
+                x, _ = torch_npu.npu_rms_norm(x, self.weight,
+                                              self.variance_epsilon)
+            else:
+                x, _, residual = torch_npu.npu_add_rms_norm(
+                    x, residual, self.weight, self.variance_epsilon)
            return x, residual


I don't quite follow the logic here. Why do we need such a check here?

The check on 310p is to maintain the original logic, see https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/ops/layernorm.py#L71. But I do not know why the 310p needs special processing.

jgong5 · 2025-11-16T02:53:01Z

+            arg_dtypes, list) and len(arg_dtypes) > 0 else arg_dtypes
+        # We found that the kernel npu_add_rms_norm_quant accept varying data format for different dtypes, therefore, we only
+        # provide the solution on bfloat16 here.
+        return dtype in (torch.bfloat16, )


Right, we usually don't decide the application of graph passes based on the concrete input. If we really have to do so, we have to add "guards" to make sure that the graph is recompiled when the input changes.

jgong5 · 2025-11-16T02:57:02Z

+
+    def compile(
+        self,
+        graph: fx.GraphModule,


Is the graph processed by AoT dispatcher before being passed here to the compiler backend?

Yes, I used aot-autograd.

ApsarasX · 2025-11-20T13:26:15Z

+from vllm.compilation.vllm_inductor_pass import VllmInductorPass
+
+
+class AddRMSNormQuantPattern:


Can we add a directory called passes or fx_masses specifically to store these passes?

Of course, I've already added it.

ApsarasX · 2025-11-20T13:27:46Z

+        return "graph_fusion_manager"
+
+    @classmethod
+    def get_pass_manager_cls(cls) -> str:


Does this interface have any requirements for the vllm version?

I'm trying to understand what you mean. We're defining our own pass manager and compiler backend here, which should be independent of the vllm version.

vllm 0.12.0 and later.

ApsarasX · 2025-11-20T13:28:05Z

+        return "vllm_ascend.compilation.graph_fusion_pass_manager.GraphFusionPassManager"
+
+    @classmethod
+    def get_compile_backend(self) -> str:


Please see the explanation above.

github-actions · 2025-11-26T03:52:07Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

wxsIcey · 2025-11-26T10:13:15Z

The operators have been correctly fused, and the functionality and accuracy are normal. Could you please take another look? @whx-sjtu @jgong5

github-actions · 2025-11-29T08:28:18Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: wxsIcey <1790571317@qq.com>

wangxiyuan · 2025-11-27T08:09:52Z

+    @classmethod
+    def get_compile_backend(self) -> str:
+        from vllm_ascend.compilation.compiler_interface import AscendAdaptor
+        return AscendAdaptor.__module__ + "." + AscendAdaptor.__name__


use string instead like others to make the coe more clear

Thanks. I will change it in next pr.

wangxiyuan · 2025-12-03T16:10:36Z

@@ -0,0 +1,219 @@
+#


this should be added to .github workflow to enable test by CI

Thanks. I think this pr can be merged first, I will enable it in next fusion pr.

…ct#4168) ### What this PR does / why we need it? The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.In order to solve that, we propose to leverage `torch.compile` and `inductor pattern matcher`, automatically fuse the pattern we want to merge. For more details can refer to the RFC vllm-project#4239 This pr integrates `AddRMSNorm` and the `Quant` operator, which can improve the inference speed of models using `w8a8 `quantization. ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config ### How was this patch tested? ```python def main(): prompts = [ "The president of the United States is Mr.", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8", # enforce_eager=True, tensor_parallel_size=1, trust_remote_code=True, gpu_memory_utilization=0.7, quantization="ascend", ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```text Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden. \nB. Mr. Trump is not Mr. Biden. \nC. The president of the United States is not Mr. Trump. \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of' ``` - vLLM version: 86e178f - vLLM main: vllm-project/vllm@86e178f --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>

…lm-project#4168)" This reverts commit 178ca16.

…ct#4168) ### What this PR does / why we need it? The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.In order to solve that, we propose to leverage `torch.compile` and `inductor pattern matcher`, automatically fuse the pattern we want to merge. For more details can refer to the RFC vllm-project#4239 This pr integrates `AddRMSNorm` and the `Quant` operator, which can improve the inference speed of models using `w8a8 `quantization. ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config ### How was this patch tested? ```python def main(): prompts = [ "The president of the United States is Mr.", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8", # enforce_eager=True, tensor_parallel_size=1, trust_remote_code=True, gpu_memory_utilization=0.7, quantization="ascend", ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```text Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden. \nB. Mr. Trump is not Mr. Biden. \nC. The president of the United States is not Mr. Trump. \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of' ``` - vLLM version: 86e178f - vLLM main: vllm-project/vllm@86e178f --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com>

…lm-project#4168)" This reverts commit 178ca16.

…ct#4168) ### What this PR does / why we need it? The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.In order to solve that, we propose to leverage `torch.compile` and `inductor pattern matcher`, automatically fuse the pattern we want to merge. For more details can refer to the RFC vllm-project#4239 This pr integrates `AddRMSNorm` and the `Quant` operator, which can improve the inference speed of models using `w8a8 `quantization. ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config ### How was this patch tested? ```python def main(): prompts = [ "The president of the United States is Mr.", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8", # enforce_eager=True, tensor_parallel_size=1, trust_remote_code=True, gpu_memory_utilization=0.7, quantization="ascend", ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```text Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden. \nB. Mr. Trump is not Mr. Biden. \nC. The president of the United States is not Mr. Trump. \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of' ``` - vLLM version: 86e178f - vLLM main: vllm-project/vllm@86e178f --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com>

…ct#4168) ### What this PR does / why we need it? The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.In order to solve that, we propose to leverage `torch.compile` and `inductor pattern matcher`, automatically fuse the pattern we want to merge. For more details can refer to the RFC vllm-project#4239 This pr integrates `AddRMSNorm` and the `Quant` operator, which can improve the inference speed of models using `w8a8 `quantization. ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config ### How was this patch tested? ```python def main(): prompts = [ "The president of the United States is Mr.", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8", # enforce_eager=True, tensor_parallel_size=1, trust_remote_code=True, gpu_memory_utilization=0.7, quantization="ascend", ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```text Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden. \nB. Mr. Trump is not Mr. Biden. \nC. The president of the United States is not Mr. Trump. \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of' ``` - vLLM version: 86e178f - vLLM main: vllm-project/vllm@86e178f --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>

…ct#4168) ### What this PR does / why we need it? The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.In order to solve that, we propose to leverage `torch.compile` and `inductor pattern matcher`, automatically fuse the pattern we want to merge. For more details can refer to the RFC vllm-project#4239 This pr integrates `AddRMSNorm` and the `Quant` operator, which can improve the inference speed of models using `w8a8 `quantization. ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config ### How was this patch tested? ```python def main(): prompts = [ "The president of the United States is Mr.", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8", # enforce_eager=True, tensor_parallel_size=1, trust_remote_code=True, gpu_memory_utilization=0.7, quantization="ascend", ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```text Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden. \nB. Mr. Trump is not Mr. Biden. \nC. The president of the United States is not Mr. Trump. \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of' ``` - vLLM version: 86e178f - vLLM main: vllm-project/vllm@86e178f --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com>

### What this PR does / why we need it? 1. In addition to [#4168](#4168), [#5011](https://github.com/vllm-project/vllm-ascend/pull/5011)， this PR adds two more pattern for AddRmsnormQuant with SP enabled. The key difference is to insert an additional `maybe_all_gather_and_maybe_unpad` between `addrmsnorm` and `quantize`. 2. This PR also introduce another api `torch.ops.vllm.quantize`, so that we pass `input_scale` and `input_scale_reciprocal` at the same time. This is because `npu_add_rms_norm_quant` and `npu_quantize` requires different `div_mode`. To avoid introducing additional reciprocal calculation in runtime, we have to pass both of them to quantize api. 3. Removes redundant `AscendQuantRmsnorm`. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: Angazenn <supperccell@163.com>

…ject#5077) ### What this PR does / why we need it? 1. In addition to [vllm-project#4168](vllm-project#4168), [vllm-project#5011](https://github.com/vllm-project/vllm-ascend/pull/5011)， this PR adds two more pattern for AddRmsnormQuant with SP enabled. The key difference is to insert an additional `maybe_all_gather_and_maybe_unpad` between `addrmsnorm` and `quantize`. 2. This PR also introduce another api `torch.ops.vllm.quantize`, so that we pass `input_scale` and `input_scale_reciprocal` at the same time. This is because `npu_add_rms_norm_quant` and `npu_quantize` requires different `div_mode`. To avoid introducing additional reciprocal calculation in runtime, we have to pass both of them to quantize api. 3. Removes redundant `AscendQuantRmsnorm`. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: Angazenn <supperccell@163.com>

paulyu12 · 2025-12-23T09:27:17Z

This PR seems break LoRA e2e test. The error log is here: https://github.com/vllm-project/vllm-ascend/actions/runs/20450707034/job/58763094223?pr=4075

Does anybody know why this occurs or give any advice?

Reproduce:

vllm v0.12.0 + vllm-ascend commit 178ca16, run "pytest -sv tests/e2e/singlecard/test_ilama_lora.py" can reproduce the bug.
but vllm-ascend commit c4a71fc works well. (just 1 commit ahead of that)

…ject#5077) ### What this PR does / why we need it? 1. In addition to [vllm-project#4168](vllm-project#4168), [vllm-project#5011](https://github.com/vllm-project/vllm-ascend/pull/5011)， this PR adds two more pattern for AddRmsnormQuant with SP enabled. The key difference is to insert an additional `maybe_all_gather_and_maybe_unpad` between `addrmsnorm` and `quantize`. 2. This PR also introduce another api `torch.ops.vllm.quantize`, so that we pass `input_scale` and `input_scale_reciprocal` at the same time. This is because `npu_add_rms_norm_quant` and `npu_quantize` requires different `div_mode`. To avoid introducing additional reciprocal calculation in runtime, we have to pass both of them to quantize api. 3. Removes redundant `AscendQuantRmsnorm`. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: Angazenn <supperccell@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

github-actions Bot added module:ops module:core labels Nov 13, 2025

wxsIcey requested review from rjg-lyh and whx-sjtu November 13, 2025 07:18

github-actions Bot added the merge-conflicts label Nov 13, 2025

wxsIcey changed the title ~~[wip] Adopt inductor fusion and define quantization fusion pass~~ Adopt inductor fusion and define quantization fusion pass Nov 13, 2025

wxsIcey marked this pull request as ready for review November 13, 2025 12:58

whx-sjtu requested changes Nov 13, 2025

View reviewed changes

whx-sjtu reviewed Nov 13, 2025

View reviewed changes

wxsIcey requested a review from jgong5 November 13, 2025 13:46

jgong5 reviewed Nov 16, 2025

View reviewed changes

wxsIcey force-pushed the fusion_compiler branch from 33ce54c to 179e727 Compare November 20, 2025 04:51

github-actions Bot removed the merge-conflicts label Nov 20, 2025

wxsIcey mentioned this pull request Nov 20, 2025

[RFC]: Ops fusion for vLLM-Ascend using Inductor Pattern Matcher #4239

Closed

ApsarasX reviewed Nov 20, 2025

View reviewed changes

github-actions Bot added the merge-conflicts label Nov 26, 2025

wxsIcey force-pushed the fusion_compiler branch from a72c4cf to 3b7b356 Compare November 26, 2025 06:47

github-actions Bot removed the merge-conflicts label Nov 26, 2025

wxsIcey added ready read for review ready-for-test start test by label for PR labels Nov 27, 2025

github-actions Bot added the module:tests label Nov 28, 2025

github-actions Bot added the merge-conflicts label Nov 29, 2025

wxsIcey added 4 commits December 3, 2025 12:33

add license to e2e test

9652682

Signed-off-by: wxsIcey <1790571317@qq.com>

resolve conflict

3aaaa44

Signed-off-by: wxsIcey <1790571317@qq.com>

fix moe w8a8 accuracy and fix ut

65270f2

Signed-off-by: wxsIcey <1790571317@qq.com>

remove unuse code and reformat code

2bcbeb4

Signed-off-by: wxsIcey <1790571317@qq.com>

wxsIcey force-pushed the fusion_compiler branch from 4b54a5a to 2bcbeb4 Compare December 3, 2025 12:36

wangxiyuan reviewed Dec 3, 2025

View reviewed changes

wangxiyuan approved these changes Dec 4, 2025

View reviewed changes

wangxiyuan merged commit 178ca16 into vllm-project:main Dec 4, 2025
21 of 22 checks passed

wxsIcey changed the title ~~Adopt inductor fusion and define quantization fusion pass~~ [Fusion] Adopt inductor fusion and define quantization fusion pass Dec 4, 2025

Angazenn added a commit to Angazenn/vllm-ascend that referenced this pull request Dec 4, 2025

Revert "Adopt inductor fusion and define quantization fusion pass (vl…

e1d5b16

…lm-project#4168)" This reverts commit 178ca16.

This was referenced Dec 5, 2025

[Feature]: Allow oot custom compiler extension via CompilerInterface and reuse backend-agnostic FX passes vllm-project/vllm#23612

Closed

[core] Adopt graph rewriter on fx.graph to enable automatic kernel fusion #2389

Closed

Angazenn added a commit to Angazenn/vllm-ascend that referenced this pull request Dec 6, 2025

Revert "Adopt inductor fusion and define quantization fusion pass (vl…

ec9f0df

…lm-project#4168)" This reverts commit 178ca16.

Angazenn mentioned this pull request Dec 16, 2025

[Graph][Fusion]Add new pattern for AddRmsnormQuant with SP. #5077

Merged

paulyu12 mentioned this pull request Jan 5, 2026

[Test][e2e][LoRA] Add more e2e tests to cover scenarios of LoRA #4075

Merged

wangxiyuan mentioned this pull request Jan 26, 2026

[Community] Nominate whx-sjtu as maintainer #6268

Merged

Yikun mentioned this pull request Feb 5, 2026

[v0.13.0rc2] FAQ / Feedback | 问题/反馈 #6186

Closed

		from vllm.compilation.vllm_inductor_pass import VllmInductorPass


		class AddRMSNormQuantPattern:

Conversation

wxsIcey commented Nov 13, 2025 • edited by Yikun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions Bot commented Nov 13, 2025

Uh oh!

github-actions Bot commented Nov 13, 2025

Uh oh!

wxsIcey commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whx-sjtu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

whx-sjtu left a comment

Choose a reason for hiding this comment

Uh oh!

whx-sjtu commented Nov 13, 2025

Uh oh!

wxsIcey commented Nov 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wxsIcey commented Nov 13, 2025 •

edited by Yikun

Loading

wxsIcey commented Nov 13, 2025 •

edited

Loading

paulyu12 commented Dec 23, 2025 •

edited

Loading