[Fusion] Adopt inductor fusion and define quantization fusion pass#4168
[Fusion] Adopt inductor fusion and define quantization fusion pass#4168wangxiyuan merged 25 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
Currently, operator fusion has been achieved through pattern matching using inductors. But it has been found that using aot-autograd causes accuracy issues. @whx-sjtu Would you be willing to review it? |
whx-sjtu
left a comment
There was a problem hiding this comment.
Nice work. Finally we make it to utilize pattern_matcher of inductor to fuse our add_rms_norm_quant kernel into Fx graph. The whole idea looks good to me with some questions about details as reviewed following.
| return shape_list | ||
|
|
||
|
|
||
| class AscendAdaptor(CompilerInterface): |
There was a problem hiding this comment.
The name AscendAdaptor is too vague; I suggest a more specific one like AscendCompiler.
There was a problem hiding this comment.
I have changed to AscendCompiler, it's definitely a better fit.
| Pattern for AddRMSNormQuant fusion. | ||
| """ | ||
| output = torch.ops.npu.npu_add_rms_norm(rms_norm_input, residual, | ||
| rms_norm_weight, 1e-6) |
There was a problem hiding this comment.
Instead of fixed to 1e-6, the eps should be defined as a static variable of AddRMSNormQuantPattern, with different values of eps corresponding to different pattern objects. Some models might use different eps like 1e-5.
There was a problem hiding this comment.
Thank you for your suggestion. I have revised it.
|
|
||
| def __init__(self, vllm_config): | ||
| super().__init__(vllm_config) | ||
| self.patterns: PatternMatcherPass = PatternMatcherPass( |
There was a problem hiding this comment.
The name of self.patterns is a bit confusing here. It should be named as something like self.pattern_match_pass.
| arg_dtypes, list) and len(arg_dtypes) > 0 else arg_dtypes | ||
| # We found that the kernel npu_add_rms_norm_quant accept varying data format for different dtypes, therefore, we only | ||
| # provide the solution on bfloat16 here. | ||
| return dtype in (torch.bfloat16, ) |
There was a problem hiding this comment.
I don't quiet understand here. Does the format of data also influence pattern matching? Maybe we can define patterns separately for bf16 and fp16 to support them both?
There was a problem hiding this comment.
Right, we usually don't decide the application of graph passes based on the concrete input. If we really have to do so, we have to add "guards" to make sure that the graph is recompiled when the input changes.
There was a problem hiding this comment.
Thanks. I have removed this judgment. Currently, the fusion operator supports float16 and bfloat16, so no special processing is required.
whx-sjtu
left a comment
There was a problem hiding this comment.
I have another question here. With current proposal can we reuse the ready-made fusion passes defined in vLLM, like the SequenceParallel Fusion Pass. Because I'm not very familiar with the stack of the current Fusion pass in vLLM, I'm confirming it here. Reusability is what we expect.
|
This feature is very important for vllm-ascend. I also hope @jgong5 can take some time to review this PR. Thanks. |
Thank you for your reply. The current PR aims to define our own compiler backend to implement custom fusion. Reusing fusion passes in VLLM is my next goal. I will submit an RFC once the solution is finalized. |
|
|
||
| class AscendCompilationConfig: | ||
| """ | ||
| Configuration Object for ascend_compilation_config from additional_config |
There was a problem hiding this comment.
This comment doesn't bring extra info about this class. In fact, we can get that from the class name. If you want to explain anything meaningful here, you can consider to add why we need this configuration here and what are the rules to add more configurations under it etc.
| self.enable_graph_fusion = enable_graph_fusion | ||
| self.fx_graph_eager = fx_graph_eager | ||
| self.enable_quantization_fusion = enable_quantization_fusion |
There was a problem hiding this comment.
Add the meaning as the code doc for each field.
There was a problem hiding this comment.
Thanks. I have added it.
| logger.info( | ||
| "graph fusion enabled! Automatic kernel fusion is expected." | ||
| ) | ||
|
|
||
| if ascend_config.ascend_compilation_config.enable_quantization_fusion: | ||
| logger.info( | ||
| "Quantization fusion enabled! op fusion on quantization are expected. " | ||
| ) |
There was a problem hiding this comment.
Take care of your grammar.
| if is_310p(): | ||
| orig_dtype = residual.dtype | ||
| x = x + residual.to(x.dtype) | ||
| residual = x.to(orig_dtype) | ||
| x, _ = torch_npu.npu_rms_norm(x, self.weight, | ||
| self.variance_epsilon) | ||
| else: | ||
| x, _, residual = torch_npu.npu_add_rms_norm( | ||
| x, residual, self.weight, self.variance_epsilon) | ||
| return x, residual |
There was a problem hiding this comment.
I don't quite follow the logic here. Why do we need such a check here?
There was a problem hiding this comment.
The check on 310p is to maintain the original logic, see https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/ops/layernorm.py#L71. But I do not know why the 310p needs special processing.
| arg_dtypes, list) and len(arg_dtypes) > 0 else arg_dtypes | ||
| # We found that the kernel npu_add_rms_norm_quant accept varying data format for different dtypes, therefore, we only | ||
| # provide the solution on bfloat16 here. | ||
| return dtype in (torch.bfloat16, ) |
There was a problem hiding this comment.
Right, we usually don't decide the application of graph passes based on the concrete input. If we really have to do so, we have to add "guards" to make sure that the graph is recompiled when the input changes.
|
|
||
| def compile( | ||
| self, | ||
| graph: fx.GraphModule, |
There was a problem hiding this comment.
Is the graph processed by AoT dispatcher before being passed here to the compiler backend?
There was a problem hiding this comment.
Yes, I used aot-autograd.
33ce54c to
179e727
Compare
| from vllm.compilation.vllm_inductor_pass import VllmInductorPass | ||
|
|
||
|
|
||
| class AddRMSNormQuantPattern: |
There was a problem hiding this comment.
Can we add a directory called passes or fx_masses specifically to store these passes?
There was a problem hiding this comment.
Of course, I've already added it.
| return "graph_fusion_manager" | ||
|
|
||
| @classmethod | ||
| def get_pass_manager_cls(cls) -> str: |
There was a problem hiding this comment.
Does this interface have any requirements for the vllm version?
There was a problem hiding this comment.
I'm trying to understand what you mean. We're defining our own pass manager and compiler backend here, which should be independent of the vllm version.
There was a problem hiding this comment.
vllm 0.12.0 and later.
| return "vllm_ascend.compilation.graph_fusion_pass_manager.GraphFusionPassManager" | ||
|
|
||
| @classmethod | ||
| def get_compile_backend(self) -> str: |
There was a problem hiding this comment.
Please see the explanation above.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
a72c4cf to
3b7b356
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
4b54a5a to
2bcbeb4
Compare
| @classmethod | ||
| def get_compile_backend(self) -> str: | ||
| from vllm_ascend.compilation.compiler_interface import AscendAdaptor | ||
| return AscendAdaptor.__module__ + "." + AscendAdaptor.__name__ |
There was a problem hiding this comment.
use string instead like others to make the coe more clear
There was a problem hiding this comment.
Thanks. I will change it in next pr.
| @@ -0,0 +1,219 @@ | |||
| # | |||
There was a problem hiding this comment.
this should be added to .github workflow to enable test by CI
There was a problem hiding this comment.
Thanks. I think this pr can be merged first, I will enable it in next fusion pr.
…ct#4168) ### What this PR does / why we need it? The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.In order to solve that, we propose to leverage `torch.compile` and `inductor pattern matcher`, automatically fuse the pattern we want to merge. For more details can refer to the RFC vllm-project#4239 This pr integrates `AddRMSNorm` and the `Quant` operator, which can improve the inference speed of models using `w8a8 `quantization. ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config ### How was this patch tested? ```python def main(): prompts = [ "The president of the United States is Mr.", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8", # enforce_eager=True, tensor_parallel_size=1, trust_remote_code=True, gpu_memory_utilization=0.7, quantization="ascend", ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```text Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden. \nB. Mr. Trump is not Mr. Biden. \nC. The president of the United States is not Mr. Trump. \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of' ``` - vLLM version: 86e178f - vLLM main: vllm-project/vllm@86e178f --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…ct#4168) ### What this PR does / why we need it? The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.In order to solve that, we propose to leverage `torch.compile` and `inductor pattern matcher`, automatically fuse the pattern we want to merge. For more details can refer to the RFC vllm-project#4239 This pr integrates `AddRMSNorm` and the `Quant` operator, which can improve the inference speed of models using `w8a8 `quantization. ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config ### How was this patch tested? ```python def main(): prompts = [ "The president of the United States is Mr.", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8", # enforce_eager=True, tensor_parallel_size=1, trust_remote_code=True, gpu_memory_utilization=0.7, quantization="ascend", ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```text Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden. \nB. Mr. Trump is not Mr. Biden. \nC. The president of the United States is not Mr. Trump. \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of' ``` - vLLM version: 86e178f - vLLM main: vllm-project/vllm@86e178f --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…lm-project#4168)" This reverts commit 178ca16.
…ct#4168) ### What this PR does / why we need it? The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.In order to solve that, we propose to leverage `torch.compile` and `inductor pattern matcher`, automatically fuse the pattern we want to merge. For more details can refer to the RFC vllm-project#4239 This pr integrates `AddRMSNorm` and the `Quant` operator, which can improve the inference speed of models using `w8a8 `quantization. ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config ### How was this patch tested? ```python def main(): prompts = [ "The president of the United States is Mr.", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8", # enforce_eager=True, tensor_parallel_size=1, trust_remote_code=True, gpu_memory_utilization=0.7, quantization="ascend", ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```text Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden. \nB. Mr. Trump is not Mr. Biden. \nC. The president of the United States is not Mr. Trump. \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of' ``` - vLLM version: 86e178f - vLLM main: vllm-project/vllm@86e178f --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com>
…lm-project#4168)" This reverts commit 178ca16.
…ct#4168) ### What this PR does / why we need it? The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.In order to solve that, we propose to leverage `torch.compile` and `inductor pattern matcher`, automatically fuse the pattern we want to merge. For more details can refer to the RFC vllm-project#4239 This pr integrates `AddRMSNorm` and the `Quant` operator, which can improve the inference speed of models using `w8a8 `quantization. ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config ### How was this patch tested? ```python def main(): prompts = [ "The president of the United States is Mr.", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8", # enforce_eager=True, tensor_parallel_size=1, trust_remote_code=True, gpu_memory_utilization=0.7, quantization="ascend", ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```text Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden. \nB. Mr. Trump is not Mr. Biden. \nC. The president of the United States is not Mr. Trump. \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of' ``` - vLLM version: 86e178f - vLLM main: vllm-project/vllm@86e178f --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com>
…ct#4168) ### What this PR does / why we need it? The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.In order to solve that, we propose to leverage `torch.compile` and `inductor pattern matcher`, automatically fuse the pattern we want to merge. For more details can refer to the RFC vllm-project#4239 This pr integrates `AddRMSNorm` and the `Quant` operator, which can improve the inference speed of models using `w8a8 `quantization. ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config ### How was this patch tested? ```python def main(): prompts = [ "The president of the United States is Mr.", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8", # enforce_eager=True, tensor_parallel_size=1, trust_remote_code=True, gpu_memory_utilization=0.7, quantization="ascend", ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```text Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden. \nB. Mr. Trump is not Mr. Biden. \nC. The president of the United States is not Mr. Trump. \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of' ``` - vLLM version: 86e178f - vLLM main: vllm-project/vllm@86e178f --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com>
…ct#4168) ### What this PR does / why we need it? The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.In order to solve that, we propose to leverage `torch.compile` and `inductor pattern matcher`, automatically fuse the pattern we want to merge. For more details can refer to the RFC vllm-project#4239 This pr integrates `AddRMSNorm` and the `Quant` operator, which can improve the inference speed of models using `w8a8 `quantization. ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config ### How was this patch tested? ```python def main(): prompts = [ "The president of the United States is Mr.", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8", # enforce_eager=True, tensor_parallel_size=1, trust_remote_code=True, gpu_memory_utilization=0.7, quantization="ascend", ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```text Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden. \nB. Mr. Trump is not Mr. Biden. \nC. The president of the United States is not Mr. Trump. \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of' ``` - vLLM version: 86e178f - vLLM main: vllm-project/vllm@86e178f --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
…ct#4168) ### What this PR does / why we need it? The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.In order to solve that, we propose to leverage `torch.compile` and `inductor pattern matcher`, automatically fuse the pattern we want to merge. For more details can refer to the RFC vllm-project#4239 This pr integrates `AddRMSNorm` and the `Quant` operator, which can improve the inference speed of models using `w8a8 `quantization. ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config ### How was this patch tested? ```python def main(): prompts = [ "The president of the United States is Mr.", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8", # enforce_eager=True, tensor_parallel_size=1, trust_remote_code=True, gpu_memory_utilization=0.7, quantization="ascend", ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```text Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden. \nB. Mr. Trump is not Mr. Biden. \nC. The president of the United States is not Mr. Trump. \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of' ``` - vLLM version: 86e178f - vLLM main: vllm-project/vllm@86e178f --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com>
…ct#4168) ### What this PR does / why we need it? The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.In order to solve that, we propose to leverage `torch.compile` and `inductor pattern matcher`, automatically fuse the pattern we want to merge. For more details can refer to the RFC vllm-project#4239 This pr integrates `AddRMSNorm` and the `Quant` operator, which can improve the inference speed of models using `w8a8 `quantization. ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config ### How was this patch tested? ```python def main(): prompts = [ "The president of the United States is Mr.", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8", # enforce_eager=True, tensor_parallel_size=1, trust_remote_code=True, gpu_memory_utilization=0.7, quantization="ascend", ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```text Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden. \nB. Mr. Trump is not Mr. Biden. \nC. The president of the United States is not Mr. Trump. \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of' ``` - vLLM version: 86e178f - vLLM main: vllm-project/vllm@86e178f --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it? 1. In addition to [#4168](#4168), [#5011](https://github.com/vllm-project/vllm-ascend/pull/5011), this PR adds two more pattern for AddRmsnormQuant with SP enabled. The key difference is to insert an additional `maybe_all_gather_and_maybe_unpad` between `addrmsnorm` and `quantize`. 2. This PR also introduce another api `torch.ops.vllm.quantize`, so that we pass `input_scale` and `input_scale_reciprocal` at the same time. This is because `npu_add_rms_norm_quant` and `npu_quantize` requires different `div_mode`. To avoid introducing additional reciprocal calculation in runtime, we have to pass both of them to quantize api. 3. Removes redundant `AscendQuantRmsnorm`. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: Angazenn <supperccell@163.com>
…ject#5077) ### What this PR does / why we need it? 1. In addition to [vllm-project#4168](vllm-project#4168), [vllm-project#5011](https://github.com/vllm-project/vllm-ascend/pull/5011), this PR adds two more pattern for AddRmsnormQuant with SP enabled. The key difference is to insert an additional `maybe_all_gather_and_maybe_unpad` between `addrmsnorm` and `quantize`. 2. This PR also introduce another api `torch.ops.vllm.quantize`, so that we pass `input_scale` and `input_scale_reciprocal` at the same time. This is because `npu_add_rms_norm_quant` and `npu_quantize` requires different `div_mode`. To avoid introducing additional reciprocal calculation in runtime, we have to pass both of them to quantize api. 3. Removes redundant `AscendQuantRmsnorm`. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: Angazenn <supperccell@163.com>
|
This PR seems break LoRA e2e test. The error log is here: https://github.com/vllm-project/vllm-ascend/actions/runs/20450707034/job/58763094223?pr=4075 Does anybody know why this occurs or give any advice? Reproduce: |
…ject#5077) ### What this PR does / why we need it? 1. In addition to [vllm-project#4168](vllm-project#4168), [vllm-project#5011](https://github.com/vllm-project/vllm-ascend/pull/5011), this PR adds two more pattern for AddRmsnormQuant with SP enabled. The key difference is to insert an additional `maybe_all_gather_and_maybe_unpad` between `addrmsnorm` and `quantize`. 2. This PR also introduce another api `torch.ops.vllm.quantize`, so that we pass `input_scale` and `input_scale_reciprocal` at the same time. This is because `npu_add_rms_norm_quant` and `npu_quantize` requires different `div_mode`. To avoid introducing additional reciprocal calculation in runtime, we have to pass both of them to quantize api. 3. Removes redundant `AscendQuantRmsnorm`. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: Angazenn <supperccell@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…ject#5077) ### What this PR does / why we need it? 1. In addition to [vllm-project#4168](vllm-project#4168), [vllm-project#5011](https://github.com/vllm-project/vllm-ascend/pull/5011), this PR adds two more pattern for AddRmsnormQuant with SP enabled. The key difference is to insert an additional `maybe_all_gather_and_maybe_unpad` between `addrmsnorm` and `quantize`. 2. This PR also introduce another api `torch.ops.vllm.quantize`, so that we pass `input_scale` and `input_scale_reciprocal` at the same time. This is because `npu_add_rms_norm_quant` and `npu_quantize` requires different `div_mode`. To avoid introducing additional reciprocal calculation in runtime, we have to pass both of them to quantize api. 3. Removes redundant `AscendQuantRmsnorm`. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: Angazenn <supperccell@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
What this PR does / why we need it?
Part of: #4239
The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.In order to solve that, we propose to leverage
torch.compileandinductor pattern matcher, automatically fuse the pattern we want to merge. For more details can refer to the RFC #4239This pr integrates
AddRMSNormand theQuantoperator, which can improve the inference speed of models usingw8a8quantization.Performance improvement results:

Does this PR introduce any user-facing change?
Yes, add new additional_config
How was this patch tested?
Co-authored-by: @ganyi1996ppo insipred by #2389