Skip to content

[npugraph_ex]enable npugraph_ex by default#6664

Merged
wangxiyuan merged 1 commit intovllm-project:mainfrom
huyq:npugraph_ex_switch_on
Feb 12, 2026
Merged

[npugraph_ex]enable npugraph_ex by default#6664
wangxiyuan merged 1 commit intovllm-project:mainfrom
huyq:npugraph_ex_switch_on

Conversation

@huyq
Copy link
Copy Markdown
Contributor

@huyq huyq commented Feb 10, 2026

What this PR does / why we need it?

This pull request enables the npugraph_ex backend by default to improve performance on Ascend NPUs, as proposed in the RFC.

Does this PR introduce any user-facing change?

Yes. npugraph_ex is now enabled by default. Users can disable it by setting enable: false in the npugraph_ex_config section of the additional_config.

How was this patch tested?

CI passed. The changes are covered by existing and new E2E tests (test_aclgraph_accuracy.py) and unit tests (test_ascend_config.py) that have been updated to reflect the new default behavior. The tests verify correctness and consistency with npugraph_ex enabled and disabled, as well as with the new static kernel option.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @huyq, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on enabling the npugraph_ex backend by default, aligning with a previously discussed RFC. It introduces significant optimizations for batch-invariant operations through a new FX pass that leverages Triton kernels and ensures compatibility by conditionally disabling certain fusion passes. Additionally, it includes a critical patch to improve npugraph_ex's interaction with Triton within the torchair framework, enhancing overall stability and performance for NPU-based inference.

Highlights

  • Default NPUGraph_EX Enablement: The npugraph_ex backend is now enabled by default, streamlining its use for improved performance. This change is reflected in both the configuration and the user guide documentation.
  • Batch-Invariant FX Pass Integration: A new BatchInvariantFXPass has been introduced. This pass automatically replaces standard PyTorch aten matrix operations with specialized batch-invariant Triton kernel implementations during npugraph_ex compilation, optimizing performance for specific batching scenarios.
  • Conditional Fusion Pass Disablement: Existing fusion passes (fuse_norm_quant, fuse_qknorm_rope, fuse_allreduce_rms) are now conditionally disabled when the batch-invariant mode is active. This prevents potential conflicts and ensures compatibility with the new batch-invariant optimizations.
  • Triton Compatibility Patch for NPUGraph_EX: A new patch (patch_npugraph_ex_triton.py) has been added to address compatibility issues with torchair's ValuePack handling in Triton environments, ensuring correct processing of input parameters during npugraph_ex compilation.
  • Test Updates: Unit and end-to-end tests have been updated to reflect the new default enablement of npugraph_ex. Specific accuracy tests now explicitly disable npugraph_ex to maintain their original testing conditions.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/source/user_guide/configuration/additional_config.md
    • Updated the default value for the npugraph_ex enable option from False to True in the documentation.
  • tests/e2e/singlecard/test_aclgraph_accuracy.py
    • Added npugraph_ex_config to a test case to explicitly disable npugraph_ex for accuracy validation.
  • tests/ut/test_ascend_config.py
    • Modified unit test assertion to expect npugraph_ex_config.enable to be True by default.
  • vllm_ascend/ascend_config.py
    • Changed the default value of the enable parameter in NpugraphExConfig from False to True.
    • Updated the docstring for the enable parameter to reflect the new default value.
    • Implemented logic to disable fuse_norm_quant, fuse_qknorm_rope, and fuse_allreduce_rms when the VLLM_BATCH_INVARIANT environment variable is set.
  • vllm_ascend/compilation/compiler_interface.py
    • Integrated the BatchInvariantFXPass to apply batch-invariant transformations if the batch-invariant mode is enabled.
    • Corrected the configuration parameter name from num_speculative_token to num_speculative_tokens.
  • vllm_ascend/compilation/npu_graph_ex_pass_manager.py
    • Imported necessary modules for batch-invariant checks and Triton availability.
    • Added conditional logic to include BatchInvariantFXPass in the pass manager if batch-invariant mode is active and Triton is available.
    • Modified the conditions for adding fusion passes to ensure they are not applied when batch-invariant mode is enabled.
  • vllm_ascend/compilation/passes/batch_invariant_fx_pass.py
    • Added a new file defining BatchInvariantFXPass to replace aten matrix operations with batch-invariant Triton kernel implementations.
    • Implemented pattern matching and replacement for torch.mm, torch.matmul, torch.addmm, torch.bmm, and torch.nn.functional.linear.
  • vllm_ascend/patch/init.py
    • Added documentation for the new patch_npugraph_ex_triton.py.
  • vllm_ascend/patch/worker/init.py
    • Imported the new patch_npugraph_ex_triton module.
  • vllm_ascend/patch/worker/patch_npugraph_ex_triton.py
    • Added a new file to patch torchair.core._concrete_graph.ValuePack and related _unpack_meta and _unpack_npu functions.
    • Ensured correct handling of ValuePack in Triton scenarios for npugraph_ex backend.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions github-actions Bot added documentation Improvements or additions to documentation module:tests module:core labels Feb 10, 2026
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables npugraph_ex by default, which is a significant performance-related change. The implementation also introduces a new 'batch-invariant' mode with corresponding graph passes. My main feedback is to address the code duplication in the implementation of the batch-invariant FX pass to improve long-term maintainability. I have also provided suggestions for the PR title and description to align with the repository's style guide.

Suggested PR Title:

[npugraph_ex][Feature] Enable npugraph_ex by default

Suggested PR Summary:

### What this PR does / why we need it?

This pull request enables the `npugraph_ex` backend by default to improve performance on Ascend NPUs, as proposed in the RFC.

It also introduces support for a "batch-invariant" mode, which can be enabled via the `VLLM_BATCH_INVARIANT` environment variable. When this mode is active, specific FX graph passes are applied to replace standard aten matrix operations with batch-invariant Triton kernels, making them compatible with `npugraph_ex` compilation. This also includes patches to `torchair` to support this new mode.

Fixes #6214

### Does this PR introduce _any_ user-facing change?

Yes. `npugraph_ex` is now enabled by default. Users can disable it by setting `enable: false` in the `npugraph_ex_config` section of the `additional_config`.

A new environment variable `VLLM_BATCH_INVARIANT` is introduced to enable the batch-invariant mode.

### How was this patch tested?

CI passed. The changes are covered by existing and new E2E tests (`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`) that have been updated to reflect the new default behavior. The tests verify correctness and consistency with `npugraph_ex` enabled and disabled, as well as with the new static kernel option.

Comment on lines +80 to +85
# Apply batch-invariant FX pass if enabled
if vllm_is_batch_invariant():
from vllm_ascend.compilation.passes.batch_invariant_fx_pass import (
apply_batch_invariant_to_fx_graph,
)
graph = apply_batch_invariant_to_fx_graph(graph)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This logic for applying the batch-invariant FX pass is duplicated. Another implementation exists in BatchInvariantFXPass (defined in vllm_ascend/compilation/passes/batch_invariant_fx_pass.py), which is used when npugraph_ex is disabled. This code duplication can lead to maintenance issues and potential inconsistencies.

To improve maintainability, please consider unifying these two implementations into one. Using BatchInvariantFXPass with its PatternMatcherPass would be a more robust and standard approach for both scenarios.

@huyq huyq force-pushed the npugraph_ex_switch_on branch 4 times, most recently from 2791b25 to 8547045 Compare February 11, 2026 01:27
Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com>
@huyq huyq force-pushed the npugraph_ex_switch_on branch from 8547045 to 78f94ec Compare February 11, 2026 02:13
@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Feb 11, 2026
@wangxiyuan wangxiyuan merged commit a0315f6 into vllm-project:main Feb 12, 2026
60 checks passed
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Feb 12, 2026
…to qwen3next_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend:
  [Docs] Fix GLM-5 deploy command (vllm-project#6711)
  [npugraph_ex]enable npugraph_ex by default (vllm-project#6664)
  [doc]add GLM5.md (vllm-project#6709)
  [Model] GLM5 adaptation (vllm-project#6642)
  [Bugfix] Update target probs to target logits in rejection sample (vllm-project#6685)
  [Main][Ops] Make triton rope support index_selecting from cos_sin_cache (vllm-project#5450)
  [CI]fix nightly multi node test error for wait for pod ready (vllm-project#6675)
  [main  to main] upgrade main 0210 (vllm-project#6673)
  [main][Quant] Remove unused rotation functions and parameters from W4A4 LAOS quantization (vllm-project#6648)
  [Test][BugFix] Fix torch.rand usage in triton penalty test (vllm-project#6680)
  Add Worker Interface:check_health (vllm-project#6681)
chenchuw886 pushed a commit to chenchuw886/vllm-ascend that referenced this pull request Feb 12, 2026
### What this PR does / why we need it?

This pull request enables the `npugraph_ex` backend by default to
improve performance on Ascend NPUs, as proposed in the
[RFC](vllm-project#6214).

### Does this PR introduce _any_ user-facing change?

Yes. `npugraph_ex` is now enabled by default. Users can disable it by
setting `enable: false` in the `npugraph_ex_config` section of the
`additional_config`.

### How was this patch tested?

CI passed. The changes are covered by existing and new E2E tests
(`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`)
that have been updated to reflect the new default behavior. The tests
verify correctness and consistency with `npugraph_ex` enabled and
disabled, as well as with the new static kernel option.

Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com>
Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>
Signed-off-by: momochenchuw <chenchuw@huawei.com>
@wangxiyuan wangxiyuan mentioned this pull request Feb 24, 2026
banxiaduhuo pushed a commit to banxiaduhuo/vllm-ascend that referenced this pull request Feb 26, 2026
### What this PR does / why we need it?

This pull request enables the `npugraph_ex` backend by default to
improve performance on Ascend NPUs, as proposed in the
[RFC](vllm-project#6214).


### Does this PR introduce _any_ user-facing change?

Yes. `npugraph_ex` is now enabled by default. Users can disable it by
setting `enable: false` in the `npugraph_ex_config` section of the
`additional_config`.

### How was this patch tested?

CI passed. The changes are covered by existing and new E2E tests
(`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`)
that have been updated to reflect the new default behavior. The tests
verify correctness and consistency with `npugraph_ex` enabled and
disabled, as well as with the new static kernel option.

Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com>
Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
### What this PR does / why we need it?

This pull request enables the `npugraph_ex` backend by default to
improve performance on Ascend NPUs, as proposed in the
[RFC](vllm-project#6214).

### Does this PR introduce _any_ user-facing change?

Yes. `npugraph_ex` is now enabled by default. Users can disable it by
setting `enable: false` in the `npugraph_ex_config` section of the
`additional_config`.

### How was this patch tested?

CI passed. The changes are covered by existing and new E2E tests
(`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`)
that have been updated to reflect the new default behavior. The tests
verify correctness and consistency with `npugraph_ex` enabled and
disabled, as well as with the new static kernel option.

Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com>
Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
### What this PR does / why we need it?

This pull request enables the `npugraph_ex` backend by default to
improve performance on Ascend NPUs, as proposed in the
[RFC](vllm-project#6214).


### Does this PR introduce _any_ user-facing change?

Yes. `npugraph_ex` is now enabled by default. Users can disable it by
setting `enable: false` in the `npugraph_ex_config` section of the
`additional_config`.

### How was this patch tested?

CI passed. The changes are covered by existing and new E2E tests
(`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`)
that have been updated to reflect the new default behavior. The tests
verify correctness and consistency with `npugraph_ex` enabled and
disabled, as well as with the new static kernel option.

Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com>
Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
### What this PR does / why we need it?

This pull request enables the `npugraph_ex` backend by default to
improve performance on Ascend NPUs, as proposed in the
[RFC](vllm-project#6214).

### Does this PR introduce _any_ user-facing change?

Yes. `npugraph_ex` is now enabled by default. Users can disable it by
setting `enable: false` in the `npugraph_ex_config` section of the
`additional_config`.

### How was this patch tested?

CI passed. The changes are covered by existing and new E2E tests
(`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`)
that have been updated to reflect the new default behavior. The tests
verify correctness and consistency with `npugraph_ex` enabled and
disabled, as well as with the new static kernel option.

Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com>
Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
### What this PR does / why we need it?

This pull request enables the `npugraph_ex` backend by default to
improve performance on Ascend NPUs, as proposed in the
[RFC](vllm-project#6214).


### Does this PR introduce _any_ user-facing change?

Yes. `npugraph_ex` is now enabled by default. Users can disable it by
setting `enable: false` in the `npugraph_ex_config` section of the
`additional_config`.

### How was this patch tested?

CI passed. The changes are covered by existing and new E2E tests
(`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`)
that have been updated to reflect the new default behavior. The tests
verify correctness and consistency with `npugraph_ex` enabled and
disabled, as well as with the new static kernel option.

Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com>
Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 6, 2026
### What this PR does / why we need it?

This pull request enables the `npugraph_ex` backend by default to
improve performance on Ascend NPUs, as proposed in the
[RFC](vllm-project#6214).


### Does this PR introduce _any_ user-facing change?

Yes. `npugraph_ex` is now enabled by default. Users can disable it by
setting `enable: false` in the `npugraph_ex_config` section of the
`additional_config`.

### How was this patch tested?

CI passed. The changes are covered by existing and new E2E tests
(`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`)
that have been updated to reflect the new default behavior. The tests
verify correctness and consistency with `npugraph_ex` enabled and
disabled, as well as with the new static kernel option.

Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com>
Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation module:core module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants