Skip to content

Add AscendC triangular inverse#332

Merged
RuixuanZhang06 merged 7 commits intosgl-project:mainfrom
zouzias:anastasios_github_mr_tri_inv
Jan 22, 2026
Merged

Add AscendC triangular inverse#332
RuixuanZhang06 merged 7 commits intosgl-project:mainfrom
zouzias:anastasios_github_mr_tri_inv

Conversation

@zouzias
Copy link
Copy Markdown
Contributor

@zouzias zouzias commented Jan 20, 2026

This MR contributes an AscendC triangular inverse kernel that implements a column sweep algorithm using vector cores. The kernel is integrated with chunk_gated_delta_rule_native method.

This is joint work with @asobczyk , @gioelegott and @learning-chip .

The users of chunk_gated_delta_rule_native can use the new kernel by setting tri_inv_fn = sgl_kernel_npu.fla.chunk.fast_inv_tril. MR is backwards compatible for now.

The kernel torch.ops.npu.tri_inv is tested on Ascend A2 and 910B4 (x86_64).

Changelog:

  • The kernel is available as torch.ops.npu.tri_inv and supports fp16 and fp32 and is tested for matrix sizes 16,32,64,128.
  • The kernel is tested and returns the exact output as numpy's column sweep method np_triu_inv_cs, see tests/python/sgl_kernel_npu/test_tri_inv_col_sweep.py
  • The method chunk_gated_delta_rule_native is refactored so that it allows its users to provide a custom triangular inverse kernel. Available options: inv_tril_inplace or torch.ops.npu.tri_inv (we plan to contribute further kernels).
  • The accuracy of chunk gated attention using the new triangular inverse method is tested against inv_tri_inplace in tests/python/sgl_kernel_npu/test_gated_delta_ascendc_tri_inv.py

Let us know if you approve to merge such kernels into your repository. If so, we plan to contribute a few more AscendC kernels that offer additional performance improvements by utilizing the Cube cores.

Performance


grafik

Geomean speed-up is about ~ 1.78

batch_size chunk_size triton_time_us ascendc_aiv_time_us speed_up
256 16 386 151 2.55629139072848
384 16 292 196 1.48979591836735
512 16 384 239 1.60669456066946
640 16 431 285 1.51228070175439
768 16 524 335 1.56417910447761
896 16 614 381 1.61154855643045
1024 16 663 427 1.55269320843091
1152 16 754 469 1.60767590618337
128 32 423 266 1.59022556390977
192 32 576 323 1.78328173374613
256 32 731 427 1.71194379391101
320 32 884 484 1.82644628099174
384 32 1082 592 1.82770270270270
448 32 1240 701 1.76890156918688
512 32 1394 764 1.82460732984293
576 32 1545 870 1.77586206896552
64 64 835 487 1.71457905544148
96 64 1233 690 1.78695652173913
128 64 1592 903 1.76301218161683
160 64 1943 908 2.13986784140969
192 64 2338 1121 2.08563782337199
224 64 2698 1333 2.02400600150038
256 64 3088 1545 1.99870550161812
288 64 3449 1761 1.95854628052243

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @zouzias, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of triangular matrix inversion operations on Ascend NPUs by introducing a new, highly optimized AscendC kernel. This kernel, which utilizes a column sweep algorithm on vector cores, is seamlessly integrated into the chunk_gated_delta_rule_native method, providing substantial speedups for relevant computations. The changes also improve the modularity of the chunk_gated_delta_rule_native function by allowing custom inverse kernel implementations, all while maintaining accuracy and backward compatibility.

Highlights

  • New AscendC Kernel for Triangular Inverse: Introduces a new AscendC kernel, torch.ops.npu.tri_inv, that implements a column sweep algorithm for triangular matrix inversion, optimized for Ascend NPU vector cores.
  • Performance Improvement: Achieves significant performance gains, with up to 2.66X speedup for matrix inverse operations (size 128) and up to 1.23X improvement for the chunk_gated_delta_rule_native method.
  • Expanded Support and Compatibility: The new kernel supports fp16 and fp32 data types and is tested for matrix sizes 16, 32, 64, 128. It is also designed to be backwards compatible with existing usage.
  • Refactored chunk_gated_delta_rule_native: The chunk_gated_delta_rule_native method has been refactored to allow users to provide custom triangular inverse kernels, enhancing its flexibility and extensibility.
  • Comprehensive Testing: Includes extensive tests to verify the accuracy of the new kernel against NumPy's column sweep method and np.linalg.inv, as well as its integration and accuracy within the chunk_gated_delta_rule_native context.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new AscendC kernel for triangular matrix inversion, which is then integrated into the chunk_gated_delta_rule_native method. The changes include the kernel implementation, host-side logic, PyTorch op registration, and comprehensive tests. The refactoring to allow a custom triangular inverse function is a good design choice.

My review focuses on correctness and code quality. I've found a potential bug in the host-side error handling for unsupported data types and some issues in the test logic that could lead to incorrect validation. I've also included some suggestions for code cleanup. Overall, this is a solid contribution with significant performance improvements.

zouzias and others added 4 commits January 20, 2026 11:57
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@ping1jing2 ping1jing2 self-assigned this Jan 20, 2026
@VDV1985
Copy link
Copy Markdown

VDV1985 commented Jan 21, 2026

@zouzias have you checked integration with SGLang? Are there any accuracy results for the full model like Qwen3-Next?

@zouzias zouzias changed the title Add an AscendC triangular inverse kernel Add AscendC triangular inverse kernel Jan 21, 2026
@zouzias zouzias changed the title Add AscendC triangular inverse kernel Add AscendC triangular inverse Jan 21, 2026
@RuixuanZhang06 RuixuanZhang06 merged commit 925c2d5 into sgl-project:main Jan 22, 2026
3 checks passed
@learning-chip
Copy link
Copy Markdown

Are there any accuracy results for the full model like Qwen3-Next?

Sure, we will add e2e test as separate PRs, likely to the main sglang repo

@zouzias zouzias deleted the anastasios_github_mr_tri_inv branch January 24, 2026 09:03
Yael-X added a commit to Yael-X/sgl-kernel-npu that referenced this pull request Jan 26, 2026
* 'main' of https://github.com/sgl-project/sgl-kernel-npu: (24 commits)
  [Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI (sgl-project#345)
  (test) add solve_tril from upstream (sgl-project#339)
  Add AscendC triangular inverse (sgl-project#332)
  support the situation that topk maybe -1 on machine A3 (sgl-project#313)
  chunk_gated_delta_rule_npu output final state (sgl-project#341)
  The environment variable DEEPEP_HCCL_BUFFSIZE is added, and the priority of DEEPEP_HCCL_BUFFSIZE is higher than that of HCCL_BUFFSIZE. (sgl-project#329)
  Added the low_latency operator API documentation. (sgl-project#337)
  Added the verification of num_max_dispatch_tokens_per_rank to the decode operator adaptation layer. (sgl-project#330)
  Document get_dispatch_layout API (sgl-project#338)
  【Doc】add fused deep moe doc (sgl-project#335)
  add deepep normal api doc (sgl-project#336)
  remove the limit that A2 internode only support topk 8 (sgl-project#323)
  Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314)
  deepep adapt custom cann installation path (sgl-project#327)
  [Chore] CANN version bump to 8.5.0 (sgl-project#326)
  add dfx for operator FusedDeepMoe (sgl-project#317)
  Integrate ccache for faster compilation (sgl-project#318)
  Modify contribution guide (sgl-project#315)
  fix bmm transpose in cann 8.5 (sgl-project#316)
  fix little batchsize and int8 quant on ci (sgl-project#302)
  ...
zhuyutong332 added a commit to zhuyutong332/sgl-kernel-npu that referenced this pull request Jan 27, 2026
* upstream/main:
  add function for deep-ep tests (sgl-project#301)
  [Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI (sgl-project#345)
  (test) add solve_tril from upstream (sgl-project#339)
  Add AscendC triangular inverse (sgl-project#332)
  support the situation that topk maybe -1 on machine A3 (sgl-project#313)
  chunk_gated_delta_rule_npu output final state (sgl-project#341)
  The environment variable DEEPEP_HCCL_BUFFSIZE is added, and the priority of DEEPEP_HCCL_BUFFSIZE is higher than that of HCCL_BUFFSIZE. (sgl-project#329)
  Added the low_latency operator API documentation. (sgl-project#337)
  Added the verification of num_max_dispatch_tokens_per_rank to the decode operator adaptation layer. (sgl-project#330)
  Document get_dispatch_layout API (sgl-project#338)
  【Doc】add fused deep moe doc (sgl-project#335)
  add deepep normal api doc (sgl-project#336)
  remove the limit that A2 internode only support topk 8 (sgl-project#323)
  Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314)
  deepep adapt custom cann installation path (sgl-project#327)
  [Chore] CANN version bump to 8.5.0 (sgl-project#326)
  add dfx for operator FusedDeepMoe (sgl-project#317)
  Integrate ccache for faster compilation (sgl-project#318)
AndyKong2020 pushed a commit to AndyKong2020/sgl-kernel-npu that referenced this pull request Mar 24, 2026
* (tri_inv) ascendc triangular inverse column sweep
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants