Add AscendC triangular inverse by zouzias · Pull Request #332 · sgl-project/sgl-kernel-npu

zouzias · 2026-01-20T10:52:29Z

This MR contributes an AscendC triangular inverse kernel that implements a column sweep algorithm using vector cores. The kernel is integrated with chunk_gated_delta_rule_native method.

This is joint work with @asobczyk , @gioelegott and @learning-chip .

The users of chunk_gated_delta_rule_native can use the new kernel by setting tri_inv_fn = sgl_kernel_npu.fla.chunk.fast_inv_tril. MR is backwards compatible for now.

The kernel torch.ops.npu.tri_inv is tested on Ascend A2 and 910B4 (x86_64).

Changelog:

The kernel is available as torch.ops.npu.tri_inv and supports fp16 and fp32 and is tested for matrix sizes 16,32,64,128.
The kernel is tested and returns the exact output as numpy's column sweep method np_triu_inv_cs, see tests/python/sgl_kernel_npu/test_tri_inv_col_sweep.py
The method chunk_gated_delta_rule_native is refactored so that it allows its users to provide a custom triangular inverse kernel. Available options: inv_tril_inplace or torch.ops.npu.tri_inv (we plan to contribute further kernels).
The accuracy of chunk gated attention using the new triangular inverse method is tested against inv_tri_inplace in tests/python/sgl_kernel_npu/test_gated_delta_ascendc_tri_inv.py

Let us know if you approve to merge such kernels into your repository. If so, we plan to contribute a few more AscendC kernels that offer additional performance improvements by utilizing the Cube cores.

Performance

Geomean speed-up is about ~ 1.78

batch_size	chunk_size	triton_time_us	ascendc_aiv_time_us	speed_up
256	16	386	151	2.55629139072848
384	16	292	196	1.48979591836735
512	16	384	239	1.60669456066946
640	16	431	285	1.51228070175439
768	16	524	335	1.56417910447761
896	16	614	381	1.61154855643045
1024	16	663	427	1.55269320843091
1152	16	754	469	1.60767590618337
128	32	423	266	1.59022556390977
192	32	576	323	1.78328173374613
256	32	731	427	1.71194379391101
320	32	884	484	1.82644628099174
384	32	1082	592	1.82770270270270
448	32	1240	701	1.76890156918688
512	32	1394	764	1.82460732984293
576	32	1545	870	1.77586206896552
64	64	835	487	1.71457905544148
96	64	1233	690	1.78695652173913
128	64	1592	903	1.76301218161683
160	64	1943	908	2.13986784140969
192	64	2338	1121	2.08563782337199
224	64	2698	1333	2.02400600150038
256	64	3088	1545	1.99870550161812
288	64	3449	1761	1.95854628052243

gemini-code-assist · 2026-01-20T10:52:58Z

Summary of Changes

Hello @zouzias, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of triangular matrix inversion operations on Ascend NPUs by introducing a new, highly optimized AscendC kernel. This kernel, which utilizes a column sweep algorithm on vector cores, is seamlessly integrated into the chunk_gated_delta_rule_native method, providing substantial speedups for relevant computations. The changes also improve the modularity of the chunk_gated_delta_rule_native function by allowing custom inverse kernel implementations, all while maintaining accuracy and backward compatibility.

Highlights

New AscendC Kernel for Triangular Inverse: Introduces a new AscendC kernel, torch.ops.npu.tri_inv, that implements a column sweep algorithm for triangular matrix inversion, optimized for Ascend NPU vector cores.
Performance Improvement: Achieves significant performance gains, with up to 2.66X speedup for matrix inverse operations (size 128) and up to 1.23X improvement for the chunk_gated_delta_rule_native method.
Expanded Support and Compatibility: The new kernel supports fp16 and fp32 data types and is tested for matrix sizes 16, 32, 64, 128. It is also designed to be backwards compatible with existing usage.
Refactored chunk_gated_delta_rule_native: The chunk_gated_delta_rule_native method has been refactored to allow users to provide custom triangular inverse kernels, enhancing its flexibility and extensibility.
Comprehensive Testing: Includes extensive tests to verify the accuracy of the new kernel against NumPy's column sweep method and np.linalg.inv, as well as its integration and accuracy within the chunk_gated_delta_rule_native context.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new AscendC kernel for triangular matrix inversion, which is then integrated into the chunk_gated_delta_rule_native method. The changes include the kernel implementation, host-side logic, PyTorch op registration, and comprehensive tests. The refactoring to allow a custom triangular inverse function is a good design choice.

My review focuses on correctness and code quality. I've found a potential bug in the host-side error handling for unsupported data types and some issues in the test logic that could lead to incorrect validation. I've also included some suggestions for code cleanup. Overall, this is a solid contribution with significant performance improvements.

csrc/tri_inv/op_host/tri_inv.cpp

tests/python/sgl_kernel_npu/test_triangular_inverse.py

csrc/tri_inv/README.md

csrc/tri_inv/op_host/tri_inv.cpp

csrc/tri_inv/op_kernel/kernel_tri_inv.h

include/sgl_kenel_npu_ops.h

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

VDV1985 · 2026-01-21T09:25:49Z

@zouzias have you checked integration with SGLang? Are there any accuracy results for the full model like Qwen3-Next?

python/sgl_kernel_npu/sgl_kernel_npu/fla/chunk.py

csrc/pytorch_extensions.cpp

learning-chip · 2026-01-23T08:34:28Z

Are there any accuracy results for the full model like Qwen3-Next?

Sure, we will add e2e test as separate PRs, likely to the main sglang repo

* 'main' of https://github.com/sgl-project/sgl-kernel-npu: (24 commits) [Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI (sgl-project#345) (test) add solve_tril from upstream (sgl-project#339) Add AscendC triangular inverse (sgl-project#332) support the situation that topk maybe -1 on machine A3 (sgl-project#313) chunk_gated_delta_rule_npu output final state (sgl-project#341) The environment variable DEEPEP_HCCL_BUFFSIZE is added, and the priority of DEEPEP_HCCL_BUFFSIZE is higher than that of HCCL_BUFFSIZE. (sgl-project#329) Added the low_latency operator API documentation. (sgl-project#337) Added the verification of num_max_dispatch_tokens_per_rank to the decode operator adaptation layer. (sgl-project#330) Document get_dispatch_layout API (sgl-project#338) 【Doc】add fused deep moe doc (sgl-project#335) add deepep normal api doc (sgl-project#336) remove the limit that A2 internode only support topk 8 (sgl-project#323) Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314) deepep adapt custom cann installation path (sgl-project#327) [Chore] CANN version bump to 8.5.0 (sgl-project#326) add dfx for operator FusedDeepMoe (sgl-project#317) Integrate ccache for faster compilation (sgl-project#318) Modify contribution guide (sgl-project#315) fix bmm transpose in cann 8.5 (sgl-project#316) fix little batchsize and int8 quant on ci (sgl-project#302) ...

* upstream/main: add function for deep-ep tests (sgl-project#301) [Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI (sgl-project#345) (test) add solve_tril from upstream (sgl-project#339) Add AscendC triangular inverse (sgl-project#332) support the situation that topk maybe -1 on machine A3 (sgl-project#313) chunk_gated_delta_rule_npu output final state (sgl-project#341) The environment variable DEEPEP_HCCL_BUFFSIZE is added, and the priority of DEEPEP_HCCL_BUFFSIZE is higher than that of HCCL_BUFFSIZE. (sgl-project#329) Added the low_latency operator API documentation. (sgl-project#337) Added the verification of num_max_dispatch_tokens_per_rank to the decode operator adaptation layer. (sgl-project#330) Document get_dispatch_layout API (sgl-project#338) 【Doc】add fused deep moe doc (sgl-project#335) add deepep normal api doc (sgl-project#336) remove the limit that A2 internode only support topk 8 (sgl-project#323) Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314) deepep adapt custom cann installation path (sgl-project#327) [Chore] CANN version bump to 8.5.0 (sgl-project#326) add dfx for operator FusedDeepMoe (sgl-project#317) Integrate ccache for faster compilation (sgl-project#318)

* (tri_inv) ascendc triangular inverse column sweep

(tri_inv) ascendc triangular inverse column sweep

de2dfe9

gemini-code-assist bot reviewed Jan 20, 2026

View reviewed changes

zouzias and others added 4 commits January 20, 2026 11:57

Update csrc/tri_inv/README.md

e119a29

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

(tri_inv_col_sweep) throw runtime error for unknown dtype

b51bdaf

gemini review comments

f8c4bcd

Update include/sgl_kenel_npu_ops.h

f041660

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

ping1jing2 self-assigned this Jan 20, 2026

Napkin-AI reviewed Jan 21, 2026

View reviewed changes

python/sgl_kernel_npu/sgl_kernel_npu/fla/chunk.py Outdated Show resolved Hide resolved

tri_inv_fn must return the matrix inverse

da9420e

zouzias changed the title ~~Add an AscendC triangular inverse kernel~~ Add AscendC triangular inverse kernel Jan 21, 2026

zouzias changed the title ~~Add AscendC triangular inverse kernel~~ Add AscendC triangular inverse Jan 21, 2026

RuixuanZhang06 reviewed Jan 22, 2026

View reviewed changes

csrc/pytorch_extensions.cpp Outdated Show resolved Hide resolved

fix chunk.py issues

c7d4d88

RuixuanZhang06 approved these changes Jan 22, 2026

View reviewed changes

RuixuanZhang06 merged commit 925c2d5 into sgl-project:main Jan 22, 2026
3 checks passed

zouzias deleted the anastasios_github_mr_tri_inv branch January 24, 2026 09:03

AndyKong2020 pushed a commit to AndyKong2020/sgl-kernel-npu that referenced this pull request Mar 24, 2026

Add AscendC triangular inverse (sgl-project#332)

8f5026c

* (tri_inv) ascendc triangular inverse column sweep

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AscendC triangular inverse#332

Add AscendC triangular inverse#332
RuixuanZhang06 merged 7 commits intosgl-project:mainfrom
zouzias:anastasios_github_mr_tri_inv

zouzias commented Jan 20, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VDV1985 commented Jan 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

learning-chip commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

zouzias commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog:

Performance

Uh oh!

gemini-code-assist bot commented Jan 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VDV1985 commented Jan 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

learning-chip commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zouzias commented Jan 20, 2026 •

edited

Loading