Add AscendC triangular inverse#332
Conversation
Summary of ChangesHello @zouzias, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the performance of triangular matrix inversion operations on Ascend NPUs by introducing a new, highly optimized AscendC kernel. This kernel, which utilizes a column sweep algorithm on vector cores, is seamlessly integrated into the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new AscendC kernel for triangular matrix inversion, which is then integrated into the chunk_gated_delta_rule_native method. The changes include the kernel implementation, host-side logic, PyTorch op registration, and comprehensive tests. The refactoring to allow a custom triangular inverse function is a good design choice.
My review focuses on correctness and code quality. I've found a potential bug in the host-side error handling for unsupported data types and some issues in the test logic that could lead to incorrect validation. I've also included some suggestions for code cleanup. Overall, this is a solid contribution with significant performance improvements.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
@zouzias have you checked integration with SGLang? Are there any accuracy results for the full model like Qwen3-Next? |
Sure, we will add e2e test as separate PRs, likely to the main sglang repo |
* 'main' of https://github.com/sgl-project/sgl-kernel-npu: (24 commits) [Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI (sgl-project#345) (test) add solve_tril from upstream (sgl-project#339) Add AscendC triangular inverse (sgl-project#332) support the situation that topk maybe -1 on machine A3 (sgl-project#313) chunk_gated_delta_rule_npu output final state (sgl-project#341) The environment variable DEEPEP_HCCL_BUFFSIZE is added, and the priority of DEEPEP_HCCL_BUFFSIZE is higher than that of HCCL_BUFFSIZE. (sgl-project#329) Added the low_latency operator API documentation. (sgl-project#337) Added the verification of num_max_dispatch_tokens_per_rank to the decode operator adaptation layer. (sgl-project#330) Document get_dispatch_layout API (sgl-project#338) 【Doc】add fused deep moe doc (sgl-project#335) add deepep normal api doc (sgl-project#336) remove the limit that A2 internode only support topk 8 (sgl-project#323) Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314) deepep adapt custom cann installation path (sgl-project#327) [Chore] CANN version bump to 8.5.0 (sgl-project#326) add dfx for operator FusedDeepMoe (sgl-project#317) Integrate ccache for faster compilation (sgl-project#318) Modify contribution guide (sgl-project#315) fix bmm transpose in cann 8.5 (sgl-project#316) fix little batchsize and int8 quant on ci (sgl-project#302) ...
* upstream/main: add function for deep-ep tests (sgl-project#301) [Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI (sgl-project#345) (test) add solve_tril from upstream (sgl-project#339) Add AscendC triangular inverse (sgl-project#332) support the situation that topk maybe -1 on machine A3 (sgl-project#313) chunk_gated_delta_rule_npu output final state (sgl-project#341) The environment variable DEEPEP_HCCL_BUFFSIZE is added, and the priority of DEEPEP_HCCL_BUFFSIZE is higher than that of HCCL_BUFFSIZE. (sgl-project#329) Added the low_latency operator API documentation. (sgl-project#337) Added the verification of num_max_dispatch_tokens_per_rank to the decode operator adaptation layer. (sgl-project#330) Document get_dispatch_layout API (sgl-project#338) 【Doc】add fused deep moe doc (sgl-project#335) add deepep normal api doc (sgl-project#336) remove the limit that A2 internode only support topk 8 (sgl-project#323) Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314) deepep adapt custom cann installation path (sgl-project#327) [Chore] CANN version bump to 8.5.0 (sgl-project#326) add dfx for operator FusedDeepMoe (sgl-project#317) Integrate ccache for faster compilation (sgl-project#318)
* (tri_inv) ascendc triangular inverse column sweep
This MR contributes an AscendC triangular inverse kernel that implements a column sweep algorithm using vector cores. The kernel is integrated with
chunk_gated_delta_rule_nativemethod.This is joint work with @asobczyk , @gioelegott and @learning-chip .
The users of
chunk_gated_delta_rule_nativecan use the new kernel by settingtri_inv_fn = sgl_kernel_npu.fla.chunk.fast_inv_tril. MR is backwards compatible for now.The kernel
torch.ops.npu.tri_invis tested on Ascend A2 and 910B4 (x86_64).Changelog:
torch.ops.npu.tri_invand supportsfp16andfp32and is tested for matrix sizes16,32,64,128.np_triu_inv_cs, seetests/python/sgl_kernel_npu/test_tri_inv_col_sweep.pychunk_gated_delta_rule_nativeis refactored so that it allows its users to provide a custom triangular inverse kernel. Available options:inv_tril_inplaceortorch.ops.npu.tri_inv(we plan to contribute further kernels).inv_tri_inplaceintests/python/sgl_kernel_npu/test_gated_delta_ascendc_tri_inv.pyLet us know if you approve to merge such kernels into your repository. If so, we plan to contribute a few more AscendC kernels that offer additional performance improvements by utilizing the Cube cores.
Performance

Geomean speed-up is about ~ 1.78