[Ascend]Support of piecewise graph compilation for prefill on NPU by Vladimir221 · Pull Request #12287 · sgl-project/sglang

Vladimir221 · 2025-10-28T16:51:29Z

Motivation

Compilation of model forward at prefill speeds up inference time as already was showed at this PR: #10062 which enabled this feature for CUDA devices, current PR enables this feature for NPU devices

Modifications

Added:

choosing of backend for PiecewiseCompileInterpreter based on platform
backend for piecewise prefill compilation for NPU
implementation of weak_ref_tensor for NPU
choosing of weak_ref_tensor implementation based on platform
NPU only support prefill compilation with 'eager' backend, so added checking of piecewise_cuda_graph_compiler option
test for piecewise graph on NPU

Changed:

device argument to PrivateUse1 in direct_register_custom_op function if platform is NPU
seq_lens_cpu, extend_seq_lens_cpu, extend_prefix_lens_cpu, extend_logprob_start_lens_cpu should be allocated at cpu, so changed device type for these tensors in warmup_and_capture and capture_one_batch_size methods of PiecewiseCudaGraphRunner class
_cache_loc_dtype method of PiecewiseCudaGraphRunner class, NPU supports int32 type for out_cache_loc

Accuracy Tests

GSM 8K Llama-3.1-8B
Ascend 910B, tp-size=1, concurrency=128

# Without Piecewise Cuda Graph
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [01:07<00:00, 19.47it/s]
Accuracy: 0.760
Invalid: 0.001
Latency: 67.817 s
Output throughput: 1769.183 token/s

# With Piecewise Cuda Graph
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:56<00:00, 23.21it/s]
Accuracy: 0.759
Invalid: 0.001
Latency: 56.924 s
Output throughput: 2079.687 token/s

Benchmarking and Profiling

Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

Added choosing of backend for PiecewiseCompileInterpreter based on platform

Added backend for piecewise prefill compilation for NPU

Added implementation of weak_ref_tensor for npu

Added choosing of weak_ref_tensor implementation based on platform

seq_lens_cpu and extend_seq_lens_cpu should be allocated on cpu

…king of piecewise_cuda_graph_compiler option, changed device arg to PrivateUse1 in direct_register_custom_op function if platform is NPU

gemini-code-assist · 2025-10-28T16:51:50Z

Summary of Changes

Hello @Vladimir221, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the model inference capabilities by extending piecewise graph compilation to NPU devices. The core objective is to accelerate the prefill phase of inference on NPU hardware, mirroring the performance gains observed on CUDA. This is achieved through the introduction of NPU-specific backend logic, optimized memory management using NPU-compatible weak tensor references, and precise device placement for CPU-bound tensors. Additionally, the changes include a mechanism to enforce the 'eager' compilation backend for NPU, ensuring compatibility with current NPU capabilities.

Highlights

NPU Piecewise Compilation Support: Introduces comprehensive support for piecewise graph compilation on NPU (Neural Processing Unit) devices, extending the performance benefits previously available only for CUDA.
Dynamic Backend Selection: Adds a new make_backend function that intelligently selects between CUDAPiecewiseBackend and the newly implemented NPUPiecewiseBackend based on the detected platform.
NPU-Specific Weak Reference Tensors: Implements weak_ref_tensor functionality tailored for NPU devices in both Python and C++, crucial for efficient memory management during graph capture on NPU.
CPU Tensor Allocation Correction: Ensures that specific tensors, seq_lens_cpu and extend_seq_lens_cpu, are explicitly allocated on the CPU within the graph runner to prevent device mismatches and ensure correct operation.
NPU Compilation Backend Enforcement: Enforces the use of the 'eager' backend for prefill graph compilation when operating on NPU devices, aligning with the currently supported compilation modes for this platform.
Custom Operation Device Handling: Modifies the direct_register_custom_op function to correctly register custom operations for 'PrivateUse1' (NPU) devices when the NPU platform is detected.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request extends piecewise graph compilation support to NPU devices, mirroring the existing functionality for CUDA to improve prefill inference performance. The implementation introduces an NPU-specific backend and a JIT-compiled weak_ref_tensor operator. The changes are well-structured, but I've identified a couple of issues in the new NPU backend where CUDA-specific code was left behind, and a minor improvement for the JIT compilation script. Addressing these points will enhance the correctness and maintainability of the NPU support.

python/sglang/srt/compilation/npu_piecewise_backend.py

python/sglang/srt/compilation/weak_ref_tensor_jit.py

Apply comments

ssshinigami

LGTM

Oasis-Git · 2025-11-03T19:23:23Z

python/sglang/srt/utils/common.py

    try:
        my_lib.define(op_name + schema_str)
-        my_lib.impl(op_name, op_func, "CUDA")
+        my_lib.impl(op_name, op_func, "CUDA" if not is_npu() else "PrivateUse1")


What is this PrivateUse1 used for?

PrivateUse1 is PyTorch provided reserved dispatch key to integrate a new backend living outside pytorch/pytorch and to dispatch PyTorch functionality to custom backend kernels. Backend for NPU operators is registered via this key (https://docs.pytorch.org/tutorials/advanced/privateuseone.html)

hi @Vladimir221 will there CUDA device and NPU device exist in the same node? if not you can register for CUDA/NPU at the same time

hi @Vladimir221 will there CUDA device and NPU device exist in the same node? if not you can register for CUDA/NPU at the same time

Do you suggest to register implementations of custom op functions for both dispatch keys and remove if statement?

my_lib.impl(op_name, op_func, "CUDA") my_lib.impl(op_name, op_func, "PrivateUse1")

Do you suggest to register implementations of custom op functions for both dispatch keys and remove if statement?

From my view, yes. It might save an if-else cost

Oasis-Git · 2025-11-03T20:53:30Z

@Vladimir221 Thanks for your contribution. LGTM. I am wondering whether we should add related unit test?

Update imports according to new files hierarchy

Added test for prefill piecewise graph compilation on NPU

Vladimir221 · 2025-11-06T15:51:17Z

@Vladimir221 Thanks for your contribution. LGTM. I am wondering whether we should add related unit test?

@Oasis-Git Added a new one test into ascend directory

…_prefill.py

ispobock · 2025-11-27T12:04:39Z

/tag-and-rerun-ci

ssshinigami

LGTM

Updated run_bench_one_batch function to make it more universal

BBuf · 2025-11-30T03:48:20Z

python/sglang/srt/compilation/weak_ref_tensor.py

+    raise NotImplementedError("weak_ref_tensor is implemented only for CUDA and NPU.")
+
+
+def weak_ref_tensors(


Can it unified with https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/compilation/cuda_piecewise_backend.py#L19

Yes, I can align it with this, but in this case we will get code duplication for weak_ref_tensors function. Moreover NPUPiecewiseBackend is based on CUDAPiecewiseBackend to not duplicate backend class initialization, so in this case I import CUDAPiecewiseBackend class in npu_piecewise_backend.py file if the host machine doesn't have sgl_kernel package (only sgl_kernel_npu package) the import error will occur. So to make it unified I'll need to remove NPUPiecewiseBackend inheritance from CUDAPiecewiseBackend and to duplicate code from CUDAPiecewiseBackend.__init__() method. If you guess this is more proper way I can align the code with your suggestion

ping1jing2 · 2025-12-04T12:35:16Z

/tag-and-rerun-ci

ping1jing2 · 2025-12-09T07:53:06Z

/rerun-failed-ci

ping1jing2 · 2025-12-09T10:25:48Z

/rerun-failed-ci

…l-project#12287) Co-authored-by: ronnie_zheng <zl19940307@163.com>

Vladimir221 and others added 6 commits October 28, 2025 18:47

Update backend.py

bbd5eee

Added choosing of backend for PiecewiseCompileInterpreter based on platform

Create npu_piecewise_backend.py

e860194

Added backend for piecewise prefill compilation for NPU

Create weak_ref_tensor_npu.cpp

b455f9f

Added implementation of weak_ref_tensor for npu

Update weak_ref_tensor_jit.py

5046dea

Added choosing of weak_ref_tensor implementation based on platform

Update piecewise_cuda_graph_runner.py

13aa142

seq_lens_cpu and extend_seq_lens_cpu should be allocated on cpu

NPU only support prefill compilation with 'eager' backend, added chec…

418c789

…king of piecewise_cuda_graph_compiler option, changed device arg to PrivateUse1 in direct_register_custom_op function if platform is NPU

Vladimir221 requested review from Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners October 28, 2025 16:51

gemini-code-assist bot reviewed Oct 28, 2025

View reviewed changes

python/sglang/srt/compilation/npu_piecewise_backend.py Outdated Show resolved Hide resolved

python/sglang/srt/compilation/npu_piecewise_backend.py Outdated Show resolved Hide resolved

python/sglang/srt/compilation/weak_ref_tensor_jit.py Outdated Show resolved Hide resolved

Vladimir221 added 2 commits October 29, 2025 11:41

Update npu_piecewise_backend.py

0495e07

Apply comments

Update weak_ref_tensor_jit.py

edcc306

Apply comments

ping1jing2 changed the title ~~Support of piecewise graph compilation for prefill on NPU~~ [Ascend]Support of piecewise graph compilation for prefill on NPU Oct 29, 2025

ping1jing2 added the run-ci label Oct 29, 2025

Merge branch 'main' into vkh/piecewise_graph_npu_support

fcca834

ssshinigami approved these changes Oct 29, 2025

View reviewed changes

Merge branch 'main' into vkh/piecewise_graph_npu_support

9bd648a

Oasis-Git reviewed Nov 3, 2025

View reviewed changes

Vladimir221 added 4 commits November 5, 2025 17:40

Merge branch 'main' into vkh/piecewise_graph_npu_support

cb6df05

Update npu_piecewise_backend.py

acee70e

Update imports according to new files hierarchy

Update piecewise_cuda_graph_runner.py

bddb0a5

Create test_piecewise_graph_prefill.py

8bf2689

Added test for prefill piecewise graph compilation on NPU

Vladimir221 requested a review from ping1jing2 as a code owner November 6, 2025 15:49

Vladimir221 added 2 commits November 6, 2025 19:01

Merge branch 'main' into vkh/piecewise_graph_npu_support

891187c

Rename test_piecewise_graph_prefill.py to test_ascend_piecewise_graph…

b8c7395

…_prefill.py

Merge branch 'main' into vkh/piecewise_graph_npu_support

6345a15

ping1jing2 marked this pull request as ready for review November 27, 2025 02:37

ping1jing2 requested review from hebiao064 and iforgetmyname as code owners November 27, 2025 02:37

ping1jing2 and others added 3 commits November 27, 2025 05:38

Merge branch 'main' into vkh/piecewise_graph_npu_support

437839a

Update weak_ref_tensor.py

d4a1600

Merge branch 'main' into vkh/piecewise_graph_npu_support

d3ece7b

ispobock approved these changes Nov 27, 2025

View reviewed changes

ssshinigami approved these changes Nov 28, 2025

View reviewed changes

ping1jing2 self-assigned this Nov 28, 2025

Vladimir221 added 3 commits November 28, 2025 16:45

Update test_utils.py

d5bd011

Updated run_bench_one_batch function to make it more universal

Update test_ascend_piecewise_graph_prefill.py

cd12434

Merge branch 'main' into vkh/piecewise_graph_npu_support

2241346

BBuf reviewed Nov 30, 2025

View reviewed changes

Vladimir221 and others added 3 commits December 2, 2025 17:47

Merge branch 'main' into vkh/piecewise_graph_npu_support

5bbc9bd

Merge branch 'main' into vkh/piecewise_graph_npu_support

3d2a5ec

Merge branch 'main' into vkh/piecewise_graph_npu_support

a9cdfd5

ping1jing2 approved these changes Dec 3, 2025

View reviewed changes

Vladimir221 and others added 2 commits December 4, 2025 13:54

Update test_utils.py

088d290

Merge branch 'main' into vkh/piecewise_graph_npu_support

f57573f

Merge branch 'main' into vkh/piecewise_graph_npu_support

4b4007a

Update test_ascend_piecewise_graph_prefill.py

9f0ede7

Merge branch 'main' into vkh/piecewise_graph_npu_support

de6b526

ispobock merged commit 27032ce into sgl-project:main Dec 11, 2025
307 of 331 checks passed

Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 17, 2025

[Ascend]Support of piecewise graph compilation for prefill on NPU (sg…

951e8a9

…l-project#12287) Co-authored-by: ronnie_zheng <zl19940307@163.com>

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

[Ascend]Support of piecewise graph compilation for prefill on NPU (sg…

eaab6c3

…l-project#12287) Co-authored-by: ronnie_zheng <zl19940307@163.com>

		raise NotImplementedError("weak_ref_tensor is implemented only for CUDA and NPU.")


		def weak_ref_tensors(

Conversation

Vladimir221 commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Oct 28, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ssshinigami left a comment

Choose a reason for hiding this comment

Uh oh!

Oasis-Git Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Vladimir221 Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

airMeng Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Vladimir221 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

airMeng Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Oasis-Git commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vladimir221 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ispobock commented Nov 27, 2025

Uh oh!

ssshinigami left a comment

Choose a reason for hiding this comment

Uh oh!

BBuf Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Vladimir221 Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ping1jing2 commented Dec 4, 2025

Uh oh!

ping1jing2 commented Dec 9, 2025

Uh oh!

ping1jing2 commented Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Vladimir221 commented Oct 28, 2025 •

edited

Loading

Oasis-Git commented Nov 3, 2025 •

edited

Loading

Vladimir221 commented Nov 6, 2025 •

edited

Loading

Vladimir221 Dec 1, 2025 •

edited

Loading