[Ascend]Support of piecewise graph compilation for prefill on NPU#12287
[Ascend]Support of piecewise graph compilation for prefill on NPU#12287ispobock merged 44 commits intosgl-project:mainfrom
Conversation
Added choosing of backend for PiecewiseCompileInterpreter based on platform
Added backend for piecewise prefill compilation for NPU
Added implementation of weak_ref_tensor for npu
Added choosing of weak_ref_tensor implementation based on platform
seq_lens_cpu and extend_seq_lens_cpu should be allocated on cpu
…king of piecewise_cuda_graph_compiler option, changed device arg to PrivateUse1 in direct_register_custom_op function if platform is NPU
Summary of ChangesHello @Vladimir221, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the model inference capabilities by extending piecewise graph compilation to NPU devices. The core objective is to accelerate the prefill phase of inference on NPU hardware, mirroring the performance gains observed on CUDA. This is achieved through the introduction of NPU-specific backend logic, optimized memory management using NPU-compatible weak tensor references, and precise device placement for CPU-bound tensors. Additionally, the changes include a mechanism to enforce the 'eager' compilation backend for NPU, ensuring compatibility with current NPU capabilities. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request extends piecewise graph compilation support to NPU devices, mirroring the existing functionality for CUDA to improve prefill inference performance. The implementation introduces an NPU-specific backend and a JIT-compiled weak_ref_tensor operator. The changes are well-structured, but I've identified a couple of issues in the new NPU backend where CUDA-specific code was left behind, and a minor improvement for the JIT compilation script. Addressing these points will enhance the correctness and maintainability of the NPU support.
Apply comments
Apply comments
| try: | ||
| my_lib.define(op_name + schema_str) | ||
| my_lib.impl(op_name, op_func, "CUDA") | ||
| my_lib.impl(op_name, op_func, "CUDA" if not is_npu() else "PrivateUse1") |
There was a problem hiding this comment.
What is this PrivateUse1 used for?
There was a problem hiding this comment.
PrivateUse1 is PyTorch provided reserved dispatch key to integrate a new backend living outside pytorch/pytorch and to dispatch PyTorch functionality to custom backend kernels. Backend for NPU operators is registered via this key (https://docs.pytorch.org/tutorials/advanced/privateuseone.html)
There was a problem hiding this comment.
hi @Vladimir221 will there CUDA device and NPU device exist in the same node? if not you can register for CUDA/NPU at the same time
There was a problem hiding this comment.
hi @Vladimir221 will there CUDA device and NPU device exist in the same node? if not you can register for CUDA/NPU at the same time
Do you suggest to register implementations of custom op functions for both dispatch keys and remove if statement?
my_lib.impl(op_name, op_func, "CUDA")
my_lib.impl(op_name, op_func, "PrivateUse1")There was a problem hiding this comment.
Do you suggest to register implementations of custom op functions for both dispatch keys and remove if statement?
From my view, yes. It might save an if-else cost
|
@Vladimir221 Thanks for your contribution. LGTM. I am wondering whether we should add related unit test? |
Update imports according to new files hierarchy
Added test for prefill piecewise graph compilation on NPU
@Oasis-Git Added a new one test into ascend directory |
|
/tag-and-rerun-ci |
Updated run_bench_one_batch function to make it more universal
| raise NotImplementedError("weak_ref_tensor is implemented only for CUDA and NPU.") | ||
|
|
||
|
|
||
| def weak_ref_tensors( |
There was a problem hiding this comment.
There was a problem hiding this comment.
Yes, I can align it with this, but in this case we will get code duplication for weak_ref_tensors function. Moreover NPUPiecewiseBackend is based on CUDAPiecewiseBackend to not duplicate backend class initialization, so in this case I import CUDAPiecewiseBackend class in npu_piecewise_backend.py file if the host machine doesn't have sgl_kernel package (only sgl_kernel_npu package) the import error will occur. So to make it unified I'll need to remove NPUPiecewiseBackend inheritance from CUDAPiecewiseBackend and to duplicate code from CUDAPiecewiseBackend.__init__() method. If you guess this is more proper way I can align the code with your suggestion
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
…l-project#12287) Co-authored-by: ronnie_zheng <zl19940307@163.com>
…l-project#12287) Co-authored-by: ronnie_zheng <zl19940307@163.com>
Motivation
Compilation of model forward at prefill speeds up inference time as already was showed at this PR: #10062 which enabled this feature for CUDA devices, current PR enables this feature for NPU devices
Modifications
Added:
PiecewiseCompileInterpreterbased on platformweak_ref_tensorfor NPUweak_ref_tensorimplementation based on platformpiecewise_cuda_graph_compileroptionChanged:
PrivateUse1indirect_register_custom_opfunction if platform is NPUseq_lens_cpu,extend_seq_lens_cpu,extend_prefix_lens_cpu,extend_logprob_start_lens_cpushould be allocated at cpu, so changed device type for these tensors inwarmup_and_captureandcapture_one_batch_sizemethods ofPiecewiseCudaGraphRunnerclass_cache_loc_dtypemethod ofPiecewiseCudaGraphRunnerclass, NPU supportsint32type forout_cache_locAccuracy Tests
GSM 8K Llama-3.1-8B
Ascend 910B, tp-size=1, concurrency=128
Benchmarking and Profiling
Profiling
Checklist