[NPU] perf update with kvcache nz & w4a8 quant#14423
[NPU] perf update with kvcache nz & w4a8 quant#14423iforgetmyname merged 6 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @liupeng374, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates support for an 'FIA NZ' optimization specifically tailored for Ascend NPUs. The primary goal is to enhance the performance and efficiency of attention operations by modifying KV cache management and Rotary Positional Embedding (RoPE) application within the DeepseekV2 attention mechanism. This is achieved through the introduction of a new cache mode and a refactored RoPE calculation that leverages precomputed values and a unified NPU kernel. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for the NZ format on Ascend NPUs, controlled by the SGLANG_USE_FIA_NZ environment variable. The changes primarily affect attention preprocessing and rotary embeddings to leverage NPU-specific optimizations. My review focuses on ensuring these changes are correctly implemented without introducing regressions. I've identified a critical issue in rotary_embedding.py that could break other models, a likely typo in mla_preprocess.py that could cause runtime errors, and some code duplication that could be refactored for better maintainability. Addressing these points will improve the robustness and quality of the code.
| ) # (B*S,N,1,D) | ||
|
|
||
| cache_mode = "PA_BNSD" | ||
| cache_mode = ("PA_NZ" if _use_fia_nz else "PA_BNSD",) |
There was a problem hiding this comment.
The cache_mode is being assigned a tuple with a trailing comma. This is likely a typo, as the torch.ops.npu.npu_kv_rmsnorm_rope_cache operator probably expects a string, similar to other parts of the code. This could cause a runtime error or unexpected behavior.
| cache_mode = ("PA_NZ" if _use_fia_nz else "PA_BNSD",) | |
| cache_mode = "PA_NZ" if _use_fia_nz else "PA_BNSD" |
| emb = torch.cat((freqs, freqs), dim=-1) | ||
| self.cos_cached_total = torch.cos(emb) * self.mscale | ||
| self.sin_cached_total = torch.sin(emb) * self.mscale | ||
| return cache | ||
|
|
||
| def get_cos_cached_total(self): | ||
| return self.cos_cached_total | ||
|
|
||
| def get_sin_cached_total(self): | ||
| return self.sin_cached_total | ||
|
|
||
| def get_cos_sin_cache( | ||
| self, positions, dtype, offsets: Optional[torch.Tensor] = None | ||
| ): | ||
| self.cos_cached = ( | ||
| self.cos_cached_total[ | ||
| torch.add(positions, offsets) if offsets is not None else positions | ||
| ] | ||
| .unsqueeze(-2) | ||
| .unsqueeze(-2) | ||
| .to(dtype) | ||
| ) | ||
| self.sin_cached = ( | ||
| self.sin_cached_total[ | ||
| torch.add(positions, offsets) if offsets is not None else positions | ||
| ] | ||
| .unsqueeze(-2) | ||
| .unsqueeze(-2) | ||
| .to(dtype) | ||
| ) | ||
| cos = self.cos_cached.to(positions.device) | ||
| sin = self.sin_cached.to(positions.device) | ||
| return cos, sin |
There was a problem hiding this comment.
The new logic to compute cos_cached_total and sin_cached_total in _compute_cos_sin_cache, and the new get_cos_sin_cache method, depend on self.mscale. This attribute is not present in the base RotaryEmbedding class or some of its subclasses (e.g., LinearScalingRotaryEmbedding), which will cause an AttributeError for models using them.
This logic appears to be specific to DeepseekScalingRotaryEmbedding. To fix this and prevent breaking other models, please:
- Revert the changes in
RotaryEmbedding._compute_cos_sin_cache. - Move the logic for computing
self.cos_cached_totalandself.sin_cached_totalintoDeepseekScalingRotaryEmbedding._compute_cos_sin_cache. - Move the new methods (
get_cos_cached_total,get_sin_cached_total, andget_cos_sin_cache) fromRotaryEmbeddingtoDeepseekScalingRotaryEmbedding.
Here's a suggested implementation for DeepseekScalingRotaryEmbedding._compute_cos_sin_cache:
def _compute_cos_sin_cache(self) -> torch.Tensor:
inv_freq = self._compute_inv_freq(self.scaling_factor)
t = torch.arange(
self.max_position_embeddings * self.scaling_factor,
device=self.device,
dtype=torch.float32,
)
freqs = torch.einsum("i,j -> ij", t, inv_freq)
cos = freqs.cos() * self.mscale
sin = freqs.sin() * self.mscale
cache = torch.cat((cos, sin), dim=-1)
emb = torch.cat((freqs, freqs), dim=-1)
self.cos_cached_total = torch.cos(emb) * self.mscale
self.sin_cached_total = torch.sin(emb) * self.mscale
return cache| from sglang.srt.hardware_backend.npu.utils import npu_format_cast | ||
| from sglang.srt.utils import get_bool_env_var | ||
|
|
||
| _use_fia_nz = get_bool_env_var("SGLANG_USE_FIA_NZ") |
There was a problem hiding this comment.
The _use_fia_nz flag is also defined in python/sglang/srt/hardware_backend/npu/modules/deepseek_v2_attention_mla_npu.py. To avoid code duplication and improve maintainability, consider defining this flag once in a shared utility module (e.g., sglang.srt.hardware_backend.npu.utils) and importing it where needed.
| from sglang.srt.models.deepseek_v2 import DeepseekV2AttentionMLA | ||
| from sglang.srt.utils import BumpAllocator | ||
|
|
||
| _use_fia_nz = get_bool_env_var("SGLANG_USE_FIA_NZ") |
There was a problem hiding this comment.
8fe44cc to
0f17578
Compare
|
/tag-and-rerun-ci |
d415a2c to
869df3c
Compare
0123a0a to
4a3e989
Compare
|
/rerun-failed-ci |
…n_eagle3_npu * 'main' of https://github.com/sgl-project/sglang: (25 commits) [NPU] perf update with kvcache nz & w4a8 quant (sgl-project#14423) [PP Prefill][NIXL] Fix PP mode transfer completion tracking to wait for all ranks (sgl-project#15027) Fix GLM-4.6 tool calls don't support streaming output for arguments i… (sgl-project#13989) feature: adding nightly wheel workflow and indexer (sgl-project#14924) [diffusion] feat: Improve LoRA compatibility by adding unified format detection and diffusers-based normalization (sgl-project#14659) [Fix] Disable trtllm moe backend for draft model for a qucik fix (sgl-project#15002) [diffusion] fix: use NDRotaryEmbedding in flux_2 (sgl-project#15034) Mistral Large 3 NVFP4 support (sgl-project#14485) call check_quantized_moe_compatibility after initialize (sgl-project#13876) Add sgl_router_attempt_http_responses_total for single attempt information (sgl-project#15037) Add error code in prometheus metrics and add X-SMG-Error-Code header (sgl-project#15036) Provide more fine grained error reason for reqwest error (sgl-project#15032) Tiny change http router response format to unify (sgl-project#15031) Tiny unify grpc existing error responses into new format (sgl-project#15030) Add `code` field and unify error responses for router (sgl-project#15028) Super tiny remove unused log_request (sgl-project#15035) Fix decode OOM caused by retraction (sgl-project#14939) [CI]Add gb200 runner back (sgl-project#15024) Add a special label for b200 CI runner that can run kernel tests (sgl-project#15033) Fix regression caused by fa3 block_table (sgl-project#15009) ... # Conflicts: # python/sglang/srt/hardware_backend/npu/attention/ascend_backend.py
Motivation
1、Use the nz format for kv cache, thie method accelerates the FIA operator.;
2、Moe's w4a8 uses per-channel quantization;
3、Accelerating preprocessing of MHA in prefill using the npu_interleave_rope operator;
4、bugfix num_token_non_padded_cpu;
Modifications
Use
export SGLANG_USE_FIA_NZ=1to enable FIA NZ, and this feature must be turned on together with mlapoexport SGLANG_NPU_USE_MLAPO=1.Accuracy Tests
Benchmarking and Profiling
FIANZ can speed up tpot by about 2ms, and the optimization of prefill can improve performance by more than 10%.
Checklist