CANN: Add support for Qwen35 ops by hipudding · Pull Request #21204 · ggml-org/llama.cpp

hipudding · 2026-03-31T03:40:33Z

Overview

This PR adds support for several missing operators in the CANN (Ascend NPU) backend for qwen3.5

New operators:

GGML_OP_SET: implement via aclnnInplaceCopy on target region
GGML_OP_CUMSUM: implement via aclnnCumsum
GGML_OP_FILL: implement via aclnnInplaceFillScalar
GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides
GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via
aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets
GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve
GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus

Optimizations:

GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu /
aclnnGeGluV3 when applicable; fallback conditions now checked inside
each function rather than at the call site
CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→
ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call
L2_NORM: fix in-place ClampMin on norm result (was clamping wrong
tensor); add eps clamping before division to avoid divide-by-zero
PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call
ReflectionPad1d once on the full 4-D view; remove redundant nb copies
GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor
helper into gather_batched lambda with batch loop inlined
SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice;
refactor helper into scatter_batched lambda with batch loop inlined
OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with
per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast
batch dims where ne02/ne03 may differ from ne2/ne3
backend memset_tensor: implement via aclrtMemset (was NULL)

Bug fixes:

COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary
buffer instead of InplaceEqTensor, avoiding corruption of src0
ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[]
fields in ggml_graph_node_properties; has_matching_properties() was
missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to
incorrectly share cached graphs and produce wrong results (ERR≈679)
graph cache op_params matching: compare full GGML_MAX_OP_PARAMS
bytes so that ops differing only in parameters are not incorrectly
replayed from cache

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:
YES — AI tools (Claude code) were used in an assistive capacity only. Specifically, AI was used to analyze problems, suggest implementation approaches, and provide explanations of relevant CANN/ACL APIs. All code was written, reviewed line-by-line, and validated by the human contributor, who takes full responsibility for the correctness and design of the changes.

Known issue: Due to kernel caching, in models like qwen3.5 where each layer is not entirely identical, a "kernel cache not found" error occurs during multi-device inference. Workaround: Disable kernel caching by setting environment variables. ACLNN_CACHE_LIMIT=0

noemotiovon · 2026-04-23T08:12:47Z

Thanks for your contribution! Overall it looks fine, but please make sure all precision tests pass.

noemotiovon · 2026-04-23T08:14:35Z

Additionally, it would be better to optimize one operator at a time.

hipudding · 2026-04-24T06:28:29Z

Additionally, it would be better to optimize one operator at a time.

Thank you for your review. This is because cann backend for qwen3.5 lacked support for several operators. To ensure high-performance inference in qwen3.5, it is necessary to add all the missing and optimizable operators.

hipudding · 2026-04-24T06:43:58Z

Thanks for your contribution! Overall it looks fine, but please make sure all precision tests pass.

New operators: - GGML_OP_SET: implement via aclnnInplaceCopy on target region - GGML_OP_CUMSUM: implement via aclnnCumsum - GGML_OP_FILL: implement via aclnnInplaceFillScalar - GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides - GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets - GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve - GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus Optimizations: - GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu / aclnnGeGluV3 when applicable; fallback conditions now checked inside each function rather than at the call site - CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→ ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call - L2_NORM: fix in-place ClampMin on norm result (was clamping wrong tensor); add eps clamping before division to avoid divide-by-zero - PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call ReflectionPad1d once on the full 4-D view; remove redundant nb copies - GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor helper into gather_batched lambda with batch loop inlined - SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice; refactor helper into scatter_batched lambda with batch loop inlined - OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast batch dims where ne02/ne03 may differ from ne2/ne3 - backend memset_tensor: implement via aclrtMemset (was NULL) Bug fixes: - COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary buffer instead of InplaceEqTensor, avoiding corruption of src0 - ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[] fields in ggml_graph_node_properties; has_matching_properties() was missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to incorrectly share cached graphs and produce wrong results (ERR≈679) - graph cache op_params matching: compare full GGML_MAX_OP_PARAMS bytes so that ops differing only in parameters are not incorrectly replayed from cache

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds missing operator support and targeted performance improvements for the CANN (Ascend NPU) backend to better run Qwen 3.5 graphs while fixing several correctness issues.

Changes:

Implements additional GGML ops/unary ops (e.g., SET, CUMSUM, TRI, DIAG, SOLVE_TRI, SOFTPLUS) and wires them into dispatch/support checks.
Introduces fused/optimized kernels for GLU variants and cross-entropy; refactors GET_ROWS/SET_ROWS and other hot paths.
Fixes correctness issues in cache matching and COUNT_EQUAL, and adds device-side tensor memset.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
ggml/src/ggml-cann/ggml-cann.cpp	Hooks new ops into compute dispatch and backend capability checks; adds buffer-level `memset_tensor`.
ggml/src/ggml-cann/aclnn_ops.h	Adds ACLNN includes and declares new CANN op entry points.
ggml/src/ggml-cann/aclnn_ops.cpp	Implements new ops (SwiGLU/GeGLU/SET/CUMSUM/TRI/DIAG/SOLVE_TRI/SOFTPLUS) and multiple performance/correctness refactors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T06:57:17Z

+
+// Fused GeGLU using aclnnGeGluV3: splits input along ne[0] (CANN last dim),
+// activates the LEFT half with GELU, multiplies by right half.
+// approximate: 0=tanh, 1=none(erf). activateLeft=true matches GGML convention.


The approximate comment is internally inconsistent (“1=none(erf)”) and doesn’t match the call sites in ggml-cann.cpp which document approximate=1 → erf. Please update the comment to clearly reflect the actual mapping expected by aclnnGeGluV3 (e.g., “0=tanh, 1=erf/exact”) to avoid future misuse.

Suggested change

// approximate: 0=tanh, 1=none(erf). activateLeft=true matches GGML convention.

// approximate: 0=tanh, 1=erf/exact. activateLeft=true matches GGML convention.

Copilot · 2026-04-24T06:57:17Z

+    // src0 GGML: [2*ne0, ne1, ne2, ne3] → 3D view [2*ne0, ne1, ne2*ne3]
+    // CANN reversed: [ne2*ne3, ne1, 2*ne0], split along CANN dim 2 (last).
+    int64_t ne0_x2   = src0->ne[0];
+    int64_t ne1      = src0->ne[1];
+    int64_t ne23     = src0->ne[2] * src0->ne[3];
+    int64_t src3d_ne[] = { ne0_x2, ne1, ne23 };
+    size_t  src3d_nb[] = { (size_t)src0->nb[0], (size_t)src0->nb[1], (size_t)src0->nb[2] };
+    acl_tensor_ptr acl_src = ggml_cann_create_tensor(src0->data, ggml_cann_type_mapping(src0->type),
+                                                     elem_size, src3d_ne, src3d_nb, 3);


ne2 and ne3 are collapsed into a single ne23 dimension, but the 3D view uses src0->nb[2] as the stride for that collapsed dimension without asserting that dim3 is contiguous with dim2 (i.e., src0->nb[3] == src0->nb[2] * src0->ne[2]). If src0 is not contiguous across dims 2/3, this view will read incorrect data. Suggest adding contiguity assertions (similar to pad_reflect_1d) for both src0 and dst, or falling back to a path that operates on per-(i2,i3) slices (or a contiguous copy) when the layout isn’t compatible.

Copilot · 2026-04-24T06:57:17Z

+    size_t nb1     = ((int32_t *) dst->op_params)[0];
+    size_t nb2     = ((int32_t *) dst->op_params)[1];
+    size_t nb3     = ((int32_t *) dst->op_params)[2];
+    size_t offset  = ((int32_t *) dst->op_params)[3];
+    bool   inplace = (bool) ((int32_t *) dst->op_params)[4];
+
+    size_t param_nb[] = { ggml_element_size(src0), nb1, nb2, nb3 };
+
+    // Create a view of dst at the target offset with src1's dimensions
+    acl_tensor_ptr acl_dst  = ggml_cann_create_tensor(dst, src1->ne, param_nb, GGML_MAX_DIMS, ACL_FORMAT_ND, offset);


This reads byte strides/offset from dst->op_params as int32_t and stores them in size_t. If the op params represent byte sizes (which can exceed 2 GiB on large tensors), the int32_t read can truncate/overflow and produce an invalid view. Also, param_nb[0] is derived from src0 element size even though the view is on dst’s buffer; this can encode incorrect strides if types ever differ. Recommended: parse these fields using the actual width used by ggml for this op’s params (e.g., int64_t or size_t if available), and prefer ggml_element_size(dst) (or validate type equality) for param_nb[0].

Copilot · 2026-04-24T06:57:17Z

+    // Copy src vector onto the diagonal of dst via strided views.
+    // src viewed as [N, n_batch], contiguous strides.
+    int64_t ne_vec[2]      = { N, n_batch };
+    size_t  nb_src_vec[2]  = { nb_f32, N * nb_f32 };
+    // dst diagonal view: stride (N+1)*4 steps along the diagonal.
+    size_t  nb_dst_diag[2] = { (N + 1) * nb_f32, N * N * nb_f32 };
+
+    acl_tensor_ptr acl_src_vec  = ggml_cann_create_tensor(src->data, ACL_FLOAT, nb_f32, ne_vec, nb_src_vec, 2);
+    acl_tensor_ptr acl_dst_diag = ggml_cann_create_tensor(dst->data, ACL_FLOAT, nb_f32, ne_vec, nb_dst_diag, 2);


ggml_cann_diag hard-codes contiguous strides for src (nb_src_vec) and assumes a specific dense layout for dst’s diagonal view. If src is not contiguous (or dst has non-standard strides/views), these views will be incorrect and can write/read wrong elements. Suggest either (a) asserting the required layout (e.g., ggml_is_contiguous(src) and appropriate dst->nb[] relationships / dst->nb[3] == dst->nb[2] * dst->ne[2] when collapsing batches), or (b) constructing the views from the actual src->nb[] / dst->nb[] with the correct diagonal stride based on the real layout.

Copilot · 2026-04-24T06:57:18Z

+    const int64_t S       = src->ne[0];
+    const int64_t n_batch = src->ne[2] * src->ne[3];
+    const size_t  nb_f32  = sizeof(float);
+
+    int64_t ne3d[3] = { S, S, n_batch };
+    size_t  nb3d[3] = { nb_f32, S * nb_f32, S * S * nb_f32 };
+
+    const ggml_tri_type ttype = (ggml_tri_type) ggml_get_op_params_i32(dst, 0);
+
+    acl_tensor_ptr acl_src = ggml_cann_create_tensor(src->data, ACL_FLOAT, nb_f32, ne3d, nb3d, 3);
+    acl_tensor_ptr acl_dst = ggml_cann_create_tensor(dst->data, ACL_FLOAT, nb_f32, ne3d, nb3d, 3);


ggml_cann_tri constructs a collapsed 3D view with fully hard-coded contiguous strides (nb3d). This will be wrong if src/dst aren’t dense contiguous (or if batch dims aren’t laid out contiguously such that nb[3] == nb[2] * ne[2]). Since this function operates directly on raw pointers rather than ggml_cann_create_tensor(src) / (...dst), it should either enforce the required contiguity assumptions via GGML_ASSERT(...) (including batch-collapse contiguity) or build ne/nb from the actual tensor strides and only collapse when it’s valid.

Suggested change

const int64_t S = src->ne[0];

const int64_t n_batch = src->ne[2] * src->ne[3];

const size_t nb_f32 = sizeof(float);

int64_t ne3d[3] = { S, S, n_batch };

size_t nb3d[3] = { nb_f32, S * nb_f32, S * S * nb_f32 };

const ggml_tri_type ttype = (ggml_tri_type) ggml_get_op_params_i32(dst, 0);

acl_tensor_ptr acl_src = ggml_cann_create_tensor(src->data, ACL_FLOAT, nb_f32, ne3d, nb3d, 3);

acl_tensor_ptr acl_dst = ggml_cann_create_tensor(dst->data, ACL_FLOAT, nb_f32, ne3d, nb3d, 3);

const size_t nb_f32 = sizeof(float);

GGML_ASSERT(src->type == GGML_TYPE_F32);

GGML_ASSERT(dst->type == GGML_TYPE_F32);

GGML_ASSERT(src->ne[0] == src->ne[1]);

GGML_ASSERT(dst->ne[0] == dst->ne[1]);

GGML_ASSERT(src->ne[0] == dst->ne[0]);

GGML_ASSERT(src->ne[1] == dst->ne[1]);

GGML_ASSERT(src->ne[2] == dst->ne[2]);

GGML_ASSERT(src->ne[3] == dst->ne[3]);

// This path collapses the last two dimensions into a single batch dimension.

// That is only valid for dense contiguous tensors with tightly packed batch

// dimensions.

GGML_ASSERT(src->nb[0] == nb_f32);

GGML_ASSERT(src->nb[1] == (size_t) src->ne[0] * src->nb[0]);

GGML_ASSERT(src->nb[2] == (size_t) src->ne[1] * src->nb[1]);

GGML_ASSERT(src->nb[3] == (size_t) src->ne[2] * src->nb[2]);

GGML_ASSERT(dst->nb[0] == nb_f32);

GGML_ASSERT(dst->nb[1] == (size_t) dst->ne[0] * dst->nb[0]);

GGML_ASSERT(dst->nb[2] == (size_t) dst->ne[1] * dst->nb[1]);

GGML_ASSERT(dst->nb[3] == (size_t) dst->ne[2] * dst->nb[2]);

const int64_t S = src->ne[0];

const int64_t n_batch = src->ne[2] * src->ne[3];

int64_t ne3d[3] = { S, S, n_batch };

size_t src_nb3d[3] = { src->nb[0], src->nb[1], src->nb[2] };

size_t dst_nb3d[3] = { dst->nb[0], dst->nb[1], dst->nb[2] };

const ggml_tri_type ttype = (ggml_tri_type) ggml_get_op_params_i32(dst, 0);

acl_tensor_ptr acl_src = ggml_cann_create_tensor(src->data, ACL_FLOAT, nb_f32, ne3d, src_nb3d, 3);

acl_tensor_ptr acl_dst = ggml_cann_create_tensor(dst->data, ACL_FLOAT, nb_f32, ne3d, dst_nb3d, 3);

Copilot · 2026-04-24T06:57:18Z

 #include "ggml-impl.h"
 #include "ggml.h"

+


There is an extra blank line between includes (line 28). Consider removing it to keep include blocks consistent.

Suggested change

hipudding · 2026-04-27T06:16:37Z

Known issue: Due to kernel caching, in models like qwen3.5 where each layer is not entirely identical, a "kernel cache not found" error occurs during multi-device inference. Workaround: Disable kernel caching by setting environment variables. ACLNN_CACHE_LIMIT=0

So the proper fix should be to use separate threads for each device so the cache is managed per device and invalidated correctly, rather than disabling kernel caching globally as a workaround?

Yes, you're right. But I didn't find a way to manage cache key, do you know any cache key's api to manage the key by llama.cpp? I thought it's a interface only in CANN.

hipudding · 2026-04-28T06:15:48Z

@KokerZhou Sorry, I misunderstood at first. The revised version, using multi-threaded inference (one thread per device), involves many modifications; I think this feature can be considered for later implementation. The best approach would be to add device information to the cache key, but I haven't found a suitable interface for that.

hipudding · 2026-04-28T06:18:06Z

@ggerganov Friendly ping! Looking forward to any feedback when you have a chance. Thanks!

New operators: - GGML_OP_SET: implement via aclnnInplaceCopy on target region - GGML_OP_CUMSUM: implement via aclnnCumsum - GGML_OP_FILL: implement via aclnnInplaceFillScalar - GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides - GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets - GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve - GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus Optimizations: - GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu / aclnnGeGluV3 when applicable; fallback conditions now checked inside each function rather than at the call site - CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→ ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call - L2_NORM: fix in-place ClampMin on norm result (was clamping wrong tensor); add eps clamping before division to avoid divide-by-zero - PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call ReflectionPad1d once on the full 4-D view; remove redundant nb copies - GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor helper into gather_batched lambda with batch loop inlined - SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice; refactor helper into scatter_batched lambda with batch loop inlined - OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast batch dims where ne02/ne03 may differ from ne2/ne3 - backend memset_tensor: implement via aclrtMemset (was NULL) Bug fixes: - COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary buffer instead of InplaceEqTensor, avoiding corruption of src0 - ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[] fields in ggml_graph_node_properties; has_matching_properties() was missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to incorrectly share cached graphs and produce wrong results (ERR≈679) - graph cache op_params matching: compare full GGML_MAX_OP_PARAMS bytes so that ops differing only in parameters are not incorrectly replayed from cache

KokerZhou · 2026-05-11T12:32:26Z

@hipudding The multithread issue has been fixed upstream in commit f37e88e:
https://gitcode.com/cann/opbase/tree/f37e88e9525fac93214acd6fc652d9c69e191e02

No further action is required.

github-actions Bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning Ascend NPU issues specific to Ascend NPUs labels Mar 31, 2026

hipudding changed the title ~~Qwen35 op~~ CANN: Add suport for Qwen35 ops Mar 31, 2026

hipudding commented Mar 31, 2026

View reviewed changes

Comment thread tests/test-backend-ops.cpp Outdated

hipudding mentioned this pull request Apr 16, 2026

cann : add GGML_OP_SET backend support (#21178) #21841

Closed

hipudding force-pushed the qwen35_op branch from cb15cdb to 531620c Compare April 16, 2026 06:44

hipudding self-assigned this Apr 16, 2026

hipudding changed the title ~~CANN: Add suport for Qwen35 ops~~ CANN: Add support for Qwen35 ops Apr 16, 2026