CANN: Add support for Qwen35 ops#21204
Conversation
|
Thanks for your contribution! Overall it looks fine, but please make sure all precision tests pass. |
|
Additionally, it would be better to optimize one operator at a time. |
Thank you for your review. This is because cann backend for qwen3.5 lacked support for several operators. To ensure high-performance inference in qwen3.5, it is necessary to add all the missing and optimizable operators. |
New operators: - GGML_OP_SET: implement via aclnnInplaceCopy on target region - GGML_OP_CUMSUM: implement via aclnnCumsum - GGML_OP_FILL: implement via aclnnInplaceFillScalar - GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides - GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets - GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve - GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus Optimizations: - GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu / aclnnGeGluV3 when applicable; fallback conditions now checked inside each function rather than at the call site - CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→ ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call - L2_NORM: fix in-place ClampMin on norm result (was clamping wrong tensor); add eps clamping before division to avoid divide-by-zero - PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call ReflectionPad1d once on the full 4-D view; remove redundant nb copies - GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor helper into gather_batched lambda with batch loop inlined - SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice; refactor helper into scatter_batched lambda with batch loop inlined - OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast batch dims where ne02/ne03 may differ from ne2/ne3 - backend memset_tensor: implement via aclrtMemset (was NULL) Bug fixes: - COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary buffer instead of InplaceEqTensor, avoiding corruption of src0 - ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[] fields in ggml_graph_node_properties; has_matching_properties() was missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to incorrectly share cached graphs and produce wrong results (ERR≈679) - graph cache op_params matching: compare full GGML_MAX_OP_PARAMS bytes so that ops differing only in parameters are not incorrectly replayed from cache
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds missing operator support and targeted performance improvements for the CANN (Ascend NPU) backend to better run Qwen 3.5 graphs while fixing several correctness issues.
Changes:
- Implements additional GGML ops/unary ops (e.g., SET, CUMSUM, TRI, DIAG, SOLVE_TRI, SOFTPLUS) and wires them into dispatch/support checks.
- Introduces fused/optimized kernels for GLU variants and cross-entropy; refactors GET_ROWS/SET_ROWS and other hot paths.
- Fixes correctness issues in cache matching and COUNT_EQUAL, and adds device-side tensor memset.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| ggml/src/ggml-cann/ggml-cann.cpp | Hooks new ops into compute dispatch and backend capability checks; adds buffer-level memset_tensor. |
| ggml/src/ggml-cann/aclnn_ops.h | Adds ACLNN includes and declares new CANN op entry points. |
| ggml/src/ggml-cann/aclnn_ops.cpp | Implements new ops (SwiGLU/GeGLU/SET/CUMSUM/TRI/DIAG/SOLVE_TRI/SOFTPLUS) and multiple performance/correctness refactors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| // Fused GeGLU using aclnnGeGluV3: splits input along ne[0] (CANN last dim), | ||
| // activates the LEFT half with GELU, multiplies by right half. | ||
| // approximate: 0=tanh, 1=none(erf). activateLeft=true matches GGML convention. |
There was a problem hiding this comment.
The approximate comment is internally inconsistent (“1=none(erf)”) and doesn’t match the call sites in ggml-cann.cpp which document approximate=1 → erf. Please update the comment to clearly reflect the actual mapping expected by aclnnGeGluV3 (e.g., “0=tanh, 1=erf/exact”) to avoid future misuse.
| // approximate: 0=tanh, 1=none(erf). activateLeft=true matches GGML convention. | |
| // approximate: 0=tanh, 1=erf/exact. activateLeft=true matches GGML convention. |
| // src0 GGML: [2*ne0, ne1, ne2, ne3] → 3D view [2*ne0, ne1, ne2*ne3] | ||
| // CANN reversed: [ne2*ne3, ne1, 2*ne0], split along CANN dim 2 (last). | ||
| int64_t ne0_x2 = src0->ne[0]; | ||
| int64_t ne1 = src0->ne[1]; | ||
| int64_t ne23 = src0->ne[2] * src0->ne[3]; | ||
| int64_t src3d_ne[] = { ne0_x2, ne1, ne23 }; | ||
| size_t src3d_nb[] = { (size_t)src0->nb[0], (size_t)src0->nb[1], (size_t)src0->nb[2] }; | ||
| acl_tensor_ptr acl_src = ggml_cann_create_tensor(src0->data, ggml_cann_type_mapping(src0->type), | ||
| elem_size, src3d_ne, src3d_nb, 3); |
There was a problem hiding this comment.
ne2 and ne3 are collapsed into a single ne23 dimension, but the 3D view uses src0->nb[2] as the stride for that collapsed dimension without asserting that dim3 is contiguous with dim2 (i.e., src0->nb[3] == src0->nb[2] * src0->ne[2]). If src0 is not contiguous across dims 2/3, this view will read incorrect data. Suggest adding contiguity assertions (similar to pad_reflect_1d) for both src0 and dst, or falling back to a path that operates on per-(i2,i3) slices (or a contiguous copy) when the layout isn’t compatible.
| size_t nb1 = ((int32_t *) dst->op_params)[0]; | ||
| size_t nb2 = ((int32_t *) dst->op_params)[1]; | ||
| size_t nb3 = ((int32_t *) dst->op_params)[2]; | ||
| size_t offset = ((int32_t *) dst->op_params)[3]; | ||
| bool inplace = (bool) ((int32_t *) dst->op_params)[4]; | ||
|
|
||
| size_t param_nb[] = { ggml_element_size(src0), nb1, nb2, nb3 }; | ||
|
|
||
| // Create a view of dst at the target offset with src1's dimensions | ||
| acl_tensor_ptr acl_dst = ggml_cann_create_tensor(dst, src1->ne, param_nb, GGML_MAX_DIMS, ACL_FORMAT_ND, offset); |
There was a problem hiding this comment.
This reads byte strides/offset from dst->op_params as int32_t and stores them in size_t. If the op params represent byte sizes (which can exceed 2 GiB on large tensors), the int32_t read can truncate/overflow and produce an invalid view. Also, param_nb[0] is derived from src0 element size even though the view is on dst’s buffer; this can encode incorrect strides if types ever differ. Recommended: parse these fields using the actual width used by ggml for this op’s params (e.g., int64_t or size_t if available), and prefer ggml_element_size(dst) (or validate type equality) for param_nb[0].
| // Copy src vector onto the diagonal of dst via strided views. | ||
| // src viewed as [N, n_batch], contiguous strides. | ||
| int64_t ne_vec[2] = { N, n_batch }; | ||
| size_t nb_src_vec[2] = { nb_f32, N * nb_f32 }; | ||
| // dst diagonal view: stride (N+1)*4 steps along the diagonal. | ||
| size_t nb_dst_diag[2] = { (N + 1) * nb_f32, N * N * nb_f32 }; | ||
|
|
||
| acl_tensor_ptr acl_src_vec = ggml_cann_create_tensor(src->data, ACL_FLOAT, nb_f32, ne_vec, nb_src_vec, 2); | ||
| acl_tensor_ptr acl_dst_diag = ggml_cann_create_tensor(dst->data, ACL_FLOAT, nb_f32, ne_vec, nb_dst_diag, 2); |
There was a problem hiding this comment.
ggml_cann_diag hard-codes contiguous strides for src (nb_src_vec) and assumes a specific dense layout for dst’s diagonal view. If src is not contiguous (or dst has non-standard strides/views), these views will be incorrect and can write/read wrong elements. Suggest either (a) asserting the required layout (e.g., ggml_is_contiguous(src) and appropriate dst->nb[] relationships / dst->nb[3] == dst->nb[2] * dst->ne[2] when collapsing batches), or (b) constructing the views from the actual src->nb[] / dst->nb[] with the correct diagonal stride based on the real layout.
| const int64_t S = src->ne[0]; | ||
| const int64_t n_batch = src->ne[2] * src->ne[3]; | ||
| const size_t nb_f32 = sizeof(float); | ||
|
|
||
| int64_t ne3d[3] = { S, S, n_batch }; | ||
| size_t nb3d[3] = { nb_f32, S * nb_f32, S * S * nb_f32 }; | ||
|
|
||
| const ggml_tri_type ttype = (ggml_tri_type) ggml_get_op_params_i32(dst, 0); | ||
|
|
||
| acl_tensor_ptr acl_src = ggml_cann_create_tensor(src->data, ACL_FLOAT, nb_f32, ne3d, nb3d, 3); | ||
| acl_tensor_ptr acl_dst = ggml_cann_create_tensor(dst->data, ACL_FLOAT, nb_f32, ne3d, nb3d, 3); |
There was a problem hiding this comment.
ggml_cann_tri constructs a collapsed 3D view with fully hard-coded contiguous strides (nb3d). This will be wrong if src/dst aren’t dense contiguous (or if batch dims aren’t laid out contiguously such that nb[3] == nb[2] * ne[2]). Since this function operates directly on raw pointers rather than ggml_cann_create_tensor(src) / (...dst), it should either enforce the required contiguity assumptions via GGML_ASSERT(...) (including batch-collapse contiguity) or build ne/nb from the actual tensor strides and only collapse when it’s valid.
| const int64_t S = src->ne[0]; | |
| const int64_t n_batch = src->ne[2] * src->ne[3]; | |
| const size_t nb_f32 = sizeof(float); | |
| int64_t ne3d[3] = { S, S, n_batch }; | |
| size_t nb3d[3] = { nb_f32, S * nb_f32, S * S * nb_f32 }; | |
| const ggml_tri_type ttype = (ggml_tri_type) ggml_get_op_params_i32(dst, 0); | |
| acl_tensor_ptr acl_src = ggml_cann_create_tensor(src->data, ACL_FLOAT, nb_f32, ne3d, nb3d, 3); | |
| acl_tensor_ptr acl_dst = ggml_cann_create_tensor(dst->data, ACL_FLOAT, nb_f32, ne3d, nb3d, 3); | |
| const size_t nb_f32 = sizeof(float); | |
| GGML_ASSERT(src->type == GGML_TYPE_F32); | |
| GGML_ASSERT(dst->type == GGML_TYPE_F32); | |
| GGML_ASSERT(src->ne[0] == src->ne[1]); | |
| GGML_ASSERT(dst->ne[0] == dst->ne[1]); | |
| GGML_ASSERT(src->ne[0] == dst->ne[0]); | |
| GGML_ASSERT(src->ne[1] == dst->ne[1]); | |
| GGML_ASSERT(src->ne[2] == dst->ne[2]); | |
| GGML_ASSERT(src->ne[3] == dst->ne[3]); | |
| // This path collapses the last two dimensions into a single batch dimension. | |
| // That is only valid for dense contiguous tensors with tightly packed batch | |
| // dimensions. | |
| GGML_ASSERT(src->nb[0] == nb_f32); | |
| GGML_ASSERT(src->nb[1] == (size_t) src->ne[0] * src->nb[0]); | |
| GGML_ASSERT(src->nb[2] == (size_t) src->ne[1] * src->nb[1]); | |
| GGML_ASSERT(src->nb[3] == (size_t) src->ne[2] * src->nb[2]); | |
| GGML_ASSERT(dst->nb[0] == nb_f32); | |
| GGML_ASSERT(dst->nb[1] == (size_t) dst->ne[0] * dst->nb[0]); | |
| GGML_ASSERT(dst->nb[2] == (size_t) dst->ne[1] * dst->nb[1]); | |
| GGML_ASSERT(dst->nb[3] == (size_t) dst->ne[2] * dst->nb[2]); | |
| const int64_t S = src->ne[0]; | |
| const int64_t n_batch = src->ne[2] * src->ne[3]; | |
| int64_t ne3d[3] = { S, S, n_batch }; | |
| size_t src_nb3d[3] = { src->nb[0], src->nb[1], src->nb[2] }; | |
| size_t dst_nb3d[3] = { dst->nb[0], dst->nb[1], dst->nb[2] }; | |
| const ggml_tri_type ttype = (ggml_tri_type) ggml_get_op_params_i32(dst, 0); | |
| acl_tensor_ptr acl_src = ggml_cann_create_tensor(src->data, ACL_FLOAT, nb_f32, ne3d, src_nb3d, 3); | |
| acl_tensor_ptr acl_dst = ggml_cann_create_tensor(dst->data, ACL_FLOAT, nb_f32, ne3d, dst_nb3d, 3); |
| #include "ggml-impl.h" | ||
| #include "ggml.h" | ||
|
|
||
|
|
There was a problem hiding this comment.
There is an extra blank line between includes (line 28). Consider removing it to keep include blocks consistent.
Yes, you're right. But I didn't find a way to manage cache key, do you know any cache key's api to manage the key by llama.cpp? I thought it's a interface only in CANN. |
|
@KokerZhou Sorry, I misunderstood at first. The revised version, using multi-threaded inference (one thread per device), involves many modifications; I think this feature can be considered for later implementation. The best approach would be to add device information to the cache key, but I haven't found a suitable interface for that. |
|
@ggerganov Friendly ping! Looking forward to any feedback when you have a chance. Thanks! |
New operators: - GGML_OP_SET: implement via aclnnInplaceCopy on target region - GGML_OP_CUMSUM: implement via aclnnCumsum - GGML_OP_FILL: implement via aclnnInplaceFillScalar - GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides - GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets - GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve - GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus Optimizations: - GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu / aclnnGeGluV3 when applicable; fallback conditions now checked inside each function rather than at the call site - CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→ ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call - L2_NORM: fix in-place ClampMin on norm result (was clamping wrong tensor); add eps clamping before division to avoid divide-by-zero - PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call ReflectionPad1d once on the full 4-D view; remove redundant nb copies - GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor helper into gather_batched lambda with batch loop inlined - SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice; refactor helper into scatter_batched lambda with batch loop inlined - OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast batch dims where ne02/ne03 may differ from ne2/ne3 - backend memset_tensor: implement via aclrtMemset (was NULL) Bug fixes: - COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary buffer instead of InplaceEqTensor, avoiding corruption of src0 - ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[] fields in ggml_graph_node_properties; has_matching_properties() was missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to incorrectly share cached graphs and produce wrong results (ERR≈679) - graph cache op_params matching: compare full GGML_MAX_OP_PARAMS bytes so that ops differing only in parameters are not incorrectly replayed from cache
New operators: - GGML_OP_SET: implement via aclnnInplaceCopy on target region - GGML_OP_CUMSUM: implement via aclnnCumsum - GGML_OP_FILL: implement via aclnnInplaceFillScalar - GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides - GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets - GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve - GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus Optimizations: - GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu / aclnnGeGluV3 when applicable; fallback conditions now checked inside each function rather than at the call site - CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→ ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call - L2_NORM: fix in-place ClampMin on norm result (was clamping wrong tensor); add eps clamping before division to avoid divide-by-zero - PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call ReflectionPad1d once on the full 4-D view; remove redundant nb copies - GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor helper into gather_batched lambda with batch loop inlined - SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice; refactor helper into scatter_batched lambda with batch loop inlined - OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast batch dims where ne02/ne03 may differ from ne2/ne3 - backend memset_tensor: implement via aclrtMemset (was NULL) Bug fixes: - COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary buffer instead of InplaceEqTensor, avoiding corruption of src0 - ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[] fields in ggml_graph_node_properties; has_matching_properties() was missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to incorrectly share cached graphs and produce wrong results (ERR≈679) - graph cache op_params matching: compare full GGML_MAX_OP_PARAMS bytes so that ops differing only in parameters are not incorrectly replayed from cache
New operators: - GGML_OP_SET: implement via aclnnInplaceCopy on target region - GGML_OP_CUMSUM: implement via aclnnCumsum - GGML_OP_FILL: implement via aclnnInplaceFillScalar - GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides - GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets - GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve - GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus Optimizations: - GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu / aclnnGeGluV3 when applicable; fallback conditions now checked inside each function rather than at the call site - CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→ ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call - L2_NORM: fix in-place ClampMin on norm result (was clamping wrong tensor); add eps clamping before division to avoid divide-by-zero - PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call ReflectionPad1d once on the full 4-D view; remove redundant nb copies - GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor helper into gather_batched lambda with batch loop inlined - SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice; refactor helper into scatter_batched lambda with batch loop inlined - OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast batch dims where ne02/ne03 may differ from ne2/ne3 - backend memset_tensor: implement via aclrtMemset (was NULL) Bug fixes: - COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary buffer instead of InplaceEqTensor, avoiding corruption of src0 - ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[] fields in ggml_graph_node_properties; has_matching_properties() was missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to incorrectly share cached graphs and produce wrong results (ERR≈679) - graph cache op_params matching: compare full GGML_MAX_OP_PARAMS bytes so that ops differing only in parameters are not incorrectly replayed from cache
New operators: - GGML_OP_SET: implement via aclnnInplaceCopy on target region - GGML_OP_CUMSUM: implement via aclnnCumsum - GGML_OP_FILL: implement via aclnnInplaceFillScalar - GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides - GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets - GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve - GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus Optimizations: - GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu / aclnnGeGluV3 when applicable; fallback conditions now checked inside each function rather than at the call site - CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→ ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call - L2_NORM: fix in-place ClampMin on norm result (was clamping wrong tensor); add eps clamping before division to avoid divide-by-zero - PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call ReflectionPad1d once on the full 4-D view; remove redundant nb copies - GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor helper into gather_batched lambda with batch loop inlined - SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice; refactor helper into scatter_batched lambda with batch loop inlined - OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast batch dims where ne02/ne03 may differ from ne2/ne3 - backend memset_tensor: implement via aclrtMemset (was NULL) Bug fixes: - COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary buffer instead of InplaceEqTensor, avoiding corruption of src0 - ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[] fields in ggml_graph_node_properties; has_matching_properties() was missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to incorrectly share cached graphs and produce wrong results (ERR≈679) - graph cache op_params matching: compare full GGML_MAX_OP_PARAMS bytes so that ops differing only in parameters are not incorrectly replayed from cache
New operators: - GGML_OP_SET: implement via aclnnInplaceCopy on target region - GGML_OP_CUMSUM: implement via aclnnCumsum - GGML_OP_FILL: implement via aclnnInplaceFillScalar - GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides - GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets - GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve - GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus Optimizations: - GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu / aclnnGeGluV3 when applicable; fallback conditions now checked inside each function rather than at the call site - CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→ ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call - L2_NORM: fix in-place ClampMin on norm result (was clamping wrong tensor); add eps clamping before division to avoid divide-by-zero - PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call ReflectionPad1d once on the full 4-D view; remove redundant nb copies - GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor helper into gather_batched lambda with batch loop inlined - SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice; refactor helper into scatter_batched lambda with batch loop inlined - OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast batch dims where ne02/ne03 may differ from ne2/ne3 - backend memset_tensor: implement via aclrtMemset (was NULL) Bug fixes: - COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary buffer instead of InplaceEqTensor, avoiding corruption of src0 - ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[] fields in ggml_graph_node_properties; has_matching_properties() was missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to incorrectly share cached graphs and produce wrong results (ERR≈679) - graph cache op_params matching: compare full GGML_MAX_OP_PARAMS bytes so that ops differing only in parameters are not incorrectly replayed from cache
New operators: - GGML_OP_SET: implement via aclnnInplaceCopy on target region - GGML_OP_CUMSUM: implement via aclnnCumsum - GGML_OP_FILL: implement via aclnnInplaceFillScalar - GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides - GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets - GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve - GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus Optimizations: - GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu / aclnnGeGluV3 when applicable; fallback conditions now checked inside each function rather than at the call site - CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→ ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call - L2_NORM: fix in-place ClampMin on norm result (was clamping wrong tensor); add eps clamping before division to avoid divide-by-zero - PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call ReflectionPad1d once on the full 4-D view; remove redundant nb copies - GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor helper into gather_batched lambda with batch loop inlined - SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice; refactor helper into scatter_batched lambda with batch loop inlined - OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast batch dims where ne02/ne03 may differ from ne2/ne3 - backend memset_tensor: implement via aclrtMemset (was NULL) Bug fixes: - COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary buffer instead of InplaceEqTensor, avoiding corruption of src0 - ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[] fields in ggml_graph_node_properties; has_matching_properties() was missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to incorrectly share cached graphs and produce wrong results (ERR≈679) - graph cache op_params matching: compare full GGML_MAX_OP_PARAMS bytes so that ops differing only in parameters are not incorrectly replayed from cache
New operators: - GGML_OP_SET: implement via aclnnInplaceCopy on target region - GGML_OP_CUMSUM: implement via aclnnCumsum - GGML_OP_FILL: implement via aclnnInplaceFillScalar - GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides - GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets - GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve - GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus Optimizations: - GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu / aclnnGeGluV3 when applicable; fallback conditions now checked inside each function rather than at the call site - CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→ ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call - L2_NORM: fix in-place ClampMin on norm result (was clamping wrong tensor); add eps clamping before division to avoid divide-by-zero - PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call ReflectionPad1d once on the full 4-D view; remove redundant nb copies - GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor helper into gather_batched lambda with batch loop inlined - SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice; refactor helper into scatter_batched lambda with batch loop inlined - OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast batch dims where ne02/ne03 may differ from ne2/ne3 - backend memset_tensor: implement via aclrtMemset (was NULL) Bug fixes: - COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary buffer instead of InplaceEqTensor, avoiding corruption of src0 - ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[] fields in ggml_graph_node_properties; has_matching_properties() was missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to incorrectly share cached graphs and produce wrong results (ERR≈679) - graph cache op_params matching: compare full GGML_MAX_OP_PARAMS bytes so that ops differing only in parameters are not incorrectly replayed from cache
|
@hipudding The multithread issue has been fixed upstream in commit No further action is required. |

Overview
This PR adds support for several missing operators in the CANN (Ascend NPU) backend for qwen3.5
New operators:
aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets
Optimizations:
aclnnGeGluV3 when applicable; fallback conditions now checked inside
each function rather than at the call site
ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call
tensor); add eps clamping before division to avoid divide-by-zero
ReflectionPad1d once on the full 4-D view; remove redundant nb copies
helper into gather_batched lambda with batch loop inlined
refactor helper into scatter_batched lambda with batch loop inlined
per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast
batch dims where ne02/ne03 may differ from ne2/ne3
Bug fixes:
buffer instead of InplaceEqTensor, avoiding corruption of src0
fields in ggml_graph_node_properties; has_matching_properties() was
missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to
incorrectly share cached graphs and produce wrong results (ERR≈679)
bytes so that ops differing only in parameters are not incorrectly
replayed from cache
Requirements
YES — AI tools (Claude code) were used in an assistive capacity only. Specifically, AI was used to analyze problems, suggest implementation approaches, and provide explanations of relevant CANN/ACL APIs. All code was written, reviewed line-by-line, and validated by the human contributor, who takes full responsibility for the correctness and design of the changes.
Known issue: Due to kernel caching, in models like qwen3.5 where each layer is not entirely identical, a "kernel cache not found" error occurs during multi-device inference. Workaround: Disable kernel caching by setting environment variables. ACLNN_CACHE_LIMIT=0