Skip to content

CANN: Add support for Qwen35 ops#21204

Merged
ggerganov merged 1 commit into
ggml-org:masterfrom
hipudding:qwen35_op
Apr 28, 2026
Merged

CANN: Add support for Qwen35 ops#21204
ggerganov merged 1 commit into
ggml-org:masterfrom
hipudding:qwen35_op

Conversation

@hipudding
Copy link
Copy Markdown
Contributor

@hipudding hipudding commented Mar 31, 2026

Overview

This PR adds support for several missing operators in the CANN (Ascend NPU) backend for qwen3.5

New operators:

  • GGML_OP_SET: implement via aclnnInplaceCopy on target region
  • GGML_OP_CUMSUM: implement via aclnnCumsum
  • GGML_OP_FILL: implement via aclnnInplaceFillScalar
  • GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides
  • GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via
    aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets
  • GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve
  • GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus

Optimizations:

  • GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu /
    aclnnGeGluV3 when applicable; fallback conditions now checked inside
    each function rather than at the call site
  • CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→
    ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call
  • L2_NORM: fix in-place ClampMin on norm result (was clamping wrong
    tensor); add eps clamping before division to avoid divide-by-zero
  • PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call
    ReflectionPad1d once on the full 4-D view; remove redundant nb copies
  • GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor
    helper into gather_batched lambda with batch loop inlined
  • SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice;
    refactor helper into scatter_batched lambda with batch loop inlined
  • OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with
    per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast
    batch dims where ne02/ne03 may differ from ne2/ne3
  • backend memset_tensor: implement via aclrtMemset (was NULL)

Bug fixes:

  • COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary
    buffer instead of InplaceEqTensor, avoiding corruption of src0
  • ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[]
    fields in ggml_graph_node_properties; has_matching_properties() was
    missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to
    incorrectly share cached graphs and produce wrong results (ERR≈679)
  • graph cache op_params matching: compare full GGML_MAX_OP_PARAMS
    bytes so that ops differing only in parameters are not incorrectly
    replayed from cache

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure:
    YES — AI tools (Claude code) were used in an assistive capacity only. Specifically, AI was used to analyze problems, suggest implementation approaches, and provide explanations of relevant CANN/ACL APIs. All code was written, reviewed line-by-line, and validated by the human contributor, who takes full responsibility for the correctness and design of the changes.

Known issue: Due to kernel caching, in models like qwen3.5 where each layer is not entirely identical, a "kernel cache not found" error occurs during multi-device inference. Workaround: Disable kernel caching by setting environment variables. ACLNN_CACHE_LIMIT=0

@github-actions github-actions Bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning Ascend NPU issues specific to Ascend NPUs labels Mar 31, 2026
@hipudding hipudding changed the title Qwen35 op CANN: Add suport for Qwen35 ops Mar 31, 2026
Comment thread tests/test-backend-ops.cpp Outdated
@hipudding hipudding self-assigned this Apr 16, 2026
@hipudding hipudding changed the title CANN: Add suport for Qwen35 ops CANN: Add support for Qwen35 ops Apr 16, 2026
Comment thread ggml/src/ggml-cann/ggml-cann.cpp Outdated
Comment thread ggml/src/ggml-cann/aclnn_ops.cpp
Comment thread ggml/src/ggml-cann/aclnn_ops.cpp
Comment thread ggml/src/ggml-cann/aclnn_ops.cpp Outdated
Comment thread ggml/src/ggml-cann/aclnn_ops.cpp Outdated
Comment thread ggml/src/ggml-cann/aclnn_ops.cpp
Comment thread ggml/src/ggml-cann/aclnn_ops.cpp Outdated
Comment thread ggml/src/ggml-cann/aclnn_ops.cpp Outdated
Comment thread ggml/src/ggml-cann/ggml-cann.cpp Outdated
Comment thread ggml/src/ggml-cann/aclnn_ops.cpp Outdated
@noemotiovon
Copy link
Copy Markdown
Collaborator

Thanks for your contribution! Overall it looks fine, but please make sure all precision tests pass.

@noemotiovon
Copy link
Copy Markdown
Collaborator

Additionally, it would be better to optimize one operator at a time.

@hipudding
Copy link
Copy Markdown
Contributor Author

Additionally, it would be better to optimize one operator at a time.

Thank you for your review. This is because cann backend for qwen3.5 lacked support for several operators. To ensure high-performance inference in qwen3.5, it is necessary to add all the missing and optimizable operators.

@hipudding
Copy link
Copy Markdown
Contributor Author

Thanks for your contribution! Overall it looks fine, but please make sure all precision tests pass.

image

New operators:
- GGML_OP_SET: implement via aclnnInplaceCopy on target region
- GGML_OP_CUMSUM: implement via aclnnCumsum
- GGML_OP_FILL: implement via aclnnInplaceFillScalar
- GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides
- GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via
  aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets
- GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve
- GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus

Optimizations:
- GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu /
  aclnnGeGluV3 when applicable; fallback conditions now checked inside
  each function rather than at the call site
- CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→
  ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call
- L2_NORM: fix in-place ClampMin on norm result (was clamping wrong
  tensor); add eps clamping before division to avoid divide-by-zero
- PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call
  ReflectionPad1d once on the full 4-D view; remove redundant nb copies
- GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor
  helper into gather_batched lambda with batch loop inlined
- SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice;
  refactor helper into scatter_batched lambda with batch loop inlined
- OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with
  per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast
  batch dims where ne02/ne03 may differ from ne2/ne3
- backend memset_tensor: implement via aclrtMemset (was NULL)

Bug fixes:
- COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary
  buffer instead of InplaceEqTensor, avoiding corruption of src0
- ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[]
  fields in ggml_graph_node_properties; has_matching_properties() was
  missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to
  incorrectly share cached graphs and produce wrong results (ERR≈679)
- graph cache op_params matching: compare full GGML_MAX_OP_PARAMS
  bytes so that ops differing only in parameters are not incorrectly
  replayed from cache
@hipudding hipudding marked this pull request as ready for review April 24, 2026 06:47
@hipudding hipudding requested a review from a team as a code owner April 24, 2026 06:47
Copilot AI review requested due to automatic review settings April 24, 2026 06:47
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds missing operator support and targeted performance improvements for the CANN (Ascend NPU) backend to better run Qwen 3.5 graphs while fixing several correctness issues.

Changes:

  • Implements additional GGML ops/unary ops (e.g., SET, CUMSUM, TRI, DIAG, SOLVE_TRI, SOFTPLUS) and wires them into dispatch/support checks.
  • Introduces fused/optimized kernels for GLU variants and cross-entropy; refactors GET_ROWS/SET_ROWS and other hot paths.
  • Fixes correctness issues in cache matching and COUNT_EQUAL, and adds device-side tensor memset.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
ggml/src/ggml-cann/ggml-cann.cpp Hooks new ops into compute dispatch and backend capability checks; adds buffer-level memset_tensor.
ggml/src/ggml-cann/aclnn_ops.h Adds ACLNN includes and declares new CANN op entry points.
ggml/src/ggml-cann/aclnn_ops.cpp Implements new ops (SwiGLU/GeGLU/SET/CUMSUM/TRI/DIAG/SOLVE_TRI/SOFTPLUS) and multiple performance/correctness refactors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


// Fused GeGLU using aclnnGeGluV3: splits input along ne[0] (CANN last dim),
// activates the LEFT half with GELU, multiplies by right half.
// approximate: 0=tanh, 1=none(erf). activateLeft=true matches GGML convention.
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approximate comment is internally inconsistent (“1=none(erf)”) and doesn’t match the call sites in ggml-cann.cpp which document approximate=1 → erf. Please update the comment to clearly reflect the actual mapping expected by aclnnGeGluV3 (e.g., “0=tanh, 1=erf/exact”) to avoid future misuse.

Suggested change
// approximate: 0=tanh, 1=none(erf). activateLeft=true matches GGML convention.
// approximate: 0=tanh, 1=erf/exact. activateLeft=true matches GGML convention.

Copilot uses AI. Check for mistakes.
Comment on lines +193 to +201
// src0 GGML: [2*ne0, ne1, ne2, ne3] → 3D view [2*ne0, ne1, ne2*ne3]
// CANN reversed: [ne2*ne3, ne1, 2*ne0], split along CANN dim 2 (last).
int64_t ne0_x2 = src0->ne[0];
int64_t ne1 = src0->ne[1];
int64_t ne23 = src0->ne[2] * src0->ne[3];
int64_t src3d_ne[] = { ne0_x2, ne1, ne23 };
size_t src3d_nb[] = { (size_t)src0->nb[0], (size_t)src0->nb[1], (size_t)src0->nb[2] };
acl_tensor_ptr acl_src = ggml_cann_create_tensor(src0->data, ggml_cann_type_mapping(src0->type),
elem_size, src3d_ne, src3d_nb, 3);
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ne2 and ne3 are collapsed into a single ne23 dimension, but the 3D view uses src0->nb[2] as the stride for that collapsed dimension without asserting that dim3 is contiguous with dim2 (i.e., src0->nb[3] == src0->nb[2] * src0->ne[2]). If src0 is not contiguous across dims 2/3, this view will read incorrect data. Suggest adding contiguity assertions (similar to pad_reflect_1d) for both src0 and dst, or falling back to a path that operates on per-(i2,i3) slices (or a contiguous copy) when the layout isn’t compatible.

Copilot uses AI. Check for mistakes.
Comment on lines +685 to +694
size_t nb1 = ((int32_t *) dst->op_params)[0];
size_t nb2 = ((int32_t *) dst->op_params)[1];
size_t nb3 = ((int32_t *) dst->op_params)[2];
size_t offset = ((int32_t *) dst->op_params)[3];
bool inplace = (bool) ((int32_t *) dst->op_params)[4];

size_t param_nb[] = { ggml_element_size(src0), nb1, nb2, nb3 };

// Create a view of dst at the target offset with src1's dimensions
acl_tensor_ptr acl_dst = ggml_cann_create_tensor(dst, src1->ne, param_nb, GGML_MAX_DIMS, ACL_FORMAT_ND, offset);
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads byte strides/offset from dst->op_params as int32_t and stores them in size_t. If the op params represent byte sizes (which can exceed 2 GiB on large tensors), the int32_t read can truncate/overflow and produce an invalid view. Also, param_nb[0] is derived from src0 element size even though the view is on dst’s buffer; this can encode incorrect strides if types ever differ. Recommended: parse these fields using the actual width used by ggml for this op’s params (e.g., int64_t or size_t if available), and prefer ggml_element_size(dst) (or validate type equality) for param_nb[0].

Copilot uses AI. Check for mistakes.
Comment on lines +820 to +828
// Copy src vector onto the diagonal of dst via strided views.
// src viewed as [N, n_batch], contiguous strides.
int64_t ne_vec[2] = { N, n_batch };
size_t nb_src_vec[2] = { nb_f32, N * nb_f32 };
// dst diagonal view: stride (N+1)*4 steps along the diagonal.
size_t nb_dst_diag[2] = { (N + 1) * nb_f32, N * N * nb_f32 };

acl_tensor_ptr acl_src_vec = ggml_cann_create_tensor(src->data, ACL_FLOAT, nb_f32, ne_vec, nb_src_vec, 2);
acl_tensor_ptr acl_dst_diag = ggml_cann_create_tensor(dst->data, ACL_FLOAT, nb_f32, ne_vec, nb_dst_diag, 2);
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ggml_cann_diag hard-codes contiguous strides for src (nb_src_vec) and assumes a specific dense layout for dst’s diagonal view. If src is not contiguous (or dst has non-standard strides/views), these views will be incorrect and can write/read wrong elements. Suggest either (a) asserting the required layout (e.g., ggml_is_contiguous(src) and appropriate dst->nb[] relationships / dst->nb[3] == dst->nb[2] * dst->ne[2] when collapsing batches), or (b) constructing the views from the actual src->nb[] / dst->nb[] with the correct diagonal stride based on the real layout.

Copilot uses AI. Check for mistakes.
Comment on lines +844 to +854
const int64_t S = src->ne[0];
const int64_t n_batch = src->ne[2] * src->ne[3];
const size_t nb_f32 = sizeof(float);

int64_t ne3d[3] = { S, S, n_batch };
size_t nb3d[3] = { nb_f32, S * nb_f32, S * S * nb_f32 };

const ggml_tri_type ttype = (ggml_tri_type) ggml_get_op_params_i32(dst, 0);

acl_tensor_ptr acl_src = ggml_cann_create_tensor(src->data, ACL_FLOAT, nb_f32, ne3d, nb3d, 3);
acl_tensor_ptr acl_dst = ggml_cann_create_tensor(dst->data, ACL_FLOAT, nb_f32, ne3d, nb3d, 3);
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ggml_cann_tri constructs a collapsed 3D view with fully hard-coded contiguous strides (nb3d). This will be wrong if src/dst aren’t dense contiguous (or if batch dims aren’t laid out contiguously such that nb[3] == nb[2] * ne[2]). Since this function operates directly on raw pointers rather than ggml_cann_create_tensor(src) / (...dst), it should either enforce the required contiguity assumptions via GGML_ASSERT(...) (including batch-collapse contiguity) or build ne/nb from the actual tensor strides and only collapse when it’s valid.

Suggested change
const int64_t S = src->ne[0];
const int64_t n_batch = src->ne[2] * src->ne[3];
const size_t nb_f32 = sizeof(float);
int64_t ne3d[3] = { S, S, n_batch };
size_t nb3d[3] = { nb_f32, S * nb_f32, S * S * nb_f32 };
const ggml_tri_type ttype = (ggml_tri_type) ggml_get_op_params_i32(dst, 0);
acl_tensor_ptr acl_src = ggml_cann_create_tensor(src->data, ACL_FLOAT, nb_f32, ne3d, nb3d, 3);
acl_tensor_ptr acl_dst = ggml_cann_create_tensor(dst->data, ACL_FLOAT, nb_f32, ne3d, nb3d, 3);
const size_t nb_f32 = sizeof(float);
GGML_ASSERT(src->type == GGML_TYPE_F32);
GGML_ASSERT(dst->type == GGML_TYPE_F32);
GGML_ASSERT(src->ne[0] == src->ne[1]);
GGML_ASSERT(dst->ne[0] == dst->ne[1]);
GGML_ASSERT(src->ne[0] == dst->ne[0]);
GGML_ASSERT(src->ne[1] == dst->ne[1]);
GGML_ASSERT(src->ne[2] == dst->ne[2]);
GGML_ASSERT(src->ne[3] == dst->ne[3]);
// This path collapses the last two dimensions into a single batch dimension.
// That is only valid for dense contiguous tensors with tightly packed batch
// dimensions.
GGML_ASSERT(src->nb[0] == nb_f32);
GGML_ASSERT(src->nb[1] == (size_t) src->ne[0] * src->nb[0]);
GGML_ASSERT(src->nb[2] == (size_t) src->ne[1] * src->nb[1]);
GGML_ASSERT(src->nb[3] == (size_t) src->ne[2] * src->nb[2]);
GGML_ASSERT(dst->nb[0] == nb_f32);
GGML_ASSERT(dst->nb[1] == (size_t) dst->ne[0] * dst->nb[0]);
GGML_ASSERT(dst->nb[2] == (size_t) dst->ne[1] * dst->nb[1]);
GGML_ASSERT(dst->nb[3] == (size_t) dst->ne[2] * dst->nb[2]);
const int64_t S = src->ne[0];
const int64_t n_batch = src->ne[2] * src->ne[3];
int64_t ne3d[3] = { S, S, n_batch };
size_t src_nb3d[3] = { src->nb[0], src->nb[1], src->nb[2] };
size_t dst_nb3d[3] = { dst->nb[0], dst->nb[1], dst->nb[2] };
const ggml_tri_type ttype = (ggml_tri_type) ggml_get_op_params_i32(dst, 0);
acl_tensor_ptr acl_src = ggml_cann_create_tensor(src->data, ACL_FLOAT, nb_f32, ne3d, src_nb3d, 3);
acl_tensor_ptr acl_dst = ggml_cann_create_tensor(dst->data, ACL_FLOAT, nb_f32, ne3d, dst_nb3d, 3);

Copilot uses AI. Check for mistakes.
#include "ggml-impl.h"
#include "ggml.h"


Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an extra blank line between includes (line 28). Consider removing it to keep include blocks consistent.

Suggested change

Copilot uses AI. Check for mistakes.
@hipudding
Copy link
Copy Markdown
Contributor Author

Known issue: Due to kernel caching, in models like qwen3.5 where each layer is not entirely identical, a "kernel cache not found" error occurs during multi-device inference. Workaround: Disable kernel caching by setting environment variables. ACLNN_CACHE_LIMIT=0

So the proper fix should be to use separate threads for each device so the cache is managed per device and invalidated correctly, rather than disabling kernel caching globally as a workaround?

Yes, you're right. But I didn't find a way to manage cache key, do you know any cache key's api to manage the key by llama.cpp? I thought it's a interface only in CANN.

@hipudding
Copy link
Copy Markdown
Contributor Author

@KokerZhou Sorry, I misunderstood at first. The revised version, using multi-threaded inference (one thread per device), involves many modifications; I think this feature can be considered for later implementation. The best approach would be to add device information to the cache key, but I haven't found a suitable interface for that.

@hipudding
Copy link
Copy Markdown
Contributor Author

@ggerganov Friendly ping! Looking forward to any feedback when you have a chance. Thanks!

@ggerganov ggerganov merged commit c3e08f4 into ggml-org:master Apr 28, 2026
85 of 88 checks passed
IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026
New operators:
- GGML_OP_SET: implement via aclnnInplaceCopy on target region
- GGML_OP_CUMSUM: implement via aclnnCumsum
- GGML_OP_FILL: implement via aclnnInplaceFillScalar
- GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides
- GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via
  aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets
- GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve
- GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus

Optimizations:
- GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu /
  aclnnGeGluV3 when applicable; fallback conditions now checked inside
  each function rather than at the call site
- CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→
  ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call
- L2_NORM: fix in-place ClampMin on norm result (was clamping wrong
  tensor); add eps clamping before division to avoid divide-by-zero
- PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call
  ReflectionPad1d once on the full 4-D view; remove redundant nb copies
- GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor
  helper into gather_batched lambda with batch loop inlined
- SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice;
  refactor helper into scatter_batched lambda with batch loop inlined
- OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with
  per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast
  batch dims where ne02/ne03 may differ from ne2/ne3
- backend memset_tensor: implement via aclrtMemset (was NULL)

Bug fixes:
- COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary
  buffer instead of InplaceEqTensor, avoiding corruption of src0
- ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[]
  fields in ggml_graph_node_properties; has_matching_properties() was
  missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to
  incorrectly share cached graphs and produce wrong results (ERR≈679)
- graph cache op_params matching: compare full GGML_MAX_OP_PARAMS
  bytes so that ops differing only in parameters are not incorrectly
  replayed from cache
IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026
New operators:
- GGML_OP_SET: implement via aclnnInplaceCopy on target region
- GGML_OP_CUMSUM: implement via aclnnCumsum
- GGML_OP_FILL: implement via aclnnInplaceFillScalar
- GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides
- GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via
  aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets
- GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve
- GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus

Optimizations:
- GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu /
  aclnnGeGluV3 when applicable; fallback conditions now checked inside
  each function rather than at the call site
- CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→
  ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call
- L2_NORM: fix in-place ClampMin on norm result (was clamping wrong
  tensor); add eps clamping before division to avoid divide-by-zero
- PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call
  ReflectionPad1d once on the full 4-D view; remove redundant nb copies
- GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor
  helper into gather_batched lambda with batch loop inlined
- SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice;
  refactor helper into scatter_batched lambda with batch loop inlined
- OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with
  per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast
  batch dims where ne02/ne03 may differ from ne2/ne3
- backend memset_tensor: implement via aclrtMemset (was NULL)

Bug fixes:
- COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary
  buffer instead of InplaceEqTensor, avoiding corruption of src0
- ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[]
  fields in ggml_graph_node_properties; has_matching_properties() was
  missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to
  incorrectly share cached graphs and produce wrong results (ERR≈679)
- graph cache op_params matching: compare full GGML_MAX_OP_PARAMS
  bytes so that ops differing only in parameters are not incorrectly
  replayed from cache
cnsiva pushed a commit to saas-home/llama.cpp that referenced this pull request Apr 29, 2026
New operators:
- GGML_OP_SET: implement via aclnnInplaceCopy on target region
- GGML_OP_CUMSUM: implement via aclnnCumsum
- GGML_OP_FILL: implement via aclnnInplaceFillScalar
- GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides
- GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via
  aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets
- GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve
- GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus

Optimizations:
- GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu /
  aclnnGeGluV3 when applicable; fallback conditions now checked inside
  each function rather than at the call site
- CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→
  ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call
- L2_NORM: fix in-place ClampMin on norm result (was clamping wrong
  tensor); add eps clamping before division to avoid divide-by-zero
- PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call
  ReflectionPad1d once on the full 4-D view; remove redundant nb copies
- GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor
  helper into gather_batched lambda with batch loop inlined
- SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice;
  refactor helper into scatter_batched lambda with batch loop inlined
- OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with
  per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast
  batch dims where ne02/ne03 may differ from ne2/ne3
- backend memset_tensor: implement via aclrtMemset (was NULL)

Bug fixes:
- COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary
  buffer instead of InplaceEqTensor, avoiding corruption of src0
- ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[]
  fields in ggml_graph_node_properties; has_matching_properties() was
  missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to
  incorrectly share cached graphs and produce wrong results (ERR≈679)
- graph cache op_params matching: compare full GGML_MAX_OP_PARAMS
  bytes so that ops differing only in parameters are not incorrectly
  replayed from cache
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
New operators:
- GGML_OP_SET: implement via aclnnInplaceCopy on target region
- GGML_OP_CUMSUM: implement via aclnnCumsum
- GGML_OP_FILL: implement via aclnnInplaceFillScalar
- GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides
- GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via
  aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets
- GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve
- GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus

Optimizations:
- GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu /
  aclnnGeGluV3 when applicable; fallback conditions now checked inside
  each function rather than at the call site
- CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→
  ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call
- L2_NORM: fix in-place ClampMin on norm result (was clamping wrong
  tensor); add eps clamping before division to avoid divide-by-zero
- PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call
  ReflectionPad1d once on the full 4-D view; remove redundant nb copies
- GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor
  helper into gather_batched lambda with batch loop inlined
- SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice;
  refactor helper into scatter_batched lambda with batch loop inlined
- OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with
  per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast
  batch dims where ne02/ne03 may differ from ne2/ne3
- backend memset_tensor: implement via aclrtMemset (was NULL)

Bug fixes:
- COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary
  buffer instead of InplaceEqTensor, avoiding corruption of src0
- ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[]
  fields in ggml_graph_node_properties; has_matching_properties() was
  missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to
  incorrectly share cached graphs and produce wrong results (ERR≈679)
- graph cache op_params matching: compare full GGML_MAX_OP_PARAMS
  bytes so that ops differing only in parameters are not incorrectly
  replayed from cache
samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026
New operators:
- GGML_OP_SET: implement via aclnnInplaceCopy on target region
- GGML_OP_CUMSUM: implement via aclnnCumsum
- GGML_OP_FILL: implement via aclnnInplaceFillScalar
- GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides
- GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via
  aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets
- GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve
- GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus

Optimizations:
- GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu /
  aclnnGeGluV3 when applicable; fallback conditions now checked inside
  each function rather than at the call site
- CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→
  ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call
- L2_NORM: fix in-place ClampMin on norm result (was clamping wrong
  tensor); add eps clamping before division to avoid divide-by-zero
- PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call
  ReflectionPad1d once on the full 4-D view; remove redundant nb copies
- GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor
  helper into gather_batched lambda with batch loop inlined
- SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice;
  refactor helper into scatter_batched lambda with batch loop inlined
- OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with
  per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast
  batch dims where ne02/ne03 may differ from ne2/ne3
- backend memset_tensor: implement via aclrtMemset (was NULL)

Bug fixes:
- COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary
  buffer instead of InplaceEqTensor, avoiding corruption of src0
- ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[]
  fields in ggml_graph_node_properties; has_matching_properties() was
  missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to
  incorrectly share cached graphs and produce wrong results (ERR≈679)
- graph cache op_params matching: compare full GGML_MAX_OP_PARAMS
  bytes so that ops differing only in parameters are not incorrectly
  replayed from cache
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
New operators:
- GGML_OP_SET: implement via aclnnInplaceCopy on target region
- GGML_OP_CUMSUM: implement via aclnnCumsum
- GGML_OP_FILL: implement via aclnnInplaceFillScalar
- GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides
- GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via
  aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets
- GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve
- GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus

Optimizations:
- GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu /
  aclnnGeGluV3 when applicable; fallback conditions now checked inside
  each function rather than at the call site
- CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→
  ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call
- L2_NORM: fix in-place ClampMin on norm result (was clamping wrong
  tensor); add eps clamping before division to avoid divide-by-zero
- PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call
  ReflectionPad1d once on the full 4-D view; remove redundant nb copies
- GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor
  helper into gather_batched lambda with batch loop inlined
- SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice;
  refactor helper into scatter_batched lambda with batch loop inlined
- OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with
  per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast
  batch dims where ne02/ne03 may differ from ne2/ne3
- backend memset_tensor: implement via aclrtMemset (was NULL)

Bug fixes:
- COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary
  buffer instead of InplaceEqTensor, avoiding corruption of src0
- ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[]
  fields in ggml_graph_node_properties; has_matching_properties() was
  missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to
  incorrectly share cached graphs and produce wrong results (ERR≈679)
- graph cache op_params matching: compare full GGML_MAX_OP_PARAMS
  bytes so that ops differing only in parameters are not incorrectly
  replayed from cache
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
New operators:
- GGML_OP_SET: implement via aclnnInplaceCopy on target region
- GGML_OP_CUMSUM: implement via aclnnCumsum
- GGML_OP_FILL: implement via aclnnInplaceFillScalar
- GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides
- GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via
  aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets
- GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve
- GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus

Optimizations:
- GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu /
  aclnnGeGluV3 when applicable; fallback conditions now checked inside
  each function rather than at the call site
- CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→
  ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call
- L2_NORM: fix in-place ClampMin on norm result (was clamping wrong
  tensor); add eps clamping before division to avoid divide-by-zero
- PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call
  ReflectionPad1d once on the full 4-D view; remove redundant nb copies
- GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor
  helper into gather_batched lambda with batch loop inlined
- SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice;
  refactor helper into scatter_batched lambda with batch loop inlined
- OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with
  per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast
  batch dims where ne02/ne03 may differ from ne2/ne3
- backend memset_tensor: implement via aclrtMemset (was NULL)

Bug fixes:
- COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary
  buffer instead of InplaceEqTensor, avoiding corruption of src0
- ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[]
  fields in ggml_graph_node_properties; has_matching_properties() was
  missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to
  incorrectly share cached graphs and produce wrong results (ERR≈679)
- graph cache op_params matching: compare full GGML_MAX_OP_PARAMS
  bytes so that ops differing only in parameters are not incorrectly
  replayed from cache
@KokerZhou
Copy link
Copy Markdown
Contributor

@hipudding The multithread issue has been fixed upstream in commit f37e88e:
https://gitcode.com/cann/opbase/tree/f37e88e9525fac93214acd6fc652d9c69e191e02

No further action is required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ascend NPU issues specific to Ascend NPUs ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants