Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
a09c2d6
tests : add --list-ops and --show-coverage options (llama/15745)
danbev Sep 5, 2025
dd8b264
CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (llama/15802)
JohannesGaessler Sep 5, 2025
e53d186
ggml-cpu: drop support for nnpa intrinsics (llama/15821)
taronaeo Sep 6, 2025
ef3a67b
ggml-cpu: document use of "free" memory [no ci] (llama/15834)
JohannesGaessler Sep 6, 2025
ff35061
kleidiai: generalize compute_forward_kv_cache to compute_forward_fp16…
chaxu01 Sep 6, 2025
664f01c
CUDA: faster tile FA (Pascal/AMD), headsize 256 (llama/15769)
JohannesGaessler Sep 6, 2025
dec6751
ggml WebGPU: remove userdata from request adapter callback (llama/15527)
danbev Sep 7, 2025
d2069c7
vulkan: Use larger loads in scalar/coopmat1 matmul (llama/15729)
jeffbolznv Sep 7, 2025
f85374e
vulkan: Support pad_ext (llama/15794)
jeffbolznv Sep 7, 2025
70e41e2
ggml-cpu: clean up s390x SIMD (llama/15855)
taronaeo Sep 7, 2025
5af0743
vulkan: support im2col_3d (llama/15795)
jeffbolznv Sep 7, 2025
b90d6a1
CANN: Stream sync between devices for acl_graph (llama/15809)
noemotiovon Sep 8, 2025
f8f13b3
tests: large sizes for get_rows (llama/15687)
jeffbolznv Sep 8, 2025
0c6f5cb
CUDA: non-contiguous src0 not supported for PAD (llama/15869)
CISC Sep 8, 2025
a767255
ggml: allow casting between f32 and i32 (llama/15783)
ngxson Sep 8, 2025
c186b92
sync : llama.cpp
ggerganov Sep 20, 2025
a6b91e6
metal : refactor + optimize (llama/15857)
ggerganov Sep 20, 2025
0103cf1
sync : llama.cpp
ggerganov Sep 20, 2025
3261bcd
cuda : fix supports_op condition for get_rows when number of blocks i…
ggerganov Sep 8, 2025
d096a7d
CUDA: generate_cu_files.py - add missing mxfp4 (llama/15880)
am17an Sep 8, 2025
fe9600d
vulkan: sort graph to allow more parallel execution (llama/15850)
jeffbolznv Sep 8, 2025
373c68f
CUDA: fix GET_ROWS for large tensors (llama/15882)
JohannesGaessler Sep 9, 2025
8c6bbdc
CUDA: Add mul_mat_id support for the mmf kernel (llama/15767)
am17an Sep 9, 2025
e7772a1
Workaround for subgroup arithmetic failing on MoltenVK with AMD GPUs …
lksj92hs Sep 9, 2025
07ec4c6
HIP: use v_dot2_f32_f16 instruction for FA (llama/15884)
JohannesGaessler Sep 9, 2025
8967731
vulkan: Fix OOB accesses in soft_max_back (llama/15861)
jeffbolznv Sep 9, 2025
bd5192a
vulkan: throw the oom error instead of no memory type found (llama/15…
0cc4m Sep 9, 2025
b15febc
CANN: implement LRU cache for ACL graphs (llama/15814)
noemotiovon Sep 10, 2025
09dbed5
CANN: Add ROPE sin/cos cache for reuse (llama/15912)
noemotiovon Sep 10, 2025
870f627
tests : filter out no-ops from coverage report (llama/15900)
danbev Sep 10, 2025
98252d7
sync : llama.cpp
ggerganov Sep 20, 2025
fb58c0a
metal : make the backend async (llama/15906)
ggerganov Sep 20, 2025
8530081
sync : llama.cpp
ggerganov Sep 20, 2025
2b327c3
ggml-cpu : fix padding in ggml_timestep_embedding (llama/15917)
danbev Sep 10, 2025
47c331f
CUDA: Add `fastdiv` to `k_bin_bcast*`, giving 1-3% E2E performance (l…
ORippler Sep 10, 2025
de3540b
CANN: Disable acl_graph for prefill stage (llama/15933)
hipudding Sep 11, 2025
bc3642a
kleidiai: fix GGML_ASSERT(*cur_backend_id != -1) failed (llama/15614)
chaxu01 Sep 11, 2025
6334d15
ggml-cpu : add check for ARM MATMUL_INT8/i8mm support (llama/15922)
danbev Sep 11, 2025
9af5e5e
CUDA: larger SRAM reads for tile FA, AMD FP16 dot (llama/15927)
JohannesGaessler Sep 11, 2025
a5408a4
ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type (llama/1…
slaren Sep 11, 2025
f2775a6
Revert "sycl: add usage of enqueue_functions extension (llama/14244)"…
NeoZhangJianyu Sep 12, 2025
89270ee
vulkan: Make device memory check more portable (llama/15939)
mbaudier Sep 12, 2025
c6a354f
Vulkan iGPU device selection overhaul and PCI ID API support (llama/1…
0cc4m Sep 12, 2025
8db1968
ggml-zdnn: fix #15414, activate FP16 and BF16 acceleration and incorr…
taronaeo Sep 12, 2025
f4d6821
metal : fix memory leaks (llama/15962)
ggerganov Sep 13, 2025
6e2e961
metal : allow ops to run concurrently (llama/15929)
ggerganov Sep 13, 2025
c9a3556
metal : refactor kernel loading (llama/15964)
ggerganov Sep 13, 2025
e0b7d3f
vulkan: initialize vulkan-hpp to allow using extension function point…
jeffbolznv Sep 13, 2025
5d4df7c
vulkan: fix failing dequant shaders (llama/15862)
jeffbolznv Sep 13, 2025
a00b972
ggml-zdnn: rm user mapped buffers (llama/15965)
taronaeo Sep 14, 2025
34dbac9
metal : fix kernel requirements (llama/15983)
ggerganov Sep 14, 2025
7f52926
Vulkan: Clean up mul_mm shader (llama/15987)
0cc4m Sep 14, 2025
7568eb5
metal : remove memory pools (llama/15966)
ggerganov Sep 14, 2025
bbdf894
CUDA: some micro-optimizations in mmf.cuh for mul_mat_id (llama/15926)
am17an Sep 15, 2025
f99ae6f
SYCL: Add COUNT_EQUAL operator support (llama/15991)
yael-works Sep 15, 2025
4a99e6b
CUDA: fix im2col_3d to respect non-contiguous inputs (views) (llama/1…
jakekarnes42 Sep 15, 2025
cc7ca51
ggml : fix padding in timestep embedding kernels (llama/15932)
danbev Sep 16, 2025
5cfc2b6
CANN: Optimize ggml_cann_set_device (llama/15935)
noemotiovon Sep 17, 2025
d80e7b9
vulkan: automatically remove unsupported devices (llama/15976)
netrunnereve Sep 17, 2025
f8c2feb
CUDA: fix FA occupancy, optimize tile kernel (llama/15982)
JohannesGaessler Sep 17, 2025
1a91fb9
sync : llama.cpp
ggerganov Sep 20, 2025
eb24610
metal : refactor + optimize v2 (llama/15995)
ggerganov Sep 20, 2025
483e0f2
GGML WebGPU: Support for ADD, MUL, RMS_NORM, GET_ROWS operators (llam…
reeselevine Sep 17, 2025
098944f
CANN: Remove print (llama/16044)
noemotiovon Sep 18, 2025
1c1edee
metal : handle nil cv during pipeline creation (llama/16065)
ggerganov Sep 18, 2025
ad834c3
metal : avoid call free for non-owned buffer (llama/16067)
jhen0409 Sep 18, 2025
2465bd4
metal : improve F32, F16 and BF16 mat-vec multiplication (llama/16057)
ggerganov Sep 18, 2025
15c7f02
cuda : add missing F32<->I32 entries in ggml_cuda_cpy_fn (llama/16060)
CISC Sep 18, 2025
d2af55f
metal : use function constants for mul_mv_ext kernels (llama/16074)
ggerganov Sep 18, 2025
044daef
CUDA: fix compilation on CC 6.0 (llama/16091)
JohannesGaessler Sep 18, 2025
3a9d858
CUDA: Optimize PAD_REFLECT_1D (llama/15957)
bugparty Sep 18, 2025
2798b6c
rename optimize_graph to graph_optimize (llama/16082)
jeffbolznv Sep 18, 2025
543c31d
opencl: optimize mxfp4 kernels (llama/16037)
shawngu-quic Sep 18, 2025
527276a
cmake : fix static linking for OpenMP on Unix-like systems (llama/16031)
angt Sep 18, 2025
f9cf621
ggml-amx : fix ggml_amx_init() on generic Linux (llama/16049)
angt Sep 18, 2025
7f7a412
ggml : refactor forward_dup for cpu backend (llama/16062)
ngxson Sep 19, 2025
96a655f
vulkan: use vec dot for matrix matrix multiplications (llama/16056)
0cc4m Sep 20, 2025
a78e889
sync : llama.cpp
ggerganov Sep 20, 2025
ec27e70
tests : adjust to new timestep_embedding operator
ggerganov Sep 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,6 @@ option(GGML_RV_ZVFH "ggml: enable riscv zvfh" ON)
option(GGML_RV_ZICBOP "ggml: enable riscv zicbop" ON)
option(GGML_XTHEADVECTOR "ggml: enable xtheadvector" OFF)
option(GGML_VXE "ggml: enable vxe" ON)
option(GGML_NNPA "ggml: enable nnpa" OFF) # temp disabled by default, see: https://github.com/ggml-org/llama.cpp/issues/14877

option(GGML_CPU_ALL_VARIANTS "ggml: build all variants of the CPU backend (requires GGML_BACKEND_DL)" OFF)
set(GGML_CPU_ARM_ARCH "" CACHE STRING "ggml: CPU architecture for ARM")
Expand Down
12 changes: 12 additions & 0 deletions include/ggml-backend.h
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,8 @@ extern "C" {
GGML_BACKEND_DEVICE_TYPE_CPU,
// GPU device using dedicated memory
GGML_BACKEND_DEVICE_TYPE_GPU,
// integrated GPU device using host memory
GGML_BACKEND_DEVICE_TYPE_IGPU,
// accelerator devices intended to be used together with the CPU backend (e.g. BLAS or AMX)
GGML_BACKEND_DEVICE_TYPE_ACCEL
};
Expand All @@ -150,11 +152,21 @@ extern "C" {

// all the device properties
struct ggml_backend_dev_props {
// device name
const char * name;
// device description
const char * description;
// device free memory in bytes
size_t memory_free;
// device total memory in bytes
size_t memory_total;
// device type
enum ggml_backend_dev_type type;
// device id
// for PCI devices, this should be the PCI bus id formatted as "domain:bus:device.function" (e.g. "0000:01:00.0")
// if the id is unknown, this should be NULL
const char * device_id;
// device capabilities
struct ggml_backend_dev_caps caps;
};

Expand Down
2 changes: 1 addition & 1 deletion include/ggml-cpu.h
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,6 @@ extern "C" {
GGML_BACKEND_API int ggml_cpu_has_riscv_v (void);
GGML_BACKEND_API int ggml_cpu_has_vsx (void);
GGML_BACKEND_API int ggml_cpu_has_vxe (void);
GGML_BACKEND_API int ggml_cpu_has_nnpa (void);
GGML_BACKEND_API int ggml_cpu_has_wasm_simd (void);
GGML_BACKEND_API int ggml_cpu_has_llamafile (void);

Expand Down Expand Up @@ -135,6 +134,7 @@ extern "C" {
GGML_BACKEND_API ggml_backend_reg_t ggml_backend_cpu_reg(void);

GGML_BACKEND_API void ggml_cpu_fp32_to_fp32(const float *, float *, int64_t);
GGML_BACKEND_API void ggml_cpu_fp32_to_i32 (const float *, int32_t *, int64_t);
GGML_BACKEND_API void ggml_cpu_fp32_to_fp16(const float *, ggml_fp16_t *, int64_t);
GGML_BACKEND_API void ggml_cpu_fp16_to_fp32(const ggml_fp16_t *, float *, int64_t);
GGML_BACKEND_API void ggml_cpu_fp32_to_bf16(const float *, ggml_bf16_t *, int64_t);
Expand Down
7 changes: 1 addition & 6 deletions include/ggml-metal.h
Original file line number Diff line number Diff line change
Expand Up @@ -39,18 +39,13 @@ extern "C" {
// user-code should use only these functions
//

// TODO: remove in the future
GGML_BACKEND_API ggml_backend_t ggml_backend_metal_init(void);

GGML_BACKEND_API bool ggml_backend_is_metal(ggml_backend_t backend);

GGML_DEPRECATED(
GGML_BACKEND_API ggml_backend_buffer_t ggml_backend_metal_buffer_from_ptr(void * data, size_t size, size_t max_size),
"obsoleted by the new device interface - https://github.com/ggml-org/llama.cpp/pull/9713");

GGML_BACKEND_API void ggml_backend_metal_set_abort_callback(ggml_backend_t backend, ggml_abort_callback abort_callback, void * user_data);

GGML_BACKEND_API ggml_backend_buffer_type_t ggml_backend_metal_buffer_type(void);

// helper to check if the device supports a specific family
// ideally, the user code should be doing these checks
// ref: https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf
Expand Down
2 changes: 0 additions & 2 deletions include/ggml-zdnn.h
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,6 @@
extern "C" {
#endif

GGML_BACKEND_API ggml_backend_t ggml_backend_zdnn_init(void);

GGML_BACKEND_API ggml_backend_reg_t ggml_backend_zdnn_reg(void);

#ifdef __cplusplus
Expand Down
15 changes: 10 additions & 5 deletions include/ggml.h
Original file line number Diff line number Diff line change
Expand Up @@ -284,19 +284,19 @@ __host__ __device__ constexpr inline void ggml_unused_vars_impl(Args&&...) noexc
// GGML_TENSOR_LOCALS(size_t, nb1, src1, nb);
//
#define GGML_TENSOR_LOCALS_1(type, prefix, pointer, array) \
const type prefix##0 = (pointer)->array[0]; \
const type prefix##0 = (pointer) ? (pointer)->array[0] : 0; \
GGML_UNUSED(prefix##0);
#define GGML_TENSOR_LOCALS_2(type, prefix, pointer, array) \
GGML_TENSOR_LOCALS_1 (type, prefix, pointer, array) \
const type prefix##1 = (pointer)->array[1]; \
const type prefix##1 = (pointer) ? (pointer)->array[1] : 0; \
GGML_UNUSED(prefix##1);
#define GGML_TENSOR_LOCALS_3(type, prefix, pointer, array) \
GGML_TENSOR_LOCALS_2 (type, prefix, pointer, array) \
const type prefix##2 = (pointer)->array[2]; \
const type prefix##2 = (pointer) ? (pointer)->array[2] : 0; \
GGML_UNUSED(prefix##2);
#define GGML_TENSOR_LOCALS(type, prefix, pointer, array) \
GGML_TENSOR_LOCALS_3 (type, prefix, pointer, array) \
const type prefix##3 = (pointer)->array[3]; \
const type prefix##3 = (pointer) ? (pointer)->array[3] : 0; \
GGML_UNUSED(prefix##3);

#define GGML_TENSOR_UNARY_OP_LOCALS \
Expand Down Expand Up @@ -1404,6 +1404,7 @@ extern "C" {
struct ggml_tensor * a,
struct ggml_tensor * b);

// note: casting from f32 to i32 will discard the fractional part
GGML_API struct ggml_tensor * ggml_cast(
struct ggml_context * ctx,
struct ggml_tensor * a,
Expand Down Expand Up @@ -1528,7 +1529,11 @@ extern "C" {
struct ggml_context * ctx,
struct ggml_tensor * a);

// supports 3D: a->ne[2] == b->ne[1]
// supports 4D a:
// a [n_embd, ne1, ne2, ne3]
// b I32 [n_rows, ne2, ne3, 1]
//
// return [n_embd, n_rows, ne2, ne3]
GGML_API struct ggml_tensor * ggml_get_rows(
struct ggml_context * ctx,
struct ggml_tensor * a, // data
Expand Down
2 changes: 1 addition & 1 deletion scripts/sync-llama.last
Original file line number Diff line number Diff line change
@@ -1 +1 @@
a81283820a466f2ace06ce4d4bc9512761f9365f
7f766929ca8e8e01dcceb1c526ee584f7e5e1408
3 changes: 3 additions & 0 deletions src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,9 @@ message(STATUS "GGML_SYSTEM_ARCH: ${GGML_SYSTEM_ARCH}")

if (NOT MSVC)
if (GGML_STATIC)
if (UNIX AND NOT APPLE)
set(CMAKE_FIND_LIBRARY_SUFFIXES ".a;.so")
endif()
add_link_options(-static)
if (MINGW)
add_link_options(-static-libgcc -static-libstdc++)
Expand Down
5 changes: 4 additions & 1 deletion src/ggml-backend-impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
extern "C" {
#endif

#define GGML_BACKEND_API_VERSION 1
#define GGML_BACKEND_API_VERSION 2

//
// Backend buffer type
Expand Down Expand Up @@ -114,6 +114,9 @@ extern "C" {
void (*event_record)(ggml_backend_t backend, ggml_backend_event_t event);
// wait for an event on on a different stream
void (*event_wait) (ggml_backend_t backend, ggml_backend_event_t event);

// (optional) sort/optimize the nodes in the graph
void (*graph_optimize) (ggml_backend_t backend, struct ggml_cgraph * cgraph);
};

struct ggml_backend {
Expand Down
5 changes: 2 additions & 3 deletions src/ggml-backend-reg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -400,9 +400,8 @@ ggml_backend_t ggml_backend_init_by_type(enum ggml_backend_dev_type type, const

ggml_backend_t ggml_backend_init_best(void) {
ggml_backend_dev_t dev = ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_GPU);
if (!dev) {
dev = ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_CPU);
}
dev = dev ? dev : ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_IGPU);
dev = dev ? dev : ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_CPU);
if (!dev) {
return nullptr;
}
Expand Down
11 changes: 11 additions & 0 deletions src/ggml-backend.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -463,6 +463,13 @@ void ggml_backend_event_wait(ggml_backend_t backend, ggml_backend_event_t event)
backend->iface.event_wait(backend, event);
}

static void ggml_backend_graph_optimize(ggml_backend_t backend, struct ggml_cgraph * cgraph) {
GGML_ASSERT(backend);
if (backend->iface.graph_optimize != NULL) {
backend->iface.graph_optimize(backend, cgraph);
}
}

// Backend device

const char * ggml_backend_dev_name(ggml_backend_dev_t device) {
Expand Down Expand Up @@ -1298,6 +1305,10 @@ void ggml_backend_sched_split_graph(ggml_backend_sched_t sched, struct ggml_cgra
struct ggml_backend_sched_split * split = &sched->splits[i];
split->graph = ggml_graph_view(graph, split->i_start, split->i_end);

// Optimize this split of the graph. This needs to happen before we make graph_copy,
// so they are in sync.
ggml_backend_graph_optimize(sched->backends[split->backend_id], &split->graph);

// add inputs to the graph copy so that they are allocated by ggml-alloc at the start of the split
for (int j = 0; j < split->n_inputs; j++) {
assert(graph_copy->size > (graph_copy->n_nodes + 1));
Expand Down
1 change: 1 addition & 0 deletions src/ggml-blas/ggml-blas.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -270,6 +270,7 @@ static struct ggml_backend_i blas_backend_i = {
/* .graph_compute = */ ggml_backend_blas_graph_compute,
/* .event_record = */ NULL,
/* .event_wait = */ NULL,
/* .graph_optimize = */ NULL,
};

static ggml_guid_t ggml_backend_blas_guid(void) {
Expand Down
62 changes: 38 additions & 24 deletions src/ggml-cann/aclnn_ops.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2268,26 +2268,30 @@ static void aclnn_index_fill_tensor(ggml_backend_cann_context& ctx,
* stream, and persistent buffers for rope init/cache.
* @param dst The destination ggml_tensor whose computation
* depends on the RoPE values (usually Qcur/Kcur).
* @param sin_tensor_buffer Pre-allocated buffer for storing repeated sin values.
* @param cos_tensor_buffer Pre-allocated buffer for storing repeated cos values.
* @param theta_scale Scalar exponent base for computing theta scale values.
* @param freq_scale Frequency scaling factor, applied to theta scale.
* @param attn_factor Attention scaling factor, applied to sin/cos.
* @param is_neox Whether to use Neox-style repeat strategy
* (dim expansion vs repeat_interleave).
*/
static void aclnn_cache_init(ggml_backend_cann_context& ctx, ggml_tensor* dst,
void* sin_tensor_buffer, void* cos_tensor_buffer,
float* corr_dims, float ext_factor,
float theta_scale, float freq_scale,
float attn_factor, bool is_neox) {
// int sin/cos cache, cache has different repeat method depond on
// @param.is_neox

ggml_tensor* src0 = dst->src[0]; // input
ggml_tensor* src1 = dst->src[1]; // position
ggml_tensor* src2 = dst->src[2]; // freq_factors

if(src2 == nullptr && ctx.rope_cache.cached
&& ctx.rope_cache.ext_factor == ext_factor
&& ctx.rope_cache.theta_scale == theta_scale
&& ctx.rope_cache.freq_scale == freq_scale
&& ctx.rope_cache.attn_factor == attn_factor
&& ctx.rope_cache.is_neox == is_neox) {
// use cache.
return;
}

int64_t theta_scale_length = src0->ne[0] / 2;
int64_t theta_scale_ne[] = {theta_scale_length, 1, 1, 1};
size_t theta_scale_nb[] = {sizeof(float), sizeof(float), sizeof(float),
Expand Down Expand Up @@ -2316,8 +2320,6 @@ static void aclnn_cache_init(ggml_backend_cann_context& ctx, ggml_tensor* dst,
ctx.rope_cache.freq_scale != freq_scale) {

ctx.rope_cache.theta_scale_length = theta_scale_length;
ctx.rope_cache.theta_scale = theta_scale;
ctx.rope_cache.freq_scale = freq_scale;

if (ctx.rope_cache.theta_scale_cache != nullptr) {
ACL_CHECK(aclrtFree(ctx.rope_cache.theta_scale_cache));
Expand All @@ -2342,7 +2344,7 @@ static void aclnn_cache_init(ggml_backend_cann_context& ctx, ggml_tensor* dst,
// return MIN(1, MAX(0, y)) - 1;
yarn_ramp_allocator.alloc(theta_scale_length * sizeof(float));
void* yarn_ramp_buffer = yarn_ramp_allocator.get();
acl_yarn_ramp_tensor = ggml_cann_create_tensor(yarn_ramp_buffer, ACL_FLOAT, sizeof(float_t),
acl_yarn_ramp_tensor = ggml_cann_create_tensor(yarn_ramp_buffer, ACL_FLOAT, sizeof(float),
theta_scale_ne, theta_scale_nb, GGML_MAX_DIMS);
float zero_value = 0, one_value = 1;
float denom_safe_value = MAX(0.001f, corr_dims[1] - corr_dims[0]);
Expand Down Expand Up @@ -2411,6 +2413,20 @@ static void aclnn_cache_init(ggml_backend_cann_context& ctx, ggml_tensor* dst,
ggml_cann_release_resources(ctx, acl_freq_factors_tensor, acl_freq_fac_res_tensor);
}

// init sin_repeat && cos_repeat, only to accelerate first layer on each device
if (position_length > ctx.rope_cache.position_length) {
ctx.rope_cache.position_length = position_length;
if (ctx.rope_cache.sin_cache != nullptr) {
ACL_CHECK(aclrtFree(ctx.rope_cache.sin_cache));
}
if (ctx.rope_cache.cos_cache != nullptr) {
ACL_CHECK(aclrtFree(ctx.rope_cache.cos_cache));
}
int64_t repeat_theta_length = theta_scale_length * position_length * 2;
ACL_CHECK(aclrtMalloc(&ctx.rope_cache.sin_cache, repeat_theta_length * sizeof(float), ACL_MEM_MALLOC_HUGE_FIRST));
ACL_CHECK(aclrtMalloc(&ctx.rope_cache.cos_cache, repeat_theta_length * sizeof(float), ACL_MEM_MALLOC_HUGE_FIRST));
}

// position
aclTensor* acl_position_tensor = ggml_cann_create_tensor(
src1->data, ggml_cann_type_mapping(src1->type),
Expand Down Expand Up @@ -2462,10 +2478,10 @@ static void aclnn_cache_init(ggml_backend_cann_context& ctx, ggml_tensor* dst,
sin_reshape_nb[i] = sin_reshape_nb[i - 1] * sin_reshape_ne[i - 1];
}
aclTensor* acl_sin_repeat_tensor =
ggml_cann_create_tensor(sin_tensor_buffer, ACL_FLOAT, sizeof(float),
ggml_cann_create_tensor(ctx.rope_cache.sin_cache, ACL_FLOAT, sizeof(float),
sin_reshape_ne, sin_reshape_nb, GGML_MAX_DIMS);
aclTensor* acl_cos_repeat_tensor =
ggml_cann_create_tensor(cos_tensor_buffer, ACL_FLOAT, sizeof(float),
ggml_cann_create_tensor(ctx.rope_cache.cos_cache, ACL_FLOAT, sizeof(float),
sin_reshape_ne, sin_reshape_nb, GGML_MAX_DIMS);

// repeat
Expand All @@ -2483,6 +2499,14 @@ static void aclnn_cache_init(ggml_backend_cann_context& ctx, ggml_tensor* dst,
num_repeats, output_size);
}

// Other layers use cache except first layer.
ctx.rope_cache.cached = true;
ctx.rope_cache.ext_factor = ext_factor;
ctx.rope_cache.theta_scale = theta_scale;
ctx.rope_cache.freq_scale = freq_scale;
ctx.rope_cache.attn_factor = attn_factor;
ctx.rope_cache.is_neox = is_neox;

ggml_cann_release_resources(ctx, acl_theta_scale_tensor, acl_position_tensor,
acl_theta_tensor, acl_sin_tensor, acl_sin_repeat_tensor, acl_cos_tensor,
acl_cos_repeat_tensor);
Expand All @@ -2504,10 +2528,7 @@ aclnnStatus aclnnRotaryPositionEmbedding(void* workspace,
#endif

void ggml_cann_rope(ggml_backend_cann_context& ctx, ggml_tensor* dst) {
// TODO: use ascendc
// Only test with LLAMA model.
ggml_tensor* src0 = dst->src[0]; // input
ggml_tensor* src1 = dst->src[1];

// param
float freq_base, freq_scale, ext_factor, attn_factor, beta_fast, beta_slow;
Expand Down Expand Up @@ -2538,15 +2559,8 @@ void ggml_cann_rope(ggml_backend_cann_context& ctx, ggml_tensor* dst) {

const bool is_neox = mode & GGML_ROPE_TYPE_NEOX;

// sin/cos tensor length.
int64_t repeat_theta_length = src0->ne[0] * src1->ne[0];
ggml_cann_pool_alloc sin_tensor_allocator(ctx.pool(), repeat_theta_length * sizeof(float));
ggml_cann_pool_alloc cos_tensor_allocator(ctx.pool(), repeat_theta_length * sizeof(float));
void *sin_tensor_buffer = sin_tensor_allocator.get();
void *cos_tensor_buffer = cos_tensor_allocator.get();

// init ctx.rope_cos/rope_sin cache
aclnn_cache_init(ctx, dst, sin_tensor_buffer, cos_tensor_buffer, corr_dims, ext_factor,
aclnn_cache_init(ctx, dst, corr_dims, ext_factor,
theta_scale, freq_scale, attn_factor, is_neox);

int64_t sin_reshape_ne[4] = {ne00, 1, ne02, 1};
Expand All @@ -2556,10 +2570,10 @@ void ggml_cann_rope(ggml_backend_cann_context& ctx, ggml_tensor* dst) {
sin_reshape_nb[i] = sin_reshape_nb[i - 1] * sin_reshape_ne[i - 1];
}
aclTensor* acl_sin_reshape_tensor =
ggml_cann_create_tensor(sin_tensor_buffer, ACL_FLOAT, sizeof(float),
ggml_cann_create_tensor(ctx.rope_cache.sin_cache, ACL_FLOAT, sizeof(float),
sin_reshape_ne, sin_reshape_nb, GGML_MAX_DIMS);
aclTensor* acl_cos_reshape_tensor =
ggml_cann_create_tensor(cos_tensor_buffer, ACL_FLOAT, sizeof(float),
ggml_cann_create_tensor(ctx.rope_cache.cos_cache, ACL_FLOAT, sizeof(float),
sin_reshape_ne, sin_reshape_nb, GGML_MAX_DIMS);

aclTensor* acl_src = ggml_cann_create_tensor(src0);
Expand Down
Loading
Loading