sync : ggml #2528

ggerganov · 2024-10-31T20:32:49Z

No description provided.

This commit removes the buffer_id field from the leaf_alloc struct. The motivation for is that this field is only written to and never read/used as far as I can tell. Each tensor_alloc has a buffer_id field and this is what caused me to look into this more closely, to understand what the buffer_id in leaf_alloc was used for.

* Single allocation of encode_async block with non-ARC capture in ggml-metal.m * Moving Block_release to the deallocation code * Release encode block when re-setting encoding buffer count if needed * Update ggml/src/ggml-metal.m --------- Co-authored-by: Georgi Gerganov <[email protected]>

* ggml : add metal backend registry / device ggml-ci * metal : fix names [no ci] * metal : global registry and device instances ggml-ci * cont : alternative initialization of global objects ggml-ci * llama : adapt to backend changes ggml-ci * fixes * metal : fix indent * metal : fix build when MTLGPUFamilyApple3 is not available ggml-ci * fix merge * metal : avoid unnecessary singleton accesses ggml-ci * metal : minor fix [no ci] * metal : g_state -> g_ggml_ctx_dev_main [no ci] * metal : avoid reference of device context in the backend context ggml-ci * metal : minor [no ci] * metal : fix maxTransferRate check * metal : remove transfer rate stuff --------- Co-authored-by: slaren <[email protected]>

* docs : clarify building Android on Termux * docs : update building Android on Termux * docs : add cross-compiling for Android * cmake : link dl explicitly for Android

…a/9752) * ggml : add backend registry / device interfaces to BLAS backend * fix mmap usage when using host buffers

* ggml : do not use BLAS with types without to_float * ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies * ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits it's not really internal if everybody uses it

* mtgpu: add docker image support Signed-off-by: Xiaodong Ye <[email protected]> * mtgpu: enable docker workflow Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

* rpc : add backend registry / device interfaces * llama : add llama_supports_rpc API * ggml_backend_rpc_start_rpc_server -> ggml_backend_rpc_start_server

* ggml : move more prints to the ggml log system * show BLAS OpenMP warnings in all builds using debug print

* Vectorize load instructions in dmmv f16 CUDA kernel Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup. * addressed comment * Update ggml/src/ggml-cuda/dmmv.cu Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

Fix cann compilation error after merging llama.cpp supports dynamically loadable backends.

…/9875) * fix: use `vm_allocate` to allocate CPU backend buffer on macOS * fix: switch to `posix_memalign` to keep existing `free()` usages work * feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS * style: formatting * fix: move const outside of `#ifndef` * style: formatting * fix: unused var * fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h` * fix: unused var * fix: page align to `GGUF_DEFAULT_ALIGNMENT` * fix: page align to `TENSOR_ALIGNMENT` * fix: convert `TENSOR_ALIGNMENT` to a macro * fix: increase page size to `32` on iOS * fix: iOS page size * fix: `hbw_posix_memalign` alignment

* vulkan : add backend registry / device interfaces * llama : print devices used on model load

add intel amx isa detection add vnni kernel for gemv cases add vnni and amx kernel support for block_q8_0 code cleanup fix packing B issue enable openmp fine tune amx kernel switch to aten parallel pattern add error message for nested parallelism code cleanup add f16 support in ggml-amx add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS update CMakeList update README fix some compilation warning fix compiler warning when amx is not enabled minor change ggml-ci move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp ggml-ci update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16 ggml-ci add amx as an ggml-backend update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h minor change update CMakeLists.txt minor change apply weight prepacking in set_tensor method in ggml-backend fix compile error ggml-ci minor change ggml-ci update CMakeLists.txt ggml-ci add march dependency minor change ggml-ci change ggml_backend_buffer_is_host to return false for amx backend ggml-ci fix supports_op use device reg for AMX backend ggml-ci minor change ggml-ci minor change fix rebase set .buffer_from_host_ptr to be false for AMX backend

* implemented missing SYCL event APIs * sycl : Added device and backend reg interfaces * Restructured ggml-sycl.cpp

* rpc : refactor backend Use structs for RPC request/response messages * rpc : refactor server

Co-authored-by: arthw <[email protected]>

ggml-ci

* [CANN] Adapt to dynamically loadable backends mechanism * Fix the Bug: inference running result is garbled in debug running model for LM models who's type is Q4_0 class * Handle the review comments of this pull request

* add pool_2d Signed-off-by: Junhee Yoo <[email protected]> * fix im2col and add unittest for N>=1024 Signed-off-by: Junhee Yoo <[email protected]> * add tests for N % 1024 != 0 Signed-off-by: Junhee Yoo <[email protected]> * remove trailing whitespaces Signed-off-by: Junhee Yoo <[email protected]> * apply suggestions Signed-off-by: Junhee Yoo <[email protected]> * apply more optimization - original IM2COL kernel + _ext with MIN() Signed-off-by: Junhee Yoo <[email protected]> * apply review: change kernel name of pool_2d Signed-off-by: Junhee Yoo <[email protected]> * apply review Signed-off-by: Junhee Yoo <[email protected]> * fix more formatting and enhance readability Signed-off-by: Junhee Yoo <[email protected]> --------- Signed-off-by: Junhee Yoo <[email protected]>

Co-authored-by: bssrdf <[email protected]>

* CUDA: fix MMQ for non-contiguous src0, add tests * revise test code

* metal : support permuted matrix multiplicaions ggml-ci * cont : use nb01 directly for row steps ggml-ci * cont : add comments [no ci] * metal : minor refactor * metal : minor

slaren · 2024-10-31T21:00:11Z

src/whisper.cpp

+    // TODO: temporary call to force backend registry initialization
+    WHISPER_LOG_INFO("%s: backends   = %zu\n", __func__, ggml_backend_reg_count());


Out of curiosity, is this a workaround for an initialization issue in the metal backend, or something related to the backend registry?

I think all backends would currently fail because during loading the model we reference the backend buffer type:

whisper.cpp/src/whisper.cpp

Lines 1802 to 1807 in bc763c1

// allocate tensors in the backend buffers

model.buffer = ggml_backend_alloc_ctx_tensors_from_buft(model.ctx, whisper_default_buffer_type(wctx.params));

if (!model.buffer) {

WHISPER_LOG_ERROR("%s: failed to allocate memory for the model\n", __func__);

return false;

}

But the ggml_backend_registry hasn't been initialized yet, before creating the first whisper_state:

whisper.cpp/src/whisper.cpp

Lines 3326 to 3334 in bc763c1

struct whisper_state * whisper_init_state(whisper_context * ctx) {

whisper_state * state = new whisper_state;

state->backends = whisper_backend_init(ctx->params);

if (state->backends.empty()) {

WHISPER_LOG_ERROR("%s: whisper_backend_init() failed\n", __func__);

whisper_free_state(state);

return nullptr;

}

I think after whisper_default_buffer_type() is reimplemented to use the backend registry, the problem will disappear and this workaround can be removed.

danbev and others added 30 commits October 31, 2024 22:18

Update building for Android (llama/9672)

456016b

* docs : clarify building Android on Termux * docs : update building Android on Termux * docs : add cross-compiling for Android * cmake : link dl explicitly for Android

ggml : add backend registry / device interfaces to BLAS backend (llam…

e9ed1a6

…a/9752) * ggml : add backend registry / device interfaces to BLAS backend * fix mmap usage when using host buffers

musa: add docker image support (llama/9685)

e493b68

* mtgpu: add docker image support Signed-off-by: Xiaodong Ye <[email protected]> * mtgpu: enable docker workflow Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

rpc : add backend registry / device interfaces (llama/9812)

5dde62c

* rpc : add backend registry / device interfaces * llama : add llama_supports_rpc API * ggml_backend_rpc_start_rpc_server -> ggml_backend_rpc_start_server

ggml : move more prints to the ggml log system (llama/9839)

dbb264b

* ggml : move more prints to the ggml log system * show BLAS OpenMP warnings in all builds using debug print

Fix cann compilation error (llama/9891)

eed9509

Fix cann compilation error after merging llama.cpp supports dynamically loadable backends.

CUDA: fix 1D im2col, add tests (ggml/993)

3c1d3e4

fix: allocating CPU buffer with size 0 (llama/9917)

3017ef0

vulkan : add backend registry / device interfaces (llama/9721)

e2a2660

* vulkan : add backend registry / device interfaces * llama : print devices used on model load

Add SYCL Backend registry, device and Event Interfaces (llama/9705)

71d0e18

* implemented missing SYCL event APIs * sycl : Added device and backend reg interfaces * Restructured ggml-sycl.cpp

rpc : backend refactoring (llama/9912)

e6d7dbc

* rpc : refactor backend Use structs for RPC request/response messages * rpc : refactor server

fix mul_mat_vec_q and *_vec_q error (llama/9939)

cd24c26

Co-authored-by: arthw <[email protected]>

rpc : pack only RPC structs (llama/9959)

1d0c577

ggml : add asserts for type conversion in fattn kernels (llama/9971)

9b3a2da

ggml-ci

Adapt to dynamically loadable backends mechanism (llama/9970)

db26898

* [CANN] Adapt to dynamically loadable backends mechanism * Fix the Bug: inference running result is garbled in debug running model for LM models who's type is Q4_0 class * Handle the review comments of this pull request

increase cuda_cpy block size (ggml/996)

d7ea6cf

Co-authored-by: bssrdf <[email protected]>

CUDA: fix MMQ for non-contiguous src0, add tests (llama/10021)

1ac1152

* CUDA: fix MMQ for non-contiguous src0, add tests * revise test code

CUDA: fix insufficient buffer clearing for MMQ (llama/10032)

8a81531

metal : support permuted matrix multiplicaions (llama/10033)

3481ab5

* metal : support permuted matrix multiplicaions ggml-ci * cont : use nb01 directly for row steps ggml-ci * cont : add comments [no ci] * metal : minor refactor * metal : minor

ggml : add AMX backend (llama/8998)

cc28564

sync : ggml

9c7653c

talk-llama : sync llama.cpp

d17cdb3

whisper : backend registry init before model load

bc763c1

slaren reviewed Oct 31, 2024

View reviewed changes

ggerganov merged commit 0377596 into master Nov 1, 2024
87 of 89 checks passed

ggerganov deleted the sync branch November 1, 2024 08:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync : ggml #2528

sync : ggml #2528

ggerganov commented Oct 31, 2024

slaren Oct 31, 2024

ggerganov Oct 31, 2024

		// TODO: temporary call to force backend registry initialization
		WHISPER_LOG_INFO("%s: backends = %zu\n", __func__, ggml_backend_reg_count());

	// allocate tensors in the backend buffers
	model.buffer = ggml_backend_alloc_ctx_tensors_from_buft(model.ctx, whisper_default_buffer_type(wctx.params));
	if (!model.buffer) {
	WHISPER_LOG_ERROR("%s: failed to allocate memory for the model\n", __func__);
	return false;
	}

	struct whisper_state * whisper_init_state(whisper_context * ctx) {
	whisper_state * state = new whisper_state;

	state->backends = whisper_backend_init(ctx->params);
	if (state->backends.empty()) {
	WHISPER_LOG_ERROR("%s: whisper_backend_init() failed\n", __func__);
	whisper_free_state(state);
	return nullptr;
	}

sync : ggml #2528

sync : ggml #2528

Conversation

ggerganov commented Oct 31, 2024

slaren Oct 31, 2024

Choose a reason for hiding this comment

ggerganov Oct 31, 2024

Choose a reason for hiding this comment