-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync : ggml #2528
sync : ggml #2528
Conversation
This commit removes the buffer_id field from the leaf_alloc struct. The motivation for is that this field is only written to and never read/used as far as I can tell. Each tensor_alloc has a buffer_id field and this is what caused me to look into this more closely, to understand what the buffer_id in leaf_alloc was used for.
* Single allocation of encode_async block with non-ARC capture in ggml-metal.m * Moving Block_release to the deallocation code * Release encode block when re-setting encoding buffer count if needed * Update ggml/src/ggml-metal.m --------- Co-authored-by: Georgi Gerganov <[email protected]>
* ggml : add metal backend registry / device ggml-ci * metal : fix names [no ci] * metal : global registry and device instances ggml-ci * cont : alternative initialization of global objects ggml-ci * llama : adapt to backend changes ggml-ci * fixes * metal : fix indent * metal : fix build when MTLGPUFamilyApple3 is not available ggml-ci * fix merge * metal : avoid unnecessary singleton accesses ggml-ci * metal : minor fix [no ci] * metal : g_state -> g_ggml_ctx_dev_main [no ci] * metal : avoid reference of device context in the backend context ggml-ci * metal : minor [no ci] * metal : fix maxTransferRate check * metal : remove transfer rate stuff --------- Co-authored-by: slaren <[email protected]>
* docs : clarify building Android on Termux * docs : update building Android on Termux * docs : add cross-compiling for Android * cmake : link dl explicitly for Android
…a/9752) * ggml : add backend registry / device interfaces to BLAS backend * fix mmap usage when using host buffers
* ggml : do not use BLAS with types without to_float * ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies * ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits it's not really internal if everybody uses it
* mtgpu: add docker image support Signed-off-by: Xiaodong Ye <[email protected]> * mtgpu: enable docker workflow Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>
* rpc : add backend registry / device interfaces * llama : add llama_supports_rpc API * ggml_backend_rpc_start_rpc_server -> ggml_backend_rpc_start_server
* ggml : move more prints to the ggml log system * show BLAS OpenMP warnings in all builds using debug print
* Vectorize load instructions in dmmv f16 CUDA kernel Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup. * addressed comment * Update ggml/src/ggml-cuda/dmmv.cu Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>
Fix cann compilation error after merging llama.cpp supports dynamically loadable backends.
…/9875) * fix: use `vm_allocate` to allocate CPU backend buffer on macOS * fix: switch to `posix_memalign` to keep existing `free()` usages work * feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS * style: formatting * fix: move const outside of `#ifndef` * style: formatting * fix: unused var * fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h` * fix: unused var * fix: page align to `GGUF_DEFAULT_ALIGNMENT` * fix: page align to `TENSOR_ALIGNMENT` * fix: convert `TENSOR_ALIGNMENT` to a macro * fix: increase page size to `32` on iOS * fix: iOS page size * fix: `hbw_posix_memalign` alignment
* vulkan : add backend registry / device interfaces * llama : print devices used on model load
add intel amx isa detection add vnni kernel for gemv cases add vnni and amx kernel support for block_q8_0 code cleanup fix packing B issue enable openmp fine tune amx kernel switch to aten parallel pattern add error message for nested parallelism code cleanup add f16 support in ggml-amx add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS update CMakeList update README fix some compilation warning fix compiler warning when amx is not enabled minor change ggml-ci move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp ggml-ci update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16 ggml-ci add amx as an ggml-backend update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h minor change update CMakeLists.txt minor change apply weight prepacking in set_tensor method in ggml-backend fix compile error ggml-ci minor change ggml-ci update CMakeLists.txt ggml-ci add march dependency minor change ggml-ci change ggml_backend_buffer_is_host to return false for amx backend ggml-ci fix supports_op use device reg for AMX backend ggml-ci minor change ggml-ci minor change fix rebase set .buffer_from_host_ptr to be false for AMX backend
* implemented missing SYCL event APIs * sycl : Added device and backend reg interfaces * Restructured ggml-sycl.cpp
* rpc : refactor backend Use structs for RPC request/response messages * rpc : refactor server
Co-authored-by: arthw <[email protected]>
* [CANN] Adapt to dynamically loadable backends mechanism * Fix the Bug: inference running result is garbled in debug running model for LM models who's type is Q4_0 class * Handle the review comments of this pull request
* add pool_2d Signed-off-by: Junhee Yoo <[email protected]> * fix im2col and add unittest for N>=1024 Signed-off-by: Junhee Yoo <[email protected]> * add tests for N % 1024 != 0 Signed-off-by: Junhee Yoo <[email protected]> * remove trailing whitespaces Signed-off-by: Junhee Yoo <[email protected]> * apply suggestions Signed-off-by: Junhee Yoo <[email protected]> * apply more optimization - original IM2COL kernel + _ext with MIN() Signed-off-by: Junhee Yoo <[email protected]> * apply review: change kernel name of pool_2d Signed-off-by: Junhee Yoo <[email protected]> * apply review Signed-off-by: Junhee Yoo <[email protected]> * fix more formatting and enhance readability Signed-off-by: Junhee Yoo <[email protected]> --------- Signed-off-by: Junhee Yoo <[email protected]>
Co-authored-by: bssrdf <[email protected]>
* CUDA: fix MMQ for non-contiguous src0, add tests * revise test code
* metal : support permuted matrix multiplicaions ggml-ci * cont : use nb01 directly for row steps ggml-ci * cont : add comments [no ci] * metal : minor refactor * metal : minor
// TODO: temporary call to force backend registry initialization | ||
WHISPER_LOG_INFO("%s: backends = %zu\n", __func__, ggml_backend_reg_count()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, is this a workaround for an initialization issue in the metal backend, or something related to the backend registry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all backends would currently fail because during loading the model we reference the backend buffer type:
Lines 1802 to 1807 in bc763c1
// allocate tensors in the backend buffers | |
model.buffer = ggml_backend_alloc_ctx_tensors_from_buft(model.ctx, whisper_default_buffer_type(wctx.params)); | |
if (!model.buffer) { | |
WHISPER_LOG_ERROR("%s: failed to allocate memory for the model\n", __func__); | |
return false; | |
} |
But the ggml_backend_registry
hasn't been initialized yet, before creating the first whisper_state
:
Lines 3326 to 3334 in bc763c1
struct whisper_state * whisper_init_state(whisper_context * ctx) { | |
whisper_state * state = new whisper_state; | |
state->backends = whisper_backend_init(ctx->params); | |
if (state->backends.empty()) { | |
WHISPER_LOG_ERROR("%s: whisper_backend_init() failed\n", __func__); | |
whisper_free_state(state); | |
return nullptr; | |
} |
I think after whisper_default_buffer_type()
is reimplemented to use the backend registry, the problem will disappear and this workaround can be removed.
No description provided.