speech: persistent VkPipelineCache + qvac-speech-ggml hybrid packaging#7
Merged
Conversation
…INE_CACHE_DIR Adds an opt-in persistent shader cache to ggml-vulkan. Enabled only when the caller sets GGML_VK_PIPELINE_CACHE_DIR to a non-empty path; when unset or empty behaviour is byte-identical to upstream ggml-vulkan. No auto-discovery of $XDG_CACHE_HOME or $HOME. ggml is a library distributed through package managers (vcpkg) and consumed by applications that should decide whether and where to persist Vulkan artefacts. Writing to the user's home directory without being asked is a side effect library consumers cannot see from the API surface. When enabled, createPipelineCache is seeded from the path at init and getPipelineCacheData is written back from ggml_vk_cleanup() (not ~vk_device_struct which is unreliable at process exit due to shared_ptr ref cycles). File keyed on vendorID/deviceID/driverVersion; Vulkan validates the blob header and silently ignores stale data if the shader bundle or driver changed. Atomic save via tmp+rename. Recovers ~91% of the cold->warm shader-compile gap on the first warm run on drivers without an aggressive per-app system cache (Mesa/RADV, Android Adreno/Mali, fresh NVIDIA installs, containers). Backport from chatterbox.cpp PR GustavoA1604/chatterbox.cpp#8 (QVAC-17872, round-1). Co-authored-by: Cursor <cursoragent@cursor.com>
Stacks on the previous patch. Writes back the on-disk pipeline-cache blob after every ggml_vk_load_shaders compile batch instead of only at ggml_vk_cleanup() time, so a process killed mid-graph (SIGKILL, abort, OS shutdown) doesn't lose the freshly compiled pipelines. Adds pipeline_cache_last_size book-keeping so warm runs short-circuit the disk write: the eager path only flushes when the cache actually grew (blob.size() > last_size), and the cleanup path skips when size matches last_size. This avoided a +90 ms WALL regression measured during dev when the flush was unconditional. Backport from chatterbox.cpp PR GustavoA1604/chatterbox.cpp#8 (QVAC-17872, round-2). Co-authored-by: Cursor <cursoragent@cursor.com>
…1-30) Cherry-pick of 512e177 from the 2026-01-30 branch, with the lib filename prefix swapped from qvac-diffusion- to qvac-speech-: - drop the GGML_BACKEND_DL requires BUILD_SHARED_LIBS check; static ggml core now coexists with MODULE GPU backends when GGML_CPU_STATIC=ON. - ggml_add_backend_library skips MODULE for ggml-cpu-* when GGML_CPU_STATIC, so CPU stays in the core .a and only Vulkan/OpenCL/CUDA become .so. - ggml/ggml-base get POSITION_INDEPENDENT_CODE=ON when GGML_BACKEND_DL is set, so MODULE backends can link the static core. - ggml gets GGML_USE_CPU compile-define when GGML_CPU_STATIC. - backend_filename_prefix() defaults to libqvac-speech-ggml- (matches the GGML_LIB_OUTPUT_PREFIX default on this branch). - ggml-config.cmake.in handles the hybrid mode: exports the static CPU variant target while leaving GPU backends to ggml_backend_load_best at runtime. - ggml_backend_opencl_init keeps the speech-branch's null-device guard (drop-clean fallback when ggml-opencl rejects all visible devices).
…orrect overwrite) Two fixes on top of QVAC-17872 round-1/round-2: 1. Replace C `std::rename` / `std::remove` with `std::filesystem::rename` / `std::filesystem::remove` at both flush sites (cleanup-time flush in ggml_vk_save_pipeline_cache, and the eager flush in ggml_vk_load_shaders). The C runtime `rename` on Windows fails when the destination already exists (per MSDN), which meant the second-and-later saves of the .pcache blob would silently fail and the on-disk cache would never advance past its initial size on Windows. std::filesystem::rename has POSIX overwrite semantics on every platform we target. 2. Build pipeline_cache_path with std::filesystem::path joining instead of `dir + "/" + fname` string concatenation. Avoids mixed-separator surprises if a caller passes a backslash-terminated dir on Windows. Behaviour-equivalent to round-2 on Linux/macOS; Windows now actually persists subsequent flushes instead of dropping them. Co-authored-by: Cursor <cursoragent@cursor.com>
jpgaribotti
approved these changes
May 7, 2026
gianni-cor
approved these changes
May 7, 2026
10 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Sync the
speechbranch with the chatterbox.cpp Vulkan optimizations and finalize theqvac-speech-ggml-*packaging convention agreed in#qvac(per-addon ggml prefix; speech-branch fork is the speech build).Five commits, +343/-40, covering:
GGML_LIB_OUTPUT_PREFIX→GGML_BACKEND_DL_PROJECT_PREFIX).VkPipelineCacheopt-in viaGGML_VK_PIPELINE_CACHE_DIR, plus a crash-safe eager flush, recovering ~91% of the cold→warm shader-compile gap on drivers without an aggressive per-app system cache (Mesa/RADV, Android Adreno/Mali, fresh NVIDIA installs, containers). Tracks JIRAQVAC-17872.2026-01-30branch, adapted fromqvac-diffusion-toqvac-speech-. Lets the speech ggml ship as a static CPU core withMODULEGPU backends so an.aar/.apk/loose-DLL drop candlopenVulkan/OpenCL/CUDA at runtime without forcing the whole core dynamic.std::renametostd::filesystem::rename, which has POSIX overwrite semantics on Windows. Prevents the.pcacheblob from getting frozen at its first-write size on the second and later saves on Windows.Commits
build: add GGML_LIB_OUTPUT_PREFIX option for per-consumer lib filename prefixAdds the
GGML_LIB_OUTPUT_PREFIXcache var. When set, ggml libs land on disk aslib<prefix>ggml-*.{a,so,dll}andtarget_compile_definitions(ggml-base PRIVATE GGML_BACKEND_DL_PROJECT_PREFIX="<prefix>")is propagated so the runtime loader looks for the same names. CMake target names and thefind_package(ggml CONFIG)package name are intentionally unchanged.vulkan: persistent VkPipelineCache, explicit opt-in via GGML_VK_PIPELINE_CACHE_DIR(QVAC-17872 round 1)Opt-in only; behaviour is byte-identical to upstream when
GGML_VK_PIPELINE_CACHE_DIRis unset/empty. Cache file keyed onvendorID/deviceID/driverVersion. Save happens fromggml_vk_cleanup()(not~vk_device_struct, which is unreliable at process exit because pipelines holdshared_ptr<vk_device_struct>ref cycles). Atomic save via tmp + rename.vulkan: crash-safe eager pipeline-cache flush(QVAC-17872 round 2)Flushes after every
ggml_vk_load_shaderscompile batch when the cache grew, so a process killed mid-graph doesn't lose freshly compiled pipelines.pipeline_cache_last_sizebook-keeping short-circuits both the eager flush and the cleanup-time flush on warm runs (cache-hit only). Without the short-circuit the unconditional flush regressed warm-run wall by ~90 ms on the chatterbox.cpp benchmark.cmake: support qvac hybrid backend packaging (cherry-pick from 2026-01-30)Cherry-pick of
512e1773from the2026-01-30branch with the prefix swapped fromqvac-diffusion-toqvac-speech-. AddsGGML_CPU_STATIC(CPU stays in the core.a, only GPU backends becomeMODULEshared libs), drops theGGML_BACKEND_DL requires BUILD_SHARED_LIBScheck, makes ggml/ggml-base PIC underGGML_BACKEND_DL, and adds the Android filename-onlydlopenfallback for flattened native dirs. Also pulls in the upstream-style unique_ptr conversion inggml_backend_vk_reg_get_device(memory-leak fix; see design notes).vulkan: use std::filesystem for pipeline-cache path/rename (Windows-correct overwrite)Replaces the C
std::rename/std::removewithstd::filesystem::rename/std::filesystem::removeat both flush sites and buildspipeline_cache_pathviastd::filesystem::pathjoining. Crenameon Windows fails if the destination exists, which meant the second-and-later flushes silently dropped on Windows.Design notes (preempting common review questions)
These are deliberate choices that look unusual at first glance — calling them out so a re-review doesn't re-litigate them.
Why is
qvac-speech-ggml-hardcoded as the no-macro fallback inbackend_filename_prefix()?Per the team agreement in
#qvac(Gianfranco / Juan A.), each*.cppaddon that vendors its own ggml fork carries its own filename prefix to avoiddlopenfilename collisions when multiple ggml versions coexist in one process /.aar/.apk:fabric/ggml→libqvac-ggml-*whispercpp/ggml(this branch) →libqvac-speech-ggml-*diffusion/ggml→libqvac-diffusion-ggml-*The speech branch is not meant to be used as a generic upstream-equivalent ggml — it is the speech build. Hardcoding the
qvac-speech-ggml-filename in the no-macro fallback closes a real footgun: a downstream that builds the speech branch with-DGGML_LIB_OUTPUT_PREFIX=(empty) but doesn't defineGGML_BACKEND_DL_PROJECT_PREFIXwould otherwise producelibggml-*.sofiles but a loader hunting forlibqvac-speech-ggml-*.so. Aligning both defaults toqvac-speech-makes the branch internally consistent.The
GGML_BACKEND_DL_PROJECT_PREFIXmacro path is preserved so any downstream that does want to override it (e.g. a future addon vendoring this branch under a different prefix) still can.Why isn't
GGML_LIB_OUTPUT_PREFIXbaked intoggml-config.cmake.in?Intentional —
find_package(ggml CONFIG)consumers set it on their own side beforefind_package. Thefind_library(NAMES "${GGML_LIB_OUTPUT_PREFIX}ggml" ggml ...)form gives a clean fallback to the bare name for unprefixed builds. We preferred that over@PACKAGE_GGML_LIB_OUTPUT_PREFIX@substitution because:GGML_BACKEND_DL_PROJECT_PREFIXfor a custom build, naming their own portfile artefacts). Putting one half of the contract in the package config and the other on the consumer side fragmented the convention.qvac-speech-ggml-style port names, so the package config never sees a "wrong" prefix in practice.GGML_MAX_NAMEpass-through inggml-config.cmake.infollows the same opt-in shape — same rationale, kept symmetric.Why is the unique_ptr refactor in
ggml_backend_vk_reg_get_devicebundled in the cmake commit?It's a clean cherry-pick of the same upstream change from the
2026-01-30branch and was part of512e1773, not a separate commit there. The new code is a real memory-leak fix: the previousdevices.push_back(new ggml_backend_device {...})andnew ggml_backend_vk_device_contextraw allocations were never freed (the staticstd::vector<ggml_backend_dev_t>held only raw pointers and is never torn down). Splitting it out from this PR would mean a follow-up cherry-pick that diverges the file from2026-01-30for no benefit. Calling it out here so it's not lost.Why is the OpenCL
qvac-parakeet patch:comment removed?The comment described why
ggml_backend_opencl_initreturnsnullptrinstead of asserting on a zero-device list — but the behavior (the null-device guard) is preserved. The cherry-pick from2026-01-30additionally hardensggml_backend_opencl_reg_device_getto also returnnullptrwhen no devices exist, instead of asserting; with the guard now in two places, the original single-site comment was stale. The functional contract (ggml_backend_opencl_initmay returnnullptrand callers must fall back to CPU) is unchanged.Why are the two pipeline-cache flush paths near-duplicates?
The cleanup-time flush in
ggml_vk_save_pipeline_cacheand the eager flush inggml_vk_load_shaderslook similar but differ in one important way: the eager path requires growth (blob.size() > pipeline_cache_last_size) before writing, whereas the cleanup path writes when size differs at all. Folding them into a singlesave(require_growth)helper saves ~10 lines but couples two call sites with subtly different correctness invariants. Left as-is for now; happy to refactor in a follow-up if a third call site shows up.Why is
getenv("GGML_VK_PIPELINE_CACHE_DIR")safe?Called once per device init, in
ggml_vk_get_device, before any worker threads exist. The result is captured intodevice->pipeline_cache_pathand never re-read. No thread-safety concern.Why no auto-discovery of
$XDG_CACHE_HOME/$HOME?ggml is a library distributed through package managers (vcpkg) and consumed by applications that should decide whether and where to persist Vulkan artefacts. Writing to the user's home directory without being asked is a side effect library consumers cannot see from the API surface. Apps that want default-on caching can set the env var in their bootstrap.
Test plan
chatterbox.cppbenchmark, env var set → ~91% of cold compile gap recovered. Env var unset → byte-identical to upstream timing..pcachewrites,device->pipeline_cache == VK_NULL_HANDLE,createComputePipelinetakesVK_NULL_HANDLE)..pcachesize grows across multiple eager flushes and the cleanup flush in the same process; without it, second flush silently drops.MODULEbuild links,dlopenfindslibqvac-speech-ggml-vulkan.sofrom the flattened APK native dir via the new filename-only fallback.ggml_backend_opencl_init()returnsnullptr,ggml_backend_opencl_reg_device_get()returnsnullptrinstead of asserting; caller falls back to CPU.qvac-speech-ggml:find_package(ggml CONFIG)resolveslibqvac-speech-ggml.a/libqvac-speech-ggml-base.aand theMODULEGPU backends.GGML_LIB_OUTPUT_PREFIX=qvac-speech-(default on this branch) and withGGML_LIB_OUTPUT_PREFIX=(explicit empty).