QVAC-19254 tts-cpp: Supertonic + Chatterbox/S3Gen GPU sched for Adreno OpenCL#36
Conversation
…atterbox/S3Gen) Route Supertonic and Chatterbox/S3Gen GPU graphs through ggml_backend_sched so ops the GPU backend cannot run (CONV_TRANSPOSE_1D in the HiFT vocoder; the CPU-only GGML_OP_CUSTOM kernels in the Supertonic vector estimator/vocoder) are routed to CPU instead of asserting. Capability-gate the Chatterbox HiFT scheduler: a backend that runs every op in the graph (Metal, CUDA, CPU) computes directly on the primary backend; only a backend missing an op (Adreno OpenCL / Vulkan) uses the [GPU,CPU] scheduler. The gate queries ggml_backend_supports_op per node, so it is generic and does not regress iOS Metal (which supports CONV_TRANSPOSE_1D natively and otherwise aborts in the scheduler's graph-split). Gate Android GPU selection to Qualcomm Adreno: other Android GPU vendors are unvalidated and at least one (ARM Mali) aborts the host process uncatchably from graph compute, so non-Adreno devices fall through to CPU. parse_adreno_version handles the OpenCL device-name string (e.g. 'OpenCL 3.0 Adreno(TM) 740') by scanning every marker for the real model number. Also expose the pre-existing S3Gen mel/encoder/CFM intermediate dump via the --dump-mel-path CLI flag.
|
Updated Run after resolving comments from @freddy311082 : https://github.com/tetherto/qvac/actions/runs/26874476789/job/79260173753?pr=2320 |
78ddab9
|
Potential regression in Before this change, const bool build_graph = true;
...
if (primary_runs_all) {
hift_allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(m.backend));
ggml_gallocr_reserve(hift_allocr, gf);
ggml_gallocr_alloc_graph(hift_allocr, gf);
} else {
s3gen_sched_alloc(m, gf);
}
...
if (hift_allocr) ggml_gallocr_free(hift_allocr);The rebuild is justified for the scheduler path because |
Co-authored-by: pratiknarola-t <pratiknarola-t@users.noreply.github.com>
freddy311082
left a comment
There was a problem hiding this comment.
Thanks — the scheduler-based routing and the per-node ggml_backend_supports_op capability gate on the HiFT path look solid, and the latest commit addresses the HiFT graph-cache concern. Two points before I re-approve: (1) the is_qualcomm_adreno AND condition appears to break Adreno selection via Vulkan on Android; (2) the Supertonic graph caches still rebuild unconditionally on every call, the same regression we just fixed for HiFT but not applied here. Details inline.
…wlisted
The AND required both 'adreno' and 'qualcomm' in the device name/desc, but ggml-vulkan reports deviceName 'Adreno (TM) 740' (no 'qualcomm') with name 'Vulkan0', so an Adreno selected via Vulkan failed the gate and fell back to CPU. Matching 'adreno' alone is sufficient: it appears in both the OpenCL ('QUALCOMM Adreno(TM)') and Vulkan ('Adreno (TM) 740') strings. Reverts 78ddab9.
Mirror the HiFT graph-cache fix (83a9a38) in the Supertonic vector estimator and vocoder. run_text_attention_cache, run_group_graph_cache, run_res_style_qkv_cache, run_tail_graph_cache and the vocoder forward rebuilt the graph and re-reserved via the scheduler on every denoise step. Each builder now reuses its shape-keyed graph when the cache already holds one built on the direct path, and each runner takes the direct gallocr + primary-backend compute when ggml_backend_supports_op covers every node, falling back to the scheduler only when an op must route to CPU. The scheduler path leaves the cached allocr null so it keeps rebuilding (its alloc_graph mutates node->src[]); the direct path reuses graph + allocr across steps. Output is bit-identical (allocation-only change).
|
@freddy311082 resolved the comments. updated CI run: https://github.com/tetherto/qvac/actions/runs/26934665625/job/79462807049?pr=2320 |
|
One additional issue in the latest direct-path Supertonic cache changes: the new Example from if (direct) {
if (!cache.allocr) {
cache.allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(model.backend));
ggml_gallocr_reserve(cache.allocr, cache.gf);
}
ggml_gallocr_alloc_graph(cache.allocr, cache.gf);
}Before this PR, the Supertonic cache builders checked Can we restore the old failure checks around each new direct-path |
The direct-path graph-cache reuse added in 174f47d calls ggml_gallocr_new and ggml_gallocr_reserve without checking failure before ggml_gallocr_alloc_graph, so an allocation failure would proceed with a null or unreserved allocator instead of throwing. The scheduler fallback (supertonic_sched_alloc) already throws; the new direct path did not. Add the null/reserve checks the rest of the Supertonic code already uses (e.g. supertonic_text_encoder.cpp) at all five direct-path sites: run_text_attention_cache, run_group_graph_cache, run_res_style_qkv_cache, run_tail_graph_cache and supertonic_vocoder_forward_ggml. The ggml_gallocr_alloc_graph call is left unchecked to match that idiom. Allocation-only: full synth output is bit-identical before and after.
ogad-tether
left a comment
There was a problem hiding this comment.
Reviewed the 6 PR commits (5205428e..e049b7a5) against the diff base. Design is sound, smoke tests cover the intended hardware paths (Adreno OpenCL/Vulkan + macOS Metal), and the capability-gated sched-vs-direct routing keeps the Metal/CUDA/iOS-Metal fast paths off the scheduler — that's the right shape.
Requesting changes for items A–B below; items C–G are non-blocking follow-ups.
Must address before merge
A. Lazy s3gen_sched_alloc creation is implicitly single-threaded — tts-cpp/src/chatterbox_tts.cpp:117
if (!m.sched) {
ggml_backend_buffer_set_usage(m.buffer_w, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
...
m.sched = ggml_backend_sched_new(...);
}The comment ("reached only from run_hift_decode on the synthesis thread") makes this a load-bearing convention with no machine-checked enforcement. If anyone later wires batched/parallel synthesis (the streaming path in chatterbox_cli.cpp:2068 looks adjacent), two threads racing through this if (!m.sched) block will leak a sched and double-mark buffer_w USAGE. Please guard with std::once_flag + std::call_once, or at minimum a thread-id assertion. The cost is one atomic load on the hot path.
B. QVAC_VERBOSE env toggle is unrelated to Adreno GPU support — tts-cpp/src/supertonic_engine.cpp:139
if (!load_supertonic_gguf(opts.model_gguf_path, model, opts.n_gpu_layers,
std::getenv("QVAC_VERBOSE") != nullptr)) {This changes Supertonic load-time logging behavior unconditionally and is not mentioned in the PR description's Hygiene section. Please either pull it into a separate PR or call it out explicitly — drive-by behavior changes in GPU-routing PRs are hard to bisect later.
Follow-ups (non-blocking)
C. Scheduler path defeats cache.allocr graph reuse — tts-cpp/src/chatterbox_tts.cpp:1968, supertonic_vector_estimator.cpp:642,820,1050,1281, supertonic_vocoder.cpp:369
The chosen invalidation predicate (cache.allocr == nullptr ⇒ rebuild) means that every time the scheduler path is taken, build_*_cache frees and reconstructs the graph from scratch. For an 8-step CFM run on Adreno-OpenCL where multiple group/res-style/tail caches all go through sched, that's many ggml_new_graph + tensor allocations per generation. The comment justifies it correctly (ggml_backend_sched_alloc_graph mutates node->src[]), but the standard llama.cpp pattern is to snapshot node->src[] at build time and restore before each sched_alloc_graph call so the cache survives. RTF 37.6 reported in the PR is acceptable for bring-up; worth a tracked follow-up for the next perf pass.
D. is_qualcomm_adreno is broader than its name suggests — tts-cpp/src/backend_selection.cpp:270
return contains_ci(name, "adreno") || contains_ci(desc, "adreno") ||
contains_ci(name, "qualcomm") || contains_ci(desc, "qualcomm");After commit 91f24b93 this matches any Qualcomm-labelled device, not just Adreno. The justification (Vulkan deviceName sometimes lacks "Adreno") is sound today, but if Qualcomm ever exposes a non-Adreno GPU/accelerator over Vulkan (Hexagon as Vulkan compute is rumored), it would auto-pass the Android gate. A small Android device-list test fixture + tighter (adreno-in-either) OR (qualcomm-AND-not-explicitly-other) rule would future-proof this.
E. cpu_backend freed after primary backend — tts-cpp/src/chatterbox_tts.cpp:235, supertonic_gguf.cpp:430
Order is sched → buffer_w → ctx_w → backend → cpu_backend. The sched holds refs to both backends so freeing it first is correct, but separating the two backend frees by buffer_w/ctx_w is unusual — easier to reason about if both backends are freed together right after the sched. Cosmetic only; current order doesn't actually break anything.
F. parse_adreno_version hardcodes the X-series as 800 — tts-cpp/src/backend_selection.cpp:233
For any "Adreno X<digit>..." pattern the function returns 800. When Qualcomm ships X2 the comparison adreno_v >= 700 keeps working, but tier ordering between X1 and X2 becomes meaningless. Worth a TODO comment so it surfaces when X2 lands.
G. supertonic sched graph_size=8192 vs s3gen sched graph_size=131072 — tts-cpp/src/supertonic_gguf.cpp:396, chatterbox_tts.cpp:130
Asymmetric magic numbers with no inline justification. I confirmed they each match the largest graph routed through their respective sched (8192 covers all multi-cache trace_proj graphs ≤ 2048 nodes; 131072 matches the HiFT graph's own capacity at chatterbox_tts.cpp:1921). Worth a one-line comment so the next reader doesn't have to grep the codebase to verify.
Notes
- Verified the giant
vector_loop_one_graph_cache(MAX_NODES =8192 * total_steps + 256≈ 65k) is not routed through the supertonic sched — it's the one-graph GPU fast path that intentionally uses direct gallocr because it builds withoutGGML_OP_CUSTOMops on non-CPU backends. So the 8192 sched capacity is sufficient for the actually-sched-routed graphs. parse_adreno_versionfix is correct for the documented"QUALCOMM Adreno(TM) (OpenCL 3.0 Adreno(TM) 740)"input (verified by hand-tracing both markers).--dump-mel-pathplumbing is clean (header field already exists, consumers atchatterbox_tts.cpp:2463/2509/2707/2784already use it).- USAGE_WEIGHTS lazy marking in
s3gen_sched_allocis idempotent, so the multi-thread concern above is bounded to the sched leak + a re-mark (not a corruption). Still please guard it.
…ng, regex parse) - s3gen_sched_alloc (item A, blocking): guard the lazy sched/cpu_backend creation with std::call_once so a future parallel/batched-synthesis caller can't race two scheds into existence (leaking one) or double-mark buffer_w USAGE. The once_flag is held via unique_ptr so model_ctx stays move-constructible (it is moved into the process-wide cache). - supertonic_engine (item B, blocking): revert the QVAC_VERBOSE drive-by back to false (+ drop the now-unused <cstdlib>); unrelated Supertonic load-time logging shouldn't ride this GPU PR. - backend_selection parse_adreno_version: switch to the same regex as parakeet (PR #38) so the two are identical; validated against 20 device strings incl. the combined OpenCL '(OpenCL 3.0 Adreno(TM) 740)' -> 740.
ccfdf17
QVAC-19254 — TTS Adreno GPU support
Enables Chatterbox + Supertonic TTS on Adreno (OpenCL / Vulkan) by routing GPU-unsupported ops to CPU via
ggml_backend_sched, with a supporting backend-tiering fix for Adreno's OpenCL device-string format.Commits
ggml_backend_sched—tts-cpp/src/supertonic_*.{cpp,h}. Routes the CPU-onlyGGML_OP_CUSTOMkernels (depthwise/pointwise Conv1D, LayerNorm, dense matmul) to CPU via sched; everything else runs on the GPU primary. Lifts the prior "GPU rejected because customs are CPU-only" guard. Verified corr ≈ 0.998 vs CPU on Adreno 740 (Vulkan) and macOS (Metal).backend_selection:parse_adreno_versionhandles the OpenCL device string —tts-cpp/src/backend_selection.cpp. The OpenCL string is"QUALCOMM Adreno(TM) (OpenCL 3.0 Adreno(TM) 740)"— parsing only the first "Adreno" marker yielded3(from "OpenCL 3.0") and mis-tiered the GPU below Vulkan. The fix scans every marker and keeps the largest ≥ 100 (3-digit model). Recovers Adreno 740.CONV_TRANSPOSE_1Dto CPU viaggml_backend_sched—tts-cpp/src/chatterbox_tts.cpp. The HiFT vocoder usesCONV_TRANSPOSE_1D, which neitherggml-openclnorggml-vulkansupports yet. The sched routes that op to CPU while keeping the rest on GPU. Includes theUSAGE_WEIGHTSmarking + per-call graph rebuild required by sched's GPU↔CPU copy machinery (mutatesnode->src[]).--dump-mel-pathCLI flag —tts-cpp/src/chatterbox_cli.cpp. Wires the CLI through to the existingopts.dump_mel_pathfield (the npy dump hooks are already on master), so a debug user can compare CPU vs GPU intermediates via--dump-mel-path /path/to/prefix.Verification
On-device smoke against the just-synced
qvac-ext-ggml/speech(ggml v0.10.2) + the matching Adreno OpenCL/Vulkan PRs (the QVAC-19253 ggml-vulkan PR + the QVAC-19254 ggml-opencl kernels PR):Hygiene
QVAC-####ticket refs + internal hypothesis-log IDs (H016/H017).model_ctx/s3gen_sched_allocblocks were compressed from 8/6 lines to 5/2 while preserving the essential SIGSEGV-prevention + threading-race rationale.dump_mel_pathfield declaration).