Skip to content

QVAC-19254 tts-cpp: Supertonic + Chatterbox/S3Gen GPU sched for Adreno OpenCL#36

Merged
pratiknarola-t merged 7 commits into
masterfrom
QVAC-19254-tts-adreno-gpu
Jun 4, 2026
Merged

QVAC-19254 tts-cpp: Supertonic + Chatterbox/S3Gen GPU sched for Adreno OpenCL#36
pratiknarola-t merged 7 commits into
masterfrom
QVAC-19254-tts-adreno-gpu

Conversation

@pratiknarola-t

Copy link
Copy Markdown

QVAC-19254 — TTS Adreno GPU support

Supersedes #35 — the original PR was auto-closed when its branch was renamed to the correct ticket (QVAC-19213-tts-adreno-gpuQVAC-19254-tts-adreno-gpu). Same signed commit (5205428e), identical content.

Enables Chatterbox + Supertonic TTS on Adreno (OpenCL / Vulkan) by routing GPU-unsupported ops to CPU via ggml_backend_sched, with a supporting backend-tiering fix for Adreno's OpenCL device-string format.

Commits

  1. Supertonic GPU correctness via ggml_backend_schedtts-cpp/src/supertonic_*.{cpp,h}. Routes the CPU-only GGML_OP_CUSTOM kernels (depthwise/pointwise Conv1D, LayerNorm, dense matmul) to CPU via sched; everything else runs on the GPU primary. Lifts the prior "GPU rejected because customs are CPU-only" guard. Verified corr ≈ 0.998 vs CPU on Adreno 740 (Vulkan) and macOS (Metal).
  2. backend_selection: parse_adreno_version handles the OpenCL device stringtts-cpp/src/backend_selection.cpp. The OpenCL string is "QUALCOMM Adreno(TM) (OpenCL 3.0 Adreno(TM) 740)" — parsing only the first "Adreno" marker yielded 3 (from "OpenCL 3.0") and mis-tiered the GPU below Vulkan. The fix scans every marker and keeps the largest ≥ 100 (3-digit model). Recovers Adreno 740.
  3. Route S3Gen CONV_TRANSPOSE_1D to CPU via ggml_backend_schedtts-cpp/src/chatterbox_tts.cpp. The HiFT vocoder uses CONV_TRANSPOSE_1D, which neither ggml-opencl nor ggml-vulkan supports yet. The sched routes that op to CPU while keeping the rest on GPU. Includes the USAGE_WEIGHTS marking + per-call graph rebuild required by sched's GPU↔CPU copy machinery (mutates node->src[]).
  4. --dump-mel-path CLI flagtts-cpp/src/chatterbox_cli.cpp. Wires the CLI through to the existing opts.dump_mel_path field (the npy dump hooks are already on master), so a debug user can compare CPU vs GPU intermediates via --dump-mel-path /path/to/prefix.

Verification

On-device smoke against the just-synced qvac-ext-ggml/speech (ggml v0.10.2) + the matching Adreno OpenCL/Vulkan PRs (the QVAC-19253 ggml-vulkan PR + the QVAC-19254 ggml-opencl kernels PR):

Smoke Result
Chatterbox-OpenCL ✅ EXIT=0, 3.44 s WAV, RTF 37.6 (consistent with prior baseline)
Supertonic-OpenCL ✅ EXIT=0, 3.57 s WAV
Supertonic-Vulkan ✅ EXIT=0, 3.57 s WAV — Adreno 740 detected, Qualcomm-gated guards active, no crashes

Hygiene

  • All source comments scrubbed of QVAC-#### ticket refs + internal hypothesis-log IDs (H016/H017).
  • The verbose model_ctx / s3gen_sched_alloc blocks were compressed from 8/6 lines to 5/2 while preserving the essential SIGSEGV-prevention + threading-race rationale.
  • Diff confirms only comments changed in the cleanup (apart from the one trailing-comment edit on the dump_mel_path field declaration).

…atterbox/S3Gen)

Route Supertonic and Chatterbox/S3Gen GPU graphs through ggml_backend_sched so ops the GPU backend cannot run (CONV_TRANSPOSE_1D in the HiFT vocoder; the CPU-only GGML_OP_CUSTOM kernels in the Supertonic vector estimator/vocoder) are routed to CPU instead of asserting.

Capability-gate the Chatterbox HiFT scheduler: a backend that runs every op in the graph (Metal, CUDA, CPU) computes directly on the primary backend; only a backend missing an op (Adreno OpenCL / Vulkan) uses the [GPU,CPU] scheduler. The gate queries ggml_backend_supports_op per node, so it is generic and does not regress iOS Metal (which supports CONV_TRANSPOSE_1D natively and otherwise aborts in the scheduler's graph-split).

Gate Android GPU selection to Qualcomm Adreno: other Android GPU vendors are unvalidated and at least one (ARM Mali) aborts the host process uncatchably from graph compute, so non-Adreno devices fall through to CPU. parse_adreno_version handles the OpenCL device-name string (e.g. 'OpenCL 3.0 Adreno(TM) 740') by scanning every marker for the real model number.

Also expose the pre-existing S3Gen mel/encoder/CFM intermediate dump via the --dump-mel-path CLI flag.
@pratiknarola-t

Copy link
Copy Markdown
Author

@pratiknarola-t

Copy link
Copy Markdown
Author

ishanvohra2
ishanvohra2 previously approved these changes Jun 3, 2026
freddy311082
freddy311082 previously approved these changes Jun 3, 2026
@pratiknarola-t pratiknarola-t dismissed stale reviews from freddy311082 and ishanvohra2 via 78ddab9 June 3, 2026 12:49
@gianni-cor

Copy link
Copy Markdown

Potential regression in tts-cpp/src/chatterbox_tts.cpp: HiFT graph/allocr caching is disabled even for backends that never use the scheduler.

Before this change, g_hift_graph_cache was keyed by (T_mel, T_stft) and reused the graph plus reserved cache.allocr on same-shape calls. In the PR, build_graph is forced to true, and the direct path (primary_runs_all, e.g. CPU/Metal/CUDA) creates a local hift_allocr, computes, then frees it every call:

const bool build_graph = true;
...
if (primary_runs_all) {
    hift_allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(m.backend));
    ggml_gallocr_reserve(hift_allocr, gf);
    ggml_gallocr_alloc_graph(hift_allocr, gf);
} else {
    s3gen_sched_alloc(m, gf);
}
...
if (hift_allocr) ggml_gallocr_free(hift_allocr);

The rebuild is justified for the scheduler path because ggml_backend_sched_alloc_graph() mutates node->src[], but for primary_runs_all backends this looks like it drops the existing same-shape HiFT cache and reintroduces graph build/reserve/free overhead on every decode. Can we preserve the cached direct path and only force rebuilds when the scheduler is actually required?

Co-authored-by: pratiknarola-t <pratiknarola-t@users.noreply.github.com>
gianni-cor
gianni-cor previously approved these changes Jun 3, 2026

@freddy311082 freddy311082 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — the scheduler-based routing and the per-node ggml_backend_supports_op capability gate on the HiFT path look solid, and the latest commit addresses the HiFT graph-cache concern. Two points before I re-approve: (1) the is_qualcomm_adreno AND condition appears to break Adreno selection via Vulkan on Android; (2) the Supertonic graph caches still rebuild unconditionally on every call, the same regression we just fixed for HiFT but not applied here. Details inline.

Comment thread tts-cpp/src/backend_selection.cpp Outdated
Comment thread tts-cpp/src/supertonic_vector_estimator.cpp
…wlisted

The AND required both 'adreno' and 'qualcomm' in the device name/desc, but ggml-vulkan reports deviceName 'Adreno (TM) 740' (no 'qualcomm') with name 'Vulkan0', so an Adreno selected via Vulkan failed the gate and fell back to CPU. Matching 'adreno' alone is sufficient: it appears in both the OpenCL ('QUALCOMM Adreno(TM)') and Vulkan ('Adreno (TM) 740') strings. Reverts 78ddab9.
Mirror the HiFT graph-cache fix (83a9a38) in the Supertonic vector estimator and vocoder. run_text_attention_cache, run_group_graph_cache, run_res_style_qkv_cache, run_tail_graph_cache and the vocoder forward rebuilt the graph and re-reserved via the scheduler on every denoise step. Each builder now reuses its shape-keyed graph when the cache already holds one built on the direct path, and each runner takes the direct gallocr + primary-backend compute when ggml_backend_supports_op covers every node, falling back to the scheduler only when an op must route to CPU. The scheduler path leaves the cached allocr null so it keeps rebuilding (its alloc_graph mutates node->src[]); the direct path reuses graph + allocr across steps. Output is bit-identical (allocation-only change).
@pratiknarola-t

Copy link
Copy Markdown
Author

@gianni-cor

Copy link
Copy Markdown

One additional issue in the latest direct-path Supertonic cache changes: the new ggml_gallocr_new / ggml_gallocr_reserve blocks no longer check failures before calling ggml_gallocr_alloc_graph.

Example from tts-cpp/src/supertonic_vector_estimator.cpp:

if (direct) {
    if (!cache.allocr) {
        cache.allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(model.backend));
        ggml_gallocr_reserve(cache.allocr, cache.gf);
    }
    ggml_gallocr_alloc_graph(cache.allocr, cache.gf);
}

Before this PR, the Supertonic cache builders checked ggml_gallocr_new and ggml_gallocr_reserve and threw clear errors on failure. The new direct-path pattern is repeated across the vector caches and supertonic_vocoder.cpp; if allocation returns null or reserve fails, the next ggml_gallocr_alloc_graph can crash or proceed with an invalid/unreserved allocator instead of failing cleanly.

Can we restore the old failure checks around each new direct-path ggml_gallocr_new / ggml_gallocr_reserve block?

The direct-path graph-cache reuse added in 174f47d calls ggml_gallocr_new
and ggml_gallocr_reserve without checking failure before
ggml_gallocr_alloc_graph, so an allocation failure would proceed with a
null or unreserved allocator instead of throwing. The scheduler fallback
(supertonic_sched_alloc) already throws; the new direct path did not.

Add the null/reserve checks the rest of the Supertonic code already uses
(e.g. supertonic_text_encoder.cpp) at all five direct-path sites:
run_text_attention_cache, run_group_graph_cache, run_res_style_qkv_cache,
run_tail_graph_cache and supertonic_vocoder_forward_ggml. The
ggml_gallocr_alloc_graph call is left unchecked to match that idiom.

Allocation-only: full synth output is bit-identical before and after.

@ogad-tether ogad-tether left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the 6 PR commits (5205428e..e049b7a5) against the diff base. Design is sound, smoke tests cover the intended hardware paths (Adreno OpenCL/Vulkan + macOS Metal), and the capability-gated sched-vs-direct routing keeps the Metal/CUDA/iOS-Metal fast paths off the scheduler — that's the right shape.

Requesting changes for items A–B below; items C–G are non-blocking follow-ups.

Must address before merge

A. Lazy s3gen_sched_alloc creation is implicitly single-threadedtts-cpp/src/chatterbox_tts.cpp:117

if (!m.sched) {
    ggml_backend_buffer_set_usage(m.buffer_w, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
    ...
    m.sched = ggml_backend_sched_new(...);
}

The comment ("reached only from run_hift_decode on the synthesis thread") makes this a load-bearing convention with no machine-checked enforcement. If anyone later wires batched/parallel synthesis (the streaming path in chatterbox_cli.cpp:2068 looks adjacent), two threads racing through this if (!m.sched) block will leak a sched and double-mark buffer_w USAGE. Please guard with std::once_flag + std::call_once, or at minimum a thread-id assertion. The cost is one atomic load on the hot path.

B. QVAC_VERBOSE env toggle is unrelated to Adreno GPU supporttts-cpp/src/supertonic_engine.cpp:139

if (!load_supertonic_gguf(opts.model_gguf_path, model, opts.n_gpu_layers,
                          std::getenv("QVAC_VERBOSE") != nullptr)) {

This changes Supertonic load-time logging behavior unconditionally and is not mentioned in the PR description's Hygiene section. Please either pull it into a separate PR or call it out explicitly — drive-by behavior changes in GPU-routing PRs are hard to bisect later.

Follow-ups (non-blocking)

C. Scheduler path defeats cache.allocr graph reusetts-cpp/src/chatterbox_tts.cpp:1968, supertonic_vector_estimator.cpp:642,820,1050,1281, supertonic_vocoder.cpp:369

The chosen invalidation predicate (cache.allocr == nullptr ⇒ rebuild) means that every time the scheduler path is taken, build_*_cache frees and reconstructs the graph from scratch. For an 8-step CFM run on Adreno-OpenCL where multiple group/res-style/tail caches all go through sched, that's many ggml_new_graph + tensor allocations per generation. The comment justifies it correctly (ggml_backend_sched_alloc_graph mutates node->src[]), but the standard llama.cpp pattern is to snapshot node->src[] at build time and restore before each sched_alloc_graph call so the cache survives. RTF 37.6 reported in the PR is acceptable for bring-up; worth a tracked follow-up for the next perf pass.

D. is_qualcomm_adreno is broader than its name suggeststts-cpp/src/backend_selection.cpp:270

return contains_ci(name, "adreno") || contains_ci(desc, "adreno") ||
       contains_ci(name, "qualcomm") || contains_ci(desc, "qualcomm");

After commit 91f24b93 this matches any Qualcomm-labelled device, not just Adreno. The justification (Vulkan deviceName sometimes lacks "Adreno") is sound today, but if Qualcomm ever exposes a non-Adreno GPU/accelerator over Vulkan (Hexagon as Vulkan compute is rumored), it would auto-pass the Android gate. A small Android device-list test fixture + tighter (adreno-in-either) OR (qualcomm-AND-not-explicitly-other) rule would future-proof this.

E. cpu_backend freed after primary backendtts-cpp/src/chatterbox_tts.cpp:235, supertonic_gguf.cpp:430

Order is sched → buffer_w → ctx_w → backend → cpu_backend. The sched holds refs to both backends so freeing it first is correct, but separating the two backend frees by buffer_w/ctx_w is unusual — easier to reason about if both backends are freed together right after the sched. Cosmetic only; current order doesn't actually break anything.

F. parse_adreno_version hardcodes the X-series as 800tts-cpp/src/backend_selection.cpp:233

For any "Adreno X<digit>..." pattern the function returns 800. When Qualcomm ships X2 the comparison adreno_v >= 700 keeps working, but tier ordering between X1 and X2 becomes meaningless. Worth a TODO comment so it surfaces when X2 lands.

G. supertonic sched graph_size=8192 vs s3gen sched graph_size=131072tts-cpp/src/supertonic_gguf.cpp:396, chatterbox_tts.cpp:130

Asymmetric magic numbers with no inline justification. I confirmed they each match the largest graph routed through their respective sched (8192 covers all multi-cache trace_proj graphs ≤ 2048 nodes; 131072 matches the HiFT graph's own capacity at chatterbox_tts.cpp:1921). Worth a one-line comment so the next reader doesn't have to grep the codebase to verify.


Notes

  • Verified the giant vector_loop_one_graph_cache (MAX_NODES = 8192 * total_steps + 256 ≈ 65k) is not routed through the supertonic sched — it's the one-graph GPU fast path that intentionally uses direct gallocr because it builds without GGML_OP_CUSTOM ops on non-CPU backends. So the 8192 sched capacity is sufficient for the actually-sched-routed graphs.
  • parse_adreno_version fix is correct for the documented "QUALCOMM Adreno(TM) (OpenCL 3.0 Adreno(TM) 740)" input (verified by hand-tracing both markers).
  • --dump-mel-path plumbing is clean (header field already exists, consumers at chatterbox_tts.cpp:2463/2509/2707/2784 already use it).
  • USAGE_WEIGHTS lazy marking in s3gen_sched_alloc is idempotent, so the multi-thread concern above is bounded to the sched leak + a re-mark (not a corruption). Still please guard it.

freddy311082
freddy311082 previously approved these changes Jun 4, 2026
gianni-cor
gianni-cor previously approved these changes Jun 4, 2026
…ng, regex parse)

- s3gen_sched_alloc (item A, blocking): guard the lazy sched/cpu_backend
  creation with std::call_once so a future parallel/batched-synthesis caller
  can't race two scheds into existence (leaking one) or double-mark buffer_w
  USAGE. The once_flag is held via unique_ptr so model_ctx stays
  move-constructible (it is moved into the process-wide cache).
- supertonic_engine (item B, blocking): revert the QVAC_VERBOSE drive-by back
  to false (+ drop the now-unused <cstdlib>); unrelated Supertonic load-time
  logging shouldn't ride this GPU PR.
- backend_selection parse_adreno_version: switch to the same regex as parakeet
  (PR #38) so the two are identical; validated against 20 device strings incl.
  the combined OpenCL '(OpenCL 3.0 Adreno(TM) 740)' -> 740.
@pratiknarola-t pratiknarola-t dismissed stale reviews from gianni-cor and freddy311082 via ccfdf17 June 4, 2026 12:42
@pratiknarola-t

pratiknarola-t commented Jun 4, 2026

Copy link
Copy Markdown
Author

@pratiknarola-t pratiknarola-t merged commit a34cb6d into master Jun 4, 2026
123 of 134 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants