Skip to content

QVAC-20556 feat[api]: enable Android GPU for Parakeet (overlay; CI validation) [DO-NOT-MERGE]#2577

Open
pratiknarola-t wants to merge 1 commit into
mainfrom
qvac-20556-parakeet-android-gpu
Open

QVAC-20556 feat[api]: enable Android GPU for Parakeet (overlay; CI validation) [DO-NOT-MERGE]#2577
pratiknarola-t wants to merge 1 commit into
mainfrom
qvac-20556-parakeet-android-gpu

Conversation

@pratiknarola-t

Copy link
Copy Markdown
Contributor

⚠️ DO-NOT-MERGE — measurement vehicle

Overlay-only PR (ticket QVAC-20556) to get an empirical AWS Device Farm signal on whether the latest speech stack drives Parakeet on Android GPUs (Pixel 9 / Mali + S25 Ultra / Adreno 830). This is the inverse of the CPU-only workaround in #2525 — please don't merge over it.

Add the verified label to fire the device-farm leg.

What this changes

packages/transcription-parakeet/:

  • ParakeetModel::load — remove the #ifdef __ANDROID__ guard that forced useGPU=false (kept the n_gpu_layers logic + the GPU-init→CPU fallback warning).
  • CMakeLists.txt — widen the Android backend-staging glob from libqvac-speech-ggml-cpu-*.so to libqvac-speech-ggml-*.so so the vulkan/opencl MODULE libs ship in the prebuild (reverses the [0.7.2] CPU-only packaging); refresh the now-stale "intentionally CPU-only" comments.
  • gpu-smoke.test.js — drop the four Android early-pass skips so the strict assertGpuBackend (backendDevice=1, backendId Vulkan/OpenCL) runs on device.
  • In-package vcpkg overlay portsggml-speech@44fd4817 (speech HEAD) + parakeet-cpp@ed749556 (whisper.cpp master), wired via overlay-ports in vcpkg-configuration.json. Registry baseline and registry version>= pins are unchanged — the registry PR is deferred until the device-farm result is understood.
  • vcpkg.json — bump parakeet-cpp version>= to the overlay version-date.

Local device finding (Adreno 740 / iQOO 11, TDT q4_0)

Run directly against this branch's prebuild on a physically-attached Adreno 740:

Path Backend Result
CPU (useGPU=false) CPU (id 0) ✅ correct transcript
GPU, engine default OpenCL (id 4, auto-selected on Adreno>700) SIGABRTggml_backend_opencl_graph_compute: op not supported joint.token_argmax (ARGMAX)GGML_ASSERT
GPU, OpenCL withheld Vulkan (id 3) ⚠️ runs, but transcript degraded vs CPU (dropped words) and ~2× slower

So on the Adreno the engine picks OpenCL, whose backend lacks ARGMAX and aborts in graph-compute instead of falling back to CPU. The Vulkan path (the one ggml-speech@8bf760f4 reported byte-identical on this exact device) is not what the engine selects, and even when forced it no longer reproduces the byte-identical result on the current 44fd4817/ed749556 stack.

Expectation for the device-farm run: the Adreno (S25) leg likely hits the same OpenCL ARGMAX abort (which can SIGABRT the Bare worklet and take down subsequent tests, cf. #2525); the Mali (Pixel 9) leg exercises the Vulkan path.

Note (pre-existing, out of scope)

While bringing this up on a local device, found that the addon's BACKENDS_SUBDIR compile-definition is PRIVATE on the bare-module target but ParakeetModel.cpp compiles into parakeet_model_core, so the subdir isn't appended to a host-provided default backendsDir. The device-farm/APK passes an explicit flat nativeLibraryDir, so CI is unaffected — but a host relying on the __dirname/prebuilds default would not find the backend .so. Filed mentally as a follow-up; not touched here.

Refs

…lidation)

DO-NOT-MERGE — overlay-only PR to get an empirical AWS Device Farm signal on
whether the latest speech stack drives Parakeet on Android GPUs (Pixel 9/Mali +
S25/Adreno 830). This is the inverse of the CPU-only workaround in #2525.

Changes (packages/transcription-parakeet):
- ParakeetModel::load — remove the __ANDROID__ guard that forced useGPU=false.
- CMakeLists — widen the Android backend-staging glob from
  libqvac-speech-ggml-cpu-*.so to libqvac-speech-ggml-*.so so the Vulkan/OpenCL
  MODULE libs ship in the prebuild (reverses the [0.7.2] CPU-only packaging);
  refresh the now-stale "intentionally CPU-only" comments.
- gpu-smoke.test.js — drop the four Android early-pass skips so the strict
  assertGpuBackend (backendDevice=1, backendId Vulkan/OpenCL) runs on device.
- vcpkg overlay ports (in-package) — ggml-speech@44fd4817 (speech HEAD) +
  parakeet-cpp@ed749556 (whisper.cpp master), wired via the overlay-ports entry
  in vcpkg-configuration.json. Registry baseline and registry version>= pins are
  unchanged; the registry PR is deferred.
- vcpkg.json — bump parakeet-cpp version>= to the overlay version-date.

Local device finding (Adreno 740 / iQOO 11), TDT q4_0, recorded for reviewers:
- CPU: correct transcript, backendDevice=0.
- GPU OpenCL (engine auto-selects this on Adreno>700): aborts in graph-compute —
  "op not supported joint.token_argmax (ARGMAX)" -> GGML_ASSERT (SIGABRT).
- GPU Vulkan (forced by withholding the OpenCL module): runs (backendId=3) but
  output is degraded vs CPU (dropped words) and ~2x slower; NOT the byte-identical
  result ggml-speech 8bf760f4 reported. Expect the Device Farm Adreno (S25) leg to
  hit the OpenCL ARGMAX abort and the Mali leg to exercise the Vulkan path.
  Do not merge — this is a measurement vehicle.
@pratiknarola-t

Copy link
Copy Markdown
Contributor Author

Local Adreno 740 (iQOO 11) matrix — refined

Ran each model type directly against this branch's prebuild on a physically-attached Adreno 740. On Adreno the engine auto-selects OpenCL (policy: Adreno>700 → OpenCL). Results:

Model CPU OpenCL (GPU, auto) Vulkan (GPU, OpenCL withheld)
TDT (q4_0) ✅ correct SIGABRTggml_backend_opencl_graph_compute: op not supported joint.token_argmax (ARGMAX)GGML_ASSERT ⚠️ runs (backendId=3) but transcript degraded vs CPU + ~2× slower
EOU (q4_0) ✅ correct (95 tokens)
Sortformer (q8_0) ✅ correct (speaker labels)
CTC n/a on mobile n/a

Takeaway: the GPU blocker is narrow — TDT's joint.token_argmax (ARGMAX) is not implemented in the ggml OpenCL backend, and supports_op/graph-compute aborts instead of falling back to CPU. EOU and Sortformer run fine on OpenCL. The Vulkan path supports the op (no crash) but is degraded/slower on this device, and is not what the engine selects on Adreno anyway.

Implications for the Device Farm run:

  • Adreno (S25/830) leg: EOU + Sortformer GPU should pass; the TDT GPU smoke will likely SIGABRT (and a Bare-worklet abort can cascade to later tests).
  • Mali (Pixel 9) leg: exercises the Vulkan path (no OpenCL on non-Adreno) — separate unknown.

Fix directions (follow-up, not in this PR): implement ARGMAX in ggml-opencl, OR make ggml-opencl supports_op return false for ARGMAX so it routes to CPU, OR have parakeet-cpp keep the TDT joint argmax on CPU.

Separately, a pre-existing latent bug surfaced during bring-up: the addon's BACKENDS_SUBDIR compile-def is PRIVATE on the bare-module target while ParakeetModel.cpp compiles into parakeet_model_core, so the subdir isn't appended to a host-provided default backendsDir (__dirname/prebuilds). The device-farm APK passes an explicit flat nativeLibraryDir, so CI is unaffected — but a host relying on the default would not find the backend .so.

@github-actions

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ❌ PENDING

**Requirements:**
- 1 Team Member approval ❌ (0/1)
- 1 Team Lead OR Management approval ❌ (0/1)



---
*This comment is automatically updated when reviews change.*

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Mobile integration tests — @qvac/transcription-parakeet (Android)

Result: failed

metric value
Devices passed 0
Devices failed 2
Test cases total 6
Test cases passed 4
Test cases failed 2
Test cases skipped 0

View workflow run

@github-actions

Copy link
Copy Markdown
Contributor

Mobile integration tests — @qvac/transcription-parakeet (iOS)

Result: passed

metric value
Devices passed 2
Devices failed 0
Test cases total 6
Test cases passed 6
Test cases failed 0
Test cases skipped 0

View workflow run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

verified Authorize secrets / label-gate in PR workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant