Skip to content

feat(android): enable Adreno large buffer for A7X/A8X GPUs#699

Merged
a-ghorbani merged 1 commit into
mainfrom
feature/TASK-20260421-1416
May 12, 2026
Merged

feat(android): enable Adreno large buffer for A7X/A8X GPUs#699
a-ghorbani merged 1 commit into
mainfrom
feature/TASK-20260421-1416

Conversation

@a-ghorbani
Copy link
Copy Markdown
Owner

Summary

Sets LM_GGML_OPENCL_ADRENO_USE_LARGE_BUFFER=1 before SoLoader.init in MainApplication.kt so llama.rn's OpenCL backend enables Qualcomm's cl_qcom_large_buffer extension on Adreno A7X/A8X GPUs.

This lifts the per-allocation CL_DEVICE_MAX_MEM_ALLOC_SIZE cap (~1 GB on most Qualcomm drivers), letting 7B+ models and long-context KV caches stay on GPU instead of failing allocation on flagship Snapdragon devices.

Closes #657

Why

  • Standard OpenCL caps a single buffer allocation at CL_DEVICE_MAX_MEM_ALLOC_SIZE. On Adreno drivers this is typically ~1 GB even on phones with 12–24 GB of RAM.
  • Large model weights or long-context KV caches can exceed this cap → allocation fails → model fails to load or falls back to CPU.
  • Upstream llama.cpp added the opt-in fix in ggml-org/llama.cpp#20997, synced into llama.rn at b8547 (2026-03-27), present in our pinned llama.rn@0.12.0-rc.9.

Scope

  • Android only. iOS uses Metal — unaffected.
  • Non-Adreno Android devices: no-op. The native layer in ggml-opencl.cpp gates on gpu_family == ADRENO && cl_qcom_large_buffer available. Mali / Xclipse / CPU-only paths never see the flag take effect.
  • Graceful fallback on older Adreno drivers that lack the extension — native code clears the flag and logs it.
  • No Kotlin-side device detection was added on purpose: native self-gating is authoritative, so a Kotlin-side check would be redundant and a second place to keep in sync.

Test plan

  • Android release build (./gradlew assembleRelease) — BUILD SUCCESSFUL locally
  • Lint, TypeCheck, Jest (2011 tests) — all green, no regressions
  • Manual verification (requested from @BlindDeveloper or anyone with Adreno A7X/A8X hardware):
    • On a Snapdragon 8 Gen 2/3 or 8 Elite device, install this build
    • Load a 7B+ GGUF (e.g. 8B Q4_K_M) or a smaller model with long context
    • Filter logcat: adb logcat -s ggml-opencl
    • Look for lm_ggml_opencl: Adreno large buffer enabled
    • If you see Adreno large buffer requested but not supported by driver instead, that is the expected graceful fallback on older A7X drivers — not a bug
  • Non-Adreno Android (Mali / Xclipse / CPU-only): confirm no regression, no new log spam

No automated unit/E2E test added — the change is a pre-init env-var side-effect with no JS/TS surface, and we have no Adreno A7X/A8X hardware in the E2E pipeline. A Jest/Robolectric test asserting Os.setenv was called would test the Kotlin stdlib, not our behaviour.

🤖 Generated by PocketPal Dev Team

@BlindDeveloper
Copy link
Copy Markdown
Contributor

@a-ghorbani
Please include these changes in the beta version which is available on Google Play.

@a-ghorbani a-ghorbani marked this pull request as ready for review April 22, 2026 08:40
@a-ghorbani
Copy link
Copy Markdown
Owner Author

@a-ghorbani Please include these changes in the beta version which is available on Google Play.

This is an APK build of this PR: https://github.com/a-ghorbani/pocketpal-ai/actions/runs/24722913168/artifacts/6555291913
Would you be happy to test on your Adreno device and let us know if it works?

@BlindDeveloper
Copy link
Copy Markdown
Contributor

@a-ghorbani
Page not found

@a-ghorbani a-ghorbani closed this Apr 22, 2026
@a-ghorbani a-ghorbani reopened this Apr 22, 2026
@BlindDeveloper
Copy link
Copy Markdown
Contributor

@a-ghorbani
It performs with the same speed and stability regardless of whether large buffer support is enabled or disabled.
I suspect it was designed that way; the difference would likely only be noticeable on flagship devices.

Set LM_GGML_OPENCL_ADRENO_USE_LARGE_BUFFER=1 before SoLoader.init so the
llama.rn OpenCL backend enables cl_qcom_large_buffer on supported Adreno
devices. Non-Adreno devices and drivers without the extension no-op.

Closes #657.
Upstream: ggml-org/llama.cpp#20997
@a-ghorbani a-ghorbani force-pushed the feature/TASK-20260421-1416 branch from 8f3d6ee to 3c40306 Compare May 11, 2026 21:28
@a-ghorbani
Copy link
Copy Markdown
Owner Author

Bench verification — ready to ship

Validated on physical Adreno hardware (POCO Myron / SD 8 Elite, Adreno 840 A8X; Samsung S23 / SD 8 Gen 2, Adreno 740 A7X). 75 cells attempted across 5 runs.

Heads-up on the bench-harness large_buffer_enabled field

It reports false on every OpenCL row in the captured reports — measurement artifact, not a real bug. The lm_ggml_opencl: Adreno large buffer enabled line fires inside build_backend_ctx() during JNI_OnLoad, before the bench's JS-bridged native-log handler is installed. It routes to stderr → /dev/null on Android.

Confirmed directly with a diagnostic build that patched ggml-opencl.cpp to call __android_log_print next to the env-var read:

PocketPalEnv: onCreate after setenv: Os.getenv=1
PocketPalDiag: build_backend_ctx: getenv(LM_GGML_OPENCL_ADRENO_USE_LARGE_BUFFER)=1 gpu_family=ADRENO has_large_buffer=1

So the env var IS visible to native libc when ggml-opencl reads it, and the large-buffer code path IS active on Adreno A8X. Verification fell back to outcome-based signals (cell pass/fail, pp/tg deltas vs the PR713 baseline) instead of trusting the field.

Myron (Adreno 840 / A8X) — wins

Three large gemma-4-e2b GPU quants show consistent +11–14% pp vs PR713:

Cell PR713 pp/tg PR699 pp/tg Δpp Δtg
gemma-4 q4_K_M 216.2 / 13.8 243.2 / 14.7 +12.4% +6.3%
gemma-4 q5_K_M 197.4 / 14.2 219.7 / 14.3 +11.3% +0.7%
gemma-4 q6_K 219.6 / 14.3 249.9 / 15.0 +13.8% +4.8%

Smoke regression (18 cells, cpu+gpu × 3 small models × 3 quants): 18/18 ok.

Samsung S23 (Adreno 740 / A7X) — 1 cell recovered

Cell PR713 status PR699 status
phi-4-mini q4_0 (gpu) crashed ok ✅ (pp=117.6, tg=9.4)

Smoke (18 cells): 18/18 ok.

The remaining 14 / 15 documented PR713-baseline GPU failures on S23 (gemma-4 × 8 quants, phi-3.5 q6_K + q8_0, phi-4 q4_K_M..q8_0) still crash. These appear to be a separate Adreno 740 pipeline bug, not the 1 GB per-allocation cap — out of scope for this PR.

Caveats / what looked like regressions but weren't

  • Myron gemma-4 q8_0 (gpu) first run: −35% pp. Cause: thermal — device had been running for an hour. Cool-device retry: pp=297.8, tg=18.2 → within ±5% of PR713. No regression.
  • S23 phi-4-mini q3_K_M (gpu) first attempt crashed; retry passed cleanly at pp=22.8, tg=4.9 (+23% / +39% vs PR713). Transient.
  • Small-quant qwen3.5-0.8b GPU cells show 10–23% tg dips on both devices, all measured during the same warm-device window as gemma-4 q8_0. Likely the same thermal pattern.

Recommendation

Safe to merge. The wins on large gemma-4 quants (Myron) and the one S23 recovery are real; the apparent regressions traced to thermal / transient issues.

Two follow-ups worth filing separately:

  1. Bench-harness large_buffer_enabled measurement gap — move toggleNativeLog(true) to bench-screen mount (or even app startup), or expose adreno_use_large_buffer as a structured field via llama.rn's JSI API so this signal stops depending on log-capture timing.
  2. S23 Adreno 740 large-model GPU crashes (gemma-4-e2b all quants, phi-3.5 q6_K+, phi-4 q4_K_M+) — not the 1 GB cap, looks like a separate Adreno 740 driver / pipeline issue.

Raw data

Full per-cell JSONs + logs archived at aghorbani@192.168.0.92:~/bench-bundle/baseline/PR699/ (95-line reports/PR699-summary.md has the long-form breakdown).

Generated by PocketPal Dev Team

@a-ghorbani a-ghorbani merged commit 9c00a08 into main May 12, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feat]: Large buffer support for Adreno gpus

2 participants