chore(deps): upgrade llama.rn to 0.12.0 by a-ghorbani · Pull Request #722 · a-ghorbani/pocketpal-ai

a-ghorbani · 2026-05-12T08:20:52Z

Summary

Routine native-dependency version bump: llama.rn 0.12.0-rc.9 → 0.12.0 (stable).

Same shape as PR #689 (rc.5 → rc.8), PR #664 (rc.2 → rc.3), and PR #608 (0.11.0 → 0.11.3): exactly three files touched and no app-side code changes.

What ships with 0.12.0

llama.cpp bump b8827 → b9084 (257 commits, 2026-04-17 → 2026-05-09). Highlights relevant to PocketPal:
- OpenCL / Adreno — IQ4_NL support (ggml-org/llama.cpp#22272, closes [General]: Update llamacpp because starting from version b8935 Opencl supports IQ4_NL which is such a big development for android users #721), Adreno MoE optimizations (MxFP4, q4_0 GEMM), q4_0 refactor.
- Hexagon HTP — HMX flash attention, HMX M-tail row processing, L2-norm / GATED_DELTA_NET / SOLVE_TRI / FILL / DAIG kernels, configurable vmem & buffer size.
- Metal — GGML_OP_MUL_MAT Tensor API optimization, macOS GPU watchdog workaround, event-sync fix.
- General — llama : add option to save memory in device buffers (#22679), graceful error on unsupported architecture, recurrent state serialization fix.
- New models — Mimo v2.5, MiniCPM-V 4.6, Sarashina2.2-vision-3b, Reka Edge 2603, Granite Speech 4.0-1b, Gemma4 family detection & NVFP4 variant.
fix(cpp, jsi): avoid blocking ui during backend init (mybigday/llama.rn@09e69c2) — backend init + device probing moved off the JS thread, reducing UI jank during first initContext.

Changes

package.json — pin bumped to 0.12.0
yarn.lock — only the llama.rn@… entry resolved to the stable tarball; no unrelated drift
ios/Podfile.lock — llama-rn (0.12.0) with refreshed checksum 3abe0ea5… → 06746f84…; no other pods drift

Verification (NATIVE_CHANGES=YES)

pod install — clean (Podfile.lock committed)
iOS Release build (yarn ios:build:e2e) — PASS (.app artefact in worktree)
Android Release build (./gradlew assembleRelease) — PASS (app-prod-release.apk in worktree)
E2E quick-smoke on iPhone 17 Pro Simulator — 1/1 PASS (smollm2-135m loaded + streamed tokens through the new native bridge), e2e/reports/2026-05-12T08-02-52-111/summary.json
Lint — 0 errors (4 pre-existing warnings)
TypeCheck — PASS
Jest — 159 suites / 2222 passed / 2 skipped / 0 failed; coverage 70.62% statements, 70.67% lines

Story: TASK-20260512-0948

Generated by PocketPal Dev Team

Bumps llama.rn from 0.12.0-rc.9 to 0.12.0 stable. Same shape as PR #689 (rc.5 → rc.8), PR #664 (rc.2 → rc.3), and PR #608 (0.11.0 → 0.11.3): exactly three files touched, no app-side code changes. What ships with 0.12.0: llama.cpp bump b8827 → b9084 (257 commits, 2026-04-17 → 2026-05-09). Highlights relevant to PocketPal: OpenCL / Adreno (Android GPU): - opencl: add iq4_nl support (#22272, b8935) — addresses #721 - opencl: Adreno optimization for MoE — MxFP4 (#22301) - opencl: q4_0 MoE GEMM for Adreno (#22731) - opencl: refactor Adreno q4_0 (#22335) - ggml: use CL_DEVICE_GLOBAL_MEM_SIZE for memory-fit estimate (#22688) Hexagon HTP (Snapdragon NPU): - hexagon: HMX flash attention (#22347) - hexagon: process M-tail rows on HMX instead of HVX (#22724) - hexagon: HTP kernel for GGML_OP_GATED_DELTA_NET (#22837) - hexagon: L2 norm (#22816), DAIG (#22195), FILL (#22198), SOLVE_TRI (#21974) ops - hexagon: non-contiguous row tensor support for unary ops (#22574) - hexagon: configurable vmem and buffer size (#22487) - hexagon: bump HMX frequency to max corner (#22334) Metal (iOS): - metal: optimize Metal Tensor API usage for GGML_OP_MUL_MAT (#20962) - metal: workaround macOS GPU interactivity watchdog (#22216) - metal: fix event synchronization (#22260) - metal: print GPU description (#22318) Memory / robustness: - llama: add option to save memory in device buffers (#22679) - model: don't crash on unsupported architecture (#22742) - llama: add missing call to ggml_backend_load_all() (#22752) - fix recurrent state serialization for partial reads/writes (#22362) - llama: fix device state save/load (#22805) - fix type casting for unaccounted memory calculation (#22424) New model architectures: - Mimo v2.5 (#22493) - MiniCPM-V 4.6 (mtmd, #22529) - Sarashina2.2-vision-3b (#22103) - Reka Edge 2603 (mtmd, #21616) - HunyuanVL update (#22037) - Granite Speech 4.0-1b (mtmd, #22101) - Gemma4 family: detection, parsing, NVFP4 variant - Nemotron Nano 3 Omni convert (#22481) Tool-calling / reasoning: - chat: parallel_tool_calls default by model capability (#22217) - common/autoparser: newline handling / forced tool-call fixes (#22654) - common/autoparser: allow space after tool call (#22073) - chat: fix handling of space in reasoning markers (#22353) - common: re-arm reasoning budget after DONE on new <think> (#22323) - common: don't pass prompt tokens to reasoning budget sampler (#22488) Tokenizer: - fix GLM-DSA crash in llama-tokenize when using vocab_only (#22102) llama.rn fix included in 0.12.0: fix(cpp, jsi): avoid blocking ui during backend init (mybigday/llama.rn@09e69c2) — backend init and device probing moved off the JS thread, reducing UI jank during first initContext. Changes: - package.json — pin bumped to 0.12.0 - yarn.lock — only the llama.rn@… entry resolved to the stable tarball; no unrelated drift - ios/Podfile.lock — llama-rn (0.12.0) with refreshed checksum 3abe0ea5… → 06746f84…; no other pods drift Verification (NATIVE_CHANGES=YES): - pod install — clean - iOS Release build (yarn ios:build:e2e) — PASS - Android Release build (./gradlew assembleRelease) — PASS - E2E quick-smoke on iPhone 17 Pro Simulator — 1/1 PASS - Lint — 0 errors (4 pre-existing warnings) - TypeCheck — PASS - Jest — 159 suites / 2222 passed / 2 skipped / 0 failed Story: TASK-20260512-0948

a-ghorbani · 2026-05-12T13:56:54Z

Bench results — mapped to the llama.cpp features called out in the PR

Ran the standard smoke + focused matrices on three test phones, then re-ran cpu+hex with flash_attn=on (the cpu+hex baseline setting) for an apples-to-apples comparison against baseline/PR713. The first sweep used app-default flash_attn=off, which mismatched the baseline on cpu+hex; all numbers below are from the matched-settings rerun. GPU was already matched (flash=off on both sides).

Bench params: pp=256, tg=64, pl=1, nr=3, inter_cell_settle_ms=30000
APK md5: 0e5ba5ae8b7ed1fe9dcffa028efe6090
Devices: poco-myron (SD 8 Elite, Adreno 840, HTP v81), samsung-s23 (SD 8 Gen 2, Adreno 740, HTP v73), poco-x7-klee (MediaTek, cpu-only)

Memory metric: the table's "total mem" column is llama.cpp's own memory_buffers.total_mib (weights + KV cache + compute buffer, summed from the loader log) — that's the authoritative per-cell footprint. An earlier draft of this comment used peak_memory_mb (sampled process RSS), which is a sweep-overlapping running peak and gave misleading −57 % to +41 % swings. Using total_mib, every cell across cpu / hex / opencl on both devices is within ±1.1 % of baseline — i.e. footprint is effectively unchanged, as expected for a version bump.

Hexagon HMX flash attention (#22347)

The biggest measured win in the PR. Lights up most clearly on small attention-heavy models with natively-HTP-supported quants.

Myron — HTP v81 — 12 / 12 cells faster, no regressions:

model/quant	pp base → PR-722 (Δ)	tg base → PR-722 (Δ)	total mem base → PR-722 (Δ)
qwen3.5-0.8b/q4_0	236.4 → 569.4 (+140.9 %)	10.4 → 16.1 (+54.3 %)	1426 → 1432 MB (+0.4 %)
qwen3.5-0.8b/q4_k_m	178.7 → 328.1 (+83.6 %)	10.0 → 14.9 (+49.4 %)	1467 → 1474 MB (+0.4 %)
qwen3.5-0.8b/q6_k	172.0 → 327.6 (+90.4 %)	9.8 → 14.5 (+47.8 %)	1572 → 1578 MB (+0.4 %)
qwen3.5-0.8b/q8_0	238.7 → 574.3 (+140.6 %)	10.1 → 15.5 (+53.2 %)	1771 → 1777 MB (+0.4 %)
qwen3-1.7b/q4_0	330.0 → 544.9 (+65.1 %)	30.6 → 31.2 (+2.0 %)	1984 → 1978 MB (−0.3 %)
qwen3-1.7b/q4_k_m	171.3 → 200.0 (+16.8 %)	23.6 → 24.5 (+3.7 %)	1840 → 1860 MB (+1.1 %)
qwen3-1.7b/q6_k	116.5 → 144.9 (+24.3 %)	18.8 → 19.1 (+1.9 %)	2213 → 2233 MB (+0.9 %)
qwen3-1.7b/q8_0	345.9 → 517.1 (+49.5 %)	23.1 → 22.5 (−2.6 %)	2874 → 2868 MB (−0.2 %)
lfm2.5-1.2b/q4_0	629.1 → 908.6 (+44.4 %)	47.2 → 49.8 (+5.6 %)	1262 → 1270 MB (+0.6 %)
lfm2.5-1.2b/q4_k_m	261.1 → 298.8 (+14.5 %)	38.6 → 40.1 (+4.0 %)	1152 → 1154 MB (+0.2 %)
lfm2.5-1.2b/q6_k	179.9 → 212.2 (+17.9 %)	30.1 → 31.3 (+4.0 %)	1373 → 1375 MB (+0.2 %)
lfm2.5-1.2b/q8_0	612.0 → 829.0 (+35.5 %)	33.0 → 35.3 (+6.9 %)	1818 → 1826 MB (+0.4 %)

S23 — HTP v73 — 6 / 12 clear wins, rest flat, no regressions:

model/quant	pp base → PR-722 (Δ)	tg base → PR-722 (Δ)	total mem base → PR-722 (Δ)
qwen3.5-0.8b/q4_0	128.9 → 226.3 (+75.5 %)	5.7 → 7.2 (+24.7 %)	1232 → 1231 MB (−0.1 %)
qwen3.5-0.8b/q4_k_m	83.2 → 138.1 (+66.0 %)	5.6 → 7.2 (+29.1 %)	1274 → 1271 MB (−0.2 %)
qwen3.5-0.8b/q6_k	85.5 → 140.3 (+64.0 %)	5.4 → 6.9 (+26.5 %)	1379 → 1376 MB (−0.2 %)
qwen3.5-0.8b/q8_0	129.4 → 261.2 (+101.8 %)	5.7 → 7.3 (+27.3 %)	1577 → 1576 MB (−0.1 %)
qwen3-1.7b/q4_0	242.4 → 326.3 (+34.7 %)	20.4 → 21.5 (+5.3 %)	1744 → 1744 MB (±0 %)
qwen3-1.7b/q4_k_m	76.6 → 83.1 (+8.5 %)	17.9 → 17.1 (−4.5 %)	1760 → 1760 MB (±0 %)
qwen3-1.7b/q6_k	52.4 → 56.5 (+7.7 %)	11.6 → 11.4 (−1.4 %)	2133 → 2133 MB (±0 %)
qwen3-1.7b/q8_0	272.3 → 336.1 (+23.4 %)	17.2 → 17.1 (−0.4 %)	2634 → 2634 MB (±0 %)
lfm2.5-1.2b/q4_0	501.9 → 520.8 (+3.8 %)	37.0 → 37.1 (+0.4 %)	994 → 994 MB (±0 %)
lfm2.5-1.2b/q4_k_m	125.4 → 125.0 (−0.3 %)	29.8 → 31.0 (+4.0 %)	998 → 998 MB (±0 %)
lfm2.5-1.2b/q6_k	92.3 → 94.0 (+1.8 %)	19.0 → 20.2 (+6.5 %)	1219 → 1219 MB (±0 %)
lfm2.5-1.2b/q8_0	441.1 → 469.9 (+6.5 %)	30.9 → 30.6 (−1.1 %)	1550 → 1550 MB (±0 %)

LFM2.5 (SSM hybrid) shows less benefit — expected, SSM has less attention compute for HMX FA to accelerate. q4_K_M / q6_K on S23 hex are flat because they aren't natively HTP-supported; they fall back through a non-FA dequantize path.

Adreno q4_0 GEMM + MoE optimizations (#22335, #22301, #22731)

Myron — Adreno 840 — 13 cells:

model/quant	pp base → PR-722 (Δ)	tg base → PR-722 (Δ)	total mem base → PR-722 (Δ)
qwen3-1.7b/q4_0	586.7 → 643.4 (+9.7 %)	28.7 → 34.9 (+21.6 %)	1710 → 1710 MB (±0 %)
qwen3-1.7b/q4_k_m	370.5 → 395.3 (+6.7 %)	23.3 → 21.9 (−6.1 %)	1758 → 1758 MB (±0 %)
qwen3-1.7b/q6_k	360.5 → 385.6 (+7.0 %)	23.6 → 21.8 (−7.6 %)	2131 → 2131 MB (±0 %)
qwen3-1.7b/q8_0	507.4 → 548.2 (+8.0 %)	26.6 → 27.4 (+2.9 %)	2600 → 2600 MB (±0 %)
qwen3.5-0.8b/q4_0	826.2 → 878.0 (+6.3 %)	39.9 → 42.7 (+6.8 %)	1199 → 1199 MB (±0 %)
qwen3.5-0.8b/q4_k_m	645.7 → 702.1 (+8.7 %)	35.4 → 41.2 (+16.4 %)	1241 → 1241 MB (±0 %)
qwen3.5-0.8b/q6_k	686.8 → 714.0 (+4.0 %)	36.7 → 41.1 (+12.0 %)	1346 → 1346 MB (±0 %)
qwen3.5-0.8b/q8_0	815.1 → 816.4 (+0.2 %)	38.0 → 38.4 (+1.2 %)	1544 → 1544 MB (±0 %)
lfm2.5-1.2b/q4_0	904.4 → 989.0 (+9.4 %)	55.3 → 61.4 (+11.1 %)	954 → 954 MB (±0 %)
lfm2.5-1.2b/q4_k_m	543.3 → 601.7 (+10.7 %)	49.2 → 54.1 (+9.8 %)	988 → 988 MB (±0 %)
lfm2.5-1.2b/q6_k	539.0 → 589.6 (+9.4 %)	44.2 → 48.0 (+8.7 %)	1209 → 1209 MB (±0 %)
lfm2.5-1.2b/q8_0	752.9 → 832.2 (+10.5 %)	39.0 → 40.7 (+4.3 %)	1510 → 1510 MB (±0 %)
phi-4-mini/q4_0	292.4 → 311.6 (+6.6 %)	11.4 → 9.5 (−16.9 %)	3375 → 3375 MB (±0 %)

S23 — Adreno 740 — 11 cells:

model/quant	pp base → PR-722 (Δ)	tg base → PR-722 (Δ)	total mem base → PR-722 (Δ)
qwen3-1.7b/q4_0	245.6 → 295.5 (+20.3 %)	12.1 → 18.5 (+53.3 %)	1710 → 1710 MB (±0 %)
qwen3-1.7b/q4_k_m	185.6 → 199.0 (+7.2 %)	14.4 → 16.4 (+13.5 %)	1758 → 1758 MB (±0 %)
qwen3-1.7b/q6_k	151.1 → 156.1 (+3.3 %)	10.7 → 14.3 (+33.7 %)	2131 → 2131 MB (±0 %)
qwen3.5-0.8b/q4_0	63.6 → 55.5 (−12.7 %)	20.9 → 20.0 (−4.1 %)	1199 → 1199 MB (±0 %)
qwen3.5-0.8b/q4_k_m	61.8 → 57.4 (−7.1 %)	21.1 → 19.8 (−6.5 %)	1241 → 1241 MB (±0 %)
qwen3.5-0.8b/q6_k	58.3 → 60.2 (+3.3 %)	19.2 → 20.2 (+5.2 %)	1346 → 1346 MB (±0 %)
qwen3.5-0.8b/q8_0	60.1 → 63.2 (+5.1 %)	18.6 → 20.1 (+8.2 %)	1544 → 1544 MB (±0 %)
lfm2.5-1.2b/q4_0	470.3 → 472.2 (+0.4 %)	39.2 → 39.4 (+0.7 %)	954 → 954 MB (±0 %)
lfm2.5-1.2b/q4_k_m	298.1 → 300.8 (+0.9 %)	31.4 → 31.5 (+0.1 %)	988 → 988 MB (±0 %)
lfm2.5-1.2b/q6_k	248.7 → 264.9 (+6.5 %)	29.1 → 30.2 (+3.7 %)	1209 → 1209 MB (±0 %)
lfm2.5-1.2b/q8_0	490.2 → 490.4 (±0 %)	30.8 → 30.3 (−1.4 %)	1510 → 1510 MB (±0 %)

qwen3-1.7b q4_0 GPU is the cleanest q4_0-GEMM win (+20.3 % pp / +53.3 % tg on S23, +21.6 % tg on Myron). The qwen3.5-0.8b/q4_0/q4_K_M cells on S23 GPU show a small pp drop (−7 % to −13 %) — worth a follow-up read; absolute numbers are very low on that pair (~60 pp tps), so it may be measurement noise but flagging it.

OpenCL IQ4_NL (#22272, the #721 ask)

Ad-hoc with 3 models × IQ4_NL × opencl. All 6 PR-722 cells loaded with effective_backend=opencl and all layers offloaded.

S23 — Adreno 740 — same APK pair (PR-713 baseline vs PR-722), same IQ4_NL GGUFs, same device:

model/quant	pp PR-713 → PR-722 (Δ)	tg PR-713 → PR-722 (Δ)	total mem PR-713 → PR-722 (Δ)
qwen3.5-0.8b/iq4_nl	51.9 → 61.0 (+17.5 %)	5.4 → 20.2 (+273.7 %)	1220 → 1199 MB (−1.7 %)
qwen3-1.7b/iq4_nl	76.6 → 168.1 (+119.6 %)	5.5 → 11.9 (+116.4 %)	1740 → 1708 MB (−1.8 %)
phi-4-mini/iq4_nl	crashed on load	crashed on load	— → 3369 MB

So PR-713 already had a basic OpenCL IQ4_NL path (the kernels existed pre-b8935) but tg was around 5 t/s — barely usable — and phi-4-mini crashed at model load. PR-722's upstream sync lands the improved kernels: tg roughly triples / doubles, pp doubles on the larger model, and phi-4-mini now runs cleanly. Total memory unchanged.

Myron — Adreno 840 — PR-722 only (didn't re-run on PR-713 APK to avoid the HyperOS install dance):

model/quant	pp	tg	total mem
qwen3.5-0.8b/iq4_nl	604.2	34.6	1199 MB
qwen3-1.7b/iq4_nl	325.2	19.9	1708 MB
phi-4-mini/iq4_nl	156.9	8.8	3369 MB

Note: even on PR-722, IQ4_NL prefill is still ~30–50 % slower than Q4_0 on Adreno 840 — expected since IQ4_NL is a more complex quant. The wins are kernel correctness, stability on large models, and the tg jump.

This closes the verification ask in #721.

Metal (iOS) — not measured

d1649047 (MUL_MAT Tensor API opt) only runs on Apple. Not in the Android bench rig — would need a separate iOS bench run.

Bonus: CPU wins (not predicted by the PR body)

Myron — 12 / 12 cells faster, +13 % to +48 % pp, no regressions:

model/quant	pp base → PR-722 (Δ)	tg base → PR-722 (Δ)	total mem base → PR-722 (Δ)
qwen3.5-0.8b/q4_0	401.8 → 539.6 (+34.3 %)	62.7 → 67.6 (+7.9 %)	1191 → 1191 MB (±0 %)
qwen3.5-0.8b/q4_k_m	285.9 → 384.3 (+34.4 %)	55.9 → 59.5 (+6.4 %)	1233 → 1233 MB (±0 %)
qwen3.5-0.8b/q6_k	262.8 → 363.5 (+38.3 %)	50.0 → 52.2 (+4.3 %)	1338 → 1338 MB (±0 %)
qwen3.5-0.8b/q8_0	421.8 → 569.6 (+35.1 %)	53.0 → 53.9 (+1.6 %)	1536 → 1536 MB (±0 %)
qwen3-1.7b/q4_0	204.8 → 302.4 (+47.6 %)	37.5 → 38.6 (+2.9 %)	1698 → 1698 MB (±0 %)
qwen3-1.7b/q4_k_m	175.9 → 199.9 (+13.7 %)	34.2 → 34.8 (+1.6 %)	1746 → 1746 MB (±0 %)
qwen3-1.7b/q6_k	111.6 → 143.1 (+28.2 %)	24.7 → 26.3 (+6.1 %)	2119 → 2119 MB (±0 %)
qwen3-1.7b/q8_0	236.8 → 308.1 (+30.1 %)	28.2 → 28.2 (±0 %)	2588 → 2588 MB (±0 %)
lfm2.5-1.2b/q4_0	356.2 → 484.2 (+35.9 %)	63.8 → 67.8 (+6.3 %)	926 → 926 MB (±0 %)
lfm2.5-1.2b/q4_k_m	225.5 → 307.6 (+36.4 %)	58.8 → 59.8 (+1.6 %)	960 → 960 MB (±0 %)
lfm2.5-1.2b/q6_k	155.5 → 216.2 (+39.0 %)	38.7 → 42.2 (+9.0 %)	1181 → 1181 MB (±0 %)
lfm2.5-1.2b/q8_0	370.1 → 499.3 (+34.9 %)	44.0 → 44.9 (+2.1 %)	1482 → 1482 MB (±0 %)

S23 — 4 cells +10 % to +17 % pp, rest flat, no regressions:

model/quant	pp base → PR-722 (Δ)	tg base → PR-722 (Δ)	total mem base → PR-722 (Δ)
qwen3.5-0.8b/q4_0	179.1 → 209.6 (+17.0 %)	24.9 → 28.3 (+13.4 %)	1191 → 1191 MB (±0 %)
qwen3.5-0.8b/q4_k_m	140.3 → 151.8 (+8.2 %)	23.9 → 25.7 (+7.4 %)	1233 → 1233 MB (±0 %)
qwen3.5-0.8b/q6_k	133.7 → 151.2 (+13.0 %)	21.6 → 22.9 (+6.3 %)	1338 → 1338 MB (±0 %)
qwen3.5-0.8b/q8_0	187.8 → 206.0 (+9.7 %)	23.6 → 24.9 (+5.3 %)	1536 → 1536 MB (±0 %)
qwen3-1.7b/q4_0	103.9 → 118.5 (+14.1 %)	17.2 → 18.3 (+6.1 %)	1698 → 1698 MB (±0 %)
qwen3-1.7b/q4_k_m	79.6 → 79.4 (−0.3 %)	16.0 → 16.1 (+1.1 %)	1746 → 1746 MB (±0 %)
qwen3-1.7b/q6_k	51.9 → 56.7 (+9.2 %)	10.4 → 11.0 (+6.1 %)	2119 → 2119 MB (±0 %)
qwen3-1.7b/q8_0	107.5 → 107.5 (±0 %)	13.4 → 14.2 (+5.9 %)	2588 → 2588 MB (±0 %)
lfm2.5-1.2b/q4_0	160.8 → 176.6 (+9.8 %)	26.6 → 26.8 (+1.0 %)	926 → 926 MB (±0 %)
lfm2.5-1.2b/q4_k_m	125.4 → 130.8 (+4.4 %)	26.1 → 26.9 (+3.2 %)	960 → 960 MB (±0 %)
lfm2.5-1.2b/q6_k	85.1 → 86.8 (+2.0 %)	18.0 → 18.5 (+3.2 %)	1181 → 1181 MB (±0 %)
lfm2.5-1.2b/q8_0	167.9 → 184.9 (+10.2 %)	20.7 → 21.7 (+4.6 %)	1753 → 1753 MB (±0 %)

Most likely a knock-on from the broader llama.cpp upstream sync rather than any of the named Adreno/Hexagon work.

Coverage gaps

Larger models (phi-4-mini, gemma-4-e2b, etc.) — focused sweep crashed mid-run on Myron + S23 around 30–40 cells in (Adreno/Hexagon harness limit, not regression). Per-cell isolation rerun needed for signal on those.
No regressions detected on any hex / cpu cell. One small GPU regression on S23 qwen3.5-0.8b q4_0/q4_K_M (−7 % to −13 % pp at very low absolute pp ~60 tps) — flagged for follow-up.

Full raw JSON reports live on the bench host: aghorbani@192.168.0.92:~/bench-bundle/pr-722/reports/ (SUMMARY.md, adhoc-cpu-hex-flash-on-divergence.md, per-device *-{smoke,focused,adhoc-cpu-hex-on}.json).

Generated by PocketPal Dev Team

a-ghorbani force-pushed the feature/TASK-20260512-0948 branch from 066fc98 to 61f0d0b Compare May 12, 2026 08:49

a-ghorbani marked this pull request as ready for review May 12, 2026 14:06

a-ghorbani merged commit d4130b4 into main May 12, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(deps): upgrade llama.rn to 0.12.0#722

chore(deps): upgrade llama.rn to 0.12.0#722
a-ghorbani merged 1 commit into
mainfrom
feature/TASK-20260512-0948

a-ghorbani commented May 12, 2026 •

edited

Loading

Uh oh!

a-ghorbani commented May 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

a-ghorbani commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What ships with 0.12.0

Changes

Verification (NATIVE_CHANGES=YES)

Uh oh!

a-ghorbani commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bench results — mapped to the llama.cpp features called out in the PR

Hexagon HMX flash attention (#22347)

Adreno q4_0 GEMM + MoE optimizations (#22335, #22301, #22731)

OpenCL IQ4_NL (#22272, the #721 ask)

Metal (iOS) — not measured

Bonus: CPU wins (not predicted by the PR body)

Coverage gaps

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

a-ghorbani commented May 12, 2026 •

edited

Loading

a-ghorbani commented May 12, 2026 •

edited

Loading