Skip to content

chore(deps): upgrade llama.rn to 0.12.0#722

Merged
a-ghorbani merged 1 commit into
mainfrom
feature/TASK-20260512-0948
May 12, 2026
Merged

chore(deps): upgrade llama.rn to 0.12.0#722
a-ghorbani merged 1 commit into
mainfrom
feature/TASK-20260512-0948

Conversation

@a-ghorbani
Copy link
Copy Markdown
Owner

@a-ghorbani a-ghorbani commented May 12, 2026

Summary

Routine native-dependency version bump: llama.rn 0.12.0-rc.90.12.0 (stable).

Same shape as PR #689 (rc.5 → rc.8), PR #664 (rc.2 → rc.3), and PR #608 (0.11.0 → 0.11.3): exactly three files touched and no app-side code changes.

What ships with 0.12.0

  • llama.cpp bump b8827b9084 (257 commits, 2026-04-17 → 2026-05-09). Highlights relevant to PocketPal:
  • fix(cpp, jsi): avoid blocking ui during backend init (mybigday/llama.rn@09e69c2) — backend init + device probing moved off the JS thread, reducing UI jank during first initContext.

Changes

  • package.json — pin bumped to 0.12.0
  • yarn.lock — only the llama.rn@… entry resolved to the stable tarball; no unrelated drift
  • ios/Podfile.lockllama-rn (0.12.0) with refreshed checksum 3abe0ea5…06746f84…; no other pods drift

Verification (NATIVE_CHANGES=YES)

  • pod install — clean (Podfile.lock committed)
  • iOS Release build (yarn ios:build:e2e) — PASS (.app artefact in worktree)
  • Android Release build (./gradlew assembleRelease) — PASS (app-prod-release.apk in worktree)
  • E2E quick-smoke on iPhone 17 Pro Simulator — 1/1 PASS (smollm2-135m loaded + streamed tokens through the new native bridge), e2e/reports/2026-05-12T08-02-52-111/summary.json
  • Lint — 0 errors (4 pre-existing warnings)
  • TypeCheck — PASS
  • Jest — 159 suites / 2222 passed / 2 skipped / 0 failed; coverage 70.62% statements, 70.67% lines

Story: TASK-20260512-0948

Generated by PocketPal Dev Team

Bumps llama.rn from 0.12.0-rc.9 to 0.12.0 stable. Same shape as PR #689
(rc.5 → rc.8), PR #664 (rc.2 → rc.3), and PR #608 (0.11.0 → 0.11.3):
exactly three files touched, no app-side code changes.

What ships with 0.12.0:

llama.cpp bump b8827 → b9084 (257 commits, 2026-04-17 → 2026-05-09).
Highlights relevant to PocketPal:

OpenCL / Adreno (Android GPU):
  - opencl: add iq4_nl support (#22272, b8935) — addresses #721
  - opencl: Adreno optimization for MoE — MxFP4 (#22301)
  - opencl: q4_0 MoE GEMM for Adreno (#22731)
  - opencl: refactor Adreno q4_0 (#22335)
  - ggml: use CL_DEVICE_GLOBAL_MEM_SIZE for memory-fit estimate (#22688)

Hexagon HTP (Snapdragon NPU):
  - hexagon: HMX flash attention (#22347)
  - hexagon: process M-tail rows on HMX instead of HVX (#22724)
  - hexagon: HTP kernel for GGML_OP_GATED_DELTA_NET (#22837)
  - hexagon: L2 norm (#22816), DAIG (#22195), FILL (#22198),
            SOLVE_TRI (#21974) ops
  - hexagon: non-contiguous row tensor support for unary ops (#22574)
  - hexagon: configurable vmem and buffer size (#22487)
  - hexagon: bump HMX frequency to max corner (#22334)

Metal (iOS):
  - metal: optimize Metal Tensor API usage for GGML_OP_MUL_MAT (#20962)
  - metal: workaround macOS GPU interactivity watchdog (#22216)
  - metal: fix event synchronization (#22260)
  - metal: print GPU description (#22318)

Memory / robustness:
  - llama: add option to save memory in device buffers (#22679)
  - model: don't crash on unsupported architecture (#22742)
  - llama: add missing call to ggml_backend_load_all() (#22752)
  - fix recurrent state serialization for partial reads/writes (#22362)
  - llama: fix device state save/load (#22805)
  - fix type casting for unaccounted memory calculation (#22424)

New model architectures:
  - Mimo v2.5 (#22493)
  - MiniCPM-V 4.6 (mtmd, #22529)
  - Sarashina2.2-vision-3b (#22103)
  - Reka Edge 2603 (mtmd, #21616)
  - HunyuanVL update (#22037)
  - Granite Speech 4.0-1b (mtmd, #22101)
  - Gemma4 family: detection, parsing, NVFP4 variant
  - Nemotron Nano 3 Omni convert (#22481)

Tool-calling / reasoning:
  - chat: parallel_tool_calls default by model capability (#22217)
  - common/autoparser: newline handling / forced tool-call fixes (#22654)
  - common/autoparser: allow space after tool call (#22073)
  - chat: fix handling of space in reasoning markers (#22353)
  - common: re-arm reasoning budget after DONE on new <think> (#22323)
  - common: don't pass prompt tokens to reasoning budget sampler (#22488)

Tokenizer:
  - fix GLM-DSA crash in llama-tokenize when using vocab_only (#22102)

llama.rn fix included in 0.12.0:
  fix(cpp, jsi): avoid blocking ui during backend init
  (mybigday/llama.rn@09e69c2) — backend init and device probing moved
  off the JS thread, reducing UI jank during first initContext.

Changes:
  - package.json — pin bumped to 0.12.0
  - yarn.lock — only the llama.rn@… entry resolved to the stable
    tarball; no unrelated drift
  - ios/Podfile.lock — llama-rn (0.12.0) with refreshed checksum
    3abe0ea5… → 06746f84…; no other pods drift

Verification (NATIVE_CHANGES=YES):
  - pod install — clean
  - iOS Release build (yarn ios:build:e2e) — PASS
  - Android Release build (./gradlew assembleRelease) — PASS
  - E2E quick-smoke on iPhone 17 Pro Simulator — 1/1 PASS
  - Lint — 0 errors (4 pre-existing warnings)
  - TypeCheck — PASS
  - Jest — 159 suites / 2222 passed / 2 skipped / 0 failed

Story: TASK-20260512-0948
@a-ghorbani a-ghorbani force-pushed the feature/TASK-20260512-0948 branch from 066fc98 to 61f0d0b Compare May 12, 2026 08:49
@a-ghorbani
Copy link
Copy Markdown
Owner Author

a-ghorbani commented May 12, 2026

Bench results — mapped to the llama.cpp features called out in the PR

Ran the standard smoke + focused matrices on three test phones, then re-ran cpu+hex with flash_attn=on (the cpu+hex baseline setting) for an apples-to-apples comparison against baseline/PR713. The first sweep used app-default flash_attn=off, which mismatched the baseline on cpu+hex; all numbers below are from the matched-settings rerun. GPU was already matched (flash=off on both sides).

  • Bench params: pp=256, tg=64, pl=1, nr=3, inter_cell_settle_ms=30000
  • APK md5: 0e5ba5ae8b7ed1fe9dcffa028efe6090
  • Devices: poco-myron (SD 8 Elite, Adreno 840, HTP v81), samsung-s23 (SD 8 Gen 2, Adreno 740, HTP v73), poco-x7-klee (MediaTek, cpu-only)

Memory metric: the table's "total mem" column is llama.cpp's own memory_buffers.total_mib (weights + KV cache + compute buffer, summed from the loader log) — that's the authoritative per-cell footprint. An earlier draft of this comment used peak_memory_mb (sampled process RSS), which is a sweep-overlapping running peak and gave misleading −57 % to +41 % swings. Using total_mib, every cell across cpu / hex / opencl on both devices is within ±1.1 % of baseline — i.e. footprint is effectively unchanged, as expected for a version bump.

Hexagon HMX flash attention (#22347)

The biggest measured win in the PR. Lights up most clearly on small attention-heavy models with natively-HTP-supported quants.

Myron — HTP v81 — 12 / 12 cells faster, no regressions:

model/quant pp base → PR-722 (Δ) tg base → PR-722 (Δ) total mem base → PR-722 (Δ)
qwen3.5-0.8b/q4_0 236.4 → 569.4 (+140.9 %) 10.4 → 16.1 (+54.3 %) 1426 → 1432 MB (+0.4 %)
qwen3.5-0.8b/q4_k_m 178.7 → 328.1 (+83.6 %) 10.0 → 14.9 (+49.4 %) 1467 → 1474 MB (+0.4 %)
qwen3.5-0.8b/q6_k 172.0 → 327.6 (+90.4 %) 9.8 → 14.5 (+47.8 %) 1572 → 1578 MB (+0.4 %)
qwen3.5-0.8b/q8_0 238.7 → 574.3 (+140.6 %) 10.1 → 15.5 (+53.2 %) 1771 → 1777 MB (+0.4 %)
qwen3-1.7b/q4_0 330.0 → 544.9 (+65.1 %) 30.6 → 31.2 (+2.0 %) 1984 → 1978 MB (−0.3 %)
qwen3-1.7b/q4_k_m 171.3 → 200.0 (+16.8 %) 23.6 → 24.5 (+3.7 %) 1840 → 1860 MB (+1.1 %)
qwen3-1.7b/q6_k 116.5 → 144.9 (+24.3 %) 18.8 → 19.1 (+1.9 %) 2213 → 2233 MB (+0.9 %)
qwen3-1.7b/q8_0 345.9 → 517.1 (+49.5 %) 23.1 → 22.5 (−2.6 %) 2874 → 2868 MB (−0.2 %)
lfm2.5-1.2b/q4_0 629.1 → 908.6 (+44.4 %) 47.2 → 49.8 (+5.6 %) 1262 → 1270 MB (+0.6 %)
lfm2.5-1.2b/q4_k_m 261.1 → 298.8 (+14.5 %) 38.6 → 40.1 (+4.0 %) 1152 → 1154 MB (+0.2 %)
lfm2.5-1.2b/q6_k 179.9 → 212.2 (+17.9 %) 30.1 → 31.3 (+4.0 %) 1373 → 1375 MB (+0.2 %)
lfm2.5-1.2b/q8_0 612.0 → 829.0 (+35.5 %) 33.0 → 35.3 (+6.9 %) 1818 → 1826 MB (+0.4 %)

S23 — HTP v73 — 6 / 12 clear wins, rest flat, no regressions:

model/quant pp base → PR-722 (Δ) tg base → PR-722 (Δ) total mem base → PR-722 (Δ)
qwen3.5-0.8b/q4_0 128.9 → 226.3 (+75.5 %) 5.7 → 7.2 (+24.7 %) 1232 → 1231 MB (−0.1 %)
qwen3.5-0.8b/q4_k_m 83.2 → 138.1 (+66.0 %) 5.6 → 7.2 (+29.1 %) 1274 → 1271 MB (−0.2 %)
qwen3.5-0.8b/q6_k 85.5 → 140.3 (+64.0 %) 5.4 → 6.9 (+26.5 %) 1379 → 1376 MB (−0.2 %)
qwen3.5-0.8b/q8_0 129.4 → 261.2 (+101.8 %) 5.7 → 7.3 (+27.3 %) 1577 → 1576 MB (−0.1 %)
qwen3-1.7b/q4_0 242.4 → 326.3 (+34.7 %) 20.4 → 21.5 (+5.3 %) 1744 → 1744 MB (±0 %)
qwen3-1.7b/q4_k_m 76.6 → 83.1 (+8.5 %) 17.9 → 17.1 (−4.5 %) 1760 → 1760 MB (±0 %)
qwen3-1.7b/q6_k 52.4 → 56.5 (+7.7 %) 11.6 → 11.4 (−1.4 %) 2133 → 2133 MB (±0 %)
qwen3-1.7b/q8_0 272.3 → 336.1 (+23.4 %) 17.2 → 17.1 (−0.4 %) 2634 → 2634 MB (±0 %)
lfm2.5-1.2b/q4_0 501.9 → 520.8 (+3.8 %) 37.0 → 37.1 (+0.4 %) 994 → 994 MB (±0 %)
lfm2.5-1.2b/q4_k_m 125.4 → 125.0 (−0.3 %) 29.8 → 31.0 (+4.0 %) 998 → 998 MB (±0 %)
lfm2.5-1.2b/q6_k 92.3 → 94.0 (+1.8 %) 19.0 → 20.2 (+6.5 %) 1219 → 1219 MB (±0 %)
lfm2.5-1.2b/q8_0 441.1 → 469.9 (+6.5 %) 30.9 → 30.6 (−1.1 %) 1550 → 1550 MB (±0 %)

LFM2.5 (SSM hybrid) shows less benefit — expected, SSM has less attention compute for HMX FA to accelerate. q4_K_M / q6_K on S23 hex are flat because they aren't natively HTP-supported; they fall back through a non-FA dequantize path.

Adreno q4_0 GEMM + MoE optimizations (#22335, #22301, #22731)

Myron — Adreno 840 — 13 cells:

model/quant pp base → PR-722 (Δ) tg base → PR-722 (Δ) total mem base → PR-722 (Δ)
qwen3-1.7b/q4_0 586.7 → 643.4 (+9.7 %) 28.7 → 34.9 (+21.6 %) 1710 → 1710 MB (±0 %)
qwen3-1.7b/q4_k_m 370.5 → 395.3 (+6.7 %) 23.3 → 21.9 (−6.1 %) 1758 → 1758 MB (±0 %)
qwen3-1.7b/q6_k 360.5 → 385.6 (+7.0 %) 23.6 → 21.8 (−7.6 %) 2131 → 2131 MB (±0 %)
qwen3-1.7b/q8_0 507.4 → 548.2 (+8.0 %) 26.6 → 27.4 (+2.9 %) 2600 → 2600 MB (±0 %)
qwen3.5-0.8b/q4_0 826.2 → 878.0 (+6.3 %) 39.9 → 42.7 (+6.8 %) 1199 → 1199 MB (±0 %)
qwen3.5-0.8b/q4_k_m 645.7 → 702.1 (+8.7 %) 35.4 → 41.2 (+16.4 %) 1241 → 1241 MB (±0 %)
qwen3.5-0.8b/q6_k 686.8 → 714.0 (+4.0 %) 36.7 → 41.1 (+12.0 %) 1346 → 1346 MB (±0 %)
qwen3.5-0.8b/q8_0 815.1 → 816.4 (+0.2 %) 38.0 → 38.4 (+1.2 %) 1544 → 1544 MB (±0 %)
lfm2.5-1.2b/q4_0 904.4 → 989.0 (+9.4 %) 55.3 → 61.4 (+11.1 %) 954 → 954 MB (±0 %)
lfm2.5-1.2b/q4_k_m 543.3 → 601.7 (+10.7 %) 49.2 → 54.1 (+9.8 %) 988 → 988 MB (±0 %)
lfm2.5-1.2b/q6_k 539.0 → 589.6 (+9.4 %) 44.2 → 48.0 (+8.7 %) 1209 → 1209 MB (±0 %)
lfm2.5-1.2b/q8_0 752.9 → 832.2 (+10.5 %) 39.0 → 40.7 (+4.3 %) 1510 → 1510 MB (±0 %)
phi-4-mini/q4_0 292.4 → 311.6 (+6.6 %) 11.4 → 9.5 (−16.9 %) 3375 → 3375 MB (±0 %)

S23 — Adreno 740 — 11 cells:

model/quant pp base → PR-722 (Δ) tg base → PR-722 (Δ) total mem base → PR-722 (Δ)
qwen3-1.7b/q4_0 245.6 → 295.5 (+20.3 %) 12.1 → 18.5 (+53.3 %) 1710 → 1710 MB (±0 %)
qwen3-1.7b/q4_k_m 185.6 → 199.0 (+7.2 %) 14.4 → 16.4 (+13.5 %) 1758 → 1758 MB (±0 %)
qwen3-1.7b/q6_k 151.1 → 156.1 (+3.3 %) 10.7 → 14.3 (+33.7 %) 2131 → 2131 MB (±0 %)
qwen3.5-0.8b/q4_0 63.6 → 55.5 (−12.7 %) 20.9 → 20.0 (−4.1 %) 1199 → 1199 MB (±0 %)
qwen3.5-0.8b/q4_k_m 61.8 → 57.4 (−7.1 %) 21.1 → 19.8 (−6.5 %) 1241 → 1241 MB (±0 %)
qwen3.5-0.8b/q6_k 58.3 → 60.2 (+3.3 %) 19.2 → 20.2 (+5.2 %) 1346 → 1346 MB (±0 %)
qwen3.5-0.8b/q8_0 60.1 → 63.2 (+5.1 %) 18.6 → 20.1 (+8.2 %) 1544 → 1544 MB (±0 %)
lfm2.5-1.2b/q4_0 470.3 → 472.2 (+0.4 %) 39.2 → 39.4 (+0.7 %) 954 → 954 MB (±0 %)
lfm2.5-1.2b/q4_k_m 298.1 → 300.8 (+0.9 %) 31.4 → 31.5 (+0.1 %) 988 → 988 MB (±0 %)
lfm2.5-1.2b/q6_k 248.7 → 264.9 (+6.5 %) 29.1 → 30.2 (+3.7 %) 1209 → 1209 MB (±0 %)
lfm2.5-1.2b/q8_0 490.2 → 490.4 (±0 %) 30.8 → 30.3 (−1.4 %) 1510 → 1510 MB (±0 %)

qwen3-1.7b q4_0 GPU is the cleanest q4_0-GEMM win (+20.3 % pp / +53.3 % tg on S23, +21.6 % tg on Myron). The qwen3.5-0.8b/q4_0/q4_K_M cells on S23 GPU show a small pp drop (−7 % to −13 %) — worth a follow-up read; absolute numbers are very low on that pair (~60 pp tps), so it may be measurement noise but flagging it.

OpenCL IQ4_NL (#22272, the #721 ask)

Ad-hoc with 3 models × IQ4_NL × opencl. All 6 PR-722 cells loaded with effective_backend=opencl and all layers offloaded.

S23 — Adreno 740 — same APK pair (PR-713 baseline vs PR-722), same IQ4_NL GGUFs, same device:

model/quant pp PR-713 → PR-722 (Δ) tg PR-713 → PR-722 (Δ) total mem PR-713 → PR-722 (Δ)
qwen3.5-0.8b/iq4_nl 51.9 → 61.0 (+17.5 %) 5.4 → 20.2 (+273.7 %) 1220 → 1199 MB (−1.7 %)
qwen3-1.7b/iq4_nl 76.6 → 168.1 (+119.6 %) 5.5 → 11.9 (+116.4 %) 1740 → 1708 MB (−1.8 %)
phi-4-mini/iq4_nl crashed on load crashed on load — → 3369 MB

So PR-713 already had a basic OpenCL IQ4_NL path (the kernels existed pre-b8935) but tg was around 5 t/s — barely usable — and phi-4-mini crashed at model load. PR-722's upstream sync lands the improved kernels: tg roughly triples / doubles, pp doubles on the larger model, and phi-4-mini now runs cleanly. Total memory unchanged.

Myron — Adreno 840 — PR-722 only (didn't re-run on PR-713 APK to avoid the HyperOS install dance):

model/quant pp tg total mem
qwen3.5-0.8b/iq4_nl 604.2 34.6 1199 MB
qwen3-1.7b/iq4_nl 325.2 19.9 1708 MB
phi-4-mini/iq4_nl 156.9 8.8 3369 MB

Note: even on PR-722, IQ4_NL prefill is still ~30–50 % slower than Q4_0 on Adreno 840 — expected since IQ4_NL is a more complex quant. The wins are kernel correctness, stability on large models, and the tg jump.

This closes the verification ask in #721.

Metal (iOS) — not measured

d1649047 (MUL_MAT Tensor API opt) only runs on Apple. Not in the Android bench rig — would need a separate iOS bench run.

Bonus: CPU wins (not predicted by the PR body)

Myron — 12 / 12 cells faster, +13 % to +48 % pp, no regressions:

model/quant pp base → PR-722 (Δ) tg base → PR-722 (Δ) total mem base → PR-722 (Δ)
qwen3.5-0.8b/q4_0 401.8 → 539.6 (+34.3 %) 62.7 → 67.6 (+7.9 %) 1191 → 1191 MB (±0 %)
qwen3.5-0.8b/q4_k_m 285.9 → 384.3 (+34.4 %) 55.9 → 59.5 (+6.4 %) 1233 → 1233 MB (±0 %)
qwen3.5-0.8b/q6_k 262.8 → 363.5 (+38.3 %) 50.0 → 52.2 (+4.3 %) 1338 → 1338 MB (±0 %)
qwen3.5-0.8b/q8_0 421.8 → 569.6 (+35.1 %) 53.0 → 53.9 (+1.6 %) 1536 → 1536 MB (±0 %)
qwen3-1.7b/q4_0 204.8 → 302.4 (+47.6 %) 37.5 → 38.6 (+2.9 %) 1698 → 1698 MB (±0 %)
qwen3-1.7b/q4_k_m 175.9 → 199.9 (+13.7 %) 34.2 → 34.8 (+1.6 %) 1746 → 1746 MB (±0 %)
qwen3-1.7b/q6_k 111.6 → 143.1 (+28.2 %) 24.7 → 26.3 (+6.1 %) 2119 → 2119 MB (±0 %)
qwen3-1.7b/q8_0 236.8 → 308.1 (+30.1 %) 28.2 → 28.2 (±0 %) 2588 → 2588 MB (±0 %)
lfm2.5-1.2b/q4_0 356.2 → 484.2 (+35.9 %) 63.8 → 67.8 (+6.3 %) 926 → 926 MB (±0 %)
lfm2.5-1.2b/q4_k_m 225.5 → 307.6 (+36.4 %) 58.8 → 59.8 (+1.6 %) 960 → 960 MB (±0 %)
lfm2.5-1.2b/q6_k 155.5 → 216.2 (+39.0 %) 38.7 → 42.2 (+9.0 %) 1181 → 1181 MB (±0 %)
lfm2.5-1.2b/q8_0 370.1 → 499.3 (+34.9 %) 44.0 → 44.9 (+2.1 %) 1482 → 1482 MB (±0 %)

S23 — 4 cells +10 % to +17 % pp, rest flat, no regressions:

model/quant pp base → PR-722 (Δ) tg base → PR-722 (Δ) total mem base → PR-722 (Δ)
qwen3.5-0.8b/q4_0 179.1 → 209.6 (+17.0 %) 24.9 → 28.3 (+13.4 %) 1191 → 1191 MB (±0 %)
qwen3.5-0.8b/q4_k_m 140.3 → 151.8 (+8.2 %) 23.9 → 25.7 (+7.4 %) 1233 → 1233 MB (±0 %)
qwen3.5-0.8b/q6_k 133.7 → 151.2 (+13.0 %) 21.6 → 22.9 (+6.3 %) 1338 → 1338 MB (±0 %)
qwen3.5-0.8b/q8_0 187.8 → 206.0 (+9.7 %) 23.6 → 24.9 (+5.3 %) 1536 → 1536 MB (±0 %)
qwen3-1.7b/q4_0 103.9 → 118.5 (+14.1 %) 17.2 → 18.3 (+6.1 %) 1698 → 1698 MB (±0 %)
qwen3-1.7b/q4_k_m 79.6 → 79.4 (−0.3 %) 16.0 → 16.1 (+1.1 %) 1746 → 1746 MB (±0 %)
qwen3-1.7b/q6_k 51.9 → 56.7 (+9.2 %) 10.4 → 11.0 (+6.1 %) 2119 → 2119 MB (±0 %)
qwen3-1.7b/q8_0 107.5 → 107.5 (±0 %) 13.4 → 14.2 (+5.9 %) 2588 → 2588 MB (±0 %)
lfm2.5-1.2b/q4_0 160.8 → 176.6 (+9.8 %) 26.6 → 26.8 (+1.0 %) 926 → 926 MB (±0 %)
lfm2.5-1.2b/q4_k_m 125.4 → 130.8 (+4.4 %) 26.1 → 26.9 (+3.2 %) 960 → 960 MB (±0 %)
lfm2.5-1.2b/q6_k 85.1 → 86.8 (+2.0 %) 18.0 → 18.5 (+3.2 %) 1181 → 1181 MB (±0 %)
lfm2.5-1.2b/q8_0 167.9 → 184.9 (+10.2 %) 20.7 → 21.7 (+4.6 %) 1753 → 1753 MB (±0 %)

Most likely a knock-on from the broader llama.cpp upstream sync rather than any of the named Adreno/Hexagon work.

Coverage gaps

  • Larger models (phi-4-mini, gemma-4-e2b, etc.) — focused sweep crashed mid-run on Myron + S23 around 30–40 cells in (Adreno/Hexagon harness limit, not regression). Per-cell isolation rerun needed for signal on those.
  • No regressions detected on any hex / cpu cell. One small GPU regression on S23 qwen3.5-0.8b q4_0/q4_K_M (−7 % to −13 % pp at very low absolute pp ~60 tps) — flagged for follow-up.

Full raw JSON reports live on the bench host: aghorbani@192.168.0.92:~/bench-bundle/pr-722/reports/ (SUMMARY.md, adhoc-cpu-hex-flash-on-divergence.md, per-device *-{smoke,focused,adhoc-cpu-hex-on}.json).

Generated by PocketPal Dev Team

@a-ghorbani a-ghorbani marked this pull request as ready for review May 12, 2026 14:06
@a-ghorbani a-ghorbani merged commit d4130b4 into main May 12, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[General]: Update llamacpp because starting from version b8935 Opencl supports IQ4_NL which is such a big development for android users

1 participant