chore(deps): upgrade llama.rn to 0.12.4 by a-ghorbani · Pull Request #743 · a-ghorbani/pocketpal-ai

a-ghorbani · 2026-05-25T14:02:11Z

Summary

Bumps llama.rn 0.12.3 → 0.12.4. Dependency-only upgrade (package.json + lockfiles). Native iOS + Android builds pass; targeted Jest suites green. Memory profile re-verification on iPhone 13 Pro + Pixel 9 is PENDING (must be run by human via memory-profile skill on physical devices before merge — see Verification below).

Effective llama.cpp range covered by this upgrade: b9254 → b9309 (55 commits) plus one llama.rn-side packaging change.

Changes

package.json — "llama.rn": "0.12.3" → "llama.rn": "0.12.4"
yarn.lock — regenerated (llama.rn block only, 4+/4-)
ios/Podfile.lock — llama-rn (0.12.3) → llama-rn (0.12.4), checksum 2bb735f3… → 94584e50…

3 files, 7 insertions / 7 deletions. No consumer code touched; the llama.rn Jest mock surface is version-agnostic.

llama.cpp / llama.rn changelog (PocketPal-relevant)

Scoped to items that touch surfaces PocketPal actually ships. Dropped: server, CUDA, SYCL, Vulkan, WebGPU, ZenDNN, web UI, cmake/ci/docs, perplexity overflow fixes.

llama.rn sync points

iOS: embed ggml-metal source into framework binary (feat(ios): embed ggml-metal source into framework binary mybigday/llama.rn#349) — packaging change; ggml-metal .metal source now ships inside the framework binary. Transparent to PocketPal's build (verified during iOS Release build below).
0.12.4 syncs to llama.cpp b9297 (feat: sync llama.cpp to b9297 mybigday/llama.rn#347)
0.12.4 syncs to llama.cpp b9309 (feat: sync llama.cpp to b9309 mybigday/llama.rn#351)

Verification

yarn install clean — yarn.lock change scoped to llama.rn block (4+/4-)
pod install clean — Podfile.lock change scoped to llama-rn pod + checksum
Targeted Jest suites pass — 32/32 suites, 749/749 tests pass (Node 22.21.0)
yarn ios:build:release succeeds (~171s, Build Succeeded, PocketPal.app produced; no new ggml-metal/metallib warnings — llama.rn#349 packaging change transparent)
yarn build:android:release succeeds (~3m 58s, BUILD SUCCESSFUL, app-prod-release.aab ~100 MB produced)
PENDING — Memory profile re-verified on iPhone 13 Pro vs baseline (model qwen3-1.7b) — to be run by human via memory-profile skill on physical device
PENDING — Memory profile re-verified on Pixel 9 vs baseline (model qwen3-1.7b) — to be run by human via memory-profile skill on physical device

Draft until both memory-profile runs complete and report PASS (regression threshold: >10% AND >200 MB, per e2e/scripts/memory-profile.sh convention).

Risk

Dependency-only; mock surface (loadLlamaModelInfo, LlamaContext, completion, bench, getFormattedChat, initMultimodal) is unchanged. No wrapper or version-string edits required — confirms the quick classification held. Pattern follows prior llama.rn upgrade PRs: #722 (0.12.0 stable), #728 (0.12.1), #740 (0.12.3).

Generated by PocketPal Dev Team

a-ghorbani · 2026-05-25T14:03:42Z

Memory profile — iPhone 13 Pro (`agh`)

Model: qwen3-1.7b · Baseline: b4d08b6 → Current: dbb073e · Threshold: regression if >10% AND >200 MB

Checkpoint	Baseline	Current	Δ	Δ %
app_launch	91.8 MB	80.8 MB	−11.1 MB	−12.0%
models_screen	91.3 MB	82.3 MB	−9.0 MB	−9.9%
chat_screen	93.2 MB	84.7 MB	−8.5 MB	−9.1%
model_loaded	2148.4 MB	2158.1 MB	+9.7 MB	+0.5%
chat_active	2143.9 MB	2145.2 MB	+1.3 MB	+0.1%
post_chat_idle	2140.6 MB	2147.7 MB	+7.1 MB	+0.3%
model_unloaded	140.2 MB	140.3 MB	+0.1 MB	+0.1%
Peak	2148.4 MB	2158.1 MB	+9.7 MB	+0.5%

Verdict: ✅ PASS — peak +0.5% (+9.7 MB on a 2 GB base, well within threshold). UI checkpoints actually lower than baseline (−8 to −11 MB across launch/models/chat screens). The +9.7 MB at model_loaded is consistent with the framework-size increase from embedding ggml-metal source into the iOS framework binary (llama.rn #349).

Pixel 9 run pending; will follow up.

Generated by PocketPal Dev Team

a-ghorbani · 2026-05-25T15:08:04Z

Memory profile — Pixel 9 (`pixel-9-real`)

Model: qwen3-1.7b · Baseline: b4d08b6 → Current: dbb073e (e2e APK from run 26404581116) · Threshold: regression if >10% AND >200 MB

Checkpoint	Baseline	Current	Δ	Δ %
app_launch	225.6 MB	272.3 MB	+46.8 MB	+20.7%
models_screen	228.4 MB	253.3 MB	+25.0 MB	+10.9%
chat_screen	216.9 MB	259.1 MB	+42.3 MB	+19.5%
model_loaded	1732.6 MB	1806.5 MB	+73.9 MB	+4.3%
chat_active	1809.7 MB	1860.4 MB	+50.7 MB	+2.8%
post_chat_idle	1810.6 MB	1859.9 MB	+49.2 MB	+2.7%
model_unloaded	345.5 MB	389.9 MB	+44.4 MB	+12.8%
Peak	1810.6 MB	1860.4 MB	+49.8 MB	+2.8%

Verdict: ✅ PASS — peak +2.8% (+49.8 MB), well within the >10% AND >200 MB regression threshold. The Δ% on the low-memory UI checkpoints (launch / models / chat / unloaded) exceeds 10% but the absolute swing is ~25–47 MB on a 200–270 MB base — that's normal Android jitter at this size, not a real regression (the AND-condition gate correctly says PASS). On the loaded checkpoints (model_loaded / chat_active / post_chat_idle) where the working set is dominated by the model itself, the deltas tighten to +2.7–4.3% / +50–74 MB, consistent with the iPhone result above (+0.5% peak on iPhone). No memory regression introduced by the llama.rn 0.12.3 → 0.12.4 bump on Android.

Generated by PocketPal Dev Team

a-ghorbani · 2026-05-25T16:11:08Z

PR-743 (llama.rn 0.12.3 → 0.12.4) — bench results

Smoke + focused matrix, 3 devices, matched-settings (cpu+hex flash_attn=on, gpu flash_attn=off), pp=256 tg=64 pl=1 nr=3 inter_cell_settle_ms=30000. APK from run 26404581116. Compared against PR-740 (immediately-preceding llama.rn bump 0.12.1→0.12.3).

For routine llama.rn-bump PRs we only compare against the prior bench (PR-740 here), not the PR-713 baseline. The standardised structure mirrors PR-740's own comment: TL;DR → coverage → summary medians → representative-cell absolutes → key findings → caveats → recommendation.

Coverage

device	smoke	focused	notes
poco-myron (SD8 Elite, Adreno 840, HTP v81)	27/27	60/60	full matrix, no crashes
samsung-s23 (SD8 Gen 2, Adreno 740, HTP v73)	27/27	42/60	focused-gpu died at `phi-4-mini/q4_0` (cell 13/20) — same GPU pipeline crash as PR-740. focused-hex died at `phi-4-mini/q6_k` (cell 30/40) — new this PR but only loses cells 30–40, so the 14 matched-vs-PR-740 hex cells are unaffected
poco-x7-klee (MT6899, cpu only)	9/9	15/20	app crash at `phi-4-mini/q6_k` (cell 15/20). gemma-4-e2b expected to OOM at load (~7.5 GiB RAM). Same coverage as PR-740

vs PR-740

Summary — median Δ per (device, backend)

Each Δ is the median of per-cell percent changes. Absolutes aren't aggregated here because mixing model/quant cells would mix workloads; see the representative-cell table for real tok/s.

smoke

device	backend	n	Δpp	Δtg	Δtotal_mib
poco-myron	cpu	9	+13.5%	+0.3%	+0.0%
poco-myron	gpu	9	+0.5%	-0.4%	+0.0%
poco-myron	hexagon	9	+16.4%	+4.2%	+0.0%
samsung-s23	cpu	9	-1.1%	-2.7%	+0.0%
samsung-s23	gpu	9	+9.1%	+2.9%	+0.0%
samsung-s23	hexagon	9	+1.4%	-2.4%	+0.0%
poco-x7-klee	cpu	9	+2.6%	-1.1%	+0.0%

focused

device	backend	n	Δpp	Δtg	Δtotal_mib
poco-myron	cpu	20	-1.3%	+0.7%	+0.0%
poco-myron	gpu	20	+2.3%	+3.4%	+0.0%
poco-myron	hexagon	20	+3.4%	+3.5%	+0.0%
samsung-s23	cpu	14	+0.9%	+4.3%	+0.0%
samsung-s23	gpu	13	+0.0%	-2.6%	+0.0%
samsung-s23	hexagon	14	+5.5%	+0.6%	+0.0%
poco-x7-klee	cpu	15	+0.5%	-0.1%	+0.0%

Representative cell — `qwen3-1.7b/q4_0` vs PR-740

Single fixed cell so the absolutes are real tok/s, not mixed workloads. Picked because it runs on all 3 devices, all 3 backends, both smoke and focused matrices.

smoke

device	backend	pp PR-743	pp PR-740	Δpp	tg PR-743	tg PR-740	Δtg
poco-myron	cpu	297.8	252.7	+17.8%	38.6	38.1	+1.4%
poco-myron	gpu	588.9	586.4	+0.4%	28.0	28.7	-2.5%
poco-myron	hexagon	890.2	727.2	+22.4%	30.9	31.6	-2.3%
samsung-s23	cpu	109.7	112.0	-2.1%	17.7	17.6	+0.5%
samsung-s23	gpu	253.8	228.6	+11.0%	14.3	12.4	+14.9%
samsung-s23	hexagon	515.6	461.7	+11.7%	21.2	21.1	+0.5%
poco-x7-klee	cpu	171.2	165.8	+3.2%	21.3	21.9	-2.6%

focused

device	backend	pp PR-743	pp PR-740	Δpp	tg PR-743	tg PR-740	Δtg
poco-myron	cpu	205.6	212.1	-3.1%	38.7	38.0	+1.8%
poco-myron	gpu	540.8	536.7	+0.8%	27.1	27.1	+0.3%
poco-myron	hexagon	831.1	720.0	+15.4%	32.5	30.4	+6.6%
samsung-s23	cpu	113.0	110.2	+2.5%	18.3	17.1	+7.3%
samsung-s23	gpu	252.2	264.3	-4.6%	15.8	16.6	-5.2%
samsung-s23	hexagon	487.4	459.9	+6.0%	19.9	20.9	-4.7%
poco-x7-klee	cpu	143.2	144.5	-0.9%	22.0	22.3	-1.3%

Key findings

Hex on HTP v81 (Myron) recovers the PR-740 regression. Smoke median +16.4% pp, focused median +3.4% pp; representative-cell pp jumps +22.4% smoke / +15.4% focused on qwen3-1.7b/q4_0. PR-740's headline finding was a -14% pp walkback on Myron hex; PR-743 takes most of that back, putting Myron hex at roughly +83% pp vs the PR-713 baseline (PR-740 was at +66–77%). Natural candidates: HMX quantized matmul rework (ggml#23368) and the repl optimization in flash-attn softmax (ggml#23455) — both PocketPal-relevant Hexagon items listed in the PR body.
S23 GPU smoke +9.1% pp / +2.9% tg, with the representative cell at +11.0% pp / +14.9% tg. Likely the OpenCL batch profiling speedup (ggml#23495) and/or backend init refactor (ggml#23318). S23 hex also +11.7% pp on the representative cell. Myron GPU stays flat — same workload but different SoC; the new Adreno MoE generalisation (ggml#23449) is a MoE-only path and our matrix has no MoE models, so no win expected there.
CPU on Myron is asymmetric: +13.5% smoke / -1.3% focused. This matches the thermal pattern we've documented on this device: smoke runs cool, focused runs warm (it runs second). PR-740's CPU regression was also worse on focused than smoke. Net read: CPU code itself is approximately flat vs PR-740 (the smoke uplift is largely thermal-favourable). Klee CPU and S23 CPU are flat-to-marginal across both tiers.
Memory: total_mib unchanged across the board (all backends, all devices, 0.0% median). No memory regression from the bump on the Android bench, consistent with the +0.5% iOS / +2.8% Android peaks reported separately in the memory-profile comments.
No real backend fallbacks. Every gpu cell reports effective_backend=opencl (label-only rename, same Adreno path) — identical pattern to PR-728 and PR-740. No silent CPU fallbacks anywhere.

Caveats

Thermal: Myron CPU smoke +13.5% is partly thermal (cool device); the focused -1.3% is the steady-state read. Take CPU deltas with that in mind.
Coverage gaps: S23 lost 18 focused cells (gpu died at phi-4-mini/q4_0 cell 13/20 = same crash mode as PR-740; hex died at phi-4-mini/q6_k cell 30/40 — new this PR but the matched-vs-PR-740 hex count is unchanged at 14, so the deltas above are computed on the same cell set). Klee lost 5 focused cells (phi-4-mini/q6_k crash + all 4 gemma-4-e2b cells expected to OOM at 7.5 GiB RAM).
MTP / multimodal not exercised: PR-743 includes several MTP refinements and mtmd changes (DeepSeek-OCR, HunyuanOCR→HunyuanVL merge, WAV MIME). None are measured by this single-prompt non-speculative text bench.
Apple-only items not measured here: the iOS-side packaging change (ggml-metal embedded in framework) and the Metal concat kernel optimization were exercised via the iOS memory-profile run (separate comment above, peak +0.5%).
total_mib is from log_signals.memory_buffers (per AGENTS.md §7). peak_memory_mb is omitted from the summary tables because it's a noisy process-RSS sample.

Recommendation

✅ Safe to merge from a perf standpoint. PR-743 is a clear net win on Hexagon (Myron especially) and S23 GPU, flat-to-slightly-positive everywhere else, with zero memory regression. The Hexagon perf walkback that PR-740 introduced on HTP v81 is mostly recovered here. No new crash modes vs PR-740 (the S23 hex crash at phi-4-mini/q6_k is new this PR but doesn't affect comparable cells — worth a follow-up if it reproduces on subsequent runs, but not a blocker for this PR).

Reports on bench host

~/bench-bundle/bench-results/pr-743/reports/SUMMARY.md
~/bench-bundle/bench-results/pr-743/reports/divergence-vs-pr-740.md (full per-cell tables vs PR-740)
Raw per-backend reports under ~/bench-bundle/bench-results/pr-743/reports/poco-myron-{smoke,focused}-{on,off}.json etc.

chore(deps): upgrade llama.rn to 0.12.4

dbb073e

a-ghorbani marked this pull request as ready for review May 25, 2026 15:03

a-ghorbani merged commit 2a88442 into main May 25, 2026
5 checks passed

a-ghorbani deleted the feature/TASK-20260525-1530 branch May 25, 2026 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(deps): upgrade llama.rn to 0.12.4#743

chore(deps): upgrade llama.rn to 0.12.4#743
a-ghorbani merged 1 commit into
mainfrom
feature/TASK-20260525-1530

a-ghorbani commented May 25, 2026

Uh oh!

a-ghorbani commented May 25, 2026

Uh oh!

a-ghorbani commented May 25, 2026

Uh oh!

Uh oh!

a-ghorbani commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

a-ghorbani commented May 25, 2026

Summary

Changes

llama.cpp / llama.rn changelog (PocketPal-relevant)

Hexagon NPU (Snapdragon)

OpenCL / Adreno

Metal (Apple)

Correctness / crash fixes

Speculative decoding (MTP) — refinements

Multimodal

Vocab / tokenizer

llama.rn sync points

Verification

Risk

Uh oh!

a-ghorbani commented May 25, 2026

Memory profile — iPhone 13 Pro (agh)

Uh oh!

a-ghorbani commented May 25, 2026

Memory profile — Pixel 9 (pixel-9-real)

Uh oh!

Uh oh!

a-ghorbani commented May 25, 2026

PR-743 (llama.rn 0.12.3 → 0.12.4) — bench results

Coverage

vs PR-740

Summary — median Δ per (device, backend)

Representative cell — qwen3-1.7b/q4_0 vs PR-740

Key findings

Caveats

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Memory profile — iPhone 13 Pro (`agh`)

Memory profile — Pixel 9 (`pixel-9-real`)

Representative cell — `qwen3-1.7b/q4_0` vs PR-740