Skip to content

chore(deps): upgrade llama.rn to 0.12.4#743

Merged
a-ghorbani merged 1 commit into
mainfrom
feature/TASK-20260525-1530
May 25, 2026
Merged

chore(deps): upgrade llama.rn to 0.12.4#743
a-ghorbani merged 1 commit into
mainfrom
feature/TASK-20260525-1530

Conversation

@a-ghorbani
Copy link
Copy Markdown
Owner

Summary

Bumps llama.rn 0.12.3 → 0.12.4. Dependency-only upgrade (package.json + lockfiles). Native iOS + Android builds pass; targeted Jest suites green. Memory profile re-verification on iPhone 13 Pro + Pixel 9 is PENDING (must be run by human via memory-profile skill on physical devices before merge — see Verification below).

Effective llama.cpp range covered by this upgrade: b9254 → b9309 (55 commits) plus one llama.rn-side packaging change.

Changes

  • package.json"llama.rn": "0.12.3""llama.rn": "0.12.4"
  • yarn.lock — regenerated (llama.rn block only, 4+/4-)
  • ios/Podfile.lockllama-rn (0.12.3)llama-rn (0.12.4), checksum 2bb735f3…94584e50…

3 files, 7 insertions / 7 deletions. No consumer code touched; the llama.rn Jest mock surface is version-agnostic.

llama.cpp / llama.rn changelog (PocketPal-relevant)

Scoped to items that touch surfaces PocketPal actually ships. Dropped: server, CUDA, SYCL, Vulkan, WebGPU, ZenDNN, web UI, cmake/ci/docs, perplexity overflow fixes.

Hexagon NPU (Snapdragon)

OpenCL / Adreno

Metal (Apple)

Correctness / crash fixes

Speculative decoding (MTP) — refinements

Multimodal

Vocab / tokenizer

llama.rn sync points

Verification

  • yarn install clean — yarn.lock change scoped to llama.rn block (4+/4-)
  • pod install clean — Podfile.lock change scoped to llama-rn pod + checksum
  • Targeted Jest suites pass — 32/32 suites, 749/749 tests pass (Node 22.21.0)
  • yarn ios:build:release succeeds (~171s, Build Succeeded, PocketPal.app produced; no new ggml-metal/metallib warnings — llama.rn#349 packaging change transparent)
  • yarn build:android:release succeeds (~3m 58s, BUILD SUCCESSFUL, app-prod-release.aab ~100 MB produced)
  • PENDING — Memory profile re-verified on iPhone 13 Pro vs baseline (model qwen3-1.7b) — to be run by human via memory-profile skill on physical device
  • PENDING — Memory profile re-verified on Pixel 9 vs baseline (model qwen3-1.7b) — to be run by human via memory-profile skill on physical device

Draft until both memory-profile runs complete and report PASS (regression threshold: >10% AND >200 MB, per e2e/scripts/memory-profile.sh convention).

Risk

Dependency-only; mock surface (loadLlamaModelInfo, LlamaContext, completion, bench, getFormattedChat, initMultimodal) is unchanged. No wrapper or version-string edits required — confirms the quick classification held. Pattern follows prior llama.rn upgrade PRs: #722 (0.12.0 stable), #728 (0.12.1), #740 (0.12.3).

Generated by PocketPal Dev Team

@a-ghorbani
Copy link
Copy Markdown
Owner Author

Memory profile — iPhone 13 Pro (agh)

Model: qwen3-1.7b · Baseline: b4d08b6Current: dbb073e · Threshold: regression if >10% AND >200 MB

Checkpoint Baseline Current Δ Δ %
app_launch 91.8 MB 80.8 MB −11.1 MB −12.0%
models_screen 91.3 MB 82.3 MB −9.0 MB −9.9%
chat_screen 93.2 MB 84.7 MB −8.5 MB −9.1%
model_loaded 2148.4 MB 2158.1 MB +9.7 MB +0.5%
chat_active 2143.9 MB 2145.2 MB +1.3 MB +0.1%
post_chat_idle 2140.6 MB 2147.7 MB +7.1 MB +0.3%
model_unloaded 140.2 MB 140.3 MB +0.1 MB +0.1%
Peak 2148.4 MB 2158.1 MB +9.7 MB +0.5%

Verdict: ✅ PASS — peak +0.5% (+9.7 MB on a 2 GB base, well within threshold). UI checkpoints actually lower than baseline (−8 to −11 MB across launch/models/chat screens). The +9.7 MB at model_loaded is consistent with the framework-size increase from embedding ggml-metal source into the iOS framework binary (llama.rn #349).

Pixel 9 run pending; will follow up.


Generated by PocketPal Dev Team

@a-ghorbani a-ghorbani marked this pull request as ready for review May 25, 2026 15:03
@a-ghorbani
Copy link
Copy Markdown
Owner Author

Memory profile — Pixel 9 (pixel-9-real)

Model: qwen3-1.7b · Baseline: b4d08b6Current: dbb073e (e2e APK from run 26404581116) · Threshold: regression if >10% AND >200 MB

Checkpoint Baseline Current Δ Δ %
app_launch 225.6 MB 272.3 MB +46.8 MB +20.7%
models_screen 228.4 MB 253.3 MB +25.0 MB +10.9%
chat_screen 216.9 MB 259.1 MB +42.3 MB +19.5%
model_loaded 1732.6 MB 1806.5 MB +73.9 MB +4.3%
chat_active 1809.7 MB 1860.4 MB +50.7 MB +2.8%
post_chat_idle 1810.6 MB 1859.9 MB +49.2 MB +2.7%
model_unloaded 345.5 MB 389.9 MB +44.4 MB +12.8%
Peak 1810.6 MB 1860.4 MB +49.8 MB +2.8%

Verdict: ✅ PASS — peak +2.8% (+49.8 MB), well within the >10% AND >200 MB regression threshold. The Δ% on the low-memory UI checkpoints (launch / models / chat / unloaded) exceeds 10% but the absolute swing is ~25–47 MB on a 200–270 MB base — that's normal Android jitter at this size, not a real regression (the AND-condition gate correctly says PASS). On the loaded checkpoints (model_loaded / chat_active / post_chat_idle) where the working set is dominated by the model itself, the deltas tighten to +2.7–4.3% / +50–74 MB, consistent with the iPhone result above (+0.5% peak on iPhone). No memory regression introduced by the llama.rn 0.12.3 → 0.12.4 bump on Android.


Generated by PocketPal Dev Team

@a-ghorbani a-ghorbani merged commit 2a88442 into main May 25, 2026
5 checks passed
@a-ghorbani a-ghorbani deleted the feature/TASK-20260525-1530 branch May 25, 2026 15:23
@a-ghorbani
Copy link
Copy Markdown
Owner Author

PR-743 (llama.rn 0.12.3 → 0.12.4) — bench results

Smoke + focused matrix, 3 devices, matched-settings (cpu+hex flash_attn=on, gpu flash_attn=off), pp=256 tg=64 pl=1 nr=3 inter_cell_settle_ms=30000. APK from run 26404581116. Compared against PR-740 (immediately-preceding llama.rn bump 0.12.1→0.12.3).

For routine llama.rn-bump PRs we only compare against the prior bench (PR-740 here), not the PR-713 baseline. The standardised structure mirrors PR-740's own comment: TL;DR → coverage → summary medians → representative-cell absolutes → key findings → caveats → recommendation.

Coverage

device smoke focused notes
poco-myron (SD8 Elite, Adreno 840, HTP v81) 27/27 60/60 full matrix, no crashes
samsung-s23 (SD8 Gen 2, Adreno 740, HTP v73) 27/27 42/60 focused-gpu died at phi-4-mini/q4_0 (cell 13/20) — same GPU pipeline crash as PR-740. focused-hex died at phi-4-mini/q6_k (cell 30/40) — new this PR but only loses cells 30–40, so the 14 matched-vs-PR-740 hex cells are unaffected
poco-x7-klee (MT6899, cpu only) 9/9 15/20 app crash at phi-4-mini/q6_k (cell 15/20). gemma-4-e2b expected to OOM at load (~7.5 GiB RAM). Same coverage as PR-740

vs PR-740

Summary — median Δ per (device, backend)

Each Δ is the median of per-cell percent changes. Absolutes aren't aggregated here because mixing model/quant cells would mix workloads; see the representative-cell table for real tok/s.

smoke

device backend n Δpp Δtg Δtotal_mib
poco-myron cpu 9 +13.5% +0.3% +0.0%
poco-myron gpu 9 +0.5% -0.4% +0.0%
poco-myron hexagon 9 +16.4% +4.2% +0.0%
samsung-s23 cpu 9 -1.1% -2.7% +0.0%
samsung-s23 gpu 9 +9.1% +2.9% +0.0%
samsung-s23 hexagon 9 +1.4% -2.4% +0.0%
poco-x7-klee cpu 9 +2.6% -1.1% +0.0%

focused

device backend n Δpp Δtg Δtotal_mib
poco-myron cpu 20 -1.3% +0.7% +0.0%
poco-myron gpu 20 +2.3% +3.4% +0.0%
poco-myron hexagon 20 +3.4% +3.5% +0.0%
samsung-s23 cpu 14 +0.9% +4.3% +0.0%
samsung-s23 gpu 13 +0.0% -2.6% +0.0%
samsung-s23 hexagon 14 +5.5% +0.6% +0.0%
poco-x7-klee cpu 15 +0.5% -0.1% +0.0%

Representative cell — qwen3-1.7b/q4_0 vs PR-740

Single fixed cell so the absolutes are real tok/s, not mixed workloads. Picked because it runs on all 3 devices, all 3 backends, both smoke and focused matrices.

smoke

device backend pp PR-743 pp PR-740 Δpp tg PR-743 tg PR-740 Δtg
poco-myron cpu 297.8 252.7 +17.8% 38.6 38.1 +1.4%
poco-myron gpu 588.9 586.4 +0.4% 28.0 28.7 -2.5%
poco-myron hexagon 890.2 727.2 +22.4% 30.9 31.6 -2.3%
samsung-s23 cpu 109.7 112.0 -2.1% 17.7 17.6 +0.5%
samsung-s23 gpu 253.8 228.6 +11.0% 14.3 12.4 +14.9%
samsung-s23 hexagon 515.6 461.7 +11.7% 21.2 21.1 +0.5%
poco-x7-klee cpu 171.2 165.8 +3.2% 21.3 21.9 -2.6%

focused

device backend pp PR-743 pp PR-740 Δpp tg PR-743 tg PR-740 Δtg
poco-myron cpu 205.6 212.1 -3.1% 38.7 38.0 +1.8%
poco-myron gpu 540.8 536.7 +0.8% 27.1 27.1 +0.3%
poco-myron hexagon 831.1 720.0 +15.4% 32.5 30.4 +6.6%
samsung-s23 cpu 113.0 110.2 +2.5% 18.3 17.1 +7.3%
samsung-s23 gpu 252.2 264.3 -4.6% 15.8 16.6 -5.2%
samsung-s23 hexagon 487.4 459.9 +6.0% 19.9 20.9 -4.7%
poco-x7-klee cpu 143.2 144.5 -0.9% 22.0 22.3 -1.3%

Key findings

  1. Hex on HTP v81 (Myron) recovers the PR-740 regression. Smoke median +16.4% pp, focused median +3.4% pp; representative-cell pp jumps +22.4% smoke / +15.4% focused on qwen3-1.7b/q4_0. PR-740's headline finding was a -14% pp walkback on Myron hex; PR-743 takes most of that back, putting Myron hex at roughly +83% pp vs the PR-713 baseline (PR-740 was at +66–77%). Natural candidates: HMX quantized matmul rework (ggml#23368) and the repl optimization in flash-attn softmax (ggml#23455) — both PocketPal-relevant Hexagon items listed in the PR body.

  2. S23 GPU smoke +9.1% pp / +2.9% tg, with the representative cell at +11.0% pp / +14.9% tg. Likely the OpenCL batch profiling speedup (ggml#23495) and/or backend init refactor (ggml#23318). S23 hex also +11.7% pp on the representative cell. Myron GPU stays flat — same workload but different SoC; the new Adreno MoE generalisation (ggml#23449) is a MoE-only path and our matrix has no MoE models, so no win expected there.

  3. CPU on Myron is asymmetric: +13.5% smoke / -1.3% focused. This matches the thermal pattern we've documented on this device: smoke runs cool, focused runs warm (it runs second). PR-740's CPU regression was also worse on focused than smoke. Net read: CPU code itself is approximately flat vs PR-740 (the smoke uplift is largely thermal-favourable). Klee CPU and S23 CPU are flat-to-marginal across both tiers.

  4. Memory: total_mib unchanged across the board (all backends, all devices, 0.0% median). No memory regression from the bump on the Android bench, consistent with the +0.5% iOS / +2.8% Android peaks reported separately in the memory-profile comments.

  5. No real backend fallbacks. Every gpu cell reports effective_backend=opencl (label-only rename, same Adreno path) — identical pattern to PR-728 and PR-740. No silent CPU fallbacks anywhere.

Caveats

  • Thermal: Myron CPU smoke +13.5% is partly thermal (cool device); the focused -1.3% is the steady-state read. Take CPU deltas with that in mind.
  • Coverage gaps: S23 lost 18 focused cells (gpu died at phi-4-mini/q4_0 cell 13/20 = same crash mode as PR-740; hex died at phi-4-mini/q6_k cell 30/40 — new this PR but the matched-vs-PR-740 hex count is unchanged at 14, so the deltas above are computed on the same cell set). Klee lost 5 focused cells (phi-4-mini/q6_k crash + all 4 gemma-4-e2b cells expected to OOM at 7.5 GiB RAM).
  • MTP / multimodal not exercised: PR-743 includes several MTP refinements and mtmd changes (DeepSeek-OCR, HunyuanOCR→HunyuanVL merge, WAV MIME). None are measured by this single-prompt non-speculative text bench.
  • Apple-only items not measured here: the iOS-side packaging change (ggml-metal embedded in framework) and the Metal concat kernel optimization were exercised via the iOS memory-profile run (separate comment above, peak +0.5%).
  • total_mib is from log_signals.memory_buffers (per AGENTS.md §7). peak_memory_mb is omitted from the summary tables because it's a noisy process-RSS sample.

Recommendation

Safe to merge from a perf standpoint. PR-743 is a clear net win on Hexagon (Myron especially) and S23 GPU, flat-to-slightly-positive everywhere else, with zero memory regression. The Hexagon perf walkback that PR-740 introduced on HTP v81 is mostly recovered here. No new crash modes vs PR-740 (the S23 hex crash at phi-4-mini/q6_k is new this PR but doesn't affect comparable cells — worth a follow-up if it reproduces on subsequent runs, but not a blocker for this PR).


Reports on bench host
  • ~/bench-bundle/bench-results/pr-743/reports/SUMMARY.md
  • ~/bench-bundle/bench-results/pr-743/reports/divergence-vs-pr-740.md (full per-cell tables vs PR-740)
  • Raw per-backend reports under ~/bench-bundle/bench-results/pr-743/reports/poco-myron-{smoke,focused}-{on,off}.json etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants