Skip to content

chore(deps): upgrade llama.rn to 0.12.3#740

Merged
a-ghorbani merged 1 commit into
mainfrom
feature/TASK-20260524-2036
May 25, 2026
Merged

chore(deps): upgrade llama.rn to 0.12.3#740
a-ghorbani merged 1 commit into
mainfrom
feature/TASK-20260524-2036

Conversation

@a-ghorbani
Copy link
Copy Markdown
Owner

@a-ghorbani a-ghorbani commented May 24, 2026

Summary

Bumps llama.rn 0.12.1 → 0.12.3. Dependency-only upgrade (package.json + lockfiles). Native iOS + Android builds pass; targeted Jest suites green. Memory profile re-verification on iPhone 13 Pro + Pixel 9 is PENDING (must be run by human via memory-profile skill on physical devices before merge — see Verification below).

Effective llama.cpp range covered by this upgrade: b9204 → b9254 (50 commits) plus llama.rn-level additions.

Changes

  • package.json"llama.rn": "0.12.1""llama.rn": "0.12.3"
  • yarn.lock — regenerated (llama.rn block only, 4+/4-)
  • ios/Podfile.lockllama-rn (0.12.1)llama-rn (0.12.3), checksum 30cce807…2bb735f3…

3 files, 7 insertions / 7 deletions. No consumer code touched; the llama.rn Jest mock surface is version-agnostic.

llama.cpp / llama.rn changelog (PocketPal-relevant)

Scoped to items that touch surfaces PocketPal actually ships. Dropped: server, web-UI, CUDA, SYCL, WebGPU, conversion-only items.

Speculative decoding (MTP)

Hexagon NPU (Snapdragon)

OpenCL / Adreno

Metal (Apple)

Multimodal

Core model

llama.rn sync points

Verification

  • yarn install clean — yarn.lock change scoped to llama.rn block (4+/4-)
  • pod install clean — Podfile.lock change scoped to llama-rn pod + checksum
  • Targeted Jest suites pass — 10/10 suites, 246/246 tests pass (Node 22.21.0)
  • yarn ios:build:release succeeds (~182s, Build Succeeded, PocketPal.app produced)
  • yarn build:android:release succeeds (~4m, BUILD SUCCESSFUL, app-prod-release.aab ~100 MB produced)
  • PENDING — Memory profile re-verified on iPhone 13 Pro vs baseline (model qwen3-1.7b) — to be run by human via memory-profile skill on physical device
  • PENDING — Memory profile re-verified on Pixel 9 vs baseline (model qwen3-1.7b) — to be run by human via memory-profile skill on physical device

Draft until both memory-profile runs complete and report PASS (regression threshold: >10% AND >200 MB, per e2e/scripts/memory-profile.sh convention).

Risk

Dependency-only; mock surface (loadLlamaModelInfo, LlamaContext, completion, bench, getFormattedChat, initMultimodal) is unchanged. No LlamaContextWrapper.mm or src/utils/*Versions.ts edits required — confirms the quick classification held. Pattern follows prior llama.rn upgrade PRs: #722 (0.12.0 stable), #728 (0.12.1).

Story: TASK-20260524-2036

Generated by PocketPal Dev Team

@a-ghorbani
Copy link
Copy Markdown
Owner Author

Memory profile — iPhone 13 Pro (agh)

Model: qwen3-1.7b · iOS: 26.3 reported (device on 26.5) · Baseline: b4d08b6Current: c0ad170 · Threshold: regression if >10% AND >200 MB

Checkpoint Baseline Current Δ Δ %
app_launch 91.8 MB 83.1 MB −8.8 MB −9.6%
models_screen 91.3 MB 84.1 MB −7.2 MB −7.9%
chat_screen 93.2 MB 86.5 MB −6.7 MB −7.2%
model_loaded 2148.4 MB 2148.8 MB +0.4 MB +0.0%
chat_active 2143.9 MB 2141.8 MB −2.2 MB −0.1%
post_chat_idle 2140.6 MB 2133.8 MB −6.8 MB −0.3%
model_unloaded 140.2 MB 133.4 MB −6.8 MB −4.8%
Peak 2148.4 MB 2148.8 MB +0.4 MB +0.0%

Verdict: ✅ PASS — peak essentially flat (+0.0%); idle/UI checkpoints slightly lower (≈−7 MB across launch/models/chat screens).

Pixel 9 run pending; will follow up.


Generated by PocketPal Dev Team

@a-ghorbani a-ghorbani marked this pull request as ready for review May 24, 2026 20:46
@a-ghorbani
Copy link
Copy Markdown
Owner Author

a-ghorbani commented May 24, 2026

PR-740 (llama.rn 0.12.1 → 0.12.3) — bench results

Smoke + focused matrix, 3 devices, matched-settings (cpu+hex flash_attn=on, gpu flash_attn=off), pp=256 tg=64 pl=1 nr=3 inter_cell_settle_ms=30000. APK from run 26371966925. Compared against PR-728 (immediately-preceding llama.rn bump 0.12.0→0.12.1) and PR-713 (canonical baseline).

Coverage

device smoke focused notes
poco-myron (SD8 Elite, Adreno 840, HTP v81) 27/27 60/60 full matrix, no crashes
samsung-s23 (SD8 Gen 2, Adreno 740, HTP v73) 27/27 41/60 focused-gpu died at phi-4-mini/q4_0 (cell 13/20) — known GPU pipeline crash after ~13 cells on this device, same failure mode as PR-728
poco-x7-klee (MT6899, cpu only) 9/9 15/20 app crash at phi-4-mini/q8_0 (cell 16/20). gemma-4-e2b expected to OOM at load (~7.5 GiB RAM). Matches PR-728 Klee coverage.

vs PR-728

Summary — median Δ per (device, backend)

Each Δ is the median of per-cell percent changes. Absolutes aren't aggregated here because mixing model/quant cells would mix workloads; see the representative-cell table for real tok/s.

smoke

device backend n Δpp Δtg Δtotal_mib
poco-myron cpu 9 -13.1% -1.6% +0.0%
poco-myron gpu 9 -0.1% +0.7% +0.0%
poco-myron hexagon 9 -13.9% -15.0% +0.0%
samsung-s23 cpu 9 -4.8% +1.6% +0.0%
samsung-s23 gpu 9 -9.4% -11.3% +0.0%
samsung-s23 hexagon 9 +0.5% +3.3% +0.0%
poco-x7-klee cpu 9 -3.2% +2.6% +0.0%

focused

device backend n Δpp Δtg Δtotal_mib
poco-myron cpu 20 -21.3% -3.0% +0.0%
poco-myron gpu 20 -9.1% -5.7% +0.0%
poco-myron hexagon 20 -13.7% -7.7% +0.0%
samsung-s23 cpu 14 -9.8% -6.1% +0.0%
samsung-s23 gpu 13 -1.6% -0.7% +0.0%
samsung-s23 hexagon 14 -0.7% -4.2% +0.0%
poco-x7-klee cpu 15 -9.3% +3.3% +0.0%

Representative cell — qwen3-1.7b/q4_0 vs PR-728

Single fixed cell so the absolutes are real tok/s, not mixed workloads. Picked because it runs on all 3 devices, all 3 backends, both smoke and focused matrices. Deltas track the summary medians closely, confirming the per-(device, backend) story is consistent across models — see the full per-cell tables on the bench host for any cell that deviates.

smoke

device backend pp PR-740 pp PR-728 Δpp tg PR-740 tg PR-728 Δtg
poco-myron cpu 252.7 300.8 -16.0% 38.1 38.8 -1.8%
poco-myron gpu 586.4 585.7 +0.1% 28.7 27.4 +4.6%
poco-myron hexagon 727.2 844.6 -13.9% 31.6 37.2 -15.0%
samsung-s23 cpu 112.0 117.3 -4.5% 17.6 18.8 -6.5%
samsung-s23 gpu 228.6 261.0 -12.4% 12.4 14.5 -14.5%
samsung-s23 hexagon 461.7 448.4 +3.0% 21.1 20.5 +2.8%
poco-x7-klee cpu 165.8 176.8 -6.2% 21.9 20.7 +5.5%

focused

device backend pp PR-740 pp PR-728 Δpp tg PR-740 tg PR-728 Δtg
poco-myron cpu 212.1 299.0 -29.1% 38.0 38.1 -0.4%
poco-myron gpu 536.7 588.2 -8.8% 27.1 27.8 -2.7%
poco-myron hexagon 720.0 847.8 -15.1% 30.4 31.1 -2.0%
samsung-s23 cpu 110.2 122.6 -10.1% 17.1 19.1 -10.8%
samsung-s23 gpu 264.3 257.0 +2.8% 16.6 15.3 +8.5%
samsung-s23 hexagon 459.9 444.9 +3.4% 20.9 20.4 +2.3%
poco-x7-klee cpu 144.5 175.9 -17.8% 22.3 21.5 +3.3%

vs PR-713 baseline

Summary — median Δ per (device, backend)

smoke

device backend n Δpp Δtg Δtotal_mib
poco-myron cpu 9 +16.6% +0.5% +0.0%
poco-myron gpu 9 -1.7% -0.9% +0.0%
poco-myron hexagon 9 +66.5% +3.2% +0.4%
samsung-s23 cpu 9 +7.8% +7.2% +0.0%
samsung-s23 gpu 9 -7.8% -9.1% +0.0%
samsung-s23 hexagon 9 +72.8% +3.6% +0.0%
poco-x7-klee cpu 9 +11.5% -3.2% +0.0%

focused

device backend n Δpp Δtg Δtotal_mib
poco-myron cpu 20 +3.7% +0.8% +0.0%
poco-myron gpu 20 -8.0% -3.3% +0.0%
poco-myron hexagon 20 +77.4% -0.8% +0.4%
samsung-s23 cpu 14 -3.1% +1.0% +0.0%
samsung-s23 gpu 13 +3.8% -7.0% +0.0%
samsung-s23 hexagon 13 +58.8% -1.1% +0.0%
poco-x7-klee cpu 15 +0.2% +0.4% +0.0%

Representative cell — qwen3-1.7b/q4_0 vs PR-713 baseline

smoke

device backend pp PR-740 pp PR-713 Δpp tg PR-740 tg PR-713 Δtg
poco-myron cpu 252.7 204.8 +23.4% 38.1 37.5 +1.4%
poco-myron gpu 586.4 586.7 -0.1% 28.7 28.7 -0.0%
poco-myron hexagon 727.2 330.0 +120.4% 31.6 30.6 +3.2%
samsung-s23 cpu 112.0 103.9 +7.8% 17.6 17.2 +2.1%
samsung-s23 gpu 228.6 245.6 -6.9% 12.4 12.1 +2.7%
samsung-s23 hexagon 461.7 242.4 +90.5% 21.1 20.4 +3.4%
poco-x7-klee cpu 165.8 155.8 +6.4% 21.9 22.1 -0.9%

focused

device backend pp PR-740 pp PR-713 Δpp tg PR-740 tg PR-713 Δtg
poco-myron cpu 212.1 204.8 +3.6% 38.0 37.5 +1.2%
poco-myron gpu 536.7 586.7 -8.5% 27.1 28.7 -5.7%
poco-myron hexagon 720.0 330.0 +118.2% 30.4 30.6 -0.5%
samsung-s23 cpu 110.2 103.9 +6.1% 17.1 17.2 -0.9%
samsung-s23 gpu 264.3 245.6 +7.6% 16.6 12.1 +37.7%
samsung-s23 hexagon 459.9 242.4 +89.8% 20.9 20.4 +2.3%
poco-x7-klee cpu 144.5 155.8 -7.2% 22.3 22.1 +0.9%

Key findings

  1. Hex on HTP v81 (Myron) regresses ~13–15 % pp / 7–15 % tg vs PR-728 — consistent across all models and quants, in both smoke (cool device) and focused (warm). Not noise: every one of the 20 focused-hex cells on Myron is between -6.6 % and -18.6 % pp. Hex on HTP v73 (S23) is essentially flat (+0.5 % smoke / -0.7 % focused pp median). PR-740 still beats PR-713 by +66 % to +77 % pp on hex — the new code is a net win over baseline, but a partial walkback of the headline PR-728 hex gain on HTP v81. The hex deltas in the llama.cpp range (PAD HVX kernel, TRI op, NORM op, MROPE/IMROPE in HTP rope, Snapdragon toolchain v0.6) are the natural suspects.

  2. CPU on Myron regresses more on focused (-21 %) than smoke (-13 %) — likely thermal: PR-728's "Myron CPU +33 %" baseline number was already flagged as thermally favorable in its own report, and the focused matrix runs after smoke when the device is warmer. Net vs PR-713 baseline is still +3.7 % (focused) / +16.6 % (smoke). CPU regression on S23/Klee is smaller (-3 to -10 %) and within run-to-run noise.

  3. GPU (OpenCL) is approximately flat on both Myron and S23. Small per-tier wobble (S23 smoke gpu -9.4 % vs focused gpu +3.8 %) is consistent with the GPU pipeline noise floor we see across all PR runs. No regression worth flagging.

  4. Memory: total_mib unchanged across the board (all backends, all devices, 0.0 % median). An earlier draft of this comment flagged a +13.4 % Myron smoke-hex anomaly; that turned out to be a log-capture artifact in PR-728's myron smoke run (only the HTP0 compute_buffer log line reached the parser; HTP1..HTP5 were missed, leaving the cell ~234 MiB short). PR-728's myron focused, PR-740 (both tiers), and PR-713 baseline all agreed on the full per-cell values, so we patched the 9 affected cells in pr-728/reports/poco-myron.json by lifting memory_buffers from PR-728 focused (for qwen3.5-0.8b + qwen3-1.7b) and from the PR-713 baseline (for gemma-3-1b, which is smoke-only). The patch is recorded under patches[] and runs[].log_signals.memory_buffers_original in that file. Re-running the comparison gives the +0.0 % shown above. Nothing to flag on memory in PR-740.

  5. No backend fallbacks except the expected gpu → opencl rename on every gpu cell (label-only, same Adreno code path; same observation in PR-728).

Caveats

  • Coverage gaps: S23 focused-gpu lost 7 cells to the known per-launch GPU crash (phi-4-mini and gemma-4-e2b families). Klee focused lost 5 cells (phi-4-mini/q8_0 crash + all 4 gemma-4-e2b cells expected to OOM at load on 7.5 GiB RAM). Pattern matches PR-728 exactly; no new instability introduced by PR-740.
  • Thermal: focused-matrix cells run after the smoke matrix; the device is warmer for focused. The CPU deltas should be read with that in mind.
  • MTP not exercised: PR-740 ships MTP speculative-decoding parallel-API support (0.12.3), but the bench matrix is single-prompt non-speculative — this PR's MTP work is not measured here. Should be tested separately if the goal is to validate the MTP path.
  • total_mib is from log_signals.memory_buffers (reliable for myron and klee; S23 hex captures only HTP0 in every report we have, so S23 mem deltas vs other PRs cancel out but absolute values understate by ~3×46 MiB). peak_memory_mb deltas are not reliable — included only in the per-cell tables on the bench host for completeness.

Recommendation

The Myron-hex perf walkback (-14 % pp median) is real and reproducible across every model/quant combination, but the absolute hex perf on Myron is still +66 % to +77 % pp above the PR-713 baseline, so users on HTP v81 still come out ahead vs anything before PR-728. HTP v73 (S23) is flat — same code path apparently doesn't hit the regression. Recommend merging if the upstream hex changes (kernel additions + toolchain v0.6) are wanted for other reasons; otherwise worth a focused investigation on whether the new HTP code paths can be tuned for v81 in a follow-up.


Reports on bench host
  • ~/bench-bundle/bench-results/pr-740/reports/SUMMARY.md
  • ~/bench-bundle/bench-results/pr-740/reports/divergence-vs-pr-728.md (full per-cell tables vs PR-728)
  • ~/bench-bundle/bench-results/pr-740/reports/divergence-vs-baseline.md (full per-cell tables vs PR-713)
  • Raw per-backend reports under ~/bench-bundle/bench-results/pr-740/reports/poco-myron-{smoke,focused}-{on,off}.json etc.

@a-ghorbani
Copy link
Copy Markdown
Owner Author

PR-740 — memory-profile, Pixel 9

memory-profile spec on Pixel 9 (real device, USB), model qwen3-1.7b. PR-740 commit c0ad170 vs the tracked baseline e2e/baselines/memory/pixel-9-qwen3-1.7b.json (commit b4d08b6).

checkpoint baseline current Δ Δ%
app_launch 225.6 MB 264.8 MB +39.2 MB +17.4 %
models_screen 228.4 MB 245.6 MB +17.2 MB +7.5 %
chat_screen 216.9 MB 250.6 MB +33.8 MB +15.6 %
model_loaded 1732.6 MB 1797.9 MB +65.3 MB +3.8 %
chat_active 1809.7 MB 1865.2 MB +55.5 MB +3.1 %
post_chat_idle 1810.6 MB 1864.4 MB +53.8 MB +3.0 %
model_unloaded 345.5 MB 396.6 MB +51.1 MB +14.8 %
Peak 1810.6 MB 1865.2 MB +54.5 MB +3.0 %

Result: PASS (e2e/scripts/memory-compare.ts gate = ">10 % AND >200 MB"; nothing crosses both).

Notes

  • The pixel-9 baseline is at commit b4d08b6 (~2 months old), so these deltas reflect everything between that and PR-740's base, not just the llama.rn 0.12.1 → 0.12.3 bump.
  • App-launch and model-unloaded show the biggest relative increase (+15–17 %) but in absolute terms they're small (+39 / +51 MB) — well below the regression gate. Worth refreshing the baseline post-merge if anyone wants tighter signal on future PRs.
  • The load-bearing checkpoints for "does it leak / does it fit" — model_loaded, chat_active, post_chat_idle — move +3.0 to +3.8 %, which is within normal version-to-version drift.
  • iPhone 13 Pro memory-profile is still pending (no host with iOS device available right now).
Run details
  • Worktree: ~/Dev/pocketpal-dev-team/worktrees/TASK-20260524-2036
  • APK: bench-bundle's e2e CI artifact (app-e2e-releaseE2e.apk, MD5 2aabcc81…) — same artifact used for the perf comment above
  • Spec: e2e/specs/memory-profile.spec.ts, TEST_MODELS=qwen3-1.7b
  • Report: e2e/reports/2026-05-25T09-24-51-326/pixel-9-real/memory-profile.json
  • Comparison: e2e/reports/2026-05-25T09-24-51-326/pixel-9-real/memory-profile-comparison.json

@a-ghorbani a-ghorbani merged commit 61dcf8f into main May 25, 2026
5 checks passed
@a-ghorbani a-ghorbani deleted the feature/TASK-20260524-2036 branch May 25, 2026 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants