chore(deps): upgrade llama.rn to 0.12.3#740
Conversation
Memory profile — iPhone 13 Pro (
|
| Checkpoint | Baseline | Current | Δ | Δ % |
|---|---|---|---|---|
| app_launch | 91.8 MB | 83.1 MB | −8.8 MB | −9.6% |
| models_screen | 91.3 MB | 84.1 MB | −7.2 MB | −7.9% |
| chat_screen | 93.2 MB | 86.5 MB | −6.7 MB | −7.2% |
| model_loaded | 2148.4 MB | 2148.8 MB | +0.4 MB | +0.0% |
| chat_active | 2143.9 MB | 2141.8 MB | −2.2 MB | −0.1% |
| post_chat_idle | 2140.6 MB | 2133.8 MB | −6.8 MB | −0.3% |
| model_unloaded | 140.2 MB | 133.4 MB | −6.8 MB | −4.8% |
| Peak | 2148.4 MB | 2148.8 MB | +0.4 MB | +0.0% |
Verdict: ✅ PASS — peak essentially flat (+0.0%); idle/UI checkpoints slightly lower (≈−7 MB across launch/models/chat screens).
Pixel 9 run pending; will follow up.
Generated by PocketPal Dev Team
PR-740 (llama.rn 0.12.1 → 0.12.3) — bench resultsSmoke + focused matrix, 3 devices, matched-settings (cpu+hex Coverage
vs PR-728Summary — median Δ per (device, backend)Each Δ is the median of per-cell percent changes. Absolutes aren't aggregated here because mixing model/quant cells would mix workloads; see the representative-cell table for real tok/s. smoke
focused
Representative cell —
|
| device | backend | pp PR-740 | pp PR-728 | Δpp | tg PR-740 | tg PR-728 | Δtg |
|---|---|---|---|---|---|---|---|
| poco-myron | cpu | 252.7 | 300.8 | -16.0% | 38.1 | 38.8 | -1.8% |
| poco-myron | gpu | 586.4 | 585.7 | +0.1% | 28.7 | 27.4 | +4.6% |
| poco-myron | hexagon | 727.2 | 844.6 | -13.9% | 31.6 | 37.2 | -15.0% |
| samsung-s23 | cpu | 112.0 | 117.3 | -4.5% | 17.6 | 18.8 | -6.5% |
| samsung-s23 | gpu | 228.6 | 261.0 | -12.4% | 12.4 | 14.5 | -14.5% |
| samsung-s23 | hexagon | 461.7 | 448.4 | +3.0% | 21.1 | 20.5 | +2.8% |
| poco-x7-klee | cpu | 165.8 | 176.8 | -6.2% | 21.9 | 20.7 | +5.5% |
focused
| device | backend | pp PR-740 | pp PR-728 | Δpp | tg PR-740 | tg PR-728 | Δtg |
|---|---|---|---|---|---|---|---|
| poco-myron | cpu | 212.1 | 299.0 | -29.1% | 38.0 | 38.1 | -0.4% |
| poco-myron | gpu | 536.7 | 588.2 | -8.8% | 27.1 | 27.8 | -2.7% |
| poco-myron | hexagon | 720.0 | 847.8 | -15.1% | 30.4 | 31.1 | -2.0% |
| samsung-s23 | cpu | 110.2 | 122.6 | -10.1% | 17.1 | 19.1 | -10.8% |
| samsung-s23 | gpu | 264.3 | 257.0 | +2.8% | 16.6 | 15.3 | +8.5% |
| samsung-s23 | hexagon | 459.9 | 444.9 | +3.4% | 20.9 | 20.4 | +2.3% |
| poco-x7-klee | cpu | 144.5 | 175.9 | -17.8% | 22.3 | 21.5 | +3.3% |
vs PR-713 baseline
Summary — median Δ per (device, backend)
smoke
| device | backend | n | Δpp | Δtg | Δtotal_mib |
|---|---|---|---|---|---|
| poco-myron | cpu | 9 | +16.6% | +0.5% | +0.0% |
| poco-myron | gpu | 9 | -1.7% | -0.9% | +0.0% |
| poco-myron | hexagon | 9 | +66.5% | +3.2% | +0.4% |
| samsung-s23 | cpu | 9 | +7.8% | +7.2% | +0.0% |
| samsung-s23 | gpu | 9 | -7.8% | -9.1% | +0.0% |
| samsung-s23 | hexagon | 9 | +72.8% | +3.6% | +0.0% |
| poco-x7-klee | cpu | 9 | +11.5% | -3.2% | +0.0% |
focused
| device | backend | n | Δpp | Δtg | Δtotal_mib |
|---|---|---|---|---|---|
| poco-myron | cpu | 20 | +3.7% | +0.8% | +0.0% |
| poco-myron | gpu | 20 | -8.0% | -3.3% | +0.0% |
| poco-myron | hexagon | 20 | +77.4% | -0.8% | +0.4% |
| samsung-s23 | cpu | 14 | -3.1% | +1.0% | +0.0% |
| samsung-s23 | gpu | 13 | +3.8% | -7.0% | +0.0% |
| samsung-s23 | hexagon | 13 | +58.8% | -1.1% | +0.0% |
| poco-x7-klee | cpu | 15 | +0.2% | +0.4% | +0.0% |
Representative cell — qwen3-1.7b/q4_0 vs PR-713 baseline
smoke
| device | backend | pp PR-740 | pp PR-713 | Δpp | tg PR-740 | tg PR-713 | Δtg |
|---|---|---|---|---|---|---|---|
| poco-myron | cpu | 252.7 | 204.8 | +23.4% | 38.1 | 37.5 | +1.4% |
| poco-myron | gpu | 586.4 | 586.7 | -0.1% | 28.7 | 28.7 | -0.0% |
| poco-myron | hexagon | 727.2 | 330.0 | +120.4% | 31.6 | 30.6 | +3.2% |
| samsung-s23 | cpu | 112.0 | 103.9 | +7.8% | 17.6 | 17.2 | +2.1% |
| samsung-s23 | gpu | 228.6 | 245.6 | -6.9% | 12.4 | 12.1 | +2.7% |
| samsung-s23 | hexagon | 461.7 | 242.4 | +90.5% | 21.1 | 20.4 | +3.4% |
| poco-x7-klee | cpu | 165.8 | 155.8 | +6.4% | 21.9 | 22.1 | -0.9% |
focused
| device | backend | pp PR-740 | pp PR-713 | Δpp | tg PR-740 | tg PR-713 | Δtg |
|---|---|---|---|---|---|---|---|
| poco-myron | cpu | 212.1 | 204.8 | +3.6% | 38.0 | 37.5 | +1.2% |
| poco-myron | gpu | 536.7 | 586.7 | -8.5% | 27.1 | 28.7 | -5.7% |
| poco-myron | hexagon | 720.0 | 330.0 | +118.2% | 30.4 | 30.6 | -0.5% |
| samsung-s23 | cpu | 110.2 | 103.9 | +6.1% | 17.1 | 17.2 | -0.9% |
| samsung-s23 | gpu | 264.3 | 245.6 | +7.6% | 16.6 | 12.1 | +37.7% |
| samsung-s23 | hexagon | 459.9 | 242.4 | +89.8% | 20.9 | 20.4 | +2.3% |
| poco-x7-klee | cpu | 144.5 | 155.8 | -7.2% | 22.3 | 22.1 | +0.9% |
Key findings
-
Hex on HTP v81 (Myron) regresses ~13–15 % pp / 7–15 % tg vs PR-728 — consistent across all models and quants, in both smoke (cool device) and focused (warm). Not noise: every one of the 20 focused-hex cells on Myron is between -6.6 % and -18.6 % pp. Hex on HTP v73 (S23) is essentially flat (+0.5 % smoke / -0.7 % focused pp median). PR-740 still beats PR-713 by +66 % to +77 % pp on hex — the new code is a net win over baseline, but a partial walkback of the headline PR-728 hex gain on HTP v81. The hex deltas in the llama.cpp range (PAD HVX kernel, TRI op, NORM op, MROPE/IMROPE in HTP rope, Snapdragon toolchain v0.6) are the natural suspects.
-
CPU on Myron regresses more on focused (-21 %) than smoke (-13 %) — likely thermal: PR-728's "Myron CPU +33 %" baseline number was already flagged as thermally favorable in its own report, and the focused matrix runs after smoke when the device is warmer. Net vs PR-713 baseline is still +3.7 % (focused) / +16.6 % (smoke). CPU regression on S23/Klee is smaller (-3 to -10 %) and within run-to-run noise.
-
GPU (OpenCL) is approximately flat on both Myron and S23. Small per-tier wobble (S23 smoke gpu -9.4 % vs focused gpu +3.8 %) is consistent with the GPU pipeline noise floor we see across all PR runs. No regression worth flagging.
-
Memory: total_mib unchanged across the board (all backends, all devices, 0.0 % median). An earlier draft of this comment flagged a +13.4 % Myron smoke-hex anomaly; that turned out to be a log-capture artifact in PR-728's myron smoke run (only the HTP0
compute_bufferlog line reached the parser; HTP1..HTP5 were missed, leaving the cell ~234 MiB short). PR-728's myron focused, PR-740 (both tiers), and PR-713 baseline all agreed on the full per-cell values, so we patched the 9 affected cells inpr-728/reports/poco-myron.jsonby liftingmemory_buffersfrom PR-728 focused (for qwen3.5-0.8b + qwen3-1.7b) and from the PR-713 baseline (for gemma-3-1b, which is smoke-only). The patch is recorded underpatches[]andruns[].log_signals.memory_buffers_originalin that file. Re-running the comparison gives the +0.0 % shown above. Nothing to flag on memory in PR-740. -
No backend fallbacks except the expected
gpu → openclrename on every gpu cell (label-only, same Adreno code path; same observation in PR-728).
Caveats
- Coverage gaps: S23 focused-gpu lost 7 cells to the known per-launch GPU crash (phi-4-mini and gemma-4-e2b families). Klee focused lost 5 cells (phi-4-mini/q8_0 crash + all 4 gemma-4-e2b cells expected to OOM at load on 7.5 GiB RAM). Pattern matches PR-728 exactly; no new instability introduced by PR-740.
- Thermal: focused-matrix cells run after the smoke matrix; the device is warmer for focused. The CPU deltas should be read with that in mind.
- MTP not exercised: PR-740 ships MTP speculative-decoding parallel-API support (0.12.3), but the bench matrix is single-prompt non-speculative — this PR's MTP work is not measured here. Should be tested separately if the goal is to validate the MTP path.
total_mibis fromlog_signals.memory_buffers(reliable for myron and klee; S23 hex captures only HTP0 in every report we have, so S23 mem deltas vs other PRs cancel out but absolute values understate by ~3×46 MiB).peak_memory_mbdeltas are not reliable — included only in the per-cell tables on the bench host for completeness.
Recommendation
The Myron-hex perf walkback (-14 % pp median) is real and reproducible across every model/quant combination, but the absolute hex perf on Myron is still +66 % to +77 % pp above the PR-713 baseline, so users on HTP v81 still come out ahead vs anything before PR-728. HTP v73 (S23) is flat — same code path apparently doesn't hit the regression. Recommend merging if the upstream hex changes (kernel additions + toolchain v0.6) are wanted for other reasons; otherwise worth a focused investigation on whether the new HTP code paths can be tuned for v81 in a follow-up.
Reports on bench host
~/bench-bundle/bench-results/pr-740/reports/SUMMARY.md~/bench-bundle/bench-results/pr-740/reports/divergence-vs-pr-728.md(full per-cell tables vs PR-728)~/bench-bundle/bench-results/pr-740/reports/divergence-vs-baseline.md(full per-cell tables vs PR-713)- Raw per-backend reports under
~/bench-bundle/bench-results/pr-740/reports/poco-myron-{smoke,focused}-{on,off}.jsonetc.
PR-740 — memory-profile, Pixel 9
Result: PASS ( Notes
Run details
|
Summary
Bumps
llama.rn0.12.1 → 0.12.3. Dependency-only upgrade (package.json + lockfiles). Native iOS + Android builds pass; targeted Jest suites green. Memory profile re-verification on iPhone 13 Pro + Pixel 9 is PENDING (must be run by human via memory-profile skill on physical devices before merge — see Verification below).Effective llama.cpp range covered by this upgrade: b9204 → b9254 (50 commits) plus llama.rn-level additions.
Changes
package.json—"llama.rn": "0.12.1"→"llama.rn": "0.12.3"yarn.lock— regenerated (llama.rn block only, 4+/4-)ios/Podfile.lock—llama-rn (0.12.1)→llama-rn (0.12.3), checksum30cce807…→2bb735f3…3 files, 7 insertions / 7 deletions. No consumer code touched; the llama.rn Jest mock surface is version-agnostic.
llama.cpp / llama.rn changelog (PocketPal-relevant)
Scoped to items that touch surfaces PocketPal actually ships. Dropped: server, web-UI, CUDA, SYCL, WebGPU, conversion-only items.
Speculative decoding (MTP)
qwen35.cpp(model : clarify MTP layer comment in qwen35.cpp [no ci] ggml-org/llama.cpp#23338)Hexagon NPU (Snapdragon)
OpenCL / Adreno
Metal (Apple)
Multimodal
fit_paramsnow takes mmproj into account (mtmd: fit_params now take into account mmproj ggml-org/llama.cpp#21489)Core model
embeddings_pre_norm_masked=falseinllama_contextggml-org/llama.cpp#23256)llama.rn sync points
Verification
yarn installclean —yarn.lockchange scoped to llama.rn block (4+/4-)pod installclean —Podfile.lockchange scoped to llama-rn pod + checksumyarn ios:build:releasesucceeds (~182s, Build Succeeded,PocketPal.appproduced)yarn build:android:releasesucceeds (~4m, BUILD SUCCESSFUL,app-prod-release.aab~100 MB produced)Draft until both memory-profile runs complete and report PASS (regression threshold: >10% AND >200 MB, per
e2e/scripts/memory-profile.shconvention).Risk
Dependency-only; mock surface (loadLlamaModelInfo, LlamaContext, completion, bench, getFormattedChat, initMultimodal) is unchanged. No
LlamaContextWrapper.mmorsrc/utils/*Versions.tsedits required — confirms thequickclassification held. Pattern follows prior llama.rn upgrade PRs: #722 (0.12.0 stable), #728 (0.12.1).Story: TASK-20260524-2036
Generated by PocketPal Dev Team