chore(deps): upgrade llama.rn to 0.12.3 by a-ghorbani · Pull Request #740 · a-ghorbani/pocketpal-ai

a-ghorbani · 2026-05-24T19:00:41Z

Summary

Bumps llama.rn 0.12.1 → 0.12.3. Dependency-only upgrade (package.json + lockfiles). Native iOS + Android builds pass; targeted Jest suites green. Memory profile re-verification on iPhone 13 Pro + Pixel 9 is PENDING (must be run by human via memory-profile skill on physical devices before merge — see Verification below).

Effective llama.cpp range covered by this upgrade: b9204 → b9254 (50 commits) plus llama.rn-level additions.

Changes

package.json — "llama.rn": "0.12.1" → "llama.rn": "0.12.3"
yarn.lock — regenerated (llama.rn block only, 4+/4-)
ios/Podfile.lock — llama-rn (0.12.1) → llama-rn (0.12.3), checksum 30cce807… → 2bb735f3…

3 files, 7 insertions / 7 deletions. No consumer code touched; the llama.rn Jest mock surface is version-agnostic.

llama.cpp / llama.rn changelog (PocketPal-relevant)

Scoped to items that touch surfaces PocketPal actually ships. Dropped: server, web-UI, CUDA, SYCL, WebGPU, conversion-only items.

Verification

yarn install clean — yarn.lock change scoped to llama.rn block (4+/4-)
pod install clean — Podfile.lock change scoped to llama-rn pod + checksum
Targeted Jest suites pass — 10/10 suites, 246/246 tests pass (Node 22.21.0)
yarn ios:build:release succeeds (~182s, Build Succeeded, PocketPal.app produced)
yarn build:android:release succeeds (~4m, BUILD SUCCESSFUL, app-prod-release.aab ~100 MB produced)
PENDING — Memory profile re-verified on iPhone 13 Pro vs baseline (model qwen3-1.7b) — to be run by human via memory-profile skill on physical device
PENDING — Memory profile re-verified on Pixel 9 vs baseline (model qwen3-1.7b) — to be run by human via memory-profile skill on physical device

Draft until both memory-profile runs complete and report PASS (regression threshold: >10% AND >200 MB, per e2e/scripts/memory-profile.sh convention).

Risk

Dependency-only; mock surface (loadLlamaModelInfo, LlamaContext, completion, bench, getFormattedChat, initMultimodal) is unchanged. No LlamaContextWrapper.mm or src/utils/*Versions.ts edits required — confirms the quick classification held. Pattern follows prior llama.rn upgrade PRs: #722 (0.12.0 stable), #728 (0.12.1).

Story: TASK-20260524-2036

Generated by PocketPal Dev Team

a-ghorbani · 2026-05-24T20:38:20Z

Memory profile — iPhone 13 Pro (`agh`)

Model: qwen3-1.7b · iOS: 26.3 reported (device on 26.5) · Baseline: b4d08b6 → Current: c0ad170 · Threshold: regression if >10% AND >200 MB

Checkpoint	Baseline	Current	Δ	Δ %
app_launch	91.8 MB	83.1 MB	−8.8 MB	−9.6%
models_screen	91.3 MB	84.1 MB	−7.2 MB	−7.9%
chat_screen	93.2 MB	86.5 MB	−6.7 MB	−7.2%
model_loaded	2148.4 MB	2148.8 MB	+0.4 MB	+0.0%
chat_active	2143.9 MB	2141.8 MB	−2.2 MB	−0.1%
post_chat_idle	2140.6 MB	2133.8 MB	−6.8 MB	−0.3%
model_unloaded	140.2 MB	133.4 MB	−6.8 MB	−4.8%
Peak	2148.4 MB	2148.8 MB	+0.4 MB	+0.0%

Verdict: ✅ PASS — peak essentially flat (+0.0%); idle/UI checkpoints slightly lower (≈−7 MB across launch/models/chat screens).

Pixel 9 run pending; will follow up.

Generated by PocketPal Dev Team

a-ghorbani · 2026-05-24T22:46:17Z

PR-740 (llama.rn 0.12.1 → 0.12.3) — bench results

Smoke + focused matrix, 3 devices, matched-settings (cpu+hex flash_attn=on, gpu flash_attn=off), pp=256 tg=64 pl=1 nr=3 inter_cell_settle_ms=30000. APK from run 26371966925. Compared against PR-728 (immediately-preceding llama.rn bump 0.12.0→0.12.1) and PR-713 (canonical baseline).

Coverage

device	smoke	focused	notes
poco-myron (SD8 Elite, Adreno 840, HTP v81)	27/27	60/60	full matrix, no crashes
samsung-s23 (SD8 Gen 2, Adreno 740, HTP v73)	27/27	41/60	focused-gpu died at `phi-4-mini/q4_0` (cell 13/20) — known GPU pipeline crash after ~13 cells on this device, same failure mode as PR-728
poco-x7-klee (MT6899, cpu only)	9/9	15/20	app crash at `phi-4-mini/q8_0` (cell 16/20). gemma-4-e2b expected to OOM at load (~7.5 GiB RAM). Matches PR-728 Klee coverage.

vs PR-728

Summary — median Δ per (device, backend)

Each Δ is the median of per-cell percent changes. Absolutes aren't aggregated here because mixing model/quant cells would mix workloads; see the representative-cell table for real tok/s.

smoke

device	backend	n	Δpp	Δtg	Δtotal_mib
poco-myron	cpu	9	-13.1%	-1.6%	+0.0%
poco-myron	gpu	9	-0.1%	+0.7%	+0.0%
poco-myron	hexagon	9	-13.9%	-15.0%	+0.0%
samsung-s23	cpu	9	-4.8%	+1.6%	+0.0%
samsung-s23	gpu	9	-9.4%	-11.3%	+0.0%
samsung-s23	hexagon	9	+0.5%	+3.3%	+0.0%
poco-x7-klee	cpu	9	-3.2%	+2.6%	+0.0%

focused

device	backend	n	Δpp	Δtg	Δtotal_mib
poco-myron	cpu	20	-21.3%	-3.0%	+0.0%
poco-myron	gpu	20	-9.1%	-5.7%	+0.0%
poco-myron	hexagon	20	-13.7%	-7.7%	+0.0%
samsung-s23	cpu	14	-9.8%	-6.1%	+0.0%
samsung-s23	gpu	13	-1.6%	-0.7%	+0.0%
samsung-s23	hexagon	14	-0.7%	-4.2%	+0.0%
poco-x7-klee	cpu	15	-9.3%	+3.3%	+0.0%

Representative cell — `qwen3-1.7b/q4_0` vs PR-728

Single fixed cell so the absolutes are real tok/s, not mixed workloads. Picked because it runs on all 3 devices, all 3 backends, both smoke and focused matrices. Deltas track the summary medians closely, confirming the per-(device, backend) story is consistent across models — see the full per-cell tables on the bench host for any cell that deviates.

smoke

device	backend	pp PR-740	pp PR-728	Δpp	tg PR-740	tg PR-728	Δtg
poco-myron	cpu	252.7	300.8	-16.0%	38.1	38.8	-1.8%
poco-myron	gpu	586.4	585.7	+0.1%	28.7	27.4	+4.6%
poco-myron	hexagon	727.2	844.6	-13.9%	31.6	37.2	-15.0%
samsung-s23	cpu	112.0	117.3	-4.5%	17.6	18.8	-6.5%
samsung-s23	gpu	228.6	261.0	-12.4%	12.4	14.5	-14.5%
samsung-s23	hexagon	461.7	448.4	+3.0%	21.1	20.5	+2.8%
poco-x7-klee	cpu	165.8	176.8	-6.2%	21.9	20.7	+5.5%

focused

device	backend	pp PR-740	pp PR-728	Δpp	tg PR-740	tg PR-728	Δtg
poco-myron	cpu	212.1	299.0	-29.1%	38.0	38.1	-0.4%
poco-myron	gpu	536.7	588.2	-8.8%	27.1	27.8	-2.7%
poco-myron	hexagon	720.0	847.8	-15.1%	30.4	31.1	-2.0%
samsung-s23	cpu	110.2	122.6	-10.1%	17.1	19.1	-10.8%
samsung-s23	gpu	264.3	257.0	+2.8%	16.6	15.3	+8.5%
samsung-s23	hexagon	459.9	444.9	+3.4%	20.9	20.4	+2.3%
poco-x7-klee	cpu	144.5	175.9	-17.8%	22.3	21.5	+3.3%

vs PR-713 baseline

Summary — median Δ per (device, backend)

smoke

device	backend	n	Δpp	Δtg	Δtotal_mib
poco-myron	cpu	9	+16.6%	+0.5%	+0.0%
poco-myron	gpu	9	-1.7%	-0.9%	+0.0%
poco-myron	hexagon	9	+66.5%	+3.2%	+0.4%
samsung-s23	cpu	9	+7.8%	+7.2%	+0.0%
samsung-s23	gpu	9	-7.8%	-9.1%	+0.0%
samsung-s23	hexagon	9	+72.8%	+3.6%	+0.0%
poco-x7-klee	cpu	9	+11.5%	-3.2%	+0.0%

focused

device	backend	n	Δpp	Δtg	Δtotal_mib
poco-myron	cpu	20	+3.7%	+0.8%	+0.0%
poco-myron	gpu	20	-8.0%	-3.3%	+0.0%
poco-myron	hexagon	20	+77.4%	-0.8%	+0.4%
samsung-s23	cpu	14	-3.1%	+1.0%	+0.0%
samsung-s23	gpu	13	+3.8%	-7.0%	+0.0%
samsung-s23	hexagon	13	+58.8%	-1.1%	+0.0%
poco-x7-klee	cpu	15	+0.2%	+0.4%	+0.0%

Representative cell — `qwen3-1.7b/q4_0` vs PR-713 baseline

smoke

device	backend	pp PR-740	pp PR-713	Δpp	tg PR-740	tg PR-713	Δtg
poco-myron	cpu	252.7	204.8	+23.4%	38.1	37.5	+1.4%
poco-myron	gpu	586.4	586.7	-0.1%	28.7	28.7	-0.0%
poco-myron	hexagon	727.2	330.0	+120.4%	31.6	30.6	+3.2%
samsung-s23	cpu	112.0	103.9	+7.8%	17.6	17.2	+2.1%
samsung-s23	gpu	228.6	245.6	-6.9%	12.4	12.1	+2.7%
samsung-s23	hexagon	461.7	242.4	+90.5%	21.1	20.4	+3.4%
poco-x7-klee	cpu	165.8	155.8	+6.4%	21.9	22.1	-0.9%

focused

device	backend	pp PR-740	pp PR-713	Δpp	tg PR-740	tg PR-713	Δtg
poco-myron	cpu	212.1	204.8	+3.6%	38.0	37.5	+1.2%
poco-myron	gpu	536.7	586.7	-8.5%	27.1	28.7	-5.7%
poco-myron	hexagon	720.0	330.0	+118.2%	30.4	30.6	-0.5%
samsung-s23	cpu	110.2	103.9	+6.1%	17.1	17.2	-0.9%
samsung-s23	gpu	264.3	245.6	+7.6%	16.6	12.1	+37.7%
samsung-s23	hexagon	459.9	242.4	+89.8%	20.9	20.4	+2.3%
poco-x7-klee	cpu	144.5	155.8	-7.2%	22.3	22.1	+0.9%

Key findings

Hex on HTP v81 (Myron) regresses ~13–15 % pp / 7–15 % tg vs PR-728 — consistent across all models and quants, in both smoke (cool device) and focused (warm). Not noise: every one of the 20 focused-hex cells on Myron is between -6.6 % and -18.6 % pp. Hex on HTP v73 (S23) is essentially flat (+0.5 % smoke / -0.7 % focused pp median). PR-740 still beats PR-713 by +66 % to +77 % pp on hex — the new code is a net win over baseline, but a partial walkback of the headline PR-728 hex gain on HTP v81. The hex deltas in the llama.cpp range (PAD HVX kernel, TRI op, NORM op, MROPE/IMROPE in HTP rope, Snapdragon toolchain v0.6) are the natural suspects.
CPU on Myron regresses more on focused (-21 %) than smoke (-13 %) — likely thermal: PR-728's "Myron CPU +33 %" baseline number was already flagged as thermally favorable in its own report, and the focused matrix runs after smoke when the device is warmer. Net vs PR-713 baseline is still +3.7 % (focused) / +16.6 % (smoke). CPU regression on S23/Klee is smaller (-3 to -10 %) and within run-to-run noise.
GPU (OpenCL) is approximately flat on both Myron and S23. Small per-tier wobble (S23 smoke gpu -9.4 % vs focused gpu +3.8 %) is consistent with the GPU pipeline noise floor we see across all PR runs. No regression worth flagging.
Memory: total_mib unchanged across the board (all backends, all devices, 0.0 % median). An earlier draft of this comment flagged a +13.4 % Myron smoke-hex anomaly; that turned out to be a log-capture artifact in PR-728's myron smoke run (only the HTP0 compute_buffer log line reached the parser; HTP1..HTP5 were missed, leaving the cell ~234 MiB short). PR-728's myron focused, PR-740 (both tiers), and PR-713 baseline all agreed on the full per-cell values, so we patched the 9 affected cells in pr-728/reports/poco-myron.json by lifting memory_buffers from PR-728 focused (for qwen3.5-0.8b + qwen3-1.7b) and from the PR-713 baseline (for gemma-3-1b, which is smoke-only). The patch is recorded under patches[] and runs[].log_signals.memory_buffers_original in that file. Re-running the comparison gives the +0.0 % shown above. Nothing to flag on memory in PR-740.
No backend fallbacks except the expected gpu → opencl rename on every gpu cell (label-only, same Adreno code path; same observation in PR-728).

Caveats

Coverage gaps: S23 focused-gpu lost 7 cells to the known per-launch GPU crash (phi-4-mini and gemma-4-e2b families). Klee focused lost 5 cells (phi-4-mini/q8_0 crash + all 4 gemma-4-e2b cells expected to OOM at load on 7.5 GiB RAM). Pattern matches PR-728 exactly; no new instability introduced by PR-740.
Thermal: focused-matrix cells run after the smoke matrix; the device is warmer for focused. The CPU deltas should be read with that in mind.
MTP not exercised: PR-740 ships MTP speculative-decoding parallel-API support (0.12.3), but the bench matrix is single-prompt non-speculative — this PR's MTP work is not measured here. Should be tested separately if the goal is to validate the MTP path.
total_mib is from log_signals.memory_buffers (reliable for myron and klee; S23 hex captures only HTP0 in every report we have, so S23 mem deltas vs other PRs cancel out but absolute values understate by ~3×46 MiB). peak_memory_mb deltas are not reliable — included only in the per-cell tables on the bench host for completeness.

Recommendation

The Myron-hex perf walkback (-14 % pp median) is real and reproducible across every model/quant combination, but the absolute hex perf on Myron is still +66 % to +77 % pp above the PR-713 baseline, so users on HTP v81 still come out ahead vs anything before PR-728. HTP v73 (S23) is flat — same code path apparently doesn't hit the regression. Recommend merging if the upstream hex changes (kernel additions + toolchain v0.6) are wanted for other reasons; otherwise worth a focused investigation on whether the new HTP code paths can be tuned for v81 in a follow-up.

Reports on bench host

~/bench-bundle/bench-results/pr-740/reports/SUMMARY.md
~/bench-bundle/bench-results/pr-740/reports/divergence-vs-pr-728.md (full per-cell tables vs PR-728)
~/bench-bundle/bench-results/pr-740/reports/divergence-vs-baseline.md (full per-cell tables vs PR-713)
Raw per-backend reports under ~/bench-bundle/bench-results/pr-740/reports/poco-myron-{smoke,focused}-{on,off}.json etc.

a-ghorbani · 2026-05-25T09:32:02Z

PR-740 — memory-profile, Pixel 9

memory-profile spec on Pixel 9 (real device, USB), model qwen3-1.7b. PR-740 commit c0ad170 vs the tracked baseline e2e/baselines/memory/pixel-9-qwen3-1.7b.json (commit b4d08b6).

checkpoint	baseline	current	Δ	Δ%
app_launch	225.6 MB	264.8 MB	+39.2 MB	+17.4 %
models_screen	228.4 MB	245.6 MB	+17.2 MB	+7.5 %
chat_screen	216.9 MB	250.6 MB	+33.8 MB	+15.6 %
model_loaded	1732.6 MB	1797.9 MB	+65.3 MB	+3.8 %
chat_active	1809.7 MB	1865.2 MB	+55.5 MB	+3.1 %
post_chat_idle	1810.6 MB	1864.4 MB	+53.8 MB	+3.0 %
model_unloaded	345.5 MB	396.6 MB	+51.1 MB	+14.8 %
Peak	1810.6 MB	1865.2 MB	+54.5 MB	+3.0 %

Result: PASS (e2e/scripts/memory-compare.ts gate = ">10 % AND >200 MB"; nothing crosses both).

Notes

The pixel-9 baseline is at commit b4d08b6 (~2 months old), so these deltas reflect everything between that and PR-740's base, not just the llama.rn 0.12.1 → 0.12.3 bump.
App-launch and model-unloaded show the biggest relative increase (+15–17 %) but in absolute terms they're small (+39 / +51 MB) — well below the regression gate. Worth refreshing the baseline post-merge if anyone wants tighter signal on future PRs.
The load-bearing checkpoints for "does it leak / does it fit" — model_loaded, chat_active, post_chat_idle — move +3.0 to +3.8 %, which is within normal version-to-version drift.
iPhone 13 Pro memory-profile is still pending (no host with iOS device available right now).

Run details

Worktree: ~/Dev/pocketpal-dev-team/worktrees/TASK-20260524-2036
APK: bench-bundle's e2e CI artifact (app-e2e-releaseE2e.apk, MD5 2aabcc81…) — same artifact used for the perf comment above
Spec: e2e/specs/memory-profile.spec.ts, TEST_MODELS=qwen3-1.7b
Report: e2e/reports/2026-05-25T09-24-51-326/pixel-9-real/memory-profile.json
Comparison: e2e/reports/2026-05-25T09-24-51-326/pixel-9-real/memory-profile-comparison.json

chore(deps): upgrade llama.rn to 0.12.3

c0ad170

a-ghorbani marked this pull request as ready for review May 24, 2026 20:46

a-ghorbani merged commit 61dcf8f into main May 25, 2026
5 checks passed

a-ghorbani deleted the feature/TASK-20260524-2036 branch May 25, 2026 09:33

a-ghorbani mentioned this pull request May 25, 2026

chore(deps): upgrade llama.rn to 0.12.4 #743

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(deps): upgrade llama.rn to 0.12.3#740

chore(deps): upgrade llama.rn to 0.12.3#740
a-ghorbani merged 1 commit into
mainfrom
feature/TASK-20260524-2036

a-ghorbani commented May 24, 2026 •

edited

Loading

Uh oh!

a-ghorbani commented May 24, 2026

Uh oh!

a-ghorbani commented May 24, 2026 •

edited

Loading

Uh oh!

a-ghorbani commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

a-ghorbani commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

llama.cpp / llama.rn changelog (PocketPal-relevant)

Speculative decoding (MTP)

Hexagon NPU (Snapdragon)

OpenCL / Adreno

Metal (Apple)

Multimodal

Core model

llama.rn sync points

Verification

Risk

Uh oh!

a-ghorbani commented May 24, 2026

Memory profile — iPhone 13 Pro (agh)

Uh oh!

a-ghorbani commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR-740 (llama.rn 0.12.1 → 0.12.3) — bench results

Coverage

vs PR-728

Summary — median Δ per (device, backend)

Representative cell — qwen3-1.7b/q4_0 vs PR-728

vs PR-713 baseline

Summary — median Δ per (device, backend)

Representative cell — qwen3-1.7b/q4_0 vs PR-713 baseline

Key findings

Caveats

Recommendation

Uh oh!

a-ghorbani commented May 25, 2026

PR-740 — memory-profile, Pixel 9

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

a-ghorbani commented May 24, 2026 •

edited

Loading

Memory profile — iPhone 13 Pro (`agh`)

a-ghorbani commented May 24, 2026 •

edited

Loading

Representative cell — `qwen3-1.7b/q4_0` vs PR-728

Representative cell — `qwen3-1.7b/q4_0` vs PR-713 baseline