chore(deps): upgrade llama.rn to 0.12.0#722
Conversation
Bumps llama.rn from 0.12.0-rc.9 to 0.12.0 stable. Same shape as PR #689 (rc.5 → rc.8), PR #664 (rc.2 → rc.3), and PR #608 (0.11.0 → 0.11.3): exactly three files touched, no app-side code changes. What ships with 0.12.0: llama.cpp bump b8827 → b9084 (257 commits, 2026-04-17 → 2026-05-09). Highlights relevant to PocketPal: OpenCL / Adreno (Android GPU): - opencl: add iq4_nl support (#22272, b8935) — addresses #721 - opencl: Adreno optimization for MoE — MxFP4 (#22301) - opencl: q4_0 MoE GEMM for Adreno (#22731) - opencl: refactor Adreno q4_0 (#22335) - ggml: use CL_DEVICE_GLOBAL_MEM_SIZE for memory-fit estimate (#22688) Hexagon HTP (Snapdragon NPU): - hexagon: HMX flash attention (#22347) - hexagon: process M-tail rows on HMX instead of HVX (#22724) - hexagon: HTP kernel for GGML_OP_GATED_DELTA_NET (#22837) - hexagon: L2 norm (#22816), DAIG (#22195), FILL (#22198), SOLVE_TRI (#21974) ops - hexagon: non-contiguous row tensor support for unary ops (#22574) - hexagon: configurable vmem and buffer size (#22487) - hexagon: bump HMX frequency to max corner (#22334) Metal (iOS): - metal: optimize Metal Tensor API usage for GGML_OP_MUL_MAT (#20962) - metal: workaround macOS GPU interactivity watchdog (#22216) - metal: fix event synchronization (#22260) - metal: print GPU description (#22318) Memory / robustness: - llama: add option to save memory in device buffers (#22679) - model: don't crash on unsupported architecture (#22742) - llama: add missing call to ggml_backend_load_all() (#22752) - fix recurrent state serialization for partial reads/writes (#22362) - llama: fix device state save/load (#22805) - fix type casting for unaccounted memory calculation (#22424) New model architectures: - Mimo v2.5 (#22493) - MiniCPM-V 4.6 (mtmd, #22529) - Sarashina2.2-vision-3b (#22103) - Reka Edge 2603 (mtmd, #21616) - HunyuanVL update (#22037) - Granite Speech 4.0-1b (mtmd, #22101) - Gemma4 family: detection, parsing, NVFP4 variant - Nemotron Nano 3 Omni convert (#22481) Tool-calling / reasoning: - chat: parallel_tool_calls default by model capability (#22217) - common/autoparser: newline handling / forced tool-call fixes (#22654) - common/autoparser: allow space after tool call (#22073) - chat: fix handling of space in reasoning markers (#22353) - common: re-arm reasoning budget after DONE on new <think> (#22323) - common: don't pass prompt tokens to reasoning budget sampler (#22488) Tokenizer: - fix GLM-DSA crash in llama-tokenize when using vocab_only (#22102) llama.rn fix included in 0.12.0: fix(cpp, jsi): avoid blocking ui during backend init (mybigday/llama.rn@09e69c2) — backend init and device probing moved off the JS thread, reducing UI jank during first initContext. Changes: - package.json — pin bumped to 0.12.0 - yarn.lock — only the llama.rn@… entry resolved to the stable tarball; no unrelated drift - ios/Podfile.lock — llama-rn (0.12.0) with refreshed checksum 3abe0ea5… → 06746f84…; no other pods drift Verification (NATIVE_CHANGES=YES): - pod install — clean - iOS Release build (yarn ios:build:e2e) — PASS - Android Release build (./gradlew assembleRelease) — PASS - E2E quick-smoke on iPhone 17 Pro Simulator — 1/1 PASS - Lint — 0 errors (4 pre-existing warnings) - TypeCheck — PASS - Jest — 159 suites / 2222 passed / 2 skipped / 0 failed Story: TASK-20260512-0948
066fc98 to
61f0d0b
Compare
Bench results — mapped to the llama.cpp features called out in the PRRan the standard
Hexagon HMX flash attention (#22347)The biggest measured win in the PR. Lights up most clearly on small attention-heavy models with natively-HTP-supported quants. Myron — HTP v81 — 12 / 12 cells faster, no regressions:
S23 — HTP v73 — 6 / 12 clear wins, rest flat, no regressions:
LFM2.5 (SSM hybrid) shows less benefit — expected, SSM has less attention compute for HMX FA to accelerate. q4_K_M / q6_K on S23 hex are flat because they aren't natively HTP-supported; they fall back through a non-FA dequantize path. Adreno q4_0 GEMM + MoE optimizations (#22335, #22301, #22731)Myron — Adreno 840 — 13 cells:
S23 — Adreno 740 — 11 cells:
qwen3-1.7b q4_0 GPU is the cleanest q4_0-GEMM win (+20.3 % pp / +53.3 % tg on S23, +21.6 % tg on Myron). The qwen3.5-0.8b/q4_0/q4_K_M cells on S23 GPU show a small pp drop (−7 % to −13 %) — worth a follow-up read; absolute numbers are very low on that pair (~60 pp tps), so it may be measurement noise but flagging it. OpenCL IQ4_NL (#22272, the #721 ask)Ad-hoc with 3 models × IQ4_NL × opencl. All 6 PR-722 cells loaded with S23 — Adreno 740 — same APK pair (PR-713 baseline vs PR-722), same IQ4_NL GGUFs, same device:
So PR-713 already had a basic OpenCL IQ4_NL path (the kernels existed pre-b8935) but tg was around 5 t/s — barely usable — and phi-4-mini crashed at model load. PR-722's upstream sync lands the improved kernels: tg roughly triples / doubles, pp doubles on the larger model, and phi-4-mini now runs cleanly. Total memory unchanged. Myron — Adreno 840 — PR-722 only (didn't re-run on PR-713 APK to avoid the HyperOS install dance):
Note: even on PR-722, IQ4_NL prefill is still ~30–50 % slower than Q4_0 on Adreno 840 — expected since IQ4_NL is a more complex quant. The wins are kernel correctness, stability on large models, and the tg jump. This closes the verification ask in #721. Metal (iOS) — not measured
Bonus: CPU wins (not predicted by the PR body)Myron — 12 / 12 cells faster, +13 % to +48 % pp, no regressions:
S23 — 4 cells +10 % to +17 % pp, rest flat, no regressions:
Most likely a knock-on from the broader llama.cpp upstream sync rather than any of the named Adreno/Hexagon work. Coverage gaps
Full raw JSON reports live on the bench host: Generated by PocketPal Dev Team |
Summary
Routine native-dependency version bump:
llama.rn0.12.0-rc.9→0.12.0(stable).Same shape as PR #689 (rc.5 → rc.8), PR #664 (rc.2 → rc.3), and PR #608 (0.11.0 → 0.11.3): exactly three files touched and no app-side code changes.
What ships with 0.12.0
b8827→b9084(257 commits, 2026-04-17 → 2026-05-09). Highlights relevant to PocketPal:GGML_OP_MUL_MATTensor API optimization, macOS GPU watchdog workaround, event-sync fix.llama : add option to save memory in device buffers(#22679), graceful error on unsupported architecture, recurrent state serialization fix.fix(cpp, jsi): avoid blocking ui during backend init(mybigday/llama.rn@09e69c2) — backend init + device probing moved off the JS thread, reducing UI jank during firstinitContext.Changes
package.json— pin bumped to0.12.0yarn.lock— only thellama.rn@…entry resolved to the stable tarball; no unrelated driftios/Podfile.lock—llama-rn (0.12.0)with refreshed checksum3abe0ea5…→06746f84…; no other pods driftVerification (NATIVE_CHANGES=YES)
pod install— clean (Podfile.lock committed)yarn ios:build:e2e) — PASS (.appartefact in worktree)./gradlew assembleRelease) — PASS (app-prod-release.apkin worktree)quick-smokeon iPhone 17 Pro Simulator — 1/1 PASS (smollm2-135m loaded + streamed tokens through the new native bridge),e2e/reports/2026-05-12T08-02-52-111/summary.jsonStory: TASK-20260512-0948
Generated by PocketPal Dev Team