QVAC-11621 fix: KV cache memory validation to prevent OOM crashes by simon-iribarren · Pull Request #1642 · tetherto/qvac

simon-iribarren · 2026-04-17T09:47:12Z

Summary

Reads model GGUF header to extract architecture params and compute exact KV cache bytes/token
Validates estimated memory usage before loading, throws structured error with suggested ctx_size if it would OOM
Falls back to file-size heuristic brackets when GGUF metadata unavailable
unsafeDisableMemoryValidation config flag to bypass the check

Note: This is the SDK-side implementation. A future addon-side implementation can provide even more precise calculations using the actual KV cache quantization config (cache_type_k, cache_type_v).

How it works

Exact KV cache computation from GGUF metadata:

Parses GGUF header for block_count, head_count_kv, embedding_length, head_count
Formula: 2 * n_layer * n_kv_heads * head_dim * dtype_size (f16 default)
Logs [exact] or [heuristic] in validation output

Plugin-level validation (validateBeforeLoad hook):

New optional hook on QvacPlugin interface, called before createModel()
Compares model_size + kv_cache + overhead against 80% of available memory
Throws ModelMemoryExceededError (code 52211) with suggestedCtxSize

Platform memory detection:

bare-os.totalmem() with conservative fractions (70% desktop, 65% mobile)

API Changes

// New plugin hook
interface ValidateBeforeLoadParams {
  modelConfig: Record<string, unknown>;
  modelFileSize: number;
  availableMemory: number;
  kvBytesPerToken?: number;
}

// New error
catch (error) {
  if (error instanceof ModelMemoryExceededError) {
    console.log(`Try ctx_size: ${error.suggestedCtxSize}`);
  }
}

// Config flag
{ "unsafeDisableMemoryValidation": true }

Testing

41 unit tests: exact KV computation (Llama-1B, Gemma-4B, Llama-8B), heuristic brackets, false-positive regression against all mobile test configs on 4/6/8/12GB devices
5 e2e test definitions: memory exceeded, default ctx safe, moderate ctx safe, recover with suggested, load after rejection
Manual desktop testing with extreme ctx_size values

Recreated from #1464 to move branch onto the fork (post-migration). No code changes vs. the original PR — only rebased onto latest main.

jesusmb1995 · 2026-04-17T10:47:20Z

I think we need to consider KV quant from the start as well, not just F16, being 75% off seems too much. Specially since the push/hype towards TurboQuant (6x smaller than F16 with minimal quality loss on long contex) that is going on right now and our focus on embedded devices and mobile with more memory constraints.

github-actions · 2026-04-17T10:47:42Z

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ❌ PENDING

**Requirements:**
- 1 Team Member approval ❌ (0/1)
- 1 Team Lead OR Management approval ❌ (0/1)



---
*This comment is automatically updated when reviews change.*

…crashes (QVAC-11621) Add pre-load memory estimation for LLM models to prevent OOM crashes on memory-constrained devices (especially iOS Metal backend). - Harden ctx_size schema: int().min(1).max(131072) - Add validateBeforeLoad plugin hook to QvacPlugin interface - Implement memory estimator using model file size as proxy for KV cache cost (256-2048 bytes/token based on model size bracket) - LLM plugin checks estimated memory against 80% of available memory via bare-os, throws ModelMemoryExceededError with suggestedCtxSize - Integrate validation in load-model handler after config resolution - Add 20 unit tests for estimator and schema validation Made-with: Cursor

…dling Made-with: Cursor

Replace HuggingFace URL with LLAMA_3_2_1B_INST_Q4_0 registry constant. Made-with: Cursor

os.freemem() on macOS/iOS returns only truly unallocated pages which is always tiny due to aggressive file caching. Use totalMemory with a conservative fraction (70% desktop, 50% mobile) for a realistic estimate of available memory. Add debug logging for memory validation. Made-with: Cursor

…ndle it The .max(131072) cap on ctx_size caused a raw Zod union error instead of reaching the validateBeforeLoad hook which provides an actionable error message with suggested ctx_size. Schema now only enforces positive integer; runtime memory validation handles the upper bound. Made-with: Cursor

The validation was silently skipped when bare-fs stat failed (modelFileSize=0). Now: always call validateBeforeLoad regardless, log at info level so results are always visible, and warn on stat failures instead of silencing them. Made-with: Cursor

The previous heuristic used 256 bytes/token for small models, but actual llama.cpp KV cache (f16) is ~32KB/token for a 1B model. This caused the memory validator to approve allocations that would crash Metal. New brackets calibrated from real model architectures: <1GB: 48KB/tok, 1-3GB: 128KB/tok, 3-6GB: 200KB/tok, 6-15GB: 350KB/tok, >15GB: 500KB/tok Adds real-world regression tests (1B model + extreme ctx_size on 24GB). Made-with: Cursor

Loads LLAMA_3_2_1B_INST_Q4_0 with ctx_size=3276000 and verifies that MODEL_MEMORY_EXCEEDED is thrown instead of crashing the native addon. Runs on both mobile and desktop test consumers. Made-with: Cursor

Adds `memoryValidation: false` to qvac.config to disable the pre-load memory estimator. Enabled by default. Allows disabling on platforms where bare-os.totalmem() behavior is unverified (e.g. mobile). Made-with: Cursor

Inverts the boolean semantics and adds "unsafe" prefix to clearly communicate that disabling memory validation risks OOM crashes. Made-with: Cursor

- error-memory-default-ctx-safe: load with default ctx_size succeeds - error-memory-moderate-ctx-safe: load with ctx_size=4096 succeeds - error-memory-recover-with-suggested: reject extreme ctx_size, retry with suggestedCtxSize from the error, confirm it loads - error-memory-load-after-rejection: reject extreme ctx_size, then load with default to verify SDK isn't left in a broken state Made-with: Cursor

- Replace flat 512MB overhead with proportional model (128MB + 10% of model file size) — avoids over-penalizing large models on small devices - Raise mobile available memory fraction from 50% to 65% — matches real-world iOS/Android behavior (apps use 60-70% before jetsam) - Add false-positive regression tests for every model+ctx_size config used in the mobile test suite against 4GB/6GB/8GB/12GB devices - Verified: LLAMA 1B, QWEN 1.7B, SmolVLM 500M pass on 4GB+; SALAMANDRATA 2B and AFRICAN 4B pass on 8GB+ (realistic for those models) Made-with: Cursor

Parse model GGUF header to extract architecture params (block_count, head_count_kv, embedding_length, head_count) and compute exact KV cache bytes per token using the llama.cpp formula: 2 * n_layer * n_kv_heads * head_dim * dtype_size Falls back to file-size-based heuristic brackets when GGUF metadata is unavailable (e.g. stat/read failure). Logs whether validation used [exact] or [heuristic] path for debugging. Addresses feedback to use the correct equation instead of conservative estimates that vary by model architecture and quantization. Made-with: Cursor

Memory validation will be implemented on the addon side where the C++ layer has access to exact KV cache quantization config (cache_type_k, cache_type_v) and can compute precise memory requirements. Removes: validateBeforeLoad hook, memory-estimator, GGUF metadata reader, platform detection, ModelMemoryExceededError, config flag, all associated unit and e2e tests. Made-with: Cursor

This reverts commit 7304fa0.

simon-iribarren requested review from a team as code owners April 17, 2026 09:47

simon-iribarren had a problem deploying to release April 17, 2026 09:47 — with GitHub Actions Failure

simon-iribarren mentioned this pull request Apr 17, 2026

QVAC-11621: KV cache memory validation to prevent OOM crashes #1464

Closed

simon-iribarren changed the title ~~QVAC-11621: KV cache memory validation to prevent OOM crashes~~ QVAC-11621 fix: KV cache memory validation to prevent OOM crashes Apr 17, 2026

simon-iribarren had a problem deploying to release April 17, 2026 09:48 — with GitHub Actions Failure

simon-iribarren added 15 commits April 20, 2026 22:34

doc: add memory-safe-loading example showing OOM prevention error han…

de3dc62

…dling Made-with: Cursor

doc: use SDK model constant in memory-safe-loading example

9ba4fb4

Replace HuggingFace URL with LLAMA_3_2_1B_INST_Q4_0 registry constant. Made-with: Cursor

test: add error-memory-exceeded test case for mobile + desktop

93191d6

Loads LLAMA_3_2_1B_INST_Q4_0 with ctx_size=3276000 and verifies that MODEL_MEMORY_EXCEEDED is thrown instead of crashing the native addon. Runs on both mobile and desktop test consumers. Made-with: Cursor

mod: rename memoryValidation to unsafeDisableMemoryValidation

587061b

Inverts the boolean semantics and adds "unsafe" prefix to clearly communicate that disabling memory validation risks OOM crashes. Made-with: Cursor

Revert "mod: remove SDK-side KV cache memory validation"

f59a6c8

This reverts commit 7304fa0.

simon-iribarren force-pushed the fix/qvac-11621-kv-cache-memory-validation branch from d1e9e77 to f59a6c8 Compare April 20, 2026 20:34

simon-iribarren had a problem deploying to release April 20, 2026 20:35 — with GitHub Actions Failure

simon-iribarren added tier1 verify labels Apr 20, 2026

simon-iribarren had a problem deploying to release April 20, 2026 20:35 — with GitHub Actions Failure

simon-iribarren mentioned this pull request Apr 21, 2026

QVAC-12239 feat[api]: add qvac doctor command and system-requirements doc #1681

Merged

simon-iribarren closed this Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QVAC-11621 fix: KV cache memory validation to prevent OOM crashes#1642

QVAC-11621 fix: KV cache memory validation to prevent OOM crashes#1642
simon-iribarren wants to merge 15 commits into
tetherto:mainfrom
simon-iribarren:fix/qvac-11621-kv-cache-memory-validation

simon-iribarren commented Apr 17, 2026

Uh oh!

jesusmb1995 commented Apr 17, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simon-iribarren commented Apr 17, 2026

Summary

How it works

API Changes

Testing

Uh oh!

jesusmb1995 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 17, 2026

Tier-based Approval Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jesusmb1995 commented Apr 17, 2026 •

edited

Loading