Skip to content

QVAC-11621 fix: KV cache memory validation to prevent OOM crashes#1642

Closed
simon-iribarren wants to merge 15 commits into
tetherto:mainfrom
simon-iribarren:fix/qvac-11621-kv-cache-memory-validation
Closed

QVAC-11621 fix: KV cache memory validation to prevent OOM crashes#1642
simon-iribarren wants to merge 15 commits into
tetherto:mainfrom
simon-iribarren:fix/qvac-11621-kv-cache-memory-validation

Conversation

@simon-iribarren

Copy link
Copy Markdown
Contributor

Summary

  • Reads model GGUF header to extract architecture params and compute exact KV cache bytes/token
  • Validates estimated memory usage before loading, throws structured error with suggested ctx_size if it would OOM
  • Falls back to file-size heuristic brackets when GGUF metadata unavailable
  • unsafeDisableMemoryValidation config flag to bypass the check

Note: This is the SDK-side implementation. A future addon-side implementation can provide even more precise calculations using the actual KV cache quantization config (cache_type_k, cache_type_v).

How it works

Exact KV cache computation from GGUF metadata:

  • Parses GGUF header for block_count, head_count_kv, embedding_length, head_count
  • Formula: 2 * n_layer * n_kv_heads * head_dim * dtype_size (f16 default)
  • Logs [exact] or [heuristic] in validation output

Plugin-level validation (validateBeforeLoad hook):

  • New optional hook on QvacPlugin interface, called before createModel()
  • Compares model_size + kv_cache + overhead against 80% of available memory
  • Throws ModelMemoryExceededError (code 52211) with suggestedCtxSize

Platform memory detection:

  • bare-os.totalmem() with conservative fractions (70% desktop, 65% mobile)

API Changes

// New plugin hook
interface ValidateBeforeLoadParams {
  modelConfig: Record<string, unknown>;
  modelFileSize: number;
  availableMemory: number;
  kvBytesPerToken?: number;
}

// New error
catch (error) {
  if (error instanceof ModelMemoryExceededError) {
    console.log(`Try ctx_size: ${error.suggestedCtxSize}`);
  }
}

// Config flag
{ "unsafeDisableMemoryValidation": true }

Testing

  • 41 unit tests: exact KV computation (Llama-1B, Gemma-4B, Llama-8B), heuristic brackets, false-positive regression against all mobile test configs on 4/6/8/12GB devices
  • 5 e2e test definitions: memory exceeded, default ctx safe, moderate ctx safe, recover with suggested, load after rejection
  • Manual desktop testing with extreme ctx_size values

Recreated from #1464 to move branch onto the fork (post-migration). No code changes vs. the original PR — only rebased onto latest main.

@simon-iribarren simon-iribarren requested review from a team as code owners April 17, 2026 09:47
@simon-iribarren simon-iribarren changed the title QVAC-11621: KV cache memory validation to prevent OOM crashes QVAC-11621 fix: KV cache memory validation to prevent OOM crashes Apr 17, 2026
@jesusmb1995

jesusmb1995 commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

I think we need to consider KV quant from the start as well, not just F16, being 75% off seems too much. Specially since the push/hype towards TurboQuant (6x smaller than F16 with minimal quality loss on long contex) that is going on right now and our focus on embedded devices and mobile with more memory constraints.

@github-actions

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ❌ PENDING

**Requirements:**
- 1 Team Member approval ❌ (0/1)
- 1 Team Lead OR Management approval ❌ (0/1)



---
*This comment is automatically updated when reviews change.*

…crashes (QVAC-11621)

Add pre-load memory estimation for LLM models to prevent OOM crashes
on memory-constrained devices (especially iOS Metal backend).

- Harden ctx_size schema: int().min(1).max(131072)
- Add validateBeforeLoad plugin hook to QvacPlugin interface
- Implement memory estimator using model file size as proxy for KV
  cache cost (256-2048 bytes/token based on model size bracket)
- LLM plugin checks estimated memory against 80% of available memory
  via bare-os, throws ModelMemoryExceededError with suggestedCtxSize
- Integrate validation in load-model handler after config resolution
- Add 20 unit tests for estimator and schema validation

Made-with: Cursor
Replace HuggingFace URL with LLAMA_3_2_1B_INST_Q4_0 registry constant.

Made-with: Cursor
os.freemem() on macOS/iOS returns only truly unallocated pages which
is always tiny due to aggressive file caching. Use totalMemory with
a conservative fraction (70% desktop, 50% mobile) for a realistic
estimate of available memory. Add debug logging for memory validation.

Made-with: Cursor
…ndle it

The .max(131072) cap on ctx_size caused a raw Zod union error instead
of reaching the validateBeforeLoad hook which provides an actionable
error message with suggested ctx_size. Schema now only enforces
positive integer; runtime memory validation handles the upper bound.

Made-with: Cursor
The validation was silently skipped when bare-fs stat failed (modelFileSize=0).
Now: always call validateBeforeLoad regardless, log at info level so
results are always visible, and warn on stat failures instead of silencing them.

Made-with: Cursor
The previous heuristic used 256 bytes/token for small models, but actual
llama.cpp KV cache (f16) is ~32KB/token for a 1B model. This caused the
memory validator to approve allocations that would crash Metal.

New brackets calibrated from real model architectures:
  <1GB: 48KB/tok, 1-3GB: 128KB/tok, 3-6GB: 200KB/tok,
  6-15GB: 350KB/tok, >15GB: 500KB/tok

Adds real-world regression tests (1B model + extreme ctx_size on 24GB).

Made-with: Cursor
Loads LLAMA_3_2_1B_INST_Q4_0 with ctx_size=3276000 and verifies that
MODEL_MEMORY_EXCEEDED is thrown instead of crashing the native addon.
Runs on both mobile and desktop test consumers.

Made-with: Cursor
Adds `memoryValidation: false` to qvac.config to disable the pre-load
memory estimator. Enabled by default. Allows disabling on platforms
where bare-os.totalmem() behavior is unverified (e.g. mobile).

Made-with: Cursor
Inverts the boolean semantics and adds "unsafe" prefix to clearly
communicate that disabling memory validation risks OOM crashes.

Made-with: Cursor
- error-memory-default-ctx-safe: load with default ctx_size succeeds
- error-memory-moderate-ctx-safe: load with ctx_size=4096 succeeds
- error-memory-recover-with-suggested: reject extreme ctx_size, retry
  with suggestedCtxSize from the error, confirm it loads
- error-memory-load-after-rejection: reject extreme ctx_size, then
  load with default to verify SDK isn't left in a broken state

Made-with: Cursor
- Replace flat 512MB overhead with proportional model (128MB + 10% of
  model file size) — avoids over-penalizing large models on small devices
- Raise mobile available memory fraction from 50% to 65% — matches
  real-world iOS/Android behavior (apps use 60-70% before jetsam)
- Add false-positive regression tests for every model+ctx_size config
  used in the mobile test suite against 4GB/6GB/8GB/12GB devices
- Verified: LLAMA 1B, QWEN 1.7B, SmolVLM 500M pass on 4GB+;
  SALAMANDRATA 2B and AFRICAN 4B pass on 8GB+ (realistic for those models)

Made-with: Cursor
Parse model GGUF header to extract architecture params (block_count,
head_count_kv, embedding_length, head_count) and compute exact KV cache
bytes per token using the llama.cpp formula:
  2 * n_layer * n_kv_heads * head_dim * dtype_size

Falls back to file-size-based heuristic brackets when GGUF metadata
is unavailable (e.g. stat/read failure). Logs whether validation
used [exact] or [heuristic] path for debugging.

Addresses feedback to use the correct equation instead of conservative
estimates that vary by model architecture and quantization.

Made-with: Cursor
Memory validation will be implemented on the addon side where the C++
layer has access to exact KV cache quantization config (cache_type_k,
cache_type_v) and can compute precise memory requirements.

Removes: validateBeforeLoad hook, memory-estimator, GGUF metadata
reader, platform detection, ModelMemoryExceededError, config flag,
all associated unit and e2e tests.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants