QVAC-11621 fix: KV cache memory validation to prevent OOM crashes#1642
Closed
simon-iribarren wants to merge 15 commits into
Closed
QVAC-11621 fix: KV cache memory validation to prevent OOM crashes#1642simon-iribarren wants to merge 15 commits into
simon-iribarren wants to merge 15 commits into
Conversation
Contributor
|
I think we need to consider KV quant from the start as well, not just F16, being 75% off seems too much. Specially since the push/hype towards TurboQuant (6x smaller than F16 with minimal quality loss on long contex) that is going on right now and our focus on embedded devices and mobile with more memory constraints. |
Contributor
Tier-based Approval Status |
…crashes (QVAC-11621) Add pre-load memory estimation for LLM models to prevent OOM crashes on memory-constrained devices (especially iOS Metal backend). - Harden ctx_size schema: int().min(1).max(131072) - Add validateBeforeLoad plugin hook to QvacPlugin interface - Implement memory estimator using model file size as proxy for KV cache cost (256-2048 bytes/token based on model size bracket) - LLM plugin checks estimated memory against 80% of available memory via bare-os, throws ModelMemoryExceededError with suggestedCtxSize - Integrate validation in load-model handler after config resolution - Add 20 unit tests for estimator and schema validation Made-with: Cursor
…dling Made-with: Cursor
Replace HuggingFace URL with LLAMA_3_2_1B_INST_Q4_0 registry constant. Made-with: Cursor
os.freemem() on macOS/iOS returns only truly unallocated pages which is always tiny due to aggressive file caching. Use totalMemory with a conservative fraction (70% desktop, 50% mobile) for a realistic estimate of available memory. Add debug logging for memory validation. Made-with: Cursor
…ndle it The .max(131072) cap on ctx_size caused a raw Zod union error instead of reaching the validateBeforeLoad hook which provides an actionable error message with suggested ctx_size. Schema now only enforces positive integer; runtime memory validation handles the upper bound. Made-with: Cursor
The validation was silently skipped when bare-fs stat failed (modelFileSize=0). Now: always call validateBeforeLoad regardless, log at info level so results are always visible, and warn on stat failures instead of silencing them. Made-with: Cursor
The previous heuristic used 256 bytes/token for small models, but actual llama.cpp KV cache (f16) is ~32KB/token for a 1B model. This caused the memory validator to approve allocations that would crash Metal. New brackets calibrated from real model architectures: <1GB: 48KB/tok, 1-3GB: 128KB/tok, 3-6GB: 200KB/tok, 6-15GB: 350KB/tok, >15GB: 500KB/tok Adds real-world regression tests (1B model + extreme ctx_size on 24GB). Made-with: Cursor
Loads LLAMA_3_2_1B_INST_Q4_0 with ctx_size=3276000 and verifies that MODEL_MEMORY_EXCEEDED is thrown instead of crashing the native addon. Runs on both mobile and desktop test consumers. Made-with: Cursor
Adds `memoryValidation: false` to qvac.config to disable the pre-load memory estimator. Enabled by default. Allows disabling on platforms where bare-os.totalmem() behavior is unverified (e.g. mobile). Made-with: Cursor
Inverts the boolean semantics and adds "unsafe" prefix to clearly communicate that disabling memory validation risks OOM crashes. Made-with: Cursor
- error-memory-default-ctx-safe: load with default ctx_size succeeds - error-memory-moderate-ctx-safe: load with ctx_size=4096 succeeds - error-memory-recover-with-suggested: reject extreme ctx_size, retry with suggestedCtxSize from the error, confirm it loads - error-memory-load-after-rejection: reject extreme ctx_size, then load with default to verify SDK isn't left in a broken state Made-with: Cursor
- Replace flat 512MB overhead with proportional model (128MB + 10% of model file size) — avoids over-penalizing large models on small devices - Raise mobile available memory fraction from 50% to 65% — matches real-world iOS/Android behavior (apps use 60-70% before jetsam) - Add false-positive regression tests for every model+ctx_size config used in the mobile test suite against 4GB/6GB/8GB/12GB devices - Verified: LLAMA 1B, QWEN 1.7B, SmolVLM 500M pass on 4GB+; SALAMANDRATA 2B and AFRICAN 4B pass on 8GB+ (realistic for those models) Made-with: Cursor
Parse model GGUF header to extract architecture params (block_count, head_count_kv, embedding_length, head_count) and compute exact KV cache bytes per token using the llama.cpp formula: 2 * n_layer * n_kv_heads * head_dim * dtype_size Falls back to file-size-based heuristic brackets when GGUF metadata is unavailable (e.g. stat/read failure). Logs whether validation used [exact] or [heuristic] path for debugging. Addresses feedback to use the correct equation instead of conservative estimates that vary by model architecture and quantization. Made-with: Cursor
Memory validation will be implemented on the addon side where the C++ layer has access to exact KV cache quantization config (cache_type_k, cache_type_v) and can compute precise memory requirements. Removes: validateBeforeLoad hook, memory-estimator, GGUF metadata reader, platform detection, ModelMemoryExceededError, config flag, all associated unit and e2e tests. Made-with: Cursor
This reverts commit 7304fa0.
d1e9e77 to
f59a6c8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ctx_sizeif it would OOMunsafeDisableMemoryValidationconfig flag to bypass the checkNote: This is the SDK-side implementation. A future addon-side implementation can provide even more precise calculations using the actual KV cache quantization config (
cache_type_k,cache_type_v).How it works
Exact KV cache computation from GGUF metadata:
block_count,head_count_kv,embedding_length,head_count2 * n_layer * n_kv_heads * head_dim * dtype_size(f16 default)[exact]or[heuristic]in validation outputPlugin-level validation (
validateBeforeLoadhook):QvacPlugininterface, called beforecreateModel()model_size + kv_cache + overheadagainst 80% of available memoryModelMemoryExceededError(code 52211) withsuggestedCtxSizePlatform memory detection:
bare-os.totalmem()with conservative fractions (70% desktop, 65% mobile)API Changes
Testing
ctx_sizevaluesRecreated from #1464 to move branch onto the fork (post-migration). No code changes vs. the original PR — only rebased onto latest
main.