UPSTREAM PR #19770: quantize : fail-early on missing imatrix; refactor + optimize#1208
UPSTREAM PR #19770: quantize : fail-early on missing imatrix; refactor + optimize#1208
Conversation
OverviewThis analysis evaluates 44 commits refactoring quantization logic in llama.cpp, covering 112,827 total functions (185 modified, 157 new, 7 removed, 112,478 unchanged) across 15 binaries. Power consumption changes are negligible across all binaries: build.bin.libllama.so (+0.5%), build.bin.llama-cvector-generator (+0.049%), build.bin.llama-tts (-0.027%), build.bin.llama-bench (+0.069%), build.bin.llama-quantize (-0.1%), build.bin.llama-tokenize (-0.072%), build.bin.libmtmd.so (0.0%), build.bin.libggml-base.so (0.0%), build.bin.libggml-cpu.so (0.0%), build.bin.libggml.so (0.0%), build.bin.llama-gemma3-cli (0.0%), build.bin.llama-gguf-split (0.0%), build.bin.llama-llava-cli (0.0%), build.bin.llama-minicpmv-cli (0.0%), build.bin.llama-qwen2vl-cli (0.0%). The refactor successfully optimizes quantization preprocessing while preserving inference performance. Function AnalysisMost performance variations occur in C++ STL functions without source code changes, indicating compiler optimization differences rather than algorithmic regressions. Architectural optimizations offset individual regressions: commit 6a8d084 moved regex compilation from per-tensor loop to initialization (1000x reduction), commits 6b85b49 and ba4ed79 implemented pre-allocation strategies reducing allocator overhead. Additional FindingsZero power change in GGML backend libraries (libggml-base.so, libggml-cpu.so, libggml.so) confirms matrix operations and attention mechanisms—which dominate 70-90% of inference time—remain unaffected. The refactor targets quantization preprocessing (one-time cost) rather than runtime inference. STL regressions occur in initialization paths (model loading, argument parsing, template processing) that do not propagate to inference hot paths. The consistent pattern of STL performance variations without code changes suggests build configuration differences (compiler versions, optimization flags, debug assertions) between base and target environments. 🔎 Full breakdown: Loci Inspector |
13648e6 to
1d064d0
Compare
1a6345e to
decff8b
Compare
OverviewAnalysis of commit Function Analysisllama_model_quantize_impl (libllama.so) improved significantly: response time decreased 10,506 ns (-0.81%), throughput time decreased 2,441 ns (-26.85%). Changes introduced preliminary validation loop for early imatrix failure detection and metadata caching to eliminate redundant string operations, directly explaining the performance gains. quantize_state_impl constructor (libllama.so) shows intentional initialization overhead: response time increased 48,916 ns (+49,106%), throughput time increased 195 ns (+196%). This reflects moving regex pattern compilation from per-tensor loop to one-time initialization—an amortization strategy that eliminates 1,000+ compilations in the main processing loop, yielding net positive performance. build_delta_net_autoregressive (libllama.so) improved despite no source changes: response time decreased 86 ns (-2.61%), throughput time decreased 86 ns (-4.73%), likely from compiler optimization differences. std::unordered_map::operator[] (libllama.so) shows acceptable trade-off: throughput time increased 63 ns (+45.71%) due to increased hash map usage frequency, but replaces expensive O(n×m) string matching with O(1) lookups, improving overall algorithmic complexity. Other analyzed functions (STL utilities, regex internals, vector operations) showed mixed performance variations within normal compiler optimization variance, with negligible practical impact on quantization or inference workflows. Additional FindingsNo GPU backend modifications were made—all changes isolated to CPU-side quantization logic. Inference hot paths (matrix operations, attention mechanisms, KV cache management) remain completely unchanged, maintaining GPU and CPU inference performance. The refactoring demonstrates proper optimization principles: fail-fast validation prevents wasted computation, metadata caching eliminates redundant operations, and regex amortization trades 49 μs initialization cost for 50+ ms savings in typical quantization workflows. 🔎 Full breakdown: Loci Inspector |
551dfb5 to
55a969e
Compare
5ac00d6 to
998dd7a
Compare
e3ea641 to
efc22ce
Compare
945fa3a to
0e8e1d6
Compare
Note
Source pull request: ggml-org/llama.cpp#19770
Currently, if a quantization requires an importance matrix and one isn't provided, the program doesn't discover this until it reaches the offending tensor during the main quantization loop. Depending on model size and target type, this can mean wasting anywhere from 5 minutes to 12 hours before the process aborts, leaving the user with a non-functional partial GGUF.
This PR adds a preliminary pass over all tensors that determines each tensor's target quantization type before the main quantization loop. This lets us check imatrix requirements upfront rather than discovering them mid-quantization. The old ftype-based imatrix guard in
quantize.cppis removed.Along the way, I refactored much of
src/llama-quant.cppto be more organized and efficient.Fail-early for missing imatrix
If an importance matrix is required but missing, quantization will now fail immediately with an error identifying the offending tensor and its target type:
tensor_requires_imatrix(renamed fromtensor_type_requires_imatrix) now uses aswitchondst_typeinstead of a boolean expression, and correctly exemptsper_layer_token_embd.weightin addition totoken_embd.weight.Performance optimizations
MoE quantization with expert-parallel threading: the old code would launch
nthreadworkersn_expertstimes per tensor. The newllama_tensor_quantizefunction detects when there are enough experts to saturate all threads and instead launches threads once, and each thread pulls in experts as they become free (work-sharing).Combined with pre-allocated work buffers (sized once from a preliminary scan of all tensor dimensions rather than being resized on every tensor), this gives a ~14% wall-clock speedup on a pure Q8_0 quantization of Qwen3.5-122B-A10B (232 GiB MoE): 12m37s -> 10m51s.
Speedup details
Hardware:
master @
3769fe6eb:this PR @
ba4ed7968:Refactoring
Extracted functions to reduce the size of
llama_model_quantize_impland make the logic reusable across the preliminary and main loops:tensor_allows_quantization: all the "should we quantize this tensor?" checks (norm tensors, RWKV weights, conv1d, positional embeddings, ...) previously inlined in the main loopllama_tensor_get_type/llama_tensor_get_type_impl: split the type resolution into a wrapper (handles overrides, fallbacks, incompatible shapes) and the core mixture/architecture logicllama_tensor_quantize: per-tensor quantization extracted from the main loop, including chunk size calculation and the expert slicing loopllama_ftype_get_default_type: the ftype-to-ggml_type switch, extracted and organized by categoryOther changes
quantize_state_impltoquantization_state_impl; managed asunique_ptr; regex patterns compiled once in the constructor instead of per-tensortensorstoweights,quantizetodo_quantizeftype, we should default to a rock-solid type like Q8_0; Q5_1 is more iffy. in 99.9% of cases this should have no effect)