Skip to content

UPSTREAM PR #19770: quantize : fail-early on missing imatrix; refactor + optimize#1208

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-19770-llama-quant-refactor-2
Open

UPSTREAM PR #19770: quantize : fail-early on missing imatrix; refactor + optimize#1208
loci-dev wants to merge 1 commit intomainfrom
loci/pr-19770-llama-quant-refactor-2

Conversation

@loci-dev
Copy link

Note

Source pull request: ggml-org/llama.cpp#19770

Currently, if a quantization requires an importance matrix and one isn't provided, the program doesn't discover this until it reaches the offending tensor during the main quantization loop. Depending on model size and target type, this can mean wasting anywhere from 5 minutes to 12 hours before the process aborts, leaving the user with a non-functional partial GGUF.

This PR adds a preliminary pass over all tensors that determines each tensor's target quantization type before the main quantization loop. This lets us check imatrix requirements upfront rather than discovering them mid-quantization. The old ftype-based imatrix guard in quantize.cpp is removed.

Along the way, I refactored much of src/llama-quant.cpp to be more organized and efficient.

Fail-early for missing imatrix

If an importance matrix is required but missing, quantization will now fail immediately with an error identifying the offending tensor and its target type:

Screenshot 2026-02-26 at 1 05 57 AM

tensor_requires_imatrix (renamed from tensor_type_requires_imatrix) now uses a switch on dst_type instead of a boolean expression, and correctly exempts per_layer_token_embd.weight in addition to token_embd.weight.

Performance optimizations

MoE quantization with expert-parallel threading: the old code would launch nthread workers n_experts times per tensor. The new llama_tensor_quantize function detects when there are enough experts to saturate all threads and instead launches threads once, and each thread pulls in experts as they become free (work-sharing).

Combined with pre-allocated work buffers (sized once from a preliminary scan of all tensor dimensions rather than being resized on every tensor), this gives a ~14% wall-clock speedup on a pure Q8_0 quantization of Qwen3.5-122B-A10B (232 GiB MoE): 12m37s -> 10m51s.

Speedup details

Hardware:

  • Ryzen 7 7700X (8c/16t, using 16 threads)
  • 128GB DDR5 @ 4800 MT/s
↕️ results ...

master @ 3769fe6eb:

llama_model_quantize_impl: model size  = 232985.51 MiB (16.01 BPW)
llama_model_quantize_impl: quant size  = 123845.04 MiB (8.51 BPW)

main: quantize time = 751904.92 ms
main:    total time = 751904.92 ms
[ perf record: Woken up 3412 times to write data ]
[ perf record: Captured and wrote 862.253 MB perf.master.3769fe6eb.data (4482571 samples) ]

real    12m37.274s
user    12m30.772s
sys     3m3.481s

this PR @ ba4ed7968:

llama_model_quantize_impl: model size  = 232985.51 MiB (16.01 BPW)
llama_model_quantize_impl: quant size  = 123845.04 MiB (8.51 BPW)

main: quantize time = 646176.59 ms
main:    total time = 646176.59 ms
[ perf record: Woken up 2446 times to write data ]
[ perf record: Captured and wrote 616.245 MB perf.llama-quant-refactor-2.ba4ed7968.data (3699558 samples) ]

real    10m51.131s
user    12m36.011s
sys     2m48.953s

Refactoring

Extracted functions to reduce the size of llama_model_quantize_impl and make the logic reusable across the preliminary and main loops:

  • tensor_allows_quantization: all the "should we quantize this tensor?" checks (norm tensors, RWKV weights, conv1d, positional embeddings, ...) previously inlined in the main loop
  • llama_tensor_get_type / llama_tensor_get_type_impl: split the type resolution into a wrapper (handles overrides, fallbacks, incompatible shapes) and the core mixture/architecture logic
  • llama_tensor_quantize: per-tensor quantization extracted from the main loop, including chunk size calculation and the expert slicing loop
  • llama_ftype_get_default_type: the ftype-to-ggml_type switch, extracted and organized by category

Other changes

  • Renamed quantize_state_impl to quantization_state_impl; managed as unique_ptr; regex patterns compiled once in the constructor instead of per-tensor
  • Renamed tensors to weights, quantize to do_quantize
  • Incompatible tensor shape fallback now falls back to F16 with a warning instead of aborting
  • Removed dead MXFP4 sanity check code
  • Default quantization type changed from Q5_1 to Q8_0 (in case some external program does not specify an ftype, we should default to a rock-solid type like Q8_0; Q5_1 is more iffy. in 99.9% of cases this should have no effect)
  • Logging cleanup

@loci-review
Copy link

loci-review bot commented Feb 28, 2026

Overview

This analysis evaluates 44 commits refactoring quantization logic in llama.cpp, covering 112,827 total functions (185 modified, 157 new, 7 removed, 112,478 unchanged) across 15 binaries. Power consumption changes are negligible across all binaries: build.bin.libllama.so (+0.5%), build.bin.llama-cvector-generator (+0.049%), build.bin.llama-tts (-0.027%), build.bin.llama-bench (+0.069%), build.bin.llama-quantize (-0.1%), build.bin.llama-tokenize (-0.072%), build.bin.libmtmd.so (0.0%), build.bin.libggml-base.so (0.0%), build.bin.libggml-cpu.so (0.0%), build.bin.libggml.so (0.0%), build.bin.llama-gemma3-cli (0.0%), build.bin.llama-gguf-split (0.0%), build.bin.llama-llava-cli (0.0%), build.bin.llama-minicpmv-cli (0.0%), build.bin.llama-qwen2vl-cli (0.0%). The refactor successfully optimizes quantization preprocessing while preserving inference performance.

Function Analysis

Most performance variations occur in C++ STL functions without source code changes, indicating compiler optimization differences rather than algorithmic regressions. std::_Rb_tree::end() (build.bin.llama-cvector-generator) shows response time increase from 79.8ns to 263.1ns (+230%), throughput time from 59.8ns to 243.1ns (+307%). std::vector<jinja::token>::begin() (build.bin.llama-cvector-generator) increases from 84.0ns to 264.8ns response time (+215%), 62.5ns to 243.3ns throughput (+289%). __gnu_cxx::__ops::__val_comp_iter (build.bin.libllama.so) in typical sampler increases from 119.6ns to 288.1ns response time (+141%), 77.5ns to 246.0ns throughput (+218%), contributing ~80ms per token. std::_Bit_const_iterator::operator*() (build.bin.llama-tts) increases from 180.5ns to 362.4ns response time (+101%), 137.5ns to 319.4ns throughput (+132%), affecting TTS batch processing in tight loops.

Architectural optimizations offset individual regressions: commit 6a8d084 moved regex compilation from per-tensor loop to initialization (1000x reduction), commits 6b85b49 and ba4ed79 implemented pre-allocation strategies reducing allocator overhead. std::vector<ggml_type>::begin() (build.bin.llama-tts) improved from 264.8ns to 84.0ns response time (-68%), 243.3ns to 62.5ns throughput (-74%).

Additional Findings

Zero power change in GGML backend libraries (libggml-base.so, libggml-cpu.so, libggml.so) confirms matrix operations and attention mechanisms—which dominate 70-90% of inference time—remain unaffected. The refactor targets quantization preprocessing (one-time cost) rather than runtime inference. STL regressions occur in initialization paths (model loading, argument parsing, template processing) that do not propagate to inference hot paths. The consistent pattern of STL performance variations without code changes suggests build configuration differences (compiler versions, optimization flags, debug assertions) between base and target environments.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 4 times, most recently from 13648e6 to 1d064d0 Compare March 3, 2026 02:17
@loci-dev loci-dev force-pushed the loci/pr-19770-llama-quant-refactor-2 branch from 1a6345e to decff8b Compare March 4, 2026 02:17
@loci-review
Copy link

loci-review bot commented Mar 4, 2026

Overview

Analysis of commit decff8b ("quantize: imatrix-fail early + code cleanup") across 112,848 functions (31 modified, 97 new, 6 removed, 112,714 unchanged) reveals minor positive impact focused on quantization pipeline optimization. Power consumption changes are negligible across all binaries: build.bin.libllama.so (+0.26%, +666 nJ), build.bin.llama-quantize (-0.07%, -29 nJ), with build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.libmtmd.so, build.bin.llama-bench, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-qwen2vl-cli, build.bin.libggml-base.so, build.bin.libggml-cpu.so, and build.bin.libggml.so showing no measurable change (0.00%).

Function Analysis

llama_model_quantize_impl (libllama.so) improved significantly: response time decreased 10,506 ns (-0.81%), throughput time decreased 2,441 ns (-26.85%). Changes introduced preliminary validation loop for early imatrix failure detection and metadata caching to eliminate redundant string operations, directly explaining the performance gains.

quantize_state_impl constructor (libllama.so) shows intentional initialization overhead: response time increased 48,916 ns (+49,106%), throughput time increased 195 ns (+196%). This reflects moving regex pattern compilation from per-tensor loop to one-time initialization—an amortization strategy that eliminates 1,000+ compilations in the main processing loop, yielding net positive performance.

build_delta_net_autoregressive (libllama.so) improved despite no source changes: response time decreased 86 ns (-2.61%), throughput time decreased 86 ns (-4.73%), likely from compiler optimization differences.

std::unordered_map::operator[] (libllama.so) shows acceptable trade-off: throughput time increased 63 ns (+45.71%) due to increased hash map usage frequency, but replaces expensive O(n×m) string matching with O(1) lookups, improving overall algorithmic complexity.

Other analyzed functions (STL utilities, regex internals, vector operations) showed mixed performance variations within normal compiler optimization variance, with negligible practical impact on quantization or inference workflows.

Additional Findings

No GPU backend modifications were made—all changes isolated to CPU-side quantization logic. Inference hot paths (matrix operations, attention mechanisms, KV cache management) remain completely unchanged, maintaining GPU and CPU inference performance. The refactoring demonstrates proper optimization principles: fail-fast validation prevents wasted computation, metadata caching eliminates redundant operations, and regex amortization trades 49 μs initialization cost for 50+ ms savings in typical quantization workflows.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 7 times, most recently from 551dfb5 to 55a969e Compare March 11, 2026 02:16
@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 5ac00d6 to 998dd7a Compare March 18, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from e3ea641 to efc22ce Compare March 19, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 945fa3a to 0e8e1d6 Compare March 20, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants