Skip to content

UPSTREAM PR #19288: ggml-cpu: use LUT for converting e8->f32 scales on x86#1152

Open
loci-dev wants to merge 2 commits intomainfrom
loci/pr-19288-mxfp4-cpu-scale
Open

UPSTREAM PR #19288: ggml-cpu: use LUT for converting e8->f32 scales on x86#1152
loci-dev wants to merge 2 commits intomainfrom
loci/pr-19288-mxfp4-cpu-scale

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Feb 3, 2026

Note

Source pull request: ggml-org/llama.cpp#19288

perf showed the e8m0->f32 function as a bottleneck. Use a LUT instead. Tested only on x86

Model Test t/s topk-cuda-refactor t/s mxfp4-cpu-scale Speedup
gpt-oss 20B MXFP4 MoE pp1024 237.74 257.53 1.08
gpt-oss 20B MXFP4 MoE pp2048 228.16 246.40 1.08
gpt-oss 20B MXFP4 MoE pp4096 211.92 227.59 1.07
gpt-oss 20B MXFP4 MoE pp8192 185.53 197.05 1.06

@loci-review
Copy link

loci-review bot commented Feb 3, 2026

Overview

Analysis of llama.cpp identified minimal performance impact from a single commit adding E8M0 quantization format support. Of 115,475 total functions, 9 were modified (0.008%), 3 added, and 0 removed, with 115,463 unchanged.

Power Consumption Changes:

  • build.bin.libggml-cpu.so: -0.109% (only measurable change)
  • build.bin.libllama.so, build.bin.llama-tts, build.bin.libmtmd.so, build.bin.llama-cvector-generator, build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-bench, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.libggml-base.so, build.bin.libggml.so: 0.0%

Overall power consumption decreased by 0.010%.

Function Analysis

ggml_cpu_init (build.bin.libggml-cpu.so): Response time increased from 1,505.73ns to 1,671.51ns (+165.78ns, +11.01%); throughput time increased from 212.80ns to 262.08ns (+49.28ns, +23.15%). This one-time initialization function added a 256-iteration loop to precompute E8M0→FP32 lookup table (1KB), enabling O(1) runtime conversions during MXFP4 dequantization. The change is intentional and beneficial, trading negligible startup cost for improved inference performance.

ggml_compute_forward_xielu (build.bin.libggml-cpu.so): Response time increased from 2,900.80ns to 3,000.37ns (+99.57ns, +3.43%); throughput time increased from 162.22ns to 234.08ns (+71.86ns, +44.30%). No source code changes detected; regression appears compiler-induced. This specialized activation function is rarely used in mainstream models, resulting in negligible real-world impact.

apply_unary_op (build.bin.libggml-cpu.so): Response time increased from 1,876.30ns to 1,884.36ns (+8.06ns, +0.43%); throughput time increased from 900.12ns to 918.72ns (+18.60ns, +2.07%). No source changes; minimal compiler artifact with sub-20ns impact on unary negation operations.

Additional Findings

The E8M0 lookup table optimization specifically targets x86 platforms for MXFP4 quantization support. No changes affected GPU backends (CUDA, Metal, HIP, Vulkan) or performance-critical paths (matrix operations, attention mechanisms). The modification follows established lookup table patterns and maintains backward compatibility while expanding quantization format support.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-review
Copy link

loci-review bot commented Feb 3, 2026

Overview

Analysis of llama.cpp identified 9 modified functions and 3 new functions among 115,475 total functions across 15 binaries. Changes target E8M0 quantization format optimization for x86 platforms through lookup table precomputation.

Power Consumption Changes:

  • build.bin.libggml-cpu.so: -0.109% (only measurable change)
  • All other binaries (llama-tts, libmtmd.so, libllama.so, llama-cvector-generator, llama-gemma3-cli, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli, llama-quantize, llama-tokenize, llama-qwen2vl-cli, llama-bench, libggml.so, libggml-base.so): 0.0%

Function Analysis

ggml_cpu_init (build.bin.libggml-cpu.so): Throughput time increased from 212.80ns to 262.08ns (+49.28ns, +23.15%); response time increased from 1505.73ns to 1671.51ns (+165.78ns, +11.01%). Added 256-entry E8M0 lookup table initialization. One-time startup cost is negligible and justified by expected 2-3x speedup in E8M0→FP32 conversions during quantized inference hot paths.

ggml_compute_forward_xielu (build.bin.libggml-cpu.so): Throughput time increased from 162.22ns to 234.08ns (+71.86ns, +44.30%); response time increased from 2900.80ns to 3000.37ns (+99.57ns, +3.43%). No source code changes to this function—regression stems from indirect effects (compiler optimization differences, cache pressure from new lookup table). XiELU is a specialized activation used in <5% of models, making impact negligible.

apply_unary_op (build.bin.libggml-cpu.so): Throughput time increased from 900.12ns to 918.72ns (+18.60ns, +2.07%); response time increased from 1876.30ns to 1884.36ns (+8.06ns, +0.43%). Affected by type conversion infrastructure changes where lookup table replaces register-only bit manipulation, introducing memory latency in tight vectorized loops. Impacts all unary operations on quantized tensors but represents <1% of total inference time.

Additional Findings

Changes are isolated to CPU backend with no GPU impact. Static analysis captures initialization overhead but cannot measure the primary benefit: expected 2-3x speedup in E8M0 conversions within quantized matrix multiplication kernels (70-90% of inference time). Real-world quantized inference workloads on x86 platforms should see net performance improvements despite minor regressions in initialization and unary operations. The optimization represents a strategic trade-off: minimal one-time startup cost and small regression in non-critical paths for substantial gains in performance-critical quantized inference hot paths.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 823244c to bab7d39 Compare February 19, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 10 times, most recently from a92fe2a to 6495042 Compare February 27, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 61b4303 to ef246cc Compare March 1, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 0db6c47 to 8019888 Compare March 8, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 6fa8e23 to f2637dc Compare March 15, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 4 times, most recently from 5ac00d6 to 998dd7a Compare March 18, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants