UPSTREAM PR #19288: ggml-cpu: use LUT for converting e8->f32 scales on x86#1152
UPSTREAM PR #19288: ggml-cpu: use LUT for converting e8->f32 scales on x86#1152
Conversation
OverviewAnalysis of llama.cpp identified minimal performance impact from a single commit adding E8M0 quantization format support. Of 115,475 total functions, 9 were modified (0.008%), 3 added, and 0 removed, with 115,463 unchanged. Power Consumption Changes:
Overall power consumption decreased by 0.010%. Function Analysisggml_cpu_init (build.bin.libggml-cpu.so): Response time increased from 1,505.73ns to 1,671.51ns (+165.78ns, +11.01%); throughput time increased from 212.80ns to 262.08ns (+49.28ns, +23.15%). This one-time initialization function added a 256-iteration loop to precompute E8M0→FP32 lookup table (1KB), enabling O(1) runtime conversions during MXFP4 dequantization. The change is intentional and beneficial, trading negligible startup cost for improved inference performance. ggml_compute_forward_xielu (build.bin.libggml-cpu.so): Response time increased from 2,900.80ns to 3,000.37ns (+99.57ns, +3.43%); throughput time increased from 162.22ns to 234.08ns (+71.86ns, +44.30%). No source code changes detected; regression appears compiler-induced. This specialized activation function is rarely used in mainstream models, resulting in negligible real-world impact. apply_unary_op (build.bin.libggml-cpu.so): Response time increased from 1,876.30ns to 1,884.36ns (+8.06ns, +0.43%); throughput time increased from 900.12ns to 918.72ns (+18.60ns, +2.07%). No source changes; minimal compiler artifact with sub-20ns impact on unary negation operations. Additional FindingsThe E8M0 lookup table optimization specifically targets x86 platforms for MXFP4 quantization support. No changes affected GPU backends (CUDA, Metal, HIP, Vulkan) or performance-critical paths (matrix operations, attention mechanisms). The modification follows established lookup table patterns and maintains backward compatibility while expanding quantization format support. 🔎 Full breakdown: Loci Inspector. |
OverviewAnalysis of llama.cpp identified 9 modified functions and 3 new functions among 115,475 total functions across 15 binaries. Changes target E8M0 quantization format optimization for x86 platforms through lookup table precomputation. Power Consumption Changes:
Function Analysisggml_cpu_init (build.bin.libggml-cpu.so): Throughput time increased from 212.80ns to 262.08ns (+49.28ns, +23.15%); response time increased from 1505.73ns to 1671.51ns (+165.78ns, +11.01%). Added 256-entry E8M0 lookup table initialization. One-time startup cost is negligible and justified by expected 2-3x speedup in E8M0→FP32 conversions during quantized inference hot paths. ggml_compute_forward_xielu (build.bin.libggml-cpu.so): Throughput time increased from 162.22ns to 234.08ns (+71.86ns, +44.30%); response time increased from 2900.80ns to 3000.37ns (+99.57ns, +3.43%). No source code changes to this function—regression stems from indirect effects (compiler optimization differences, cache pressure from new lookup table). XiELU is a specialized activation used in <5% of models, making impact negligible. apply_unary_op (build.bin.libggml-cpu.so): Throughput time increased from 900.12ns to 918.72ns (+18.60ns, +2.07%); response time increased from 1876.30ns to 1884.36ns (+8.06ns, +0.43%). Affected by type conversion infrastructure changes where lookup table replaces register-only bit manipulation, introducing memory latency in tight vectorized loops. Impacts all unary operations on quantized tensors but represents <1% of total inference time. Additional FindingsChanges are isolated to CPU backend with no GPU impact. Static analysis captures initialization overhead but cannot measure the primary benefit: expected 2-3x speedup in E8M0 conversions within quantized matrix multiplication kernels (70-90% of inference time). Real-world quantized inference workloads on x86 platforms should see net performance improvements despite minor regressions in initialization and unary operations. The optimization represents a strategic trade-off: minimal one-time startup cost and small regression in non-critical paths for substantial gains in performance-critical quantized inference hot paths. 🔎 Full breakdown: Loci Inspector. |
823244c to
bab7d39
Compare
a92fe2a to
6495042
Compare
61b4303 to
ef246cc
Compare
0db6c47 to
8019888
Compare
6fa8e23 to
f2637dc
Compare
5ac00d6 to
998dd7a
Compare
Note
Source pull request: ggml-org/llama.cpp#19288
perfshowed the e8m0->f32 function as a bottleneck. Use a LUT instead. Tested only on x86