UPSTREAM PR #19288: ggml-cpu: use LUT for converting e8->f32 scales on x86 by loci-dev · Pull Request #1152 · auroralabs-loci/llama.cpp

loci-dev · 2026-02-03T11:44:20Z

Note

Source pull request: ggml-org/llama.cpp#19288

perf showed the e8m0->f32 function as a bottleneck. Use a LUT instead. Tested only on x86

Model	Test	t/s topk-cuda-refactor	t/s mxfp4-cpu-scale	Speedup
gpt-oss 20B MXFP4 MoE	pp1024	237.74	257.53	1.08
gpt-oss 20B MXFP4 MoE	pp2048	228.16	246.40	1.08
gpt-oss 20B MXFP4 MoE	pp4096	211.92	227.59	1.07
gpt-oss 20B MXFP4 MoE	pp8192	185.53	197.05	1.06

loci-review · 2026-02-03T12:54:39Z

Overview

Analysis of llama.cpp identified minimal performance impact from a single commit adding E8M0 quantization format support. Of 115,475 total functions, 9 were modified (0.008%), 3 added, and 0 removed, with 115,463 unchanged.

Power Consumption Changes:

build.bin.libggml-cpu.so: -0.109% (only measurable change)
build.bin.libllama.so, build.bin.llama-tts, build.bin.libmtmd.so, build.bin.llama-cvector-generator, build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-bench, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.libggml-base.so, build.bin.libggml.so: 0.0%

Overall power consumption decreased by 0.010%.

Function Analysis

ggml_cpu_init (build.bin.libggml-cpu.so): Response time increased from 1,505.73ns to 1,671.51ns (+165.78ns, +11.01%); throughput time increased from 212.80ns to 262.08ns (+49.28ns, +23.15%). This one-time initialization function added a 256-iteration loop to precompute E8M0→FP32 lookup table (1KB), enabling O(1) runtime conversions during MXFP4 dequantization. The change is intentional and beneficial, trading negligible startup cost for improved inference performance.

ggml_compute_forward_xielu (build.bin.libggml-cpu.so): Response time increased from 2,900.80ns to 3,000.37ns (+99.57ns, +3.43%); throughput time increased from 162.22ns to 234.08ns (+71.86ns, +44.30%). No source code changes detected; regression appears compiler-induced. This specialized activation function is rarely used in mainstream models, resulting in negligible real-world impact.

apply_unary_op (build.bin.libggml-cpu.so): Response time increased from 1,876.30ns to 1,884.36ns (+8.06ns, +0.43%); throughput time increased from 900.12ns to 918.72ns (+18.60ns, +2.07%). No source changes; minimal compiler artifact with sub-20ns impact on unary negation operations.

Additional Findings

The E8M0 lookup table optimization specifically targets x86 platforms for MXFP4 quantization support. No changes affected GPU backends (CUDA, Metal, HIP, Vulkan) or performance-critical paths (matrix operations, attention mechanisms). The modification follows established lookup table patterns and maintains backward compatibility while expanding quantization format support.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

loci-review · 2026-02-03T13:50:07Z

Overview

Analysis of llama.cpp identified 9 modified functions and 3 new functions among 115,475 total functions across 15 binaries. Changes target E8M0 quantization format optimization for x86 platforms through lookup table precomputation.

Power Consumption Changes:

build.bin.libggml-cpu.so: -0.109% (only measurable change)
All other binaries (llama-tts, libmtmd.so, libllama.so, llama-cvector-generator, llama-gemma3-cli, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli, llama-quantize, llama-tokenize, llama-qwen2vl-cli, llama-bench, libggml.so, libggml-base.so): 0.0%

Function Analysis

ggml_cpu_init (build.bin.libggml-cpu.so): Throughput time increased from 212.80ns to 262.08ns (+49.28ns, +23.15%); response time increased from 1505.73ns to 1671.51ns (+165.78ns, +11.01%). Added 256-entry E8M0 lookup table initialization. One-time startup cost is negligible and justified by expected 2-3x speedup in E8M0→FP32 conversions during quantized inference hot paths.

ggml_compute_forward_xielu (build.bin.libggml-cpu.so): Throughput time increased from 162.22ns to 234.08ns (+71.86ns, +44.30%); response time increased from 2900.80ns to 3000.37ns (+99.57ns, +3.43%). No source code changes to this function—regression stems from indirect effects (compiler optimization differences, cache pressure from new lookup table). XiELU is a specialized activation used in <5% of models, making impact negligible.

apply_unary_op (build.bin.libggml-cpu.so): Throughput time increased from 900.12ns to 918.72ns (+18.60ns, +2.07%); response time increased from 1876.30ns to 1884.36ns (+8.06ns, +0.43%). Affected by type conversion infrastructure changes where lookup table replaces register-only bit manipulation, introducing memory latency in tight vectorized loops. Impacts all unary operations on quantized tensors but represents <1% of total inference time.

Additional Findings

Changes are isolated to CPU backend with no GPU impact. Static analysis captures initialization overhead but cannot measure the primary benefit: expected 2-3x speedup in E8M0 conversions within quantized matrix multiplication kernels (70-90% of inference time). Real-world quantized inference workloads on x86 platforms should see net performance improvements despite minor regressions in initialization and unary operations. The optimization represents a strategic trade-off: minimal one-time startup cost and small regression in non-critical paths for substantial gains in performance-critical quantized inference hot paths.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

ggml-cpu: use LUT for converting e8->f32 scales on x86

635c8df

loci-dev temporarily deployed to PROD__AL_DEMO February 3, 2026 11:44 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from ab12294 to 048ad94 Compare February 3, 2026 12:20

add dispatch based on macro

e0a6335

loci-dev temporarily deployed to PROD__AL_DEMO February 3, 2026 12:47 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

loci-dev force-pushed the main branch 10 times, most recently from 823244c to bab7d39 Compare February 19, 2026 02:17

loci-dev force-pushed the main branch 10 times, most recently from a92fe2a to 6495042 Compare February 27, 2026 02:17

loci-dev force-pushed the main branch 2 times, most recently from 61b4303 to ef246cc Compare March 1, 2026 02:17

loci-dev force-pushed the main branch 8 times, most recently from 0db6c47 to 8019888 Compare March 8, 2026 02:17

loci-dev force-pushed the main branch 9 times, most recently from 6fa8e23 to f2637dc Compare March 15, 2026 02:18

loci-dev force-pushed the main branch 4 times, most recently from 5ac00d6 to 998dd7a Compare March 18, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #19288: ggml-cpu: use LUT for converting e8->f32 scales on x86#1152

UPSTREAM PR #19288: ggml-cpu: use LUT for converting e8->f32 scales on x86#1152
loci-dev wants to merge 2 commits intomainfrom
loci/pr-19288-mxfp4-cpu-scale

loci-dev commented Feb 3, 2026

Uh oh!

loci-review bot commented Feb 3, 2026

Uh oh!

loci-review bot commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Feb 3, 2026

Uh oh!

loci-review bot commented Feb 3, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

loci-review bot commented Feb 3, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants