UPSTREAM PR #17069: convert : handle compressed-tensors quant method by DajanaV · Pull Request #112 · auroralabs-loci/llama.cpp

DajanaV · 2025-11-07T03:46:02Z

(alternative to #17064, cc @ngxson)

This adds support for a few formats in the compressed-tensors quant method.

pack-quantized
- symmetric = true (without zero point)
  - https://huggingface.co/gaunernst/gemma-3-4b-it-qat-compressed-tensors (tested)
  - https://huggingface.co/moonshotai/Kimi-K2-Thinking (untested, but should work?)
- symmetric = false (with zero point)
  - https://huggingface.co/cpatonn/Qwen3-4B-Instruct-2507-AWQ-4bit (tested)
int-quantized
- https://huggingface.co/RedHatAI/Qwen2.5-1.5B-quantized.w8a8 (tested)
float-quantized
- https://huggingface.co/RedHatAI/Llama-3.2-1B-Instruct-FP8-dynamic (tested)
naive-quantized
- https://huggingface.co/nm-testing/Qwen2-1.5B-Instruct-FP8W8 (tested)

I've also re-tested plain fp8 with https://huggingface.co/Qwen/Qwen3-4B-FP8 to make sure I didn't break it.

I found a problem in the lazy tensors related to skipping metadata changes for binary operators, which I've fixed. Otherwise the broadcast shift (when unpacking) didn't have the correct final shape.

Make sure to read the contributing guidelines before submitting a PR

loci-review · 2025-11-07T04:25:19Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 17eb8e97 compared to baseline 52cd5469 reveals minimal performance variations across the llama.cpp codebase. The changes are primarily related to Python conversion scripts for compressed-tensors quantization support, with no direct modifications to core C++ inference functions.

Key Findings

Performance Metrics:

Highest Response Time Change: std::vector<llm_bigram_spm>::pop_back() improved by 0.10% (67 ns → 67 ns, -0.067 ns absolute)
Highest Throughput Change: std::make_unique<llm_graph_input_pos_bucket>() degraded by 0.12% (104 ns → 104 ns, +0.122 ns absolute)

Core Function Impact:
No core inference functions (llama_decode, llama_encode, llama_tokenize) show measurable performance changes. The observed variations occur in STL utility functions used during tokenization preprocessing, not in the primary inference pipeline. Tokens per second performance remains unaffected as no critical path functions experienced meaningful response time or throughput changes.

Power Consumption Analysis:
All binaries show negligible power consumption changes (<0.001%):

libllama.so: -0.0009 nJ
llama-run: -0.0012 nJ
llama-cvector-generator: +0.0037 nJ
llama-tts: -0.0001 nJ

Energy efficiency remains stable across all components.

Flame Graph and CFG Analysis:
The pop_back() function exhibits a simple single-frame execution profile with identical assembly code between versions. The 0.06 ns improvement represents measurement variance rather than algorithmic changes, as both versions execute identical instruction sequences with no structural differences in control flow.

GitHub Code Review Insights:
The PR introduces compressed-tensors quantization support in Python conversion scripts without affecting C++ runtime performance. Changes include new dequantization methods and lazy tensor operator fixes that improve model conversion robustness but don't impact inference execution.

Conclusion:
The analysis reveals stable performance characteristics with variations within measurement noise. No actionable performance optimizations are required as the changes maintain inference efficiency while expanding quantization format support.

compilade added 6 commits November 6, 2025 21:12

convert : handle compressed-tensors quant method

33dba6c

convert : handle int-quantized models

d23bdd5

convert : handle naive-quantized models

33dcb44

gguf-py : __pos__ is also unary

987862a

convert : fix flake8 lint

3770d94

convert : use F32 for dequant of pack-quantized tensors

128118f

DajanaV temporarily deployed to PROD__AL_DEMO November 7, 2025 03:46 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 22 times, most recently from 0ad40ce to 0fa8f01 Compare November 10, 2025 09:10

DajanaV force-pushed the main branch 30 times, most recently from 654fc56 to 35c840d Compare November 15, 2025 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17069: convert : handle compressed-tensors quant method#112

UPSTREAM PR #17069: convert : handle compressed-tensors quant method#112
DajanaV wants to merge 6 commits intomainfrom
upstream-PR17069-branch_ggml-org-compilade/convert-prequant-compressed-tensors

DajanaV commented Nov 7, 2025

Uh oh!

loci-review bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DajanaV commented Nov 7, 2025

Uh oh!

loci-review bot commented Nov 7, 2025

Performance Analysis Summary

Overview

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants