You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found a problem in the lazy tensors related to skipping metadata changes for binary operators, which I've fixed. Otherwise the broadcast shift (when unpacking) didn't have the correct final shape.
Analysis of version 17eb8e97 compared to baseline 52cd5469 reveals minimal performance variations across the llama.cpp codebase. The changes are primarily related to Python conversion scripts for compressed-tensors quantization support, with no direct modifications to core C++ inference functions.
Key Findings
Performance Metrics:
Highest Response Time Change: std::vector<llm_bigram_spm>::pop_back() improved by 0.10% (67 ns → 67 ns, -0.067 ns absolute)
Core Function Impact:
No core inference functions (llama_decode, llama_encode, llama_tokenize) show measurable performance changes. The observed variations occur in STL utility functions used during tokenization preprocessing, not in the primary inference pipeline. Tokens per second performance remains unaffected as no critical path functions experienced meaningful response time or throughput changes.
Power Consumption Analysis:
All binaries show negligible power consumption changes (<0.001%):
libllama.so: -0.0009 nJ
llama-run: -0.0012 nJ
llama-cvector-generator: +0.0037 nJ
llama-tts: -0.0001 nJ
Energy efficiency remains stable across all components.
Flame Graph and CFG Analysis:
The pop_back() function exhibits a simple single-frame execution profile with identical assembly code between versions. The 0.06 ns improvement represents measurement variance rather than algorithmic changes, as both versions execute identical instruction sequences with no structural differences in control flow.
GitHub Code Review Insights:
The PR introduces compressed-tensors quantization support in Python conversion scripts without affecting C++ runtime performance. Changes include new dequantization methods and lazy tensor operator fixes that improve model conversion robustness but don't impact inference execution.
Conclusion:
The analysis reveals stable performance characteristics with variations within measurement noise. No actionable performance optimizations are required as the changes maintain inference efficiency while expanding quantization format support.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Mirrored from ggml-org/llama.cpp#17069
(alternative to #17064, cc @ngxson)
This adds support for a few formats in the
compressed-tensorsquant method.pack-quantizedsymmetric = true(without zero point)symmetric = false(with zero point)int-quantizedfloat-quantizednaive-quantizedI've also re-tested plain
fp8with https://huggingface.co/Qwen/Qwen3-4B-FP8 to make sure I didn't break it.I found a problem in the lazy tensors related to skipping metadata changes for binary operators, which I've fixed. Otherwise the broadcast shift (when unpacking) didn't have the correct final shape.
Make sure to read the contributing guidelines before submitting a PR