-
Notifications
You must be signed in to change notification settings - Fork 1
UPSTREAM PR #18096: ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) #589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
UPSTREAM PR #18096: ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) #589
Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary - PR #589OverviewPR #589 introduces ARM64-optimized Q8_0 quantized matrix operations using tensor repacking with NEON DOTPROD and I8MM instructions. The implementation adds 599 lines across 4 files, implementing specialized GEMV and GEMM kernels for Q8_0 quantization format. Analysis shows mixed performance characteristics with significant improvements in specific ARM64 operations but regressions in utility functions. Key FindingsPerformance-Critical Functions ImpactMatrix Operations (Performance-Critical Area): The new Q8_0 repack implementation targets
Parameter Infrastructure Functions: Multiple parameter-setting functions show consistent regressions:
These functions exhibit 25-65% callee overhead, suggesting shared infrastructure changes affecting parameter validation or storage logic. The consistent pattern across multiple implementations indicates systematic modification to parameter handling. Optimized Functions:
Inference Performance ImpactToken Generation Functions: No direct modifications to Estimated Impact: Affected Functions:
Power Consumption AnalysisBinary-Level Impact: Power consumption analysis shows a 3.09% increase in Breakdown:
Context: Code Change AnalysisImplementation Scope: The PR implements a well-structured optimization following established patterns from previous quantization repack work (Q4_0, Q4_K, IQ4_NL). Key additions include:
Performance Trade-offs: The implementation prioritizes ARM64 prompt processing performance (2-3.7x speedup reported in PR description) at the cost of increased complexity in repack type selection. The 317 ns regression in The parameter-setting function regressions (13-20 ns) appear to stem from shared infrastructure modifications, possibly related to validation logic or parameter buffer handling that affects all quantization formats, not just Q8_0. |
eda9f43 to
26e5d36
Compare
048ad94 to
6c1fde6
Compare
Mirrored from ggml-org/llama.cpp#18096
This is similar work to #17494 and #16739.
The primary motivation for q8_0 repack is to optimize prefill/prompt processing performance, which can be useful for, for example, audio models.
Token generation/decode is mixed (It's slightly better in M4, and slightly worse on RPI5, probably device dependent).
This PR implements:
M4 max - i8mm path (GGML_NATIVE=ON native+dotprod+i8mm+nosve+sme)
build: 26ef677 (7365)
Rpi5 - dotprod path (GGML_NATIVE=ON cortex-a76+crypto+dotprod+noi8mm+nosve)
Perplexity
Perplexity is slightly different for the generic implementations, but I don't find any algorithmic differences except for the order of operations. Generic i8mm (4x8) gets better perplexity than the original implementation, while generic dotprod (4x4) is slightly worse. Repack vs no-repack has the same values.
Decode
Following the advise from previous PRs I checked the outputs of GEMVs and all are under the threshold defined in
test-backend-ops(same as the one used here: ggml-org/llama.cpp#17494 (comment)), there are some individual weight differences. Looking directly at llama-completion and llama-server, I didn't find artifacts or bad answers.Ci passes locally when routing through the generic path.
Thought these are not exactly accurate tests:
Some llama-cli queries
NO REPACK
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
Paris, the enchanting city of lights, is renowned for its iconic landmarks such as the Eiffel Tower and the historic Louvre Museum, attracting millions of visitors from around the globe each year to experience its timeless charm, rich history, and vibrant culture.
[ Prompt: 356.3 t/s | Generation: 140.9 t/s ]
=================================================
i8mm
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
Paris, the City of Light, is renowned for its stunning architecture, artistic heritage, and romantic ambiance, making it a quintessential destination for travelers worldwide.
[ Prompt: 551.6 t/s | Generation: 118.8 t/s ]
Yes, Paris is indeed the capital of France. It serves as the political, cultural, and economic heart of the country.
[ Prompt: 562.1 t/s | Generation: 131.3 t/s ]
===============================================================
DOTPROD
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
Paris, the City of Light, is renowned globally for its rich history, stunning architecture, and cultural treasures like the Eiffel Tower and the Louvre Museum.
[ Prompt: 540.8 t/s | Generation: 119.9 t/s ]
Paris is the capital of France.
[ Prompt: 552.4 t/s | Generation: 133.3 t/s ]
===============================================================
Generic
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
Paris, the enchanting capital city of France, is renowned for its timeless charm, iconic landmarks such as the Eiffel Tower, and romantic atmosphere that attracts millions of visitors each year.
[ Prompt: 139.7 t/s | Generation: 240.1 t/s ]