You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to perform inference on quantized LLAMA (W8A16) on ARM-v9 (with SVE) using oneDNN. The LLAMA weights are per-group quantized.
Based on my understanding, I need to prepack the weights to reduce the cost of repeated packing. However, packing will disrupt the arrangement of per-group quantization scales and shifts. I understand that dequantization needs to be fused with the kernel. If fused with packing, it's equivalent to storing another copy of the weights in FP16, essentially undoing the quantization.
I haven't figured out how to combine prepacking and per-group dequantization.
Which interface should I use for prepacking? SVE instructions can be 256-bit or 512-bit wide; how does oneDNN intelligently handle packing?
After prepacking and saving the weights again, how do I fuse dequantization with the kernel during computation?
The text was updated successfully, but these errors were encountered:
@oleotiger, we are working on enabling per-group quantization in oneDNN. You can find description of proposed design for fused weight decompression here. Implementation is not yet available for any platforms though. The only option for now is to decompress weights separately, as you indicated.
I want to perform inference on quantized LLAMA (W8A16) on ARM-v9 (with SVE) using oneDNN. The LLAMA weights are per-group quantized.
Based on my understanding, I need to prepack the weights to reduce the cost of repeated packing. However, packing will disrupt the arrangement of per-group quantization scales and shifts. I understand that dequantization needs to be fused with the kernel. If fused with packing, it's equivalent to storing another copy of the weights in FP16, essentially undoing the quantization.
I haven't figured out how to combine prepacking and per-group dequantization.
Which interface should I use for prepacking? SVE instructions can be 256-bit or 512-bit wide; how does oneDNN intelligently handle packing?
After prepacking and saving the weights again, how do I fuse dequantization with the kernel during computation?
The text was updated successfully, but these errors were encountered: