-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New/other Matrix multiplication algorithm implementation #1971
Comments
Hi @vineel96
|
@mgouicem thanks for the information.
|
When it comes to lossless compression oneDNN supports sparse tensors with two packing schemes:
Another compression approach involves converting one of the matrices to low precision (int8 or int4). This is lossy compression method, but it's widely used to accelerate large language models. See GPT-Q RFC and matmul weights decompression example for more details. If you have a proof of concept for other techniques please share. |
Hi @vpirogov,
|
Sparse storage formats like COO and CSR require relatively high levels of zeroes in weights tensor to bring performance value. Think ~80% or more. In context of LLMs hardware specific weight compression on SPR is demonstrated to bring performance benefits on some models by OpenVINO.
GPT-Q is supported in oneDNN v3.5, which has int8 weights support on CPUs with Intel AMX and int8/int4 support on Intel GPUs. See release notes.
The paper seems to offer a generalization of CSR format that exploits data duplication in addition to dropping zeroes. I would expect it to be a marginal improvement over classical CSR storage schemes. |
Thanks @vpirogov
|
@vineel96, we are working on the initial COO support and GPU optimized version towards the next release, oneDNN v3.6. Feel free to contribute optimized implementation for your favorite platform. Choosing algorithm based on the input matrix structure is possible with oneDNN, though this check would have to be done during primitive execution and may negatively affect performance. |
@vpirogov |
@vpirogov , any one worked and completed this RFC : https://github.com/oneapi-src/oneDNN/tree/rfcs/rfcs/20200812-multidim-matmul ? or its still need to be implemented? Do option 1 gives any performance boost compared to brgemm? Any comments on this RFC? |
@vineel96, the RFC you are referring to is related to API. We extended matmul primitive on all platforms to support batched case. Performance is implementation specific and comparison to BRGEMM makes no sense here. |
Hello,
Thanks.
The text was updated successfully, but these errors were encountered: