Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New/other Matrix multiplication algorithm implementation #1971

Open
vineel96 opened this issue Jun 20, 2024 · 11 comments
Open

New/other Matrix multiplication algorithm implementation #1971

vineel96 opened this issue Jun 20, 2024 · 11 comments
Labels
enhancement A feature or an optimization request help wanted platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64

Comments

@vineel96
Copy link

Hello,

  1. The current implementation for matrix multiplication uses BRGEMM algorithm. Is there any implementation of "Low Rank Approximation approach" for matrix multiplication in oneDNN? Is there any experiments done by intel team for "Low Rank Approximation" implementation and found any bottlenecks for this algorithm compared to BRGEMM?
  2. Any experiments done to replace broadcast kernel of BRGEMM algorithm with other kernels? Any performance comparison results for the same?
    Thanks.
@vineel96 vineel96 changed the title Matrix multiplication algorithm implementation New/other Matrix multiplication algorithm implementation Jun 20, 2024
@mgouicem
Copy link
Contributor

Hi @vineel96

  1. I guess you are referring to sampling algorithms right? oneDNN typically does not do such things and implements algorithms that are equivalent (in terms of operations) to a naive Matrix Multiplication. Regarding brgemm, it is not an algorithm per-se, but mostly a flexible low level abstraction for assembly kernels.
  2. BRGEMM has multiple strategies depending on problem shapes and ISA (e.g. AMX kernels do not rely on broadcast strategy). For implementations working on vector registers, we indeed use broadcast kernels (either implicit or explicit broadcasts). I think we did some experiments at some point with dot-product based approach (with horizontal reduction), but performance was typically low for problem shapes that are encountered in Deep Learning.

@vineel96
Copy link
Author

@mgouicem thanks for the information.

  1. Yes. Is there a chance to implement lossless compression methods for matrices which reduces the GFLOP for matrix multiplication as a whole which helps in reducing its timing in onednn?
  2. Any scope to further improve the MatMul by reducing its overall GFLOP? like introducing new algorithms ?

@vpirogov
Copy link
Member

When it comes to lossless compression oneDNN supports sparse tensors with two packing schemes:

  • Compressed sparse row (CSR) widely used in scientific applications. Support for Coordinate format (COO) is coming shortly.
  • Hardware-specific packed formats.
    You can find more information in the Developer Guide.

Another compression approach involves converting one of the matrices to low precision (int8 or int4). This is lossy compression method, but it's widely used to accelerate large language models. See GPT-Q RFC and matmul weights decompression example for more details.

If you have a proof of concept for other techniques please share.

@vineel96
Copy link
Author

vineel96 commented Jun 22, 2024

Hi @vpirogov,
Thanks for the reply.

  1. Do COO gives performance boost only if DL model uses sparse tensors? But most of the DL models from hugging face will not be using sparse tensors right? Also, does anyone is already implementing this work or can anyone take it up to complete?
  2. Also anyone is working on GPT-Quantization (GPT-Q) enablement? If not, anyone can take it to complete?
  3. Currently I am looking into lossless compression methods for matrix multiplication where we either reduce total FLOP of matmul or increase FLOP per sec as it scales across no of threads. I am looking for papers for this regard and found one (https://arxiv.org/pdf/2203.14540) dont know how it helps. Any links/papers/suggestions in this regard is appreciated.
    Thanks.

@vpirogov
Copy link
Member

Do COO gives performance boost only if DL model uses sparse tensors? But most of the DL models from hugging face will not be using sparse tensors right? Also, does anyone is already implementing this work or can anyone take it up to complete?

Sparse storage formats like COO and CSR require relatively high levels of zeroes in weights tensor to bring performance value. Think ~80% or more. In context of LLMs hardware specific weight compression on SPR is demonstrated to bring performance benefits on some models by OpenVINO.

  • Also anyone is working on GPT-Quantization (GPT-Q) enablement? If not, anyone can take it to complete?

GPT-Q is supported in oneDNN v3.5, which has int8 weights support on CPUs with Intel AMX and int8/int4 support on Intel GPUs. See release notes.

  • Currently I am looking into lossless compression methods for matrix multiplication where we either reduce total FLOP of matmul or increase FLOP per sec as it scales across no of threads. I am looking for papers for this regard and found one (https://arxiv.org/pdf/2203.14540) dont know how it helps. Any links/papers/suggestions in this regard is appreciated.

The paper seems to offer a generalization of CSR format that exploits data duplication in addition to dropping zeroes. I would expect it to be a marginal improvement over classical CSR storage schemes.

@vineel96
Copy link
Author

vineel96 commented Jun 28, 2024

Thanks @vpirogov

  1. For COO rfc , does anyone is already implementing this work or can anyone take it up to complete it?
  2. Also is it allowed in oneDNN to keep extra checks on input matrix, say whether its diagonal matrix or some pattern matrix and based on that run an optimized algorithm? Similar to this paper: https://arxiv.org/abs/2208.07129 (Fast hardware-aware matrix-free algorithm for higher-order finite-element discretized matrix multivector products on distributed systems)

@vpirogov
Copy link
Member

@vineel96, we are working on the initial COO support and GPU optimized version towards the next release, oneDNN v3.6. Feel free to contribute optimized implementation for your favorite platform.

Choosing algorithm based on the input matrix structure is possible with oneDNN, though this check would have to be done during primitive execution and may negatively affect performance.

@vineel96
Copy link
Author

vineel96 commented Jul 2, 2024

@vpirogov
COO support and GPT-Q support is there for ARM cpu's? or anyone should to take up this task?

@vpirogov
Copy link
Member

vpirogov commented Jul 9, 2024

No, neither of these are supported on Arm CPUs.

@jondea, @milpuz01, do you have anything in the works for GPT-Q or sparse formats?

@vpirogov vpirogov added enhancement A feature or an optimization request platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 help wanted and removed question labels Jul 10, 2024
@vineel96
Copy link
Author

@vpirogov , any one worked and completed this RFC : https://github.com/oneapi-src/oneDNN/tree/rfcs/rfcs/20200812-multidim-matmul ? or its still need to be implemented? Do option 1 gives any performance boost compared to brgemm? Any comments on this RFC?

@vpirogov
Copy link
Member

@vineel96, the RFC you are referring to is related to API. We extended matmul primitive on all platforms to support batched case. Performance is implementation specific and comparison to BRGEMM makes no sense here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A feature or an optimization request help wanted platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64
Projects
None yet
Development

No branches or pull requests

3 participants