Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add rotc to blas #1067

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

add rotc to blas #1067

wants to merge 2 commits into from

Conversation

thijssteel
Copy link
Collaborator

@thijssteel thijssteel commented Oct 21, 2024

This PR adds a new routine to the BLAS: rotc.

The routine accepts an m x (n+1) matrix A and two n x k matrices C and S. The matrices C and S each store a cosine and a sine of a Givens rotation and the routine applies these to the matrix A one by one. The amount of computational work this requires makes this fit for a level 3 BLAS routine.

This routine would be very useful for optimizing eigenvalue algorithms. The QR-SVD algorithm currently uses lasr, which is not really optimized at all. The Hessenberg-triangular reduction and the multishift-QR/QZ implementation also apply chains of rotation sequences, but they accumulate blocks of rotations into bigger matrices than can then be applied using gemm. This routine can be faster than that.

This PR just adds a reference implementation. In a separate repository, I implemented a highly optimized version for double precision to prove that this routine actually can be optimized .

At the moment, only the implementation itself is in this PR. I think I will need some help with the other things like testblas, cblas, ... anyone willing to help?

Some minor things:

  • Why BLAS and not LAPACK?
    Optimizing this routine requires a significant amount of hardware specific optimizations, similar to a matrix-matrix multiplication. This goes against the portable performance ideas of LAPACK.
  • Why are the output arguments in the middle
    There is somewhat of a convention in BLAS/LAPACK to put output arguments at the end. However, I'm trying to match the level 1 blas routine rot with the current interface.
  • What is dzrotc and why do we need it?
    In some cases, you can to apply real rotations to a complex matrix. For example, when computing the SVD of a complex matrix. This variant facilitates that. I think it is important enough to warrant an extra variant, especially since it is not that much extra work to provide an optimized version. It should be able to reuse the real valued kernels.
  • Why .f90 and not .f?
    Honestly, I just like looking at the free form files better. There is nothing in the implementation that requires free form or modern fortran so I can change it if it causes issues.
  • How much extra work is this for optimized BLAS implementations?
    Using a shuffling kernel as I did in my implementation is a bit of work, but accumulating the rotations and then using gemm also gives reasonable performance. That can be a good alternative while the optimized implementation is being made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant