Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a new routine to the BLAS: rotc.
The routine accepts an m x (n+1) matrix A and two n x k matrices C and S. The matrices C and S each store a cosine and a sine of a Givens rotation and the routine applies these to the matrix A one by one. The amount of computational work this requires makes this fit for a level 3 BLAS routine.
This routine would be very useful for optimizing eigenvalue algorithms. The QR-SVD algorithm currently uses
lasr
, which is not really optimized at all. The Hessenberg-triangular reduction and the multishift-QR/QZ implementation also apply chains of rotation sequences, but they accumulate blocks of rotations into bigger matrices than can then be applied usinggemm
. This routine can be faster than that.This PR just adds a reference implementation. In a separate repository, I implemented a highly optimized version for double precision to prove that this routine actually can be optimized .
At the moment, only the implementation itself is in this PR. I think I will need some help with the other things like testblas, cblas, ... anyone willing to help?
Some minor things:
Optimizing this routine requires a significant amount of hardware specific optimizations, similar to a matrix-matrix multiplication. This goes against the portable performance ideas of LAPACK.
There is somewhat of a convention in BLAS/LAPACK to put output arguments at the end. However, I'm trying to match the level 1 blas routine
rot
with the current interface.dzrotc
and why do we need it?In some cases, you can to apply real rotations to a complex matrix. For example, when computing the SVD of a complex matrix. This variant facilitates that. I think it is important enough to warrant an extra variant, especially since it is not that much extra work to provide an optimized version. It should be able to reuse the real valued kernels.
Honestly, I just like looking at the free form files better. There is nothing in the implementation that requires free form or modern fortran so I can change it if it causes issues.
Using a shuffling kernel as I did in my implementation is a bit of work, but accumulating the rotations and then using
gemm
also gives reasonable performance. That can be a good alternative while the optimized implementation is being made.