Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement xGEMMTR and cblas_xGEMMTR #887

Merged
merged 23 commits into from
Jun 28, 2024

Conversation

grisuthedragon
Copy link
Contributor

@grisuthedragon grisuthedragon commented Jul 24, 2023

Description

The xGEMMT function implements the GEMM operation but only the lower or the upper triangle of C is accessed. The routine is provide by many optimized BLAS libraries but not by the reference implementation. Due to the fact, that projects like MUMPS used it makes it necessary to provide this function in the reference implementation. This enables projects, that used this function to in runtime environments that allow the exchange of the BLAS libraries, like update-alternative on Debian/Ubuntu or FlexiBLAS.

The xGEMMT function are provided for example by:

The PR includes subroutines, the tests, and the CBLAS interface.

EDIT After the discussion, the routine is now called xGEMMTR.

Checklist

  • The documentation has been updated.
  • CBLAS interface added
  • Tests for BLAS and CBLAS added.

@codecov
Copy link

codecov bot commented Jul 24, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 0.00%. Comparing base (69992ad) to head (f90c7ae).
Report is 8 commits behind head on master.

Current head f90c7ae differs from pull request most recent head c57c156

Please upload reports for the commit c57c156 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff            @@
##           master     #887     +/-   ##
=========================================
  Coverage    0.00%    0.00%             
=========================================
  Files        1937     1918     -19     
  Lines      190566   188614   -1952     
=========================================
+ Misses     190566   188614   -1952     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@langou
Copy link
Contributor

langou commented Jul 24, 2023

Thanks Martin.

Two parts. (1) overall in favor. (2) rant against this specific GEMMT routine. To help me: who knows usage of GEMMT?

(1) Oh my. Adding a routine in the reference Level 3 BLAS? I am overall in favor since GEMMT already exists in many BLAS implementations. And I am happy to see Level 3 BLAS evolves and adapts.

(2-1) I must say that overall I am not a big fan of this GEMMT routine. I wonder why people need it. It seems popular enough. In many libraries. (Thanks for the survey.) But why do users need it? What are some examples of usage?

(2-2) I would rather see a routine like C <- alpha * A * D * A^T + beta * C where C is n-by-n symmetric in input and D is m-by-m symmetric in input too. (And so A is n-by-m.) That would be useful. In the BLAS Technical Forum Standard, they had BLAS_xSY_TRIDIAG_RK where, in addition of D being symmetric, the matrix D was symmetric tridiagonal. This BLAS_xSY_TRIDIAG_RK routine would be useful too.

(2-3) I wonder whether users need C <- alpha * A * D * A^T + beta * C and then they first perform GEMM with B <= 1. * A * D + 0. * B and then GEMMT with C <- alpha * A * B + beta * C. Well, if that's the case, it would be really, in my opinion, much better to use C <- alpha * A * D * A^T + beta * C.

(2-4) If we use C <- alpha * A * D * A^T + beta * C, we can reduce data movement from A and A^T by "combining" the data movement of A and A^T as shown in
https://doi.org/10.1145/3490148.3538587
https://doi.org/10.1109/SC41404.2022.00034
https://doi.org/10.1145/3558481.3591072
as opposed to use a GEMM follows by a GEMMT.

(2-5) Also I do not like the fact that the symmetry of A*B^T relies on what A and B are. Argh. I do not like this. When we look at the interface C <- alpha * A * D * A^T + beta * C, C in input is 100% symmetric because we assume the upper part from the lower part, D in input is 100% symmetric because we assume the upper part from the lower part. So the output C is 100% symmetric. There is no arguing. C in output is symmetric. When we look at C <- alpha * A * B + beta * C. Argh. The symmetry of output C relies on the "quality" of A and the "quality" of B and the "quality" of A*B. I think there is more uncertainty on whether A*B^T or B^T * A is the same. It should be the same. Is it? With the GEMMT interface, it is far from clear which one you should use. Argh. I think this is error prone, and not good for stability. I do not have good examples. But I do not like it.

(2-6) Speaking of C <- alpha * A * D * A^T + beta * C, I am not sure what a reference algorithm would be. Mmm. Is there a "more stable way" to perform A * D * A^T?

(2-7) One argument that I can see is for GEMMT is that the users as a fast way to right-multiply symmetric matrix D by A. So, the user wants to do C <- alpha * A * D * A^T + beta * C, but they have a fast way to do B <= A * D. So then yes, GEMMT can be useful.

@langou
Copy link
Contributor

langou commented Jul 24, 2023

And let me add. The name. GEMMT. I understand that this is institutionalized in several BLAS libraries. Fine with me. But I find GEMMT a very cryptic name. I would have not have guessed the functionality from the GEMMT name.

langou
langou previously approved these changes Jul 24, 2023
@martin-frbg
Copy link
Collaborator

I share the doubts about adding industry standard extensions to what is (or used to be?) seen as the reference implementation of something that was largely agreed on by working groups decades ago. Though to some extent that would hold for Reference-LAPACK as well - is it the classical reference if old algorithms get upgraded and new ones added, or is it the new and improved UCDenver LAPACK ?

@grisuthedragon
Copy link
Contributor Author

(2-1) I must say that overall I am not a big fan of this GEMMT routine. I wonder why people need it. It seems popular enough. In many libraries. (Thanks for the survey.) But why do users need it? What are some examples of usage?

At least some library using it, is MUMPS 5.6.1 in src/{s,c,d,z}fac_front_aux.F. Doing a short research, it shows, that they are using this in an LDL^T factorization, updating the L matrices. Thus, this seems to the C <- alpha * A * D * A^T + beta * C approach.

(2-2) I would rather see a routine like C <- alpha * A * D * A^T + beta * C where C is n-by-n symmetric in input and D is m-by-m symmetric in input too. (And so A is n-by-m.) That would be useful. In the BLAS Technical Forum Standard, they had BLAS_xSY_TRIDIAG_RK where, in addition of D being symmetric, the matrix D was symmetric tridiagonal. This BLAS_xSY_TRIDIAG_RK routine would be useful too.

IMHO, only allowing D to be a symmetric tridiagonal is a too large restriction.

(2-3) I wonder whether users need C <- alpha * A * D * A^T + beta * C and then they first perform GEMM with B <= 1. * A * D + 0. * B and then GEMMT with C <- alpha * A * B + beta * C. Well, if that's the case, it would be really, in my opinion, much better to use C <- alpha * A * D * A^T + beta * C.

For the most application cases I could think off, the combined variant would be better. For example, when solving a Lyapunov equation with the Bartels-Stewart algorithm and the Schur-Decomposition, one ends up with these type of updates.

(2-4) If we use C <- alpha * A * D * A^T + beta * C, we can reduce data movement from A and A^T by "combining" the data movement of A and A^T as shown in https://doi.org/10.1145/3490148.3538587 https://doi.org/10.1109/SC41404.2022.00034 https://doi.org/10.1145/3558481.3591072 as opposed to use a GEMM follows by a GEMMT.

The is also an implementation of this functionality in https://github.com/SLICOT/SLICOT-Reference/blob/main/src/MB01RU.f, which is used in the case I mentioned in (2-3). But the approach is horribly slow.

(2-5) Also I do not like the fact that the symmetry of A*B^T relies on what A and B are. Argh. I do not like this. When we look at the interface C <- alpha * A * D * A^T + beta * C, C in input is 100% symmetric because we assume the upper part from the lower part, D in input is 100% symmetric because we assume the upper part from the lower part. So the output C is 100% symmetric. There is no arguing. C in output is symmetric. When we look at C <- alpha * A * B + beta * C. Argh. The symmetry of output C relies on the "quality" of A and the "quality" of B and the "quality" of A*B. I think there is more uncertainty on whether A*B^T or B^T * A is the same. It should be the same. Is it? With the GEMMT interface, it is far from clear which one you should use. Argh. I think this is error prone, and not good for stability. I do not have good examples. But I do not like it.

(2-6) Speaking of C <- alpha * A * D * A^T + beta * C, I am not sure what a reference algorithm would be. Mmm. Is there a "more stable way" to perform A * D * A^T?

Again, I could mention https://github.com/SLICOT/SLICOT-Reference/blob/main/src/MB01RU.f at this point.

(2-7) One argument that I can see is for GEMMT is that the users as a fast way to right-multiply symmetric matrix D by A. So, the user wants to do C <- alpha * A * D * A^T + beta * C, but they have a fast way to do B <= A * D. So then yes, GEMMT can be useful.

I implemented the routine for sake of completeness, since many of the optimized libraries provide it, but not regarding the meaningful usage, which is truly a medium to big problem as you pointed out. Ignoring the BLAS-like extensions, that are common to almost all the optimized libraries, is also not a good way on the other hand. First, if the reference implementation provides it, one had a bigger chance that the optimized version behave the same way. Second, while developing stuff on top of BLAS and LAPACK, many developers (at least the ones I know), switch back to the reference implementation for debugging purpose. Once a BLAS-like extensions is used, this either requires code changes or a build-system, which checks for the availability of the extensions. Third, if applications/libraries use BLAS-like extensions, they can crash library exchange systems like update-alternatives easily.

@grisuthedragon
Copy link
Contributor Author

I share the doubts about adding industry standard extensions to what is (or used to be?) seen as the reference implementation of something that was largely agreed on by working groups decades ago. Though to some extent that would hold for Reference-LAPACK as well - is it the classical reference if old algorithms get upgraded and new ones added, or is it the new and improved UCDenver LAPACK ?

Even tough the BLAS standard is now a very old one, one can slowly adjust it a bit to match the changed requirements by the users (as long a now old functionality gets disturbed). Furthermore, there are some bad designs in the references interface as well, that could only be fixed by adding new routines, like the missing second scalar in daxpy, which is covered by daxbpy in many BLAS implementations, which is also not yet in the reference.

@vladimir-ch
Copy link
Contributor

Would it be thinkable that the reference implements C <- alpha * A * D * A^T + beta * C under a non-cryptic name? It would would apparently cover a majority of use-cases of GEMMT and by being the reference it's in the privileged position to show how things should be done properly.

@langou
Copy link
Contributor

langou commented Jul 25, 2023

One major issue with C <- alpha * A * D * A^T + beta * C is that we would want A to be input only.

And if D is tridiagonal, that's easy enough to do. If D is a regular m-by-m matrix, that's not going to work out without allocation. So that's probably the reason for the two alternatives: (1) GEMMT, and (2) BLAS_xSY_TRIDIAG_RK.

I had forgotten this "detail".

@robert-mijakovic
Copy link

From the top of my head, I can say that GEMMT is used by ABINIT and possibly other Computational Chemistry, Molecular Modelling, and Materials Science packages. I can have a look into other packages that use GEMMT if necessary.

@robert-mijakovic
Copy link

Yet two additional packages, PETSc and VASP.

@robert-mijakovic
Copy link

After some search, I came up with the following packages that depend on MUMPS, PETSc, and ABINIT which could possibly benefit from GEMMT.

  1. MUMPS dependent packages: Trilinos, Goma, Palace, xSDK, MFEM, CEED, FrontISTR, Ipopt, HPDDM, FEniCS, z-PARES, Akantu, Tandem, Elmer FEM, Camelia, Clp (Coin-or linear programming), FreeFEM++, Cbc (Coin-or branch and cut), Kwant, OpenSees, TELEMAC-MASCARET, TOY2DAC

  2. PETSc dependent packages: PISM, DAMASK-grid, Sundials, PISM, Ratel, SLEPc, AMReX, ExaGO, Sowing, Akantu, libMesh, FreeFEM, openCARP, Alquimia, DAMASK-mesh, PHIST, DFT-FE, ELSI, AMP, libpressio, STEPS (STochastic Engine for Pathway Simulation), FEniCS-DOLFINx, SfePy, Gmsh

  3. ABINIT dependent packages: Yambo, BigDFT

@ilayn
Copy link
Contributor

ilayn commented Jul 28, 2023

Would it be thinkable that the reference implements C <- alpha * A * D * A^T + beta * C under a non-cryptic name?

I was about to type this a few days ago but got distracted by work. xGEMMT indeed reads like the matmul(m, transpose(m)). I am also from the control theory side and I know why SLICOT uses them extensively since control algos typically love working with quadratic forms.

However if this function is going to be implemented in the reference, it should actually bring in some significant benefits beyond an implicit 2xDGEMM calls if possible in my opinion.

@langou
Copy link
Contributor

langou commented Jul 28, 2023

I was about to type this a few days ago but got distracted by work. xGEMMT indeed reads like the matmul(m, transpose(m)).

(1) I find the fact that the input/output matrix C is symmetric is a significant omission in the GEMMT interface name, I was not expecting any symmetry with a name such as GEMMT(2) "GEMMT indeed reads like the matmul(m, transpose(m))." I think I understand what you mean but (2a) if it were a matmul(m, transpose(m)), that would be a SYRK, (2b) if it were matmul(m, transpose(m)), we would have only one argument M in the interface and not an A and a B, (2c) GEMMT enables four variants: matmul(a,b),matmul(a, transpose(b)), matmul(transpose(a),b) and matmul(transpose(a),transpose(b)). All in all, I understand that the intent is that it is something like a matmul(m, transpose(m)).

In any case, I think, at this point the name of the subroutine is set. The reason to include the routine is that it is widely used. Changing the name now would defeat the reason of including it.

@langou
Copy link
Contributor

langou commented Jul 28, 2023

However if this function is going to be implemented in the reference, it should actually bring in some significant benefits beyond an implicit 2xDGEMM calls if possible in my opinion.

I think by "this function" you mean C <- alpha * A * D * A^T + beta * C for a general D, in which I case I strongly agree with you.

Advantage of BLAS_xSY_TRIDIAG_RK over GEMM + GEMMT: If D is tridiagonal, I think this C <- alpha * A * D * A^T + beta * C can bring some benefit over GEMM + GEMMT by (1) reducing data movement of A, (2) avoiding a B buffer to store A * D.

Advantage of GEMMT over GEMM: GEMMT is half the FLOPS of GEMM.

@ilayn
Copy link
Contributor

ilayn commented Jul 29, 2023

I think by "this function" you mean C <- alpha * A * D * A^T + beta * C for a general D, in which I case I strongly agree with you.

Sorry for my terse style, I was typing way quicker than I can make sense. Indeed that's what I meant.

I find the fact that the input/output matrix C is symmetric is a significant omission in the GEMMT interface name,

Indeed. And fully agree with your following points.

Changing the name now would defeat the reason of including it.

I think we can live with a name change because there is no precedence to it, and like you mentioned, the naming is already not optimal. LAPACK names are already cryptic enough so in my seriously humble opinion, there is no reason to carry on the legacy of unofficial sources or suboptimal decisions made elsewhere.

There is enough legacy within LAPACK to live with and. For anyone that needs to switch to the reference implementation of this "GEMMT" they will need to modify code anyways. Thus I think it gives the freedom to correct this naming. But then again that's just my opinion.

Functionality-wise similar routines might be requested such as triangular-triangular multiplication that comes in Sylvester/Schur related algorithms. Hence I'm not sure halving the matmul flops is a killer feature. But again, these are all personal takes, if folks are using it apparently they need it.

@langou
Copy link
Contributor

langou commented Jul 31, 2023

Functionality-wise similar routines might be requested such as triangular-triangular multiplication that comes in Sylvester/Schur related algorithms. Hence I'm not sure halving the matmul flops is a killer feature. But again, these are all personal takes, if folks are using it apparently they need it.

I agree that "similar routines might be requested such as triangular-triangular multiplication". We also had needs for similar "weird" kernels in <T>LAPACK. We were thinking to give names like matmult_AtrBtrCtr where each matrix has each shape in the interface. (Here tr for triangular for example.) "I'm not sure halving the matmul flops is a killer feature." FLOPS is one argument, but there are also arguments for (1) less storage, (2) in-place vs out-of-place and (3) access of data.

@rileyjmurray
Copy link

Joining the thread because I'm interested in this functionality.

I'm working on an iterative linear system solver with @PTNobel based on a block Krylov subspace method. This method (and many like it) need to compute small square matrices of the form P' G P where G is PSD. The context is that we'll have already computed GP for other reasons in the algorithm. So we'd benefit from a function to compute P' Q where Q is a general matrix.

I have a feature request for the hypothetical function that exploits the known symmetry of P' Q. In the uplo argument, it would be great to allow a "general" flag so that only one triangle is computed in terms of dot products, but then the result is copied to the remaining triangle. This will ensure explicit symmetry of the output matrix. We find in our applications that computing M = P' Q with GEMM results in a matrix that is symmetric up to a normwise relative error of 5e-16 in double precision, but this difference is apparently enough for one of chol(M, "lower") and chol(M, "upper") to fail when the other succeeds.

(Please excuse the lazy notation above w.r.t. chol.)

BLAS/SRC/zgemmt.f Outdated Show resolved Hide resolved
@mgates3
Copy link
Contributor

mgates3 commented Aug 10, 2023

gemmt was introduced in Intel MKL, circa 2015.

In cuBLAS, similar functionality is called syrkx and herkx, which predated the MKL gemmt function (not sure exactly what version of CUDA it was introduced).

Likewise in rocBLAS / hipBLAS, they are syrkx and herkx.

herk does

C = A A^H + C

for a Hermitian matrix C, computing only the lower or upper triangle.
herkx is an extended version of herk that does

C = A B^H + C

but again computes only the lower or upper triangle. In general, A B^H would be non-Hermitian, but herkx is useful when the user can guarantee that the result is Hermitian. As discussed above, a typical use case is

C = B D B^H, thus A = B D

where D is Hermitian. A specific instance of that is when D is diagonal, e.g., generating a Hermitian matrix from its EVD or computing the backward error of an eigenvalue problem,

A - Q Lambda Q^H.

As a completely different application, it is also useful to compute a triangular matrix, instead of a Hermitian matrix, when the user knows that one of the triangles will be zero. This could be used in larft to generate the T matrix for a block Householder reflector. (The input matrices are still rectangular, not triangular, we just have extra knowledge that the output is triangular.)

I agree the t in gemmt doesn't fit into the BLAS naming scheme at all. Does it mean transpose? Does it mean triangular? I personally like the herkx name, as it signifies that the output matrix is Hermitian, although that doesn't fit the rather odd use case above when the output matrix is triangular, not Hermitian. gemmt is also a bit more general than herkx in that it can specify both transA and transB, whereas herkx has only a single trans.

Also consider if we had this routine in other formats like packed or RFP, obviously the gemmt name doesn't generalize well, whereas hprkx (packed) or hfrkx (RFP) would make sense.

@mgates3
Copy link
Contributor

mgates3 commented Aug 11, 2023

Another use case is in the Polar Decomposition. The QDWH iteration converges to a unitary matrix U, and then computes a Hermitian matrix H = U^H A. Again, we have special knowledge that U^H A is Hermitian. This is distinct from the use cases of A B A^H, with B Hermitian.

@luszczek suggests herkmm instead of herkx.

@langou
Copy link
Contributor

langou commented Aug 12, 2023

@mgates3 very nice application! Yes, indeed, if for a given matrix A, you know U, the unitary polar factor of A, then you can recover the Hermitian polar factor by doing U^H * A. That's a really neat application. Thanks for bringing this up. And indeed the QDWH iterations algorithm does need GEMMT. It converges to U, the unitary polar factor of A, and so, "after convergence," it computes the Hermitian factor with U^H * A, which is exactly GEMMT. This is a very nice application. Thanks for mentioning.

@langou
Copy link
Contributor

langou commented Aug 12, 2023

We need to speak about names.

The PR named the routine GEMMT.

The names HERKX and GEMMT are already used for this functionality in some BLAS libraries.

Then HERKMM is kind of nice and was suggested by @luszczek.

Then I suggested something like HEGEGEMM.

But @mgates3 mentioned that this can be used for triangular matrix too, so maybe we could call this TRGEGEMM.

If you have a strong opinion, write it below.

@mgates3
Copy link
Contributor

mgates3 commented Aug 14, 2023

The triangular application is rather weird and probably rare. How often is A*B a triangular matrix? So I wouldn't name it based on triangular. Interpreting the output as Hermitian (or symmetric) is much more common.

I'm not sure why you're proposing repeating the ge in hegegemm. Is it to specify the type of each matrix A, B, C? We've never done that before, and if we did, it's unclear what the order should be: hegege or gegehe (since C is Hermitian).

hegemm would be okay.

I prefer some variant of herk because that's what it most looks like: the output matrix C is Hermitian, whereas in hemm the input matrix A or B is Hermitian.

@weslleyspereira
Copy link
Collaborator

Hi @grisuthedragon. Would you be able to merge to/rebase with the current master? The script for the Github Actions needed to be updated. Thanks!

@grisuthedragon
Copy link
Contributor Author

Hi @grisuthedragon. Would you be able to merge to/rebase with the current master? The script for the Github Actions needed to be updated. Thanks!

Done.

@grisuthedragon
Copy link
Contributor Author

So finally, it seems that only the naming is the problem. I see, that from the typical BLAS naming scheme, the GEMMT name is confusing, so I do not have a problem to rename it.

As far as I collect the stuff from the discussion, there are the following options:

Already in use:

  • xGEMMT, bad from the naming scheme point of view but in MKL, OpenBLAS, ArmPL, ESSL, BLIS
  • xSYRKX/xHESYRKX, good name from the BLAS point of view, already in CUDA and rocBLAS

Potential names by functionality:

  • xSYGEMM/xHEGEMM
  • xSYRKMM/xHERKMM
  • xSYGEGEMM/HEGEGEMM

Since the functionality already exists in some implementations, I would go for one of the existing names.

@grisuthedragon grisuthedragon changed the title Implement xGEMMT and cblas_xGEMMT Implement xGEMMTR and cblas_xGEMMTR Jun 24, 2024
langou
langou previously approved these changes Jun 24, 2024
@langou
Copy link
Contributor

langou commented Jun 24, 2024

Everyone, please review this commit. Comments welcome. Please comment by Thursday June 27th.

BLAS/SRC/cgemmtr.f Outdated Show resolved Hide resolved
BLAS/SRC/dgemmtr.f Outdated Show resolved Hide resolved
BLAS/SRC/sgemmtr.f Outdated Show resolved Hide resolved
BLAS/SRC/zgemmtr.f Outdated Show resolved Hide resolved
Copy link
Contributor

@mgates3 mgates3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the minor Doxygen group fix. Otherwise it seems fine.

updated:
 * BLAS/SRC/sgemmtr.f
 * BLAS/SRC/zgemmtr.f
 * BLAS/SRC/cgemmtr.f
 * BLAS/SRC/dgemmtr.f
@grisuthedragon
Copy link
Contributor Author

I changed and added the doxygen group.

@langou langou merged commit 694e337 into Reference-LAPACK:master Jun 28, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.