-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster GSVD with QR factorizations, 2-by-1 CS decomposition (REAL(8) only) #63
Faster GSVD with QR factorizations, 2-by-1 CS decomposition (REAL(8) only) #63
Conversation
This is very impressive work, but I am skeptical of any performance conclusions based on boost.ublas, as it is notoriously slow. It might be worthwhile to switch the tests to depend on either the reference BLAS, OpenBLAS, ATLAS, BLIS, MKL, or one of their ilk. |
@poulson Boost uBLAS is used only for (defect) testing and more convenient matrix construction. The solver itself is implemented in pure Fortran, independent of Boost, and uses the reference BLAS in this repository. In the benchmark program, the timer is started right before calling the GSVD solver (DGGSVD3 or DGGQRCS) and stopped immediately afterwards. |
In that case, I would recommend removing the boost dependency in the PR...Boost it is a very heavy dependency to only be used for convenience in one place. For what it's worth, the Travis build is failing due to boost not being installed by default. Thankfully, should you want to keep the boost dependency, Travis whitelists boost (https://github.com/travis-ci/apt-package-whitelist/blob/master/ubuntu-precise), so it would be a trivial change to the |
what's status on this? |
This is not top priority for me currently, especially since the maintainers have not commented on this algorithm yet. |
@cconrads-scicomp Please don't be discouraged or take it personally. It's great work and there is a significant time investment involved in handling this type of change set. |
@cconrads-scicomp i pulled into my repo and see how it goes |
i think this last test would have passed if an updated apple compiler was choosen as the gcc choosen is very old. travis bombed with |
@gaming-hacker did you get around to doing any tests ? While I do not feel competent to review this code it should be possible to accept it into OpenBLAS to give it more exposure if desired (at least OpenBLAS does not strive to provide a reference implementation of anything). (@christoph-conrads did you receive any criticism of your thesis work in the meantime, and/or did any further publications arise out of it ?) |
@martin-frbg i ran some basic tests on the algorithm, nothing intense or complex after building with the clang-5.0 compiler and well nothing has bombed or crashed at this point. |
@martin-frbg Soon I will have a computer at home again and then I will finish the implementation for matrices with complex values, say within the next month. No further publications arose out of it because my code is based on a transformation well known in the literature, see for example Section 8.7.4 and Section 8.7.5 in the fourth edition of Matrix Computations by Golub and Van Loan or in this paper by Bai. So far, no one implemented this because there was no implementation of the CS decomposition in LAPACK until recently and I happened to compare these two approaches. |
7ab5d25
to
ac60a23
Compare
(double precision real only)
- add file to CMakeLists.txt - fix syntax errors etc.
This test code uses C++, Boost.Test, and Boost uBLAS.
- workspace queries are signalled by LWORK=-1, not INFO=-1 - fix undeclared, uninitialized variable errors
- add test case where both matrices are zero - add test case where m, n, p are pairwise different
- fix workspace queries - fix DORCSD2BY1 parameter containing the number of rows of X11 - avoid scaling the matrix B if B is zero - handle the case where both matrices A, B are zero - check INFO after calling DORGQR
Fix the computation of an n-by-n orthogonal matrix Q by DORGRQ by ensuring that the R elementary reflectors from DGERQF can be found in the last R rows of Q.
Fix the computation of the upper triangular matrix [0, R].
Ensure the number of columns indicated to DORCSD2BY1 is less than or equal to the number of rows during workspace queries.
* make xGGQRCS test data easily reproducible * allow larger matrices in non-stop test
For matrices with m rows, test xGGQRCS with leading dimensions > m in the randomized test.
The infinite xGGQRCS test attempts to show a message every minute.
Replace a C++ expression because g++ 9.2.0 was generating broken code due to this expression. Maybe there was a compiler bug, something was wrong with the object files, or the expression caused undefined behavior. TODO: Find out if `std::max(a+b, c)` is still undefined in C++11.
The code checking the xGGQRCS solution was super slow: For m=434, n=937, p=978, and OpenBLAS 0.3.9, SGGQRCS computes a solution within one second but checking it took 23 seconds. This commit uses xGEMM to compute U^* U and reduces this time to 2 seconds on the author's computer.
The xORCSD2BY1/xUNCSD2BY1 output matrix U2 was clearly not orthogonal/unitary for certain input matrix dimensions m, p, and q (e.g., m = 260, p=130, q=131). The reason was an accidental overwrite of data by xORGQR()/xUNGQR() when the WORK array was apparently large enough to use blocking.
Speed up matrix assembly by calling LAPACK's `xGEMM()` instead of Boost uBLAS' `prod()`.
Previously, xUNCSD2BY1 only allowed workspace queries by passing LWORK=-1 (note the missing "R"). The new commit makes xUNCSD2BY1 behave, e.g., like xHEEVD.
Reduce the number of rows of matrices A, B to achieve 50% chance of having more columns than rows in [A, B].
eb6afd5
to
e22ff6c
Compare
The generalized singular value decomposition can be computed directly with xGGSVD, xGGSVD3 (see LAWN 46) or by computing QR factorizations and a 2-by-1 CS decomposition. In the following, let us call the former approach direct GSVD and the latter QR+CSD. In my master's thesis I used both approaches and it turned out that the QR+CSD was always signifcantly faster than the direct GSVD for pairs of square matrices with up to 3600 columns (see Section 3.5 of my master's thesis).
I am convinced that the speed difference is persistent across platforms because the iterative phase of the solver is much faster for QR+CSD. The direct GSVD reduces both matrices A, B to upper triangular form before the iterative solver xTGSJA is called whereas the iterative phase of QR+CSD operates on a 2-by-1 isometric matrix Q^T = [Q1^T, Q2^T]^T, where Q1, Q2 are bidiagonal (A isometric <=> A^T A = I). Since the matrix Q is always well-conditioned, the CSD solver xBBCSD needs only few iterations to compute a solution. For comparison, the maximum number of iterations per singular value/eigenvalue is
I implemented the QR+CSD solver named xGGQRCS for real double precision floats as a proof of concept. Computing the GSVD for a pair of 1000-by-1000 matrices and entries uniformly distributed in [-1, +1) takes 27.9s with the QR+CSD (and debugging statements enabled) and 243s with the direct GSVD on my computer. You can repeat these measurements by calling the program
bin/ggsvd_vs_ggqrcs
in the build directory.This pull request contains the following changes:
SRC/dggqrcs.f
with debugging statements,TESTING/cpp
, andTESTING/cpp/ggsvd_vs_ggqrcs.cpp
.If there is enough interest, then I will implement the other variants of the QR+CSD solver ({C,S,Z}GGQRCS) and remove the debugging code.