Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster GSVD with QR factorizations, 2-by-1 CS decomposition (REAL(8) only) #63

Closed

Conversation

christoph-conrads
Copy link
Contributor

@christoph-conrads christoph-conrads commented Oct 1, 2016

The generalized singular value decomposition can be computed directly with xGGSVD, xGGSVD3 (see LAWN 46) or by computing QR factorizations and a 2-by-1 CS decomposition. In the following, let us call the former approach direct GSVD and the latter QR+CSD. In my master's thesis I used both approaches and it turned out that the QR+CSD was always signifcantly faster than the direct GSVD for pairs of square matrices with up to 3600 columns (see Section 3.5 of my master's thesis).

I am convinced that the speed difference is persistent across platforms because the iterative phase of the solver is much faster for QR+CSD. The direct GSVD reduces both matrices A, B to upper triangular form before the iterative solver xTGSJA is called whereas the iterative phase of QR+CSD operates on a 2-by-1 isometric matrix Q^T = [Q1^T, Q2^T]^T, where Q1, Q2 are bidiagonal (A isometric <=> A^T A = I). Since the matrix Q is always well-conditioned, the CSD solver xBBCSD needs only few iterations to compute a solution. For comparison, the maximum number of iterations per singular value/eigenvalue is

  • 6 for the CSD (see variable MAXITR in xbbcsd.f, line 356),
  • 30 for the symmetric eigenvalue problem with Francis QR (MAXIT, dsteqr.f:154), and
  • 40 for the GSVD (MAXIT, dtgsja.f:402).

I implemented the QR+CSD solver named xGGQRCS for real double precision floats as a proof of concept. Computing the GSVD for a pair of 1000-by-1000 matrices and entries uniformly distributed in [-1, +1) takes 27.9s with the QR+CSD (and debugging statements enabled) and 243s with the direct GSVD on my computer. You can repeat these measurements by calling the program bin/ggsvd_vs_ggqrcs in the build directory.

This pull request contains the following changes:

  • a real double precision QR+CSD solver in SRC/dggqrcs.f with debugging statements,
  • tests using C++14, Boost.Test and Boost uBLAS in TESTING/cpp , and
  • a benchmark program in TESTING/cpp/ggsvd_vs_ggqrcs.cpp.

If there is enough interest, then I will implement the other variants of the QR+CSD solver ({C,S,Z}GGQRCS) and remove the debugging code.

@poulson
Copy link

poulson commented Oct 1, 2016

This is very impressive work, but I am skeptical of any performance conclusions based on boost.ublas, as it is notoriously slow. It might be worthwhile to switch the tests to depend on either the reference BLAS, OpenBLAS, ATLAS, BLIS, MKL, or one of their ilk.

@christoph-conrads
Copy link
Contributor Author

christoph-conrads commented Oct 1, 2016

@poulson Boost uBLAS is used only for (defect) testing and more convenient matrix construction. The solver itself is implemented in pure Fortran, independent of Boost, and uses the reference BLAS in this repository. In the benchmark program, the timer is started right before calling the GSVD solver (DGGSVD3 or DGGQRCS) and stopped immediately afterwards.

@poulson
Copy link

poulson commented Oct 2, 2016

In that case, I would recommend removing the boost dependency in the PR...Boost it is a very heavy dependency to only be used for convenience in one place.

For what it's worth, the Travis build is failing due to boost not being installed by default. Thankfully, should you want to keep the boost dependency, Travis whitelists boost (https://github.com/travis-ci/apt-package-whitelist/blob/master/ubuntu-precise), so it would be a trivial change to the .travis.yml to install it.

@gaming-hacker
Copy link

what's status on this?

@christoph-conrads
Copy link
Contributor Author

This is not top priority for me currently, especially since the maintainers have not commented on this algorithm yet.

@poulson
Copy link

poulson commented Feb 15, 2017

@cconrads-scicomp Please don't be discouraged or take it personally. It's great work and there is a significant time investment involved in handling this type of change set.

@gaming-hacker
Copy link

@cconrads-scicomp i pulled into my repo and see how it goes

@gaming-hacker
Copy link

i think this last test would have passed if an updated apple compiler was choosen as the gcc choosen is very old. travis bombed with
error: invalid value 'c++14' in '-std=c++14'

@martin-frbg
Copy link
Collaborator

@gaming-hacker did you get around to doing any tests ? While I do not feel competent to review this code it should be possible to accept it into OpenBLAS to give it more exposure if desired (at least OpenBLAS does not strive to provide a reference implementation of anything). (@christoph-conrads did you receive any criticism of your thesis work in the meantime, and/or did any further publications arise out of it ?)

@gaming-hacker
Copy link

@martin-frbg i ran some basic tests on the algorithm, nothing intense or complex after building with the clang-5.0 compiler and well nothing has bombed or crashed at this point.

@christoph-conrads
Copy link
Contributor Author

@martin-frbg Soon I will have a computer at home again and then I will finish the implementation for matrices with complex values, say within the next month. No further publications arose out of it because my code is based on a transformation well known in the literature, see for example Section 8.7.4 and Section 8.7.5 in the fourth edition of Matrix Computations by Golub and Van Loan or in this paper by Bai. So far, no one implemented this because there was no implementation of the CS decomposition in LAPACK until recently and I happened to compare these two approaches.

- add file to CMakeLists.txt
- fix syntax errors etc.
This test code uses C++, Boost.Test, and Boost uBLAS.
- workspace queries are signalled by LWORK=-1, not INFO=-1
- fix undeclared, uninitialized variable errors
- add test case where both matrices are zero
- add test case where m, n, p are pairwise different
- fix workspace queries
- fix DORCSD2BY1 parameter containing the number of rows of X11
- avoid scaling the matrix B if B is zero
- handle the case where both matrices A, B are zero
- check INFO after calling DORGQR
Fix the computation of an n-by-n orthogonal matrix Q by DORGRQ by
ensuring that the R elementary reflectors from DGERQF can be found in
the last R rows of Q.
Fix the computation of the upper triangular matrix [0, R].
Ensure the number of columns indicated to DORCSD2BY1 is less than or
equal to the number of rows during workspace queries.
* make xGGQRCS test data easily reproducible
* allow larger matrices in non-stop test
For matrices with m rows, test xGGQRCS with leading dimensions > m in
the randomized test.
The infinite xGGQRCS test attempts to show a message every minute.
Replace a C++ expression because g++ 9.2.0 was generating broken code
due to this expression. Maybe there was a compiler bug, something was
wrong with the object files, or the expression caused undefined
behavior.

TODO: Find out if `std::max(a+b, c)` is still undefined in C++11.
The code checking the xGGQRCS solution was super slow: For m=434, n=937,
p=978, and OpenBLAS 0.3.9, SGGQRCS computes a solution within one second
but checking it took 23 seconds. This commit uses xGEMM to compute U^* U
and reduces this time to 2 seconds on the author's computer.
The xORCSD2BY1/xUNCSD2BY1 output matrix U2 was clearly not
orthogonal/unitary for certain input matrix dimensions m, p, and q
(e.g., m = 260, p=130, q=131). The reason was an accidental overwrite of
data by xORGQR()/xUNGQR() when the WORK array was apparently large
enough to use blocking.
Speed up matrix assembly by calling LAPACK's `xGEMM()` instead of Boost
uBLAS' `prod()`.
Previously, xUNCSD2BY1 only allowed workspace queries by passing
LWORK=-1 (note the missing "R"). The new commit makes xUNCSD2BY1 behave,
e.g., like xHEEVD.
Reduce the number of rows of matrices A, B to achieve 50% chance of
having more columns than rows in [A, B].
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants