Faster GSVD with QR factorizations, 2-by-1 CS decomposition (REAL(8) only) #63

christoph-conrads · 2016-10-01T10:54:01Z

The generalized singular value decomposition can be computed directly with xGGSVD, xGGSVD3 (see LAWN 46) or by computing QR factorizations and a 2-by-1 CS decomposition. In the following, let us call the former approach direct GSVD and the latter QR+CSD. In my master's thesis I used both approaches and it turned out that the QR+CSD was always signifcantly faster than the direct GSVD for pairs of square matrices with up to 3600 columns (see Section 3.5 of my master's thesis).

I am convinced that the speed difference is persistent across platforms because the iterative phase of the solver is much faster for QR+CSD. The direct GSVD reduces both matrices A, B to upper triangular form before the iterative solver xTGSJA is called whereas the iterative phase of QR+CSD operates on a 2-by-1 isometric matrix Q^T = [Q1^T, Q2^T]^T, where Q1, Q2 are bidiagonal (A isometric <=> A^T A = I). Since the matrix Q is always well-conditioned, the CSD solver xBBCSD needs only few iterations to compute a solution. For comparison, the maximum number of iterations per singular value/eigenvalue is

6 for the CSD (see variable MAXITR in xbbcsd.f, line 356),
30 for the symmetric eigenvalue problem with Francis QR (MAXIT, dsteqr.f:154), and
40 for the GSVD (MAXIT, dtgsja.f:402).

I implemented the QR+CSD solver named xGGQRCS for real double precision floats as a proof of concept. Computing the GSVD for a pair of 1000-by-1000 matrices and entries uniformly distributed in [-1, +1) takes 27.9s with the QR+CSD (and debugging statements enabled) and 243s with the direct GSVD on my computer. You can repeat these measurements by calling the program bin/ggsvd_vs_ggqrcs in the build directory.

This pull request contains the following changes:

a real double precision QR+CSD solver in SRC/dggqrcs.f with debugging statements,
tests using C++14, Boost.Test and Boost uBLAS in TESTING/cpp , and
a benchmark program in TESTING/cpp/ggsvd_vs_ggqrcs.cpp.

If there is enough interest, then I will implement the other variants of the QR+CSD solver ({C,S,Z}GGQRCS) and remove the debugging code.

poulson · 2016-10-01T15:36:42Z

This is very impressive work, but I am skeptical of any performance conclusions based on boost.ublas, as it is notoriously slow. It might be worthwhile to switch the tests to depend on either the reference BLAS, OpenBLAS, ATLAS, BLIS, MKL, or one of their ilk.

christoph-conrads · 2016-10-01T17:11:19Z

@poulson Boost uBLAS is used only for (defect) testing and more convenient matrix construction. The solver itself is implemented in pure Fortran, independent of Boost, and uses the reference BLAS in this repository. In the benchmark program, the timer is started right before calling the GSVD solver (DGGSVD3 or DGGQRCS) and stopped immediately afterwards.

poulson · 2016-10-02T00:45:15Z

In that case, I would recommend removing the boost dependency in the PR...Boost it is a very heavy dependency to only be used for convenience in one place.

For what it's worth, the Travis build is failing due to boost not being installed by default. Thankfully, should you want to keep the boost dependency, Travis whitelists boost (https://github.com/travis-ci/apt-package-whitelist/blob/master/ubuntu-precise), so it would be a trivial change to the .travis.yml to install it.

gaming-hacker · 2017-02-15T00:19:16Z

what's status on this?

christoph-conrads · 2017-02-15T15:56:54Z

This is not top priority for me currently, especially since the maintainers have not commented on this algorithm yet.

poulson · 2017-02-15T16:30:20Z

@cconrads-scicomp Please don't be discouraged or take it personally. It's great work and there is a significant time investment involved in handling this type of change set.

gaming-hacker · 2017-02-20T07:40:11Z

@cconrads-scicomp i pulled into my repo and see how it goes

gaming-hacker · 2017-06-03T21:23:07Z

i think this last test would have passed if an updated apple compiler was choosen as the gcc choosen is very old. travis bombed with
error: invalid value 'c++14' in '-std=c++14'

martin-frbg · 2017-11-11T20:02:28Z

@gaming-hacker did you get around to doing any tests ? While I do not feel competent to review this code it should be possible to accept it into OpenBLAS to give it more exposure if desired (at least OpenBLAS does not strive to provide a reference implementation of anything). (@christoph-conrads did you receive any criticism of your thesis work in the meantime, and/or did any further publications arise out of it ?)

gaming-hacker · 2017-11-12T22:02:26Z

@martin-frbg i ran some basic tests on the algorithm, nothing intense or complex after building with the clang-5.0 compiler and well nothing has bombed or crashed at this point.

christoph-conrads · 2017-11-13T22:06:55Z

@martin-frbg Soon I will have a computer at home again and then I will finish the implementation for matrices with complex values, say within the next month. No further publications arose out of it because my code is based on a transformation well known in the literature, see for example Section 8.7.4 and Section 8.7.5 in the fourth edition of Matrix Computations by Golub and Van Loan or in this paper by Bai. So far, no one implemented this because there was no implementation of the CS decomposition in LAPACK until recently and I happened to compare these two approaches.

(double precision real only)

- add file to CMakeLists.txt - fix syntax errors etc.

This test code uses C++, Boost.Test, and Boost uBLAS.

- workspace queries are signalled by LWORK=-1, not INFO=-1 - fix undeclared, uninitialized variable errors

- add test case where both matrices are zero - add test case where m, n, p are pairwise different

- fix workspace queries - fix DORCSD2BY1 parameter containing the number of rows of X11 - avoid scaling the matrix B if B is zero - handle the case where both matrices A, B are zero - check INFO after calling DORGQR

Fix the computation of an n-by-n orthogonal matrix Q by DORGRQ by ensuring that the R elementary reflectors from DGERQF can be found in the last R rows of Q.

Fix the computation of the upper triangular matrix [0, R].

Ensure the number of columns indicated to DORCSD2BY1 is less than or equal to the number of rows during workspace queries.

* make xGGQRCS test data easily reproducible * allow larger matrices in non-stop test

For matrices with m rows, test xGGQRCS with leading dimensions > m in the randomized test.

The infinite xGGQRCS test attempts to show a message every minute.

Replace a C++ expression because g++ 9.2.0 was generating broken code due to this expression. Maybe there was a compiler bug, something was wrong with the object files, or the expression caused undefined behavior. TODO: Find out if `std::max(a+b, c)` is still undefined in C++11.

The code checking the xGGQRCS solution was super slow: For m=434, n=937, p=978, and OpenBLAS 0.3.9, SGGQRCS computes a solution within one second but checking it took 23 seconds. This commit uses xGEMM to compute U^* U and reduces this time to 2 seconds on the author's computer.

The xORCSD2BY1/xUNCSD2BY1 output matrix U2 was clearly not orthogonal/unitary for certain input matrix dimensions m, p, and q (e.g., m = 260, p=130, q=131). The reason was an accidental overwrite of data by xORGQR()/xUNGQR() when the WORK array was apparently large enough to use blocking.

Speed up matrix assembly by calling LAPACK's `xGEMM()` instead of Boost uBLAS' `prod()`.

Previously, xUNCSD2BY1 only allowed workspace queries by passing LWORK=-1 (note the missing "R"). The new commit makes xUNCSD2BY1 behave, e.g., like xHEEVD.

Reduce the number of rows of matrices A, B to achieve 50% chance of having more columns than rows in [A, B].

julielangou added Type: Enhancement Priority: Medium Status: Not Confirmed Related: Algorithm Warning: Revisit labels Oct 11, 2016

christoph-conrads force-pushed the qr+cs-gsvd branch from 7ab5d25 to ac60a23 Compare April 21, 2020 20:18

christoph-conrads added 13 commits April 22, 2020 16:08

Add GSVD solver based on QR, CS decompositions

a5838fe

(double precision real only)

Make DGGQRCS compile

345f7c8

- add file to CMakeLists.txt - fix syntax errors etc.

Test DGGQRCS

301ad25

This test code uses C++, Boost.Test, and Boost uBLAS.

Fix DGGQRCS

7ca5fa8

- workspace queries are signalled by LWORK=-1, not INFO=-1 - fix undeclared, uninitialized variable errors

Tests: check xGGQRCS results thoroughly

ab903a3

Tests: add xGGQRCS tests

c54d6a3

- add test case where both matrices are zero - add test case where m, n, p are pairwise different

Fix many DGGQRCS bugs

d5f10f6

- fix workspace queries - fix DORCSD2BY1 parameter containing the number of rows of X11 - avoid scaling the matrix B if B is zero - handle the case where both matrices A, B are zero - check INFO after calling DORGQR

Fix matrix computation in DGGQRCS

42a1217

Fix the computation of an n-by-n orthogonal matrix Q by DORGRQ by ensuring that the R elementary reflectors from DGERQF can be found in the last R rows of Q.

Tests: fix matrix computation in xGGQRCS tests

c43a3f2

Fix the computation of the upper triangular matrix [0, R].

Tests: better rectangular xGGQRCS test matrices

1adaf81

Tests: fix xGGQRCS diagonal matrix reconstruction

d263b28

Avoid DGGQRCS workspace query memory overflows

3935757

DGGQRCS: ensure #columns <= #rows with DORCSD2BY1

d65d3c2

Ensure the number of columns indicated to DORCSD2BY1 is less than or equal to the number of rows during workspace queries.

christoph-conrads added 26 commits April 22, 2020 16:08

Test: improve randomized xGGQRCS tests

778055d

* make xGGQRCS test data easily reproducible * allow larger matrices in non-stop test

DGGQRCS: set all THETA entries to NaN

02735d5

xGGQRCS: disallow zero dimensions

1b80e51

Test: test xGGQRCS leading dimension handling

61f6539

For matrices with m rows, test xGGQRCS with leading dimensions > m in the randomized test.

CGGQRCS: fix LRWORK computation

91e3167

Test: XERBLA in xGGQRCS tests shows error messages

93b34a5

Test: infinite xGGQRCS test show message regularly

1832823

The infinite xGGQRCS test attempts to show a message every minute.

Test: add CGGQRCS test

3c6dbed

Test: always check a return value property

2b3153c

Test: change infinite xGGQRCS output again

ee66919

CGGQRCS: fix documentation

25e4d47

Test: add complex128 prototypes to LAPACK C header

236ac88

Test: speed up xGGQRCS test matrix assembly

91093ce

Speed up matrix assembly by calling LAPACK's `xGEMM()` instead of Boost uBLAS' `prod()`.

Test: update copyright header

0431fb4

Test: add missing copyright notice

2a4e2a7

xUNCSD2BY1: allow workspace query with LRWORK=-1

f49b9c1

Previously, xUNCSD2BY1 only allowed workspace queries by passing LWORK=-1 (note the missing "R"). The new commit makes xUNCSD2BY1 behave, e.g., like xHEEVD.

Test: test xUNCSD2BY1 workspace query with LRWORK

7e3bf11

Test: avoid double xGGQRCS workspace query

d09b18b

Test: avoid harmless out-of-bounds access

cbd0f25

Test: tune xGGQRCS random matrix dimension size

f1c5c02

Reduce the number of rows of matrices A, B to achieve 50% chance of having more columns than rows in [A, B].

Add ZGGQRCS, COMPLEX*16 GSVD via QR, CSD

91e1ce3

Test: enable generic xGGQRCS tests for ZGGQRCS

db1a905

xGGQRCS: complete list of external subroutines

e22ff6c

christoph-conrads force-pushed the qr+cs-gsvd branch from eb6afd5 to e22ff6c Compare April 22, 2020 14:12

Reference-LAPACK deleted a comment from codecov bot Apr 22, 2020

christoph-conrads closed this Apr 22, 2020

christoph-conrads mentioned this pull request Apr 22, 2020

Add GSVD with QR factorizations, 2-by-1 CS decomposition #406

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster GSVD with QR factorizations, 2-by-1 CS decomposition (REAL(8) only) #63

Faster GSVD with QR factorizations, 2-by-1 CS decomposition (REAL(8) only) #63

christoph-conrads commented Oct 1, 2016 •

edited

Loading

poulson commented Oct 1, 2016

christoph-conrads commented Oct 1, 2016 •

edited

Loading

poulson commented Oct 2, 2016

gaming-hacker commented Feb 15, 2017

christoph-conrads commented Feb 15, 2017

poulson commented Feb 15, 2017

gaming-hacker commented Feb 20, 2017

gaming-hacker commented Jun 3, 2017

martin-frbg commented Nov 11, 2017

gaming-hacker commented Nov 12, 2017

christoph-conrads commented Nov 13, 2017

Faster GSVD with QR factorizations, 2-by-1 CS decomposition (REAL(8) only) #63

Faster GSVD with QR factorizations, 2-by-1 CS decomposition (REAL(8) only) #63

Conversation

christoph-conrads commented Oct 1, 2016 • edited Loading

poulson commented Oct 1, 2016

christoph-conrads commented Oct 1, 2016 • edited Loading

poulson commented Oct 2, 2016

gaming-hacker commented Feb 15, 2017

christoph-conrads commented Feb 15, 2017

poulson commented Feb 15, 2017

gaming-hacker commented Feb 20, 2017

gaming-hacker commented Jun 3, 2017

martin-frbg commented Nov 11, 2017

gaming-hacker commented Nov 12, 2017

christoph-conrads commented Nov 13, 2017

christoph-conrads commented Oct 1, 2016 •

edited

Loading

christoph-conrads commented Oct 1, 2016 •

edited

Loading