Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GSVD with QR factorizations, 2-by-1 CS decomposition #406

Open
wants to merge 101 commits into
base: master
Choose a base branch
from

Conversation

christoph-conrads
Copy link
Contributor

This pull request adds a generalized singular value decomposition computed by means of a QR decomposition with column pivoting and the 2-by-1 CS decomposition to LAPACK. The PR requires #405.

Given a pair of matrices A, B with appropriate dimensions, the GSVD computes a decomposition

  • A = U1 D1 R Q^*,
  • B = U2 D2 R Q^*.

I would like to have feedback on the computation of the matrix R. Currently it is always computed but this is expensive because of an additional matrix-matrix multiplication followed by an RQ decomposition. Should I make this optional?

The PR is based on #63 but provides all implementations (single-precision real, double-precision real, single-precision complex, and double-precision complex) with tests removed because of the C++ dependency. The tests can be found in fb5dfb3.

@christoph-conrads
Copy link
Contributor Author

christoph-conrads commented Apr 22, 2020

Here is the motivation for the new GSVD solver from #63:

The generalized singular value decomposition can be computed directly with xGGSVD, xGGSVD3 (see LAWN 46) or by computing QR factorizations and a 2-by-1 CS decomposition. In the following, let us call the former approach direct GSVD and the latter QR+CSD. In my master's thesis I used both approaches and it turned out that the QR+CSD was always signifcantly faster than the direct GSVD for pairs of square matrices with up to 3600 columns (see Section 3.5 of my master's thesis).

I am convinced that the speed difference is persistent across platforms because the iterative phase of the solver is much faster for QR+CSD. The direct GSVD reduces both matrices A, B to upper triangular form before the iterative solver xTGSJA is called whereas the iterative phase of QR+CSD operates on a 2-by-1 isometric matrix Q^T = [Q1^T, Q2^T]^T, where Q1, Q2 are bidiagonal (A isometric <=> A^T A = I). Since the matrix Q is always well-conditioned, the CSD solver xBBCSD needs only few iterations to compute a solution. For comparison, the maximum number of iterations per singular value/eigenvalue is

  • 6 for the CSD (see variable MAXITR in xbbcsd.f, line 356),
  • 30 for the symmetric eigenvalue problem with Francis QR (MAXIT, dsteqr.f:154), and
  • 40 for the GSVD (MAXIT, dtgsja.f:402).

@codecov
Copy link

codecov bot commented Apr 22, 2020

Codecov Report

Merging #406 (2a8c4dd) into master (4b3c7c2) will decrease coverage by 0.43%.
The diff coverage is 0.29%.

❗ Current head 2a8c4dd differs from pull request most recent head 8a0bef1. Consider uploading reports for the commit 8a0bef1 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #406      +/-   ##
==========================================
- Coverage   82.37%   81.93%   -0.44%     
==========================================
  Files        1894     1904      +10     
  Lines      190681   191698    +1017     
==========================================
+ Hits       157067   157068       +1     
- Misses      33614    34630    +1016     
Impacted Files Coverage Δ
SRC/cggqrcs.f 0.00% <0.00%> (ø)
SRC/clasrtr.f 0.00% <0.00%> (ø)
SRC/dggqrcs.f 0.00% <0.00%> (ø)
SRC/dlasrti.f 0.00% <0.00%> (ø)
SRC/dlasrtr.f 0.00% <0.00%> (ø)
SRC/sggqrcs.f 0.00% <0.00%> (ø)
SRC/slasrti.f 0.00% <0.00%> (ø)
SRC/slasrtr.f 0.00% <0.00%> (ø)
SRC/zggqrcs.f 0.00% <0.00%> (ø)
SRC/zlasrtr.f 0.00% <0.00%> (ø)
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b3c7c2...8a0bef1. Read the comment docs.

@christoph-conrads
Copy link
Contributor Author

christoph-conrads commented Apr 23, 2020

I would like to have feedback on the computation of the matrix R. Currently it is always computed but this is expensive because of an additional matrix-matrix multiplication followed by an RQ decomposition. Should I make this optional?

Yes, because then it is possible to compute only the generalized singular values for the cost of one QR decomposition with column pivoting, unitary matrix assembly, and 2-by-1 CS value computation. AFAIK xGGSVD3() must compute the upper-triangular factor R by design and would be considerably slower in this scenario.

TODO:

  • Re-read LAWN 46 and understand xGGSVD3()algorithm
  • Benchmark this idea before modifying the code

@christoph-conrads
Copy link
Contributor Author

TODO:

  • Re-read LAWN 46 and understand xGGSVD3()algorithm

The whole theory behind xGGSVD3() is based on simultaneous transformations of a pair of upper triangular matrices.

@christoph-conrads
Copy link
Contributor Author

christoph-conrads commented May 2, 2020

The latest xGGQRCS code contains row scaling ({s,d}LASRTR) in an attempt to handle matrix pencils where the input matrices A, B differ significantly in norm. This does not even yield satisfactory singular values. The input matrices must be scaled for a backward stable algorithm (backward stable with respect to A, B individually, not with respect to [A, B]).

The remainder of this post will discuss how to compute the backward stable GSVD of (A, B) using matrix scaling and xGGQRCS (QR factorization followed by 2-by-1 CS decomposition).

Set w to the power of 2 nearest to norm(A) / norm(B) and compute the GSVD of [A; wB] with xGGQRCS. This yields orthogonal/unitary matrices U1, U2 and matrices with sine, cosine values S, C such that

  • A = U1 S X,
  • wB = U2 C X.
  • S^* S + C^*C = I.
  • X has full rank.

We now have to adjust X and the singular values to acquire the GSVD of (A, B).

We can immediately fix the singular values because if σ is a generalized singular value of the matrix pencil (A, wB), then is a singular value of (A, B). xGGQRCS does not return singular values. Instead it returns radians values θ such that σ = tan(θ). Thus, the correct radians value is σ' = arctan(w * tan(θ)); the updated matrices Σ, Γ can be computed with the aid of the new radians values (Σ(i,i) = sin(θ'), Γ(i,i) = cos(θ')`). This operation should be numerically stable because

  • The arc tangent has maximum derivative 1 over its entire range and perfectly conditioned.
  • w * tan(θ) will be calculcated accurately because it is a floating-point multiplication.
  • tan(θ) is
    • well conditioned for small angles (less than π/4) [i.e., there are vectors x such that ||wBx|| >= ||Ax||]
    • infinite (or already negative) for angles near π/2; then the multiplication by w does not change anything [i.e., there are vectors x such that ||Ax|| > 0, ||wBx = 0]
  • badly conditioned for θ much larger than π/4 (this should not happen because it would indicate the existence of vectors x with ||wBx|| << ||Ax|| but wB has comparable norm to A).

The question is now if we need to change X and how. Let

|A|   |V1   | |Σ|
| | = |     | | | Y
|B|   |   V2| |Γ|

be the GSVD of (A, B) such that Σ^* Σ + Γ^* Γ = I. Observe

A^* A = X^* S^* S X = Y^* Σ^* Σ Y.
w^2 B^* B = X^* C^* C X = w^2 Y^* Γ^* Γ Y.

Moreover, S^* S and Σ^* Σ are be diagonal which is extremely helpful. For demonstration purposes assume that

S^* S = |0  |
        |  s|

C^* C = |1  |
        |  c|

Σ^* Σ = |0  |
        |  σ|

Γ^* Γ = |1  |
        |  γ|

Furthermore, we assume the (1,1) matrix entry arises through the matrix dimensions (meaning it is free of round-off error). With simple row scalings we can compute Y from X and the old and new singular values:


                      |0 0| |1   0| |1   0|             |0 0|
X^* S^* S X = X^* S^* |   | |     | |     | X = X^* S^* |   | Y.
                      |0 s| |0 σ/s| |0 s/σ|             |0 σ|

                      |1 0| |1   0| |1   0|             |1 0|
X^* C^* C X = X^* C^* |   | |     | |     | X = X^* C^* |   | wY.
                      |0 c| |0 γ/c| |0 c/γ|   =         |0 γ|

Thus

Y[1,:] = X[0,:] / w # first row of Y
Y[2,:] = X[1,:] * s/σ = X[1,:] * c/γ / w # second row of Y

Observe that S(1,1) zero but C(1,1) is not so there is a unique row scaling for the first row of X (divide by w). Similarly, C(i,i) might be zero but then S(i,i) must be one. Moreover, whenever S(i,i) =/= 0, C(i,i) =/= 0, there are two equations for each row. Both equations should yield the same result.

Here is a NumPy demo with a random test matrix causing a large backward error with the current xGGQRCS version:

#!/usr/bin/python3

# This program demonstrates how to adjust singular values and right-hand side
# matrix of the generalized singular value decomposition computed with SGGQRCS
# when one of the input matrices had to be scaled.
#
# Author: Christoph Conrads (https://christoph-conrads.name) 

import numpy as np
import numpy.linalg as L


def make_matrix(x):
    return np.matrix(x, dtype=np.float32)


def main():
    eps = np.finfo(np.float32).eps

    # unmodified input matrices with norm(B) >> norm(A)
    a = make_matrix([[-8.519847412e+02, +6.469862671e+02]])
    b = make_matrix([
        [+5.485938125e+05, -4.166526250e+05],
        [+1.846850781e+05, -1.402660781e+05],
        [+5.322575625e+05, -4.042448438e+05],
        [-1.630551465e+04, +1.238360352e+04],
        [-1.286453438e+05, +9.770555469e+04],
        [-1.323287812e+05, +1.005026797e+05],
        [+5.681228750e+05, -4.314841250e+05],
        [-3.107875312e+05, +2.360408594e+05],
        [+1.456551719e+05, -1.106233281e+05],
        [+1.365355156e+05, -1.036972344e+05]
    ])
    # GSVD computed by SGGQRCS with input A, B/(2**10)
    u1 = make_matrix([[1]])
    # SGGQRCS returns square matrices but we can ignore the other columns
    # because the matrix pencil has only rank 2
    u2 = make_matrix([
        [-5.237864256e-01, +3.081814051e-01],
        [-1.694306731e-01, -5.048786998e-01],
        [-5.036462545e-01, -1.015115157e-01],
        [+1.279635727e-02, +2.352355272e-01],
        [+1.268841326e-01, -4.299084842e-01],
        [+1.265842170e-01, -9.544299543e-02],
        [-5.364804864e-01, -2.056131810e-01],
        [+2.987869084e-01, -3.556257784e-01],
        [-1.335151047e-01, -4.078340828e-01],
        [-1.268201172e-01, -2.355325669e-01]
    ])
    x = make_matrix([
        [-1.029685547e+03, +7.820372925e+02],
        [-8.520648804e+02, +6.470472412e+02]
    ])
    # SGGQRCS returns radians values instead of singular values
    theta = np.float32(1.55708992481e+00)
    # scaling factor
    w = np.power(np.float32(2), np.float32(-10))

    # check for typos
    assert L.norm(u2.H*u2 - np.eye(2)) <= np.sqrt(2) * eps
    assert theta >= 0
    assert theta <= np.pi / 2

    # add helper functions
    copy = lambda x: np.matrix(np.copy(x))
    compute_relative_error = \
        lambda a, u, d, x: L.norm(a - u*d*x) / (L.norm(a) * eps)


    # assemble diagonal matrices
    d1 = make_matrix([[0, np.sin(theta)]])
    d2 = make_matrix([[1, 0], [0, np.cos(theta)]])


    # print information
    print('norm(B) / norm(A) = {:8.2e}'.format(L.norm(b)/L.norm(a)))
    print('Relative error A: {:8.2e}'.format(compute_relative_error(a,u1,d1,x)))
    print('Relative error B: {:8.2e}'.format(compute_relative_error(b,u2,d2,x)))


    # fix values
    theta_fixed = np.arctan(w * np.tan(theta))
    d1_fixed = make_matrix([[0, np.sin(theta_fixed)]])
    d2_fixed = make_matrix([[1, 0], [0, np.cos(theta_fixed)]])

    x_fixed = copy(x) / w
    # alternative: x_fixed[1,:] *= d1[0,1]/d1_fixed[0,1]
    x_fixed[1,:] *= d2[1,1]/d2_fixed[1,1]


    # recompute backward error
    fmt = 'Relative error {:s} after fixes: {:8.2e}'
    print(fmt.format("A", compute_relative_error(a,u1,d1_fixed,x_fixed)))
    print(fmt.format("B", compute_relative_error(b,u2,d2_fixed,x_fixed)))


if __name__ == '__main__':
    main()

Script output on my computer:

norm(B) / norm(A) = 1.24e+03
Relative error A: 1.73e+00
Relative error B: 8.38e+06
Relative error A after fixes: 2.71e+00
Relative error B after fixes: 1.49e+00

@christoph-conrads
Copy link
Contributor Author

@julielangou It's done.

Given a pair of matrices (A, B), xGGQRCS computes the GSVD such that A = U1 D1 X, B = U2 D2 X.

The tests can be found in the branch qr+cs-gsvd in TESTING/cpp/ggqrcs.hpp due to the C++ and Boost.Test dependency.

Bonus functionality (not used by xGGQRCS because row sorting has no effect):

  • {s,d}LASRTR: row sorting
  • {s,d}LASRTI: sort an array of indices based on the values they references (used by xLASRTR)

@christoph-conrads
Copy link
Contributor Author

The branch qr+cs-gsvd with the C++ tests and approximately 90% line coverage for xGGQRCS can be found here: https://travis-ci.org/github/christoph-conrads/lapack/branches

@christoph-conrads
Copy link
Contributor Author

christoph-conrads commented May 17, 2020

Benchmarks

This post compares the performance of the GSVD solvers xGGSVD3 and xGGQRCS.

Introduction

Of course, the comparison will be very limited because of the huge number of parameters relevant to the GSVD, e.g.,

  • the matrix dimensions,
  • generalized singular value distribution,
  • sparsity (all-zero columns, rows),
  • matrix condition numbers,
  • floating-point format,
  • field (real or complex values),
  • computing all or none of the GSVD matrices,
  • BLAS implementation (Netlib, OpenBLAS, MKL...),
  • number of threads,
  • compilers,
  • CPU
    • instruction set architecture
    • CPU architecture
  • ...

In this post we benchmark with matrices A = U1 D1 X, B = U2 D2 X, where

  • A, B have the same number of rows,
  • the number of rows and columns is between 8 and 128 inclusive,
  • the generalized singular values are uniformly distributed either in
    • [0, π/4),
    • [0, π/2000), or
    • [π1999/8000, π/4),
  • X has condition number 2**(d/4), d being the number of significand digits.
  • U1, U2 are random,
  • all matrices must be computed or none,
  • Netlib BLAS,
  • CPU time is measured,
  • at least 100 random matrices must be decomposed and the CPU time must sum up to more than one second.

Code: TESTING/cpp/gsvd_benchmarks.cpp

Attention: xGGSVD3 computes a decomposition A = U1 D1 R Q^*, B = U2 D2 R Q^*. Depending on the problem at hand, it may not suffice to measure only execution time of the GSVD solver.

Results

  • xGGQRCS is faster than xGGSVD3 if A, B are (almost) square.
  • xGGQRCS is the slowest relative to xGGSVD3 if A, B have only a small number of columns compared to the number of rows. (Remember A, B have the same number of rows in the benchmarks.)
  • Switching from real to complex arithmetic has a smaller impact on CPU time for xGGQRCS than for xGGSVD3.
  • When computing only generalized singular values, xGGQRCS has a smaller CPU time reduction than xGGSVD3.
  • When computing only generalized singular values and for matrices with more columns than rows, the larger the number of columns, the more xGGQRCS slows down relative to xGGSVD3.
  • The larger the matrices, the larger the maximum differences in CPU time.
  • xGGQRCS needs a larger workspace than xGGSVD3.
  • The larger the number of rows, the larger the workspace of xGGQRCS relative to xGGSVD3.

Plots

The plots below compare the CPU time consumed by the xGGQRCS compared to xGGSVD3 in single-precision when

  • computing generalized singular values and matrices (upper row),
  • computing only generalized singular values (lower row),

in

  • real arithmetic (SGGQRCS, SGGSVD3) in the left half,
  • complex arithmetic (CGGQRCS, CGGSVD3) in the right half.

The colors indicate the number of matrix rows; the brighter the colors, the larger the number of rows.

The x axis shows the number of columns in the matrices, the y axis shows the CPU time needed by xGGQRCS divided by the CPU time needed by xGGSVD3.

xGGQRCS-vs-xGGSVD3-cputime-float32

Copy link
Collaborator

@thijssteel thijssteel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides a few minor remarks, looks good to me.

I would worry about one thing: i'm pretty sure the SVD is a little better at identifying the numerical rank than RRQR. I haven't looked into this, but is it possible that the direct GSVD is then also better at this than this method?

SRC/dggqrcs.f Show resolved Hide resolved
SRC/dggqrcs.f Outdated Show resolved Hide resolved
SRC/dggqrcs.f Outdated Show resolved Hide resolved
SRC/dggqrcs.f Outdated Show resolved Hide resolved
SRC/dlasrti.f Outdated Show resolved Hide resolved
SRC/dlasrti.f Outdated Show resolved Hide resolved
SRC/CMakeLists.txt Outdated Show resolved Hide resolved
SRC/dggqrcs.f Outdated Show resolved Hide resolved
@christoph-conrads
Copy link
Contributor Author

I would worry about one thing: i'm pretty sure the SVD is a little better at identifying the numerical rank than RRQR. I haven't looked into this, but is it possible that the direct GSVD is then also better at this than this method?

The backward error introduced by a QR transformation of a matrix K is on the order of eps ||K|| (see, e.g., Matrix Computations, 4th ed., Section 5.2) meaning all small singular values of the R factor will have a large absolute error. Consequently, this makes rank determination heuristic.

Both GSVD and QR+CS initially apply one or more permuted QR factorizations, see xGGSVP3 or Computing the Generalized Singular Value Decomposition (Equation (1.5)) and xGGQRCS. Based on the fact that the QR factorization is a large backward error source, I do not see one method being generally better than the other with respect to rank determination.

SRC/cggqrcs.f Outdated Show resolved Hide resolved
SRC/cggqrcs.f Outdated Show resolved Hide resolved
SRC/zggqrcs.f Outdated Show resolved Hide resolved
@christoph-conrads
Copy link
Contributor Author

@thijssteel Please let me know if you are satisfied by the changes.

@christoph-conrads
Copy link
Contributor Author

christoph-conrads commented Feb 14, 2021

Benchmarks

This post compares the performance of the GSVD solvers xGGSVD3 and xGGQRCS.

[snip]

Plots

The plots below compare the CPU time consumed by the xGGQRCS compared to xGGSVD3 in single-precision when
* computing generalized singular values and matrices (upper row),
* computing only generalized singular values (lower row),
in
* real arithmetic (SGGQRCS, SGGSVD3) in the left half,
* complex arithmetic (CGGQRCS, CGGSVD3) in the right half.

The colors indicate the number of matrix rows; the brighter the colors, the larger the number of rows.

The x axis shows the number of columns in the matrices, the y axis shows the CPU time needed by xGGQRCS divided by the CPU time needed by xGGSVD3.

Intel Cascade Lake results (commit 6a61522bd8fd74dadf406bef1c628f94b3947216) on a virtual private server. Note that CGGQRCS takes now more than twice the CPU time of CGGSVD3 for matrix pairs with less than 20 columns when only singular values are requested. This difference was less pronounced in the last set of benchmarks.

xGGQRCS-vs-xGGSVD3-CascadeLake-20210214
gsvd-benchmarks-6a61522bd8fd74dadf406bef1c628f94b3947216.log

plot-gsvd-benchmark.py.gz

@h-vetinari
Copy link
Contributor

How consistent is the scaling behaviour observed in the benchmarks? Would it be reliable enough to specify a convenience function (e.g. for SV-only) that switches between xGGSVD3 and xGGQRCS in a way that (mostly) avoids the doubled time for small n,m, but benefits in the larger regime? Because I'd be very tempted to do that as a user (if I had benchmarked things myself), but it would be much nicer if the library already provided that, than having to come with it oneself.

Eyeballing the graphs, I'm thinking of something along the lines of n <= 20 || m/n < 3 or perhaps log(m)/n < ?...

@langou
Copy link
Contributor

langou commented Feb 24, 2021

Hi @h-vetinari,

Yes this is a good idea.

(1) we could write higher levels wrappers that would call the "best" function. The current structure (clean interface for a stand alone xGGSVD3 and clean interface for a stand alone xGGQRCS) is good with me. What we might be missing is higher levels wrappers.

(2) However my preference is, for now, to leave this kind of tuning for higher levels (than LAPACK). Higher levels like Matlab, Octave, Numpy, Scipy, etc. This kind of tuning could be present a lot in LAPACK. For example for QR, there is TSQR (for tall and skinny matrices) and there is the standard QR algorithm. When n < a_function( m ), then TSQR is better than QR. For SYEV, there are four different algorithms (SYEVR, SYEV, SYEVX, SYEVD). Sometimes there are switches from one algorithm to the next when the choice is "obvious", but as soon as the choice gets complicated / machine dependent, these switches are not there.

(3) For now, we could have some comments in the header of these subroutines to give brief explanations to the users and the various choices. Also timings as done by @christoph-conrads are useful to explain to users the trade-off.

(4) For the future, I am not against starting some higher level wrappers. Like QR(m,n,A,lda). This would leave opportunities for software like MKL, OpenBLAS, to optimize the algorithm used by these functions. We could provide reference implementation for these high level wrappers. We should look at what is in LAPACKE.

So bottom line, my personal opinion is to leave things as is for now. But it is a possible to-come-soon agenda items. Opinions welcome.

Cheers, Julien.

@weslleyspereira
Copy link
Collaborator

Nice job @christoph-conrads ! I think we should add tests for the new routines. Maybe we can just replicate the tests for xCGGSVD3. What do you think?

@christoph-conrads
Copy link
Contributor Author

Nice job @christoph-conrads ! I think we should add tests for the new routines. Maybe we can just replicate the tests for xCGGSVD3. What do you think?

There are tests in the christoph-conrads/qr+cs-gsvd branch in TESTING/cpp/xGGQRCS_tests.cpp and the tests are passing in Travis CI. The tests are not part of the pull request because they are in C++ and require the Boost C++ libraries.

@christoph-conrads
Copy link
Contributor Author

christoph-conrads commented Feb 24, 2021

Would it be reliable enough to specify a convenience function (e.g. for SV-only) that switches between xGGSVD3 and xGGQRCS in a way that (mostly) avoids the doubled time for small n,m, but benefits in the larger regime?

No, because there is no such regime. Consider the matrix pencil (A, B) below:

    |1 0|      |0 0|
A = |   |  B = |   |
    |0 0|      |0 1|

Judging only by the dimensions, this is the optimal case xGGQRCS yet in practice xGGSVD3 will be much faster here.

The xGGSVD3 preprocessing(!) stage will determine after executing two QR factorizations that this matrix pencil has the singular value pairs (1, 0) and (0, 1) (or that it has the singular values zero and infinity). xGGQRCS must actually execute in its entirety to arrive at the same conclusion. The critical property in this example are the nullspaces of the matrices and not their dimensions.

This is of course the counter-example from mathematics but I really doubt that such a switch is reliably possible in practice because there are too many relevant variables including: number of rows (matrix A), number of rows (matrix B), number of columns, rank of A, rank of B, rank of (A, B), whether or not to assemble the matrices U1, U2, and X. This list contains only quantities from mathematics. In addition, there are factors like the BLAS implementation, CPU architecture, or the number of cores.

Finally, the major problem with xGGQRCS is its large memory consumption in comparison to xGGSVD3. A method deciding on the fly between xGGQRCS and xGGSVD3 would burden the user with the xGGQRCS storage thirst.

@weslleyspereira
Copy link
Collaborator

There are tests in the christoph-conrads/qr+cs-gsvd branch in TESTING/cpp/xGGQRCS_tests.cpp and the tests are passing in Travis CI. The tests are not part of the pull request because they are in C++ and require the Boost C++ libraries.

Yes, I saw these tests. I just thought a quick way to get the new methods covered by the current Lapack/TESTING/EIG would be to adapt CERRGG and CGSVTS3 subroutines. And maybe add your new tests in a different PR.

@christoph-conrads
Copy link
Contributor Author

There are tests in the christoph-conrads/qr+cs-gsvd branch in TESTING/cpp/xGGQRCS_tests.cpp and the tests are passing in Travis CI. The tests are not part of the pull request because they are in C++ and require the Boost C++ libraries.

Yes, I saw these tests. I just thought a quick way to get the new methods covered by the current Lapack/TESTING/EIG would be to adapt CERRGG and CGSVTS3 subroutines. And maybe add your new tests in a different PR.

(Emphasis mine)

Adopting xGSVTS3 is not trivial because xGGSVD3 computes factorizations of the form A = U C R V^* while xGGQRCS computes A = U C X. I will try to get it done this weekend.

@christoph-conrads
Copy link
Contributor Author

There are tests in the christoph-conrads/qr+cs-gsvd branch in TESTING/cpp/xGGQRCS_tests.cpp and the tests are passing in Travis CI. The tests are not part of the pull request because they are in C++ and require the Boost C++ libraries.

Yes, I saw these tests. I just thought a quick way to get the new methods covered by the current Lapack/TESTING/EIG would be to adapt CERRGG and CGSVTS3 subroutines. And maybe add your new tests in a different PR.

(Emphasis mine)

Adopting xGSVTS3 is not trivial because xGGSVD3 computes factorizations of the form A = U C R V^* while xGGQRCS computes A = U C X. I will try to get it done this weekend.

*       SUBROUTINE SERRGG( PATH, NUNIT )                                                                                                
*                                                                                                                                       
*       .. Scalar Arguments ..                                                                                                          
*       CHARACTER*3        PATH                                                                                                         
*       INTEGER            NUNIT                                                                                                        
* [snip]
      ELSE IF( LSAMEN( 3, PATH, 'GQR' ) ) THEN                                                                                          
*                                                                                                                                       
*        SGGQRF                                                                                                                         
*                                                                                                                                       
         SRNAMT = 'SGGQRF'                                                                                                              
         INFOT = 1                                                                                                                      
         CALL SGGQRF( -1, 0, 0, A, 1, R1, B, 1, R2, W, LW, INFO )                                                                       
* [snip]
*                                                                                                                                       
*        SGGRQF                                                                                                                         
*                                                                                                                                       
         SRNAMT = 'SGGRQF'

@weslleyspereira Which three letter identifier should I use? If GQR is reused (xGGQRCS), the code would be tested together with the QR and RQ factorization.

@weslleyspereira
Copy link
Collaborator

@weslleyspereira Which three letter identifier should I use? If GQR is reused (xG_GQR_CS), the code would be tested together with the QR and RQ factorization.

I see... Maybe just xQRCS would be clear enough, no?

@christoph-conrads
Copy link
Contributor Author

@weslleyspereira Which three letter identifier should I use? If GQR is reused (xG_GQR_CS), the code would be tested together with the QR and RQ factorization.

I see... Maybe just xQRCS would be clear enough, no?

Think like it's 1977, not 2021:

*       CHARACTER*3        PATH                                                                                                         

@weslleyspereira
Copy link
Collaborator

weslleyspereira commented Feb 24, 2021

Ah... now I saw your mention to the three-letter PATH. I thought it could be the SRNAMT id, sorry.
I would suggest gQR, but the cases don't matter. This takes me to QRC, which doesn't sound good, or UCX, which represents well the final decomposition format but not the routine's name. I don't know...

@christoph-conrads
Copy link
Contributor Author

How consistent is the scaling behaviour observed in the benchmarks?

(1) we could write higher levels wrappers that would call the "best" function. The current structure (clean interface for a stand alone xGGSVD3 and clean interface for a stand alone xGGQRCS) is good with me. What we might be missing is higher levels wrappers.

To have xGGQRCS outperform xGGSVD3 more consistently, xGGQRCS could preprocess the matrices with the function xGGSVP3.

xGGSVP3 computes at most four QR factorizations to detect

  • the common nullspace of the matrices,
  • the singular value pairs (1,0) (singular value infinity), and
  • the singular value pairs (0,1) (singular value zero).

xGGSVP3 reduces the matrices to upper triangular form which xGGQRCS does not need but the preprocessing step would stop xGGQRCS from underperforming when the matrices are heavily rank deficient.

@weslleyspereira
Copy link
Collaborator

weslleyspereira commented May 21, 2021

Hi.

The PR christoph-conrads#3 has some bug fixes, and test cases for the new routines. Consider merging that one before accepting this.

* These changes are also in christoph-conrads#2. I just split the branches to ease the merge,

@christoph-conrads
Copy link
Contributor Author

christoph-conrads commented May 22, 2021

TODO list:

@christoph-conrads
Copy link
Contributor Author

Hi @weslleyspereira,

I do not want to merge christoph-conrads#2 because the current code already checks for overflows. Moreover, the checks about when a matrix norm is too small are purely heuristic and they do not necessarily indicate (user) errors. In such a case, I prefer to stick with simplicity.

@christoph-conrads
Copy link
Contributor Author

@langou @weslleyspereira The column pre-processing requires a new function that copies and transposes matrices and it was named SGETRP. Should I keep the name or should it be renamed to, e.g., SLATRP because of its functionality similar to SLACPY?

@langou
Copy link
Contributor

langou commented May 24, 2021

@langou @weslleyspereira The column pre-processing requires a new function that copies and transposes matrices and it was named SGETRP. Should I keep the name or should it be renamed to, e.g., SLATRP because of its functionality similar to SLACPY?

Hi @christoph-conrads,

Thanks for asking.

Reading https://github.com/christoph-conrads/lapack/blob/qr+cs-gsvd/SRC/sgetrp.f

The prefix LA is for "auxiliary" and it was used consistently when LAPACK had clearer distinction between "driver" and "computational" and "auxiliary". There are a lots of xLAwyz subroutines out there. :). There is some merit to have a distinction between "driver" and "computational" and "auxiliary", but this distinction has slowly become more and more foreign over time. Also, one can argue whether a transpose routine should considered "driver" or "computational" or "auxiliary". This is not obvious and it depends on the definitions of these three categories. It seems that the definitions depend on whom you ask.

I like SGETRP more than SLATRP. I do not dislike SLATRP though.

So SGETRP is out-of-place transpose. Sounds good. Thanks for the contribution.

Note that Gustavson and Swirszcz have a nice in-place transposition algorithm. (This is not in LAPACK, but, this is in the wishlist.) See Gustavson F.G., Swirszcz T. (2007) In-Place Transposition of Rectangular Matrices. In: Kågström B., Elmroth E., Dongarra J., Waśniewski J. (eds) Applied Parallel Computing. State of the Art in Scientific Computing. PARA 2006. Lecture Notes in Computer Science, vol 4699. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75755-9_68

Note as well that MKL has mkl_?imatcopy for scaling and in-place transposition/copying of matrices and mkl_?omatcopy for scaling and out-of-place transposition/copying of matrices and mkl_?omatcopy2 for two-strided version of mkl_?omatcopy.

@h-vetinari
Copy link
Contributor

Since 3.10.0 was released now, this should probably be re-milestoned.

@langou langou modified the milestones: LAPACK 3.10.0, LAPACK 3.11.0 Jun 29, 2021
@langou
Copy link
Contributor

langou commented Jun 29, 2021

Good point @h-vetinari. Thanks! (Done.)

@christoph-conrads
Copy link
Contributor Author

christoph-conrads commented Oct 24, 2021

Debugging the failing test regression_preprocessing_20210606

tl;Dr SGGQRCS fails because of SORCSD2BY1 computing incorrect factors U1 in the presence of tiny, numerically irrelevant values. edit continued in #634 /edit

Test regression_preprocessing_20210606 in branch christoph-conrads/qr+cs-gsvd.

The test fails with a correctly computed matrix B and an incorrect matrix A = U1 S X. The fact that B = U2 C X is computed correctly is an important hint because it implies that

  • the singular values in the matrix C are correct and by extension the matrix S,
  • the matrix X is correct.

A looks as follows:

-1.21e+00 +3.80e-01 +9.33e-02 -4.93e-02 +1.16e-01 -3.58e-07 +4.64e-02
+0.00e+00 +0.00e+00 +0.00e+00 +0.00e+00 +0.00e+00 +0.00e+00 +0.00e+00

The matrix assembled from U1 S X (which should be equal to A) looks as follows:

-8.59e-01 +2.69e-01 +6.60e-02 -3.49e-02 +8.20e-02 -9.78e-08 +3.28e-02
-8.59e-01 +2.69e-01 +6.60e-02 -3.49e-02 +8.20e-02 -9.78e-08 +3.28e-02

Since the pre-processing of A is disabled, the matrix U1 is computed by SORCSD2BY1 and never modified by SGGQRCS allowing one to focus on SORCSD2BY1 as the problem source or a memory problem (i.e., a buffer overread or a buffer overflow).

edit
Moreover, the Frobenius norms of the matrices U1 computed with and without pre-processing of B, respectively, are identical hinting at a mathematical problem with the matrix U1 instead of a memory bug.
/edit

Given an isometric matrix Q = [X1; X2], SORCSD2BY1 computes the CS decomposition of Q.
The input matrix Q for SORCSD2BY1 looks as follows, where the first two rows belong to the matrix X1 = U1 S V^* and the other two rows to the matrix X2 = U2 C V^*; SORCSD2BY1 computes U1, U2, S, C, V. U1, S, C are identical to the matrices mentioned above in the context of SGGQRCS.

("isometric": Q has no more rows than columns and Q^* Q is the identity)

-2.38418579E-07 1.00000000     -2.66202893E-10
 0.00000000     0.00000000      2.66202893E-10
 1.00000000     2.38418579E-07 -5.55111512E-17
 0.00000000     2.66202893E-10  1.00000000

This matrix can in single-precision be considered isometric. Moreover, U1 should be a permutation of the identity matrix with (almost) arbitrary signs, e.g.,

-1  0
 0 -1

(If you have trouble seeing this, then set all the matrix entries small in modulus to zero.)

Instead, the following matrix is computed:

-sqrt(2) +sqrt(2)
-sqrt(2) -sqrt(2)

This is wrong. After a bit of experimenting, one can construct the test xUNCSD2BY1_regression_20211024 triggering the problem with the following simpler matrix (ε is the epsilon value for single- or double-precision arithmetic, respectively; the test triggers the misbehavior also in double-precision but the constant c was chosen by experimenting with single-precision code):

c := 2^-45
μ := c ε

    | 0  1 -μ |
Q = | 0  0 +μ |
    | 1  0  0 |
    | 0  μ  1 |

@briansutton
Copy link

About the failed test reported by @christoph-conrads just above:

There seems to be a discrepancy between two comparisons with zero around the "twice is enough" procedure of SORBDB6. Within this subroutine, SLASSQ is used to measure the squared norm for a vector. But then the calling routine SORBDB5 uses SNRM2 to measure a squared norm in a few places. Under certain conditions, SLASSQ returns zero but SNRM2 returns nonzero for the same vector.

I think the least disruptive fix is for SORBDB6 to explicitly set all entries to zero when SLASSQ measures the norm to be zero.

While tracking this down, I noticed a second bug: In lines 296-301 of SORBDB5, the increments INCX1 and INCX2 are not currently used.

Here is a diff that may fix both problems:

--- lapack-3.10.0/SRC/sorbdb6.f
+++ sorbdb6.f
@@ -167,14 +167,13 @@
 *  =====================================================================
 *
 *     .. Parameters ..
-      REAL               ALPHASQ, REALONE, REALZERO
-      PARAMETER          ( ALPHASQ = 0.01E0, REALONE = 1.0E0,
-     $                     REALZERO = 0.0E0 )
+      REAL               ALPHASQ, REALZERO
+      PARAMETER          ( ALPHASQ = 0.01E0, REALZERO = 0.0E0 )
       REAL               NEGONE, ONE, ZERO
       PARAMETER          ( NEGONE = -1.0E0, ONE = 1.0E0, ZERO = 0.0E0 )
 *     ..
 *     .. Local Scalars ..
-      INTEGER            I
+      INTEGER            I, IX
       REAL               NORMSQ1, NORMSQ2, SCL1, SCL2, SSQ1, SSQ2
 *     ..
 *     .. External Subroutines ..
@@ -215,12 +214,21 @@
 *     space
 *
       SCL1 = REALZERO
-      SSQ1 = REALONE
+      SSQ1 = REALZERO
       CALL SLASSQ( M1, X1, INCX1, SCL1, SSQ1 )
       SCL2 = REALZERO
-      SSQ2 = REALONE
+      SSQ2 = REALZERO
       CALL SLASSQ( M2, X2, INCX2, SCL2, SSQ2 )
       NORMSQ1 = SCL1**2*SSQ1 + SCL2**2*SSQ2
+      IF ( NORMSQ1 .EQ. 0 ) THEN
+         DO IX = 1, 1 + (M1-1)*INCX1, INCX1
+           X1( IX ) = ZERO
+         END DO
+         DO IX = 1, 1 + (M2-1)*INCX2, INCX2
+           X2( IX ) = ZERO
+         END DO
+        RETURN
+      END IF
 *
       IF( M1 .EQ. 0 ) THEN
          DO I = 1, N
@@ -239,10 +247,10 @@
      $            INCX2 )
 *
       SCL1 = REALZERO
-      SSQ1 = REALONE
+      SSQ1 = REALZERO
       CALL SLASSQ( M1, X1, INCX1, SCL1, SSQ1 )
       SCL2 = REALZERO
-      SSQ2 = REALONE
+      SSQ2 = REALZERO
       CALL SLASSQ( M2, X2, INCX2, SCL2, SSQ2 )
       NORMSQ2 = SCL1**2*SSQ1 + SCL2**2*SSQ2
 *
@@ -255,6 +263,12 @@
       END IF
 *
       IF( NORMSQ2 .EQ. ZERO ) THEN
+         DO IX = 1, 1 + (M1-1)*INCX1, INCX1
+           X1( IX ) = ZERO
+         END DO
+         DO IX = 1, 1 + (M2-1)*INCX2, INCX2
+           X2( IX ) = ZERO
+         END DO
          RETURN
       END IF
 *
@@ -281,10 +295,10 @@
      $            INCX2 )
 *
       SCL1 = REALZERO
-      SSQ1 = REALONE
+      SSQ1 = REALZERO
       CALL SLASSQ( M1, X1, INCX1, SCL1, SSQ1 )
       SCL2 = REALZERO
-      SSQ2 = REALONE
+      SSQ2 = REALZERO
       CALL SLASSQ( M1, X1, INCX1, SCL1, SSQ1 )
       NORMSQ2 = SCL1**2*SSQ1 + SCL2**2*SSQ2
 *
@@ -293,11 +307,11 @@
 *     truncate it to zero.
 *
       IF( NORMSQ2 .LT. ALPHASQ*NORMSQ1 ) THEN
-         DO I = 1, M1
-            X1(I) = ZERO
+         DO IX = 1, 1 + (M1-1)*INCX1, INCX1
+            X1(IX) = ZERO
          END DO
-         DO I = 1, M2
-            X2(I) = ZERO
+         DO IX = 1, 1 + (M2-1)*INCX2, INCX2
+            X2(IX) = ZERO
          END DO
       END IF
 *

christoph-conrads added a commit to christoph-conrads/lapack that referenced this pull request Oct 26, 2021
This patch was authored by Brian D. Sutton and posted to the discussion
of LAPACK pull request Reference-LAPACK#406.

* fix indexing for vector increments different from one
* always set vectors that are numerically zero to zero

Previously SORBDB6 would only set vectors to zero if a second iteration
of Gram-Schmidt was necessary. This would cause problems on the caller
site if the test for a zero vector differed from the SORBDB6 test for
zero.
@christoph-conrads
Copy link
Contributor Author

About the failed test reported by @christoph-conrads just above:

There seems to be a discrepancy between two comparisons with zero around the "twice is enough" procedure of SORBDB6. Within this subroutine, SLASSQ is used to measure the squared norm for a vector. But then the calling routine SORBDB5 uses SNRM2 to measure a squared norm in a few places. Under certain conditions, SLASSQ returns zero but SNRM2 returns nonzero for the same vector.

Yes, I reached the same conclusion in #634.

I think the least disruptive fix is for SORBDB6 to explicitly set all entries to zero when SLASSQ measures the norm to be zero.

This is probably the best fix if the API should be left intact (one could return a boolean).

While tracking this down, I noticed a second bug: In lines 296-301 of SORBDB5, the increments INCX1 and INCX2 are not currently used.

Good catch.

Here is a diff that may fix both problems:
[snip]

This works in single-precision. Your code can be found in commit 9ef9480 which belongs to the branch 634-fix-xORBDB5-zero-check. I will port this to xUNBDB5.

@christoph-conrads
Copy link
Contributor Author

christoph-conrads commented Oct 26, 2021

There is a problem in the diff:

+      SSQ1 = REALZERO
       CALL SLASSQ( M1, X1, INCX1, SCL1, SSQ1 )
       SCL2 = REALZERO
-      SSQ2 = REALONE
+      SSQ2 = REALZERO
       CALL SLASSQ( M1, X1, INCX1, SCL1, SSQ1 )
       NORMSQ2 = SCL1**2*SSQ1 + SCL2**2*SSQ2

It says SLASSQ( M1, X1, INCX1, SCL1, SSQ1 ) twice now. I will fix this tomorrow in the code.

@christoph-conrads
Copy link
Contributor Author

There is a problem in the diff:

+      SSQ1 = REALZERO
       CALL SLASSQ( M1, X1, INCX1, SCL1, SSQ1 )
       SCL2 = REALZERO
-      SSQ2 = REALONE
+      SSQ2 = REALZERO
       CALL SLASSQ( M1, X1, INCX1, SCL1, SSQ1 )
       NORMSQ2 = SCL1**2*SSQ1 + SCL2**2*SSQ2

It says SLASSQ( M1, X1, INCX1, SCL1, SSQ1 ) twice now. I will fix this tomorrow in the code.

No, this is a bug in the xORBDB6/xUNBDB6 code:

lapack/SRC/sorbdb6.f

Lines 283 to 294 in 5d4180c

SCL1 = REALZERO
SSQ1 = REALONE
CALL SLASSQ( M1, X1, INCX1, SCL1, SSQ1 )
SCL2 = REALZERO
SSQ2 = REALONE
CALL SLASSQ( M1, X1, INCX1, SCL1, SSQ1 )
NORMSQ2 = SCL1**2*SSQ1 + SCL2**2*SSQ2
*
* If second projection is sufficiently large in norm, then do
* nothing more. Alternatively, if it shrunk significantly, then
* truncate it to zero.
*

SCL2 and SSQ2 are not updated after the second Gram-Schmidt orthogonalization.

christoph-conrads added a commit to christoph-conrads/lapack that referenced this pull request Nov 11, 2021
This patch was authored by Brian D. Sutton and posted to the discussion
of LAPACK pull request Reference-LAPACK#406.

* fix indexing for vector increments different from one
* always set vectors that are numerically zero to zero

Previously SORBDB6 would only set vectors to zero if a second iteration
of Gram-Schmidt was necessary. This would cause problems on the caller
site if the test for a zero vector differed from the SORBDB6 test for
zero.
christoph-conrads added a commit to christoph-conrads/lapack that referenced this pull request Jan 30, 2022
This patch was authored by Brian D. Sutton and posted to the discussion
of LAPACK pull request Reference-LAPACK#406.

* fix indexing for vector increments different from one
* always set vectors that are numerically zero to zero

Previously SORBDB6 would only set vectors to zero if a second iteration
of Gram-Schmidt was necessary. This would cause problems on the caller
site if the test for a zero vector differed from the SORBDB6 test for
zero.
@h-vetinari
Copy link
Contributor

@christoph-conrads
What's the current status on this? I ask because @weslleyspereira mentioned in #629 that 3.11.0 is not too far away anymore:

I think a minor release, 3.11.0, is a better place to have new algorithms. I don't think it will take long to have it released.

@christoph-conrads
Copy link
Contributor Author

Hi @h-vetinari, #647 must be applied and there were failing tests (see the qr+cs-gsvd branch) the last time I checked. I will take a look at it next weekend.

christoph-conrads added a commit to christoph-conrads/lapack that referenced this pull request May 29, 2022
This patch was authored by Brian D. Sutton and posted to the discussion
of LAPACK pull request Reference-LAPACK#406.

* fix indexing for vector increments different from one
* always set vectors that are numerically zero to zero

Previously SORBDB6 would only set vectors to zero if a second iteration
of Gram-Schmidt was necessary. This would cause problems on the caller
site if the test for a zero vector differed from the SORBDB6 test for
zero.
christoph-conrads added a commit to christoph-conrads/lapack that referenced this pull request Jul 10, 2022
This patch was authored by Brian D. Sutton and posted to the discussion
of LAPACK pull request Reference-LAPACK#406.

* fix indexing for vector increments different from one
* always set vectors that are numerically zero to zero

Previously SORBDB6 would only set vectors to zero if a second iteration
of Gram-Schmidt was necessary. This would cause problems on the caller
site if the test for a zero vector differed from the SORBDB6 test for
zero.
christoph-conrads added a commit to christoph-conrads/lapack that referenced this pull request Jul 16, 2022
This patch was authored by Brian D. Sutton and posted to the discussion
of LAPACK pull request Reference-LAPACK#406.

* fix indexing for vector increments different from one
* always set vectors that are numerically zero to zero

Previously SORBDB6 would only set vectors to zero if a second iteration
of Gram-Schmidt was necessary. This would cause problems on the caller
site if the test for a zero vector differed from the SORBDB6 test for
zero.
@christoph-conrads
Copy link
Contributor Author

I am looking at the code and understanding the cause of a bug where all matrices except the matrix U1 (A = U1 S X) are numerically accurate when the pre-processing of A is active.

@christoph-conrads
Copy link
Contributor Author

christoph-conrads commented Dec 7, 2023

An update regarding the state of this PR:

Last update: Jan 5, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants