-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GSVD with QR factorizations, 2-by-1 CS decomposition #406
base: master
Are you sure you want to change the base?
Add GSVD with QR factorizations, 2-by-1 CS decomposition #406
Conversation
Here is the motivation for the new GSVD solver from #63:
|
Codecov Report
@@ Coverage Diff @@
## master #406 +/- ##
==========================================
- Coverage 82.37% 81.93% -0.44%
==========================================
Files 1894 1904 +10
Lines 190681 191698 +1017
==========================================
+ Hits 157067 157068 +1
- Misses 33614 34630 +1016
Continue to review full report at Codecov.
|
Yes, because then it is possible to compute only the generalized singular values for the cost of one QR decomposition with column pivoting, unitary matrix assembly, and 2-by-1 CS value computation. AFAIK TODO:
|
The whole theory behind |
The latest The remainder of this post will discuss how to compute the backward stable GSVD of Set
We now have to adjust We can immediately fix the singular values because if
The question is now if we need to change
be the GSVD of
Moreover,
Furthermore, we assume the (1,1) matrix entry arises through the matrix dimensions (meaning it is free of round-off error). With simple row scalings we can compute
Thus
Observe that Here is a NumPy demo with a random test matrix causing a large backward error with the current #!/usr/bin/python3
# This program demonstrates how to adjust singular values and right-hand side
# matrix of the generalized singular value decomposition computed with SGGQRCS
# when one of the input matrices had to be scaled.
#
# Author: Christoph Conrads (https://christoph-conrads.name)
import numpy as np
import numpy.linalg as L
def make_matrix(x):
return np.matrix(x, dtype=np.float32)
def main():
eps = np.finfo(np.float32).eps
# unmodified input matrices with norm(B) >> norm(A)
a = make_matrix([[-8.519847412e+02, +6.469862671e+02]])
b = make_matrix([
[+5.485938125e+05, -4.166526250e+05],
[+1.846850781e+05, -1.402660781e+05],
[+5.322575625e+05, -4.042448438e+05],
[-1.630551465e+04, +1.238360352e+04],
[-1.286453438e+05, +9.770555469e+04],
[-1.323287812e+05, +1.005026797e+05],
[+5.681228750e+05, -4.314841250e+05],
[-3.107875312e+05, +2.360408594e+05],
[+1.456551719e+05, -1.106233281e+05],
[+1.365355156e+05, -1.036972344e+05]
])
# GSVD computed by SGGQRCS with input A, B/(2**10)
u1 = make_matrix([[1]])
# SGGQRCS returns square matrices but we can ignore the other columns
# because the matrix pencil has only rank 2
u2 = make_matrix([
[-5.237864256e-01, +3.081814051e-01],
[-1.694306731e-01, -5.048786998e-01],
[-5.036462545e-01, -1.015115157e-01],
[+1.279635727e-02, +2.352355272e-01],
[+1.268841326e-01, -4.299084842e-01],
[+1.265842170e-01, -9.544299543e-02],
[-5.364804864e-01, -2.056131810e-01],
[+2.987869084e-01, -3.556257784e-01],
[-1.335151047e-01, -4.078340828e-01],
[-1.268201172e-01, -2.355325669e-01]
])
x = make_matrix([
[-1.029685547e+03, +7.820372925e+02],
[-8.520648804e+02, +6.470472412e+02]
])
# SGGQRCS returns radians values instead of singular values
theta = np.float32(1.55708992481e+00)
# scaling factor
w = np.power(np.float32(2), np.float32(-10))
# check for typos
assert L.norm(u2.H*u2 - np.eye(2)) <= np.sqrt(2) * eps
assert theta >= 0
assert theta <= np.pi / 2
# add helper functions
copy = lambda x: np.matrix(np.copy(x))
compute_relative_error = \
lambda a, u, d, x: L.norm(a - u*d*x) / (L.norm(a) * eps)
# assemble diagonal matrices
d1 = make_matrix([[0, np.sin(theta)]])
d2 = make_matrix([[1, 0], [0, np.cos(theta)]])
# print information
print('norm(B) / norm(A) = {:8.2e}'.format(L.norm(b)/L.norm(a)))
print('Relative error A: {:8.2e}'.format(compute_relative_error(a,u1,d1,x)))
print('Relative error B: {:8.2e}'.format(compute_relative_error(b,u2,d2,x)))
# fix values
theta_fixed = np.arctan(w * np.tan(theta))
d1_fixed = make_matrix([[0, np.sin(theta_fixed)]])
d2_fixed = make_matrix([[1, 0], [0, np.cos(theta_fixed)]])
x_fixed = copy(x) / w
# alternative: x_fixed[1,:] *= d1[0,1]/d1_fixed[0,1]
x_fixed[1,:] *= d2[1,1]/d2_fixed[1,1]
# recompute backward error
fmt = 'Relative error {:s} after fixes: {:8.2e}'
print(fmt.format("A", compute_relative_error(a,u1,d1_fixed,x_fixed)))
print(fmt.format("B", compute_relative_error(b,u2,d2_fixed,x_fixed)))
if __name__ == '__main__':
main() Script output on my computer:
|
@julielangou It's done. Given a pair of matrices The tests can be found in the branch qr+cs-gsvd in Bonus functionality (not used by
|
The branch |
BenchmarksThis post compares the performance of the GSVD solvers xGGSVD3 and xGGQRCS. IntroductionOf course, the comparison will be very limited because of the huge number of parameters relevant to the GSVD, e.g.,
In this post we benchmark with matrices
Code: Attention: xGGSVD3 computes a decomposition Results
PlotsThe plots below compare the CPU time consumed by the xGGQRCS compared to xGGSVD3 in single-precision when
in
The colors indicate the number of matrix rows; the brighter the colors, the larger the number of rows. The x axis shows the number of columns in the matrices, the y axis shows the CPU time needed by xGGQRCS divided by the CPU time needed by xGGSVD3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides a few minor remarks, looks good to me.
I would worry about one thing: i'm pretty sure the SVD is a little better at identifying the numerical rank than RRQR. I haven't looked into this, but is it possible that the direct GSVD is then also better at this than this method?
The backward error introduced by a QR transformation of a matrix Both GSVD and QR+CS initially apply one or more permuted QR factorizations, see |
cdbd846
to
fdcae32
Compare
@thijssteel Please let me know if you are satisfied by the changes. |
a7febc1
to
9a1f28c
Compare
Intel Cascade Lake results (commit 6a61522bd8fd74dadf406bef1c628f94b3947216) on a virtual private server. Note that CGGQRCS takes now more than twice the CPU time of CGGSVD3 for matrix pairs with less than 20 columns when only singular values are requested. This difference was less pronounced in the last set of benchmarks.
|
9a1f28c
to
9e20c49
Compare
How consistent is the scaling behaviour observed in the benchmarks? Would it be reliable enough to specify a convenience function (e.g. for SV-only) that switches between Eyeballing the graphs, I'm thinking of something along the lines of |
Hi @h-vetinari, Yes this is a good idea. (1) we could write higher levels wrappers that would call the "best" function. The current structure (clean interface for a stand alone xGGSVD3 and clean interface for a stand alone xGGQRCS) is good with me. What we might be missing is higher levels wrappers. (2) However my preference is, for now, to leave this kind of tuning for higher levels (than LAPACK). Higher levels like Matlab, Octave, Numpy, Scipy, etc. This kind of tuning could be present a lot in LAPACK. For example for QR, there is TSQR (for tall and skinny matrices) and there is the standard QR algorithm. When n < a_function( m ), then TSQR is better than QR. For SYEV, there are four different algorithms (SYEVR, SYEV, SYEVX, SYEVD). Sometimes there are switches from one algorithm to the next when the choice is "obvious", but as soon as the choice gets complicated / machine dependent, these switches are not there. (3) For now, we could have some comments in the header of these subroutines to give brief explanations to the users and the various choices. Also timings as done by @christoph-conrads are useful to explain to users the trade-off. (4) For the future, I am not against starting some higher level wrappers. Like QR(m,n,A,lda). This would leave opportunities for software like MKL, OpenBLAS, to optimize the algorithm used by these functions. We could provide reference implementation for these high level wrappers. We should look at what is in LAPACKE. So bottom line, my personal opinion is to leave things as is for now. But it is a possible to-come-soon agenda items. Opinions welcome. Cheers, Julien. |
Nice job @christoph-conrads ! I think we should add tests for the new routines. Maybe we can just replicate the tests for xCGGSVD3. What do you think? |
There are tests in the christoph-conrads/qr+cs-gsvd branch in |
No, because there is no such regime. Consider the matrix pencil (A, B) below:
Judging only by the dimensions, this is the optimal case xGGQRCS yet in practice xGGSVD3 will be much faster here. The xGGSVD3 preprocessing(!) stage will determine after executing two QR factorizations that this matrix pencil has the singular value pairs (1, 0) and (0, 1) (or that it has the singular values zero and infinity). xGGQRCS must actually execute in its entirety to arrive at the same conclusion. The critical property in this example are the nullspaces of the matrices and not their dimensions. This is of course the counter-example from mathematics but I really doubt that such a switch is reliably possible in practice because there are too many relevant variables including: number of rows (matrix A), number of rows (matrix B), number of columns, rank of A, rank of B, rank of (A, B), whether or not to assemble the matrices U1, U2, and X. This list contains only quantities from mathematics. In addition, there are factors like the BLAS implementation, CPU architecture, or the number of cores. Finally, the major problem with xGGQRCS is its large memory consumption in comparison to xGGSVD3. A method deciding on the fly between xGGQRCS and xGGSVD3 would burden the user with the xGGQRCS storage thirst. |
Yes, I saw these tests. I just thought a quick way to get the new methods covered by the current Lapack/TESTING/EIG would be to adapt CERRGG and CGSVTS3 subroutines. And maybe add your new tests in a different PR. |
(Emphasis mine) Adopting |
* SUBROUTINE SERRGG( PATH, NUNIT )
*
* .. Scalar Arguments ..
* CHARACTER*3 PATH
* INTEGER NUNIT
* [snip]
ELSE IF( LSAMEN( 3, PATH, 'GQR' ) ) THEN
*
* SGGQRF
*
SRNAMT = 'SGGQRF'
INFOT = 1
CALL SGGQRF( -1, 0, 0, A, 1, R1, B, 1, R2, W, LW, INFO )
* [snip]
*
* SGGRQF
*
SRNAMT = 'SGGRQF' @weslleyspereira Which three letter identifier should I use? If |
I see... Maybe just |
Think like it's 1977, not 2021:
|
Ah... now I saw your mention to the three-letter |
9e20c49
to
e6163a1
Compare
To have xGGQRCS outperform xGGSVD3 more consistently, xGGQRCS could preprocess the matrices with the function xGGSVP3. xGGSVP3 computes at most four QR factorizations to detect
xGGSVP3 reduces the matrices to upper triangular form which xGGQRCS does not need but the preprocessing step would stop xGGQRCS from underperforming when the matrices are heavily rank deficient. |
Hi. The PR christoph-conrads#3 has some bug fixes, and test cases for the new routines. Consider merging that one before accepting this. * These changes are also in christoph-conrads#2. I just split the branches to ease the merge, |
TODO list:
|
Hi @weslleyspereira, I do not want to merge christoph-conrads#2 because the current code already checks for overflows. Moreover, the checks about when a matrix norm is too small are purely heuristic and they do not necessarily indicate (user) errors. In such a case, I prefer to stick with simplicity. |
@langou @weslleyspereira The column pre-processing requires a new function that copies and transposes matrices and it was named |
Thanks for asking. Reading https://github.com/christoph-conrads/lapack/blob/qr+cs-gsvd/SRC/sgetrp.f The prefix LA is for "auxiliary" and it was used consistently when LAPACK had clearer distinction between "driver" and "computational" and "auxiliary". There are a lots of xLAwyz subroutines out there. :). There is some merit to have a distinction between "driver" and "computational" and "auxiliary", but this distinction has slowly become more and more foreign over time. Also, one can argue whether a transpose routine should considered "driver" or "computational" or "auxiliary". This is not obvious and it depends on the definitions of these three categories. It seems that the definitions depend on whom you ask. I like SGETRP more than SLATRP. I do not dislike SLATRP though. So SGETRP is out-of-place transpose. Sounds good. Thanks for the contribution. Note that Gustavson and Swirszcz have a nice in-place transposition algorithm. (This is not in LAPACK, but, this is in the wishlist.) See Gustavson F.G., Swirszcz T. (2007) In-Place Transposition of Rectangular Matrices. In: Kågström B., Elmroth E., Dongarra J., Waśniewski J. (eds) Applied Parallel Computing. State of the Art in Scientific Computing. PARA 2006. Lecture Notes in Computer Science, vol 4699. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75755-9_68 Note as well that MKL has mkl_?imatcopy for scaling and in-place transposition/copying of matrices and mkl_?omatcopy for scaling and out-of-place transposition/copying of matrices and mkl_?omatcopy2 for two-strided version of mkl_?omatcopy. |
Since 3.10.0 was released now, this should probably be re-milestoned. |
Good point @h-vetinari. Thanks! (Done.) |
Debugging the failing test
|
About the failed test reported by @christoph-conrads just above: There seems to be a discrepancy between two comparisons with zero around the "twice is enough" procedure of SORBDB6. Within this subroutine, SLASSQ is used to measure the squared norm for a vector. But then the calling routine SORBDB5 uses SNRM2 to measure a squared norm in a few places. Under certain conditions, SLASSQ returns zero but SNRM2 returns nonzero for the same vector. I think the least disruptive fix is for SORBDB6 to explicitly set all entries to zero when SLASSQ measures the norm to be zero. While tracking this down, I noticed a second bug: In lines 296-301 of SORBDB5, the increments INCX1 and INCX2 are not currently used. Here is a diff that may fix both problems:
|
This patch was authored by Brian D. Sutton and posted to the discussion of LAPACK pull request Reference-LAPACK#406. * fix indexing for vector increments different from one * always set vectors that are numerically zero to zero Previously SORBDB6 would only set vectors to zero if a second iteration of Gram-Schmidt was necessary. This would cause problems on the caller site if the test for a zero vector differed from the SORBDB6 test for zero.
Yes, I reached the same conclusion in #634.
This is probably the best fix if the API should be left intact (one could return a boolean).
Good catch.
This works in single-precision. Your code can be found in commit 9ef9480 which belongs to the branch 634-fix-xORBDB5-zero-check. I will port this to xUNBDB5. |
There is a problem in the diff: + SSQ1 = REALZERO
CALL SLASSQ( M1, X1, INCX1, SCL1, SSQ1 )
SCL2 = REALZERO
- SSQ2 = REALONE
+ SSQ2 = REALZERO
CALL SLASSQ( M1, X1, INCX1, SCL1, SSQ1 )
NORMSQ2 = SCL1**2*SSQ1 + SCL2**2*SSQ2 It says |
No, this is a bug in the xORBDB6/xUNBDB6 code: Lines 283 to 294 in 5d4180c
|
This patch was authored by Brian D. Sutton and posted to the discussion of LAPACK pull request Reference-LAPACK#406. * fix indexing for vector increments different from one * always set vectors that are numerically zero to zero Previously SORBDB6 would only set vectors to zero if a second iteration of Gram-Schmidt was necessary. This would cause problems on the caller site if the test for a zero vector differed from the SORBDB6 test for zero.
This patch was authored by Brian D. Sutton and posted to the discussion of LAPACK pull request Reference-LAPACK#406. * fix indexing for vector increments different from one * always set vectors that are numerically zero to zero Previously SORBDB6 would only set vectors to zero if a second iteration of Gram-Schmidt was necessary. This would cause problems on the caller site if the test for a zero vector differed from the SORBDB6 test for zero.
@christoph-conrads
|
Hi @h-vetinari, #647 must be applied and there were failing tests (see the qr+cs-gsvd branch) the last time I checked. I will take a look at it next weekend. |
This patch was authored by Brian D. Sutton and posted to the discussion of LAPACK pull request Reference-LAPACK#406. * fix indexing for vector increments different from one * always set vectors that are numerically zero to zero Previously SORBDB6 would only set vectors to zero if a second iteration of Gram-Schmidt was necessary. This would cause problems on the caller site if the test for a zero vector differed from the SORBDB6 test for zero.
This patch was authored by Brian D. Sutton and posted to the discussion of LAPACK pull request Reference-LAPACK#406. * fix indexing for vector increments different from one * always set vectors that are numerically zero to zero Previously SORBDB6 would only set vectors to zero if a second iteration of Gram-Schmidt was necessary. This would cause problems on the caller site if the test for a zero vector differed from the SORBDB6 test for zero.
This patch was authored by Brian D. Sutton and posted to the discussion of LAPACK pull request Reference-LAPACK#406. * fix indexing for vector increments different from one * always set vectors that are numerically zero to zero Previously SORBDB6 would only set vectors to zero if a second iteration of Gram-Schmidt was necessary. This would cause problems on the caller site if the test for a zero vector differed from the SORBDB6 test for zero.
I am looking at the code and understanding the cause of a bug where all matrices except the matrix |
An update regarding the state of this PR:
Last update: Jan 5, 2024 |
This pull request adds a generalized singular value decomposition computed by means of a QR decomposition with column pivoting and the 2-by-1 CS decomposition to LAPACK. The PR requires #405.
Given a pair of matrices
A
,B
with appropriate dimensions, the GSVD computes a decompositionA = U1 D1 R Q^*
,B = U2 D2 R Q^*
.I would like to have feedback on the computation of the matrix
R
. Currently it is always computed but this is expensive because of an additional matrix-matrix multiplication followed by an RQ decomposition. Should I make this optional?The PR is based on #63 but provides all implementations (single-precision real, double-precision real, single-precision complex, and double-precision complex) with tests removed because of the C++ dependency. The tests can be found in fb5dfb3.