Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes for "Test 5" needed in xeigtst* #1043

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

frjohnst
Copy link

We are seeing a number of "Test 5" failures from the
xeigtst* tests with *ed.in and *gd.in input files.
We believe these are the consequence of limited
numerical precision, and not the result of a computational
error, so we would like to change "Test 5" as outlined below.

In the following, we will discuss only xeigtsts
failures with sed.in, since all "Test 5" failures related to
this PR follow a similar pattern.

Example of failures from xeigtsts run with sed.in :

...
N=   20, IWK= 1, seed=2570,2564,2644, 913, type 21, test( 5)=  .839E+07
N=   20, IWK= 2, seed=2570,2564,2644, 913, type 21, test( 5)=  .839E+07
SEV:    8 out of  1100 tests failed to pass the threshold

These "fail"s are happening because the sed.in test, described in part as

Tests of the Nonsymmetric Eigenvalue Problem Driver
SGEEV (eigenvalues and eigevectors)

compares eigenvalues computed from calls to SGEEV requesting eigenvalues
only to calls to SGEEV requesting both eigenvalues and eigenvectors, and
demands that the eigenvalues (real and imaginary parts) be identical:
(original "Test (5)" from EIG/sdrvev.f).

*
*              Do Test (5)
*
               DO 150 J = 1, N
                  IF( WR( J ).NE.WR1( J ) .OR. WI( J ).NE.WI1( J ) )
     $               RESULT( 5 ) = ULPINV
  150          CONTINUE
*

As best we can determine, a matrix is updated as eigenvalues are computed,
and these matrix updates ("H" in the code just below) may be different
between the eigenvalues-only and eigenvalues+eigenvectors cases.
For example, the snippet below is from SLAQR3, called eventually by
SGEEV, where WANTT is true when eigenvectors are computed:

*
*        ==== Update horizontal slab in H ====
*
         IF( WANTT ) THEN
            DO 80 KCOL = KBOT + 1, N, NH
               KLN = MIN( NH, N-KCOL+1 )
               CALL SGEMM( 'C', 'N', JW, KLN, JW, ONE, V, LDV,
     $                     H( KWTOP, KCOL ), LDH, ZERO, T, LDT )
               CALL SLACPY( 'A', JW, KLN, T, LDT, H( KWTOP, KCOL ),
     $                      LDH )
   80       CONTINUE
         END IF
*

As best we can determine, from the mathematical point of view the eigenvalues
should be identical in the two cases, but finite computer precision makes the
two sets of eigenvalues differ slightly. To confirm this, the "Test (5)"
comparison of the two sets of eigenvalues was changed to compare the relative
eigenvalue differences to a small tolerance (instead of demanding exact
bitwise equality):
(proposed modification for sdrvev.f)

*
*              Do Test (5)
*
       EVAL_5 = .FALSE.
       DO 150 J = 1, N
           IF( WR( J ).NE.WR1( J ) .OR. WI( J ).NE.WI1( J ) )
     $         EVAL_5 = .TRUE.
  150  CONTINUE
*
       IF (EVAL_5) THEN
        WTOL= ULP*THRESH
        DO 300 J = 1, N
          TEMPR = ABS(WR(J)-WR1(J))/ (1+ABS(WR1(J)))
          TEMPI = ABS(WI(J)-WI1(J))/ (1+ABS(WI1(J)))
          IF ( (TEMPR .GT. WTOL) .OR. (TEMPI .GT. WTOL) ) THEN
                     RESULT( 5 ) = ULPINV
          ENDIF
  300   CONTINUE
       ENDIF

and with this change the test passes. So this PR proposes that "Test 5"
executed for *ev.in and *gd.in should generally be changed in this manner.

Note that changes made for generalized eigenvalue "Test 5" differ somewhat in the
details (since alphas/betas are computed in this case instead of
eigenvalues) but the basic idea is the same.

@angsch
Copy link
Collaborator

angsch commented Aug 19, 2024

Does this happen with reference-BLAS or are you linking against an optimized BLAS implementation?

@frjohnst
Copy link
Author

We are using an optimized OpenBLAS implementation, but we believe that optimized OpenBLAS library functions correctly, since it passes all the other BLAS/LAPACK tests we run internally.

@frjohnst
Copy link
Author

And xeigtst* Test 5 also passes, if modified as described in this PR to require the two sets of eigenvalues to agree to within a small tolerance, rather than requiring that they be exactly the same.

@angsch
Copy link
Collaborator

angsch commented Aug 19, 2024

As best we can determine, from the mathematical point of view the eigenvalues
should be identical in the two cases, but finite computer precision makes the
two sets of eigenvalues differ slightly.

The optimized BLAS implementation likely uses different kernels for the two cases. The kernels are selected based on matrix dimensions. The result is a slightly different accumulation order, as you analyzed. Reference-BLAS should produce the same result regardless of the matrix sizes.

@martin-frbg
Copy link
Collaborator

Indeed OpenBLAS will use a different algorithm (different blocking and hence different accumulation of rounding errors) on several hardware platforms when the matrix size is "small", and it will use GEMV instead of GEMM when one matrix dimension is equal to one.
I have not had time to look at this PR in detail, but
(1) I believe the deviation falls within the nebulous range of what the LAPACK FAQ calls "minor errors" such as may be caused by different implementations of compiler optimization, math library or the use of another BLAS implementation... (2) my point (1) should probably become part of the documentation of OpenBLAS (now that that is beginning to take form), including a warning not to open testsuite issues here unless the problem is reproducible with a pure Reference build as well
(3) I will probably merge this PR into OpenBLAS' copy of the testsuite , which already deviates by using some relaxed thresholds elsewhere
(4)I expect this will come back to haunt Reference-LAPACK some day, especially if/when parts of it really get rewritten in C++ as proposed (see the various other open issues regarding testsuite fragility)

@langou
Copy link
Contributor

langou commented Aug 20, 2024

With respect to the initial post of this discussion thread ( #1043 (comment) ) and the lines of codes at:

lapack/SRC/slaqr3.f

Lines 659 to 667 in 9c0ef66

IF( WANTT ) THEN
DO 80 KCOL = KBOT + 1, N, NH
KLN = MIN( NH, N-KCOL+1 )
CALL SGEMM( 'C', 'N', JW, KLN, JW, ONE, V, LDV,
$ H( KWTOP, KCOL ), LDH, ZERO, T, LDT )
CALL SLACPY( 'A', JW, KLN, T, LDT, H( KWTOP, KCOL ),
$ LDH )
80 CONTINUE
END IF

Indeed when we do not want the Schur factor T, (so when WANTT is FALSE,) then we should be able to safely skip the operations in the IF( WANTT ) statement above. The goal is to save some operations that are unnecessary when the Schur factor T is not desired. These operations should not affect the computation of the eigenvalues. These operations should only be working on pieces of H (which converges to T as the QR iteration algorithm converges) that needs to be updated to get a correct T, but these operations do not impact the quasidiagonal of T (where the eigenvalues of A will be) in the subsequent operations. Therefore skipping these operations (WANTT is FALSE) or performing them (WANTT is TRUE) should result in the (exactly bit-to-bit) same eigenvalues.

Now I might be mistaken, or there might be a bug in the code, or this comes from some parts of the HSEQR algorithm that I do not understand, but my current understanding is that skipping these operations (WANTT is FALSE) or performing them (WANTT is TRUE) should result in the (exactly bit-to-bit) same eigenvalues

@langou
Copy link
Contributor

langou commented Aug 20, 2024

⏸️ I would like to press the pause button on this. ⏸️

I am not saying "no", but I would like that we give time for folks to express their opinions and maybe some more investigations.

Three comments:

short version

(1) I do not understand why the two sets of eigenvalues are not bit-to-bit the same
(2) it can be desirable to have the two sets of eigenvalues bit-to-bit similar
(3) if we do not have bit-to-bit similarity, then the check might need to take into account the condition numbers of the eigenvalues.

TL;DR

(1) First it is not obvious to me why, if provided with same ILAENV parameters (so algorithmic block size, early agressive window size, etc.) HSEQR with eigenvectors and HSEQR without eigenvectors do not result in the (exactly bit-by bit) same eigenvalues. I thought both would return the (exactly bit-by bit) same eigenvalues. When you compute the eigenvectors, you perform some extra operations, but these operations should not impact the eigenvalues. These operations are confined to part of the Schur form that will not influence the quasi diagonal part of T. (Or, at least, that is my understanding.) So either (a) my understanding is not correct, or (b) HSEQR with eigenvectors and HSEQR without eigenvectors (used in this test) use different block sizes, or (c) the BLAS (used in this test) is not reproducible, or (d) maybe there is a weird bug somewhere. I would like to drill down a little on this to try to understand. In short, there is something here that I do not understand. I have not reviewed the codes and I probably should.

(2) There are some arguments for having a software that returns (exactly bit-by bit) same eigenvalues whether or not eigenvectors are requested. (See a related discussion for the symmetric eigenvalue problem at #997 (comment).) Before given up on "having a software that returns (exactly bit-by bit) same eigenvalues whether or not eigenvectors are requested", I would like to understand impact on performance, reason why, etc. I would rather have a software that returns (exactly bit-by bit) same eigenvalues whether or not eigenvectors are requested if possible and if cost is not prohibitive.

(3) If we give up on "having a software that returns (exactly bit-to-bit) same eigenvalues whether or not eigenvectors are requested", then, the call with eigenvectors will return eigenvalues but also Schur form (T and Q) so we can check backward error with the Schur form. When we test the call with no eigenvectors, there is no Schur form and so we only have eigenvalues. For this reason, we compare the two sets of eigenvalues. The one from the call with no eigenvectors (that we want to check) with respect to the one from the call with eigenvectors (that we trust since we have computed the backward error.) When we compare the two sets of eigenvalues, we do it bit-to-bit. (See reason why above.) Now if we cannot guarantee bit-to-bit similar set of eigenvalues, yes, we want to check some kind of relative error between two eigenvalues and we would like these relative errors to be close to machine precision. This is called a forward error. For the nonsymmetric eigenvalue problem, we might want to introduce the condition number of the eigenvalue in multiplication to the forward error. So not just the forward error. Unless we know that the eigenvalues are well conditioned.

@langou
Copy link
Contributor

langou commented Aug 20, 2024

For a software like OpenBLAS, during the execution of a program that calls twice GEMM (say) with twice the exact same parameters (but at different times in the execution) then does OpenBLAS guarantee bit-to-bit similar answers? For example, does OpenBLAS allow itself to change block sizes of GEMM for two calls with similar parameters during the same execution of a program?

@martin-frbg
Copy link
Collaborator

Repeat calls to GEMM with the exact same parameters should produce the exact same computation and result, "similar" parameters might result in a different partitioning of the problem (depending on number of threads available) which I think might affect rounding errors even though the actual GEMM kernel (and its block size) is constant for a given cpu. If the code paths taken in LAPACK produce vastly different matrix sizes, say one calling GEMM with dimensions 1000x1000 and the other 20x20 or sometimes even 1x20, you might end up in different assembly kernels with results no longer guaranteed to be bit-to-bit identical. It gets a bit murkier when the eigenvector and non-eigenvector paths in LAPACK call two different BLAS functions to achieve the same goal, as there is no guarantee that the algorithms and implementations are exactly the same as in the Reference BLAS, only that the individual result is correct "to the computational precision of the data type" (or in practice, that it passes the tests from Reference-BLAS (among others).
So I think the LAPACK testsuite is still great for its original purpose - making sure that Reference-LAPACK built against Reference-BLAS, both ideally with compiler optimizations disabled, returns exactly the expected results. Only its utility for validating any given alternative implementation is limited by insisting on bit-wise identical floating point results - this has probably always been the case, but is aggravated by the availability of FMA and vector instructions in modern hardware.
(BTW I think @frjohnst never mentioned what compiler, architecture and cpu model the deviations were seen with ?)

@langou
Copy link
Contributor

langou commented Aug 20, 2024

Repeat calls to GEMM with the exact same parameters should produce the exact same computation and result,

This answers my question.

"similar" parameters might result in a different partitioning of the problem (depending on number of threads available) which I think might affect rounding errors even though the actual GEMM kernel (and its block size) is constant for a given cpu.

Yes, this is understood. I am asking about "exactly same" parameters. By "similar", I mean "exactly same".

If the code paths taken in LAPACK produce vastly different matrix sizes, say one calling GEMM with dimensions 1000x1000 and the other 20x20 or sometimes even 1x20, you might end up in different assembly kernels with results no longer guaranteed to be bit-to-bit identical.

Yes, this is understood and fine. If we create ourselves our own GEMM based on smaller GEMMs, and compare different block sizes, we should not expect bit-to-bit reproducibility between two similar calls.

It gets a bit murkier when the eigenvector and non-eigenvector paths in LAPACK call two different BLAS functions to achieve the same goal, as there is no guarantee that the algorithms and implementations are exactly the same as in the Reference BLAS, only that the individual result is correct "to the computational precision of the data type" (or in practice, that it passes the tests from Reference-BLAS (among others).

Yes, this is understood and fine. If we use two different BLAS libraries in an execution, we should not expect bit-to-bit reproducibility between two similar calls.

I meant

  • two calls of the same BLAS functions at different time in the execution
  • "exact same parameters", (the only thing that would change are the pointers, but the data inside the arrays would be bit-to-bit similar,) maybe the LDAs might change though, let me know if LDAs matters, LDA might actually matter, we might need to enforce that LDA is the same
  • same BLAS library (it is OK for two BLAS libraries to not return the same bit-to-bit answers)
  • same execution of the executable (it might be OK for two distinct executions of the same executable -same number of threads- to not return the same bit-to-bit answers, I do not really know about whether that's OK or not but here for sake of argumentation, let's assume same execution)

So I think the LAPACK testsuite is still great for its original purpose - making sure that Reference-LAPACK built against Reference-BLAS, both ideally with compiler optimizations disabled, returns exactly the expected results. Only its utility for validating any given alternative implementation is limited by insisting on bit-wise identical floating point results - this has probably always been the case, but is aggravated by the availability of FMA and vector instructions in modern hardware.

Even with FMA and vector instructions enabled, one could expect bit-to-bit reproducibility when the parameters are the same. The BLAS library would have to guarantee to pretty much have no "runtime" decision making. Or if it makes some "runtime" decisions, then it needs to stick to these decisions for the whole execution.

Given a set of input parameters, during. a given execution, () units on which computation are done (CPU, GPU, etc.), () block sizes, () number of threads, () microkernels, () reduction tree, etc. would need to be fixed.

The requirement "() block sizes, () number of threads, () reduction tree" could be waived somewhat if one uses ideas expressed in https://www.siam.org/publications/siam-news/articles/reproducible-blas-make-addition-associative-again/ But for a given execution, an easy way to achieve reproducibility is to either have no runtime decision, or stick to decisions once one has been made.

I am asking what I can expect from a BLAS library. Maybe the fact that the LDAs might not be the same leads to not-bit-to-bit-similar outputs. Maybe the fact that the pointers are not pointing to the same pieces of memory matters and leads to not-bit-to-bit-similar outputs.

Note that the MPI-3.0 standard reads (page 175, lines 9–13): "It is strongly recommended that MPI REDUCE be implemented so that the same result be obtained whenever the function is applied on the same arguments, appearing in the same order. Note that this may prevent optimizations that take advantage of the physical location of ranks." My understanding of this is once MPI decides (at runtime or at compile time) a reduction tree, it needs to stick to it for the whole execution as long as the number of processes are the same, the data size, data type are the same.

@thijssteel
Copy link
Collaborator

thijssteel commented Aug 20, 2024

I just talked to @langou did not read the entire discussion here, sorry if I missed anything.

I think that https://github.com/Reference-LAPACK/lapack/blob/master/SRC/dlaqr5.f#L787 is probably the bad guy here.
Especially for the vertical multiply, the starting index of the multiplication is different. With blocking, that might lead to slightly different rounding.

A possible solution would be to split the gemm call into two pieces, one of which only gets called if wantt is true. Then the gemm call that is important for reproducibility is at least consistent.

I have on one occasion noticed a situation where this was important. Slightly different rounding errors led to a different shift and a different eigenvalue converging first. But that issue was then exaggerated because my code was also bugged at the time. I don't know if this could lead to a big issue with the current HQR implementation. While I agree that requiring bitwise reproducibility for all implementations is too strict, if it is (relatively) easily attainable for reference LAPACK, i think a reasonable effort should be made to make it so.

@frjohnst
Copy link
Author

@martin-frbg We were using IBM's openxl compilers. As I recall, we saw this behavior on multiple Power machines (e.g. Power8).

@martin-frbg
Copy link
Collaborator

Given a set of input parameters, during. a given execution, () units on which computation are done (CPU, GPU, etc.), () >block sizes, () number of threads, () microkernels, () reduction tree, etc. would need to be fixed.

For a given set of input parameters, this is fixed (but not necessarily the same for the entire range of possible input parameters). I have no idea to what extent cpu hardware (and possible microcode flaws) might have an influence in the future, with two or three tiers of differently fast/capable cores within one cpu becoming popular. But all runtime decisions made in or "by" OpenBLAS are based on (1) model of cpu (2) problem size, so should always be fully reproducible. I do not think physical location of data during individual runs plays a role on any supported architecture, the optimizations in OpenBLAS tend to be a lot more mundane, judicious loop unrolling and making use of fma instructions where available

Difference in LDA might lead to non-bit-to-bit identity (between cases with different LDA of course) if blocking/loop unrolling is not equally well implemented for either dimension.

ISTR there was a case discussed in some issue here fairly recently, where the result of some fairly complex calculation was compared against exact zero, and that comparison determined the further code flow... not sure if I can find it or if it was just a bad dream though...

@angsch
Copy link
Collaborator

angsch commented Aug 20, 2024

A possible solution would be to split the gemm call into two pieces, one of which only gets called if wantt is true. Then the gemm call that is important for reproducibility is at least consistent.

I upvote this.

While it is not directly related to this issue, I would like to highlight that deviations between job configurations are already accepted for some routines in LAPACK. GESVD, for example, uses different algorithms depending on whether or not the singular vectors are requested and bitwise equality cannot be expected.

@langou
Copy link
Contributor

langou commented Aug 20, 2024

While it is not directly related to this issue, I would like to highlight that deviations between job configurations are already accepted for some routines in LAPACK. GESVD, for example, uses different algorithms depending on whether or not the singular vectors are requested and bitwise equality cannot be expected.

Correct.

A cross reference where this fact is mentioned is https://github.com/Reference-LAPACK/lapack/blob/master/SRC/dlaqr5.f#L787

@langou
Copy link
Contributor

langou commented Aug 20, 2024

@thijssteel wrote "A possible solution would be to split the gemm call into two pieces, one of which only gets called if wantt is true. Then the gemm call that is important for reproducibility is at least consistent."

@angsch wrote: "I upvote this."

@langou writes: "I upvote this too."

@langou
Copy link
Contributor

langou commented Aug 20, 2024

To summarize where we are at: @thijssteel established that, even with a BLAS that returns bit-to-bit-similar answers for similar calls, GEEV from LAPACK might not return bit-to-bit-similar eigenvalues when called "with eigenvectors computation enabled" and called "without eigenvectors computation enabled".

The reason is that some BLAS calls that affect the eigenvalue regions are done with different sizes (larger sizes for "with eigenvectors computation enabled", smaller sizes for "without eigenvectors computation enabled"). There is a possible fix. We could break the BLAS call in the "with eigenvectors computation enabled" so that the block sizes match the block sizes in "without eigenvectors computation enabled"and then deal with the remaining region (that do not affect eigenvalues) separately. (See @thijssteel's post.)

Thanks to @frjohnst for pointing out this matter and thanks to @thijssteel for explaining why this happens.

WRT this PR, I am not sure what to do. This PR is relevant when we do not expect bit-to-bit-similar eigenvalues. (Which is the current state of LAPACK.) However, I am leery to merge the PR. I would rather that we fix LAPACK GEEV to return bit-to-bit-similar eigenvalues when called "with eigenvectors computation enabled" and called "without eigenvectors computation enabled". I feel that, if we merge this PR, then we'll be less motivated to work on LAPACK GEEV.

@frjohnst
Copy link
Author

Please note that we see this issue for xGGEV and xGGEV3 (generalized eigenvalue) as well as
xGEEV, so several groups of routines would potentially need to be updated, if the opensource
LAPACK community chose to update LAPACK source instead of the test cases.
We would be okay with either approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants