Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bench_blas.py::test_gesdd in Codspeed CI job #4776

Closed
rgommers opened this issue Jul 2, 2024 · 9 comments · Fixed by #4819
Closed

bench_blas.py::test_gesdd in Codspeed CI job #4776

rgommers opened this issue Jul 2, 2024 · 9 comments · Fixed by #4819

Comments

@rgommers
Copy link
Contributor

rgommers commented Jul 2, 2024

This failure just showed up in CI, from this log:

 benchmarks/bench_blas.py ............................................... [ 75%]
  .........F.....                                                          [100%]
  ================================ 62 benchmarked ================================
  
  =================================== FAILURES ===================================
  ______________________________ test_gesdd[mn1-s] _______________________________
  
  benchmark = <pytest_codspeed.plugin.BenchmarkFixture object at 0x152f7980>
  mn = (1000, 222), variant = 's'
  
      @pytest.mark.parametrize('variant', ['s', 'd'])
      @pytest.mark.parametrize('mn', gesdd_sizes)
      def test_gesdd(benchmark, mn, variant):
          m, n = mn
          rndm = np.random.RandomState(1234)
          dtyp = dtype_map[variant]
      
          a = np.array(rndm.uniform(size=(m, n)), dtype=dtyp, order='F')
      
          gesdd_lwork = ow.get_func('gesdd_lwork', variant)
      
          lwork, info = gesdd_lwork(m, n)
          lwork = int(lwork)
          assert info == 0
      
          gesdd = ow.get_func('gesdd', variant)
          u, s, vt, info = benchmark(run_gesdd, a, lwork, gesdd)
      
  >       assert info == 0
  E       assert 12 == 0
  
  benchmarks/bench_blas.py:237: AssertionError
  =========================== short test summary info ============================
  FAILED benchmarks/bench_blas.py::test_gesdd[mn1-s] - assert 12 == 0
  ================== 1 failed, 61 passed in 2080.74s (0:34:40) ===================
   ** On entry to SLASCL parameter number  4 had an illegal value
   ** On entry to SLASCL parameter number  4 had an illegal value
   ** On entry to SLASCL parameter number  4 had an illegal value
   ** On entry to SLASCL parameter number  4 had an illegal value

This benchmark was introduced in gh-4678 one and a half months ago. I'm not sure it failed before like this. @ev-br you may want to look into this?

@ev-br
Copy link
Contributor

ev-br commented Jul 2, 2024

Hm... Actually, it did fail like this before: https://github.com/OpenMathLib/OpenBLAS/actions/workflows/codspeed-bench.yml
And the single-precision variant has been introduced in #4763, so it's relatively recent. Hm. Am taking a look.

@ev-br
Copy link
Contributor

ev-br commented Jul 2, 2024

A few quick observations:

  1. the mistake is almost certainly a failure of thread safety in the LAPACK implementation or in a Julia wrapper. I suspect someone is misusing a static or global variable.

The benchmark is using OPENBLAS_NUM_TREADS=1, so it's not thread safety but it's still deep in OpenBLAS.

  • I cannot repro it locally with either scipy_openbpas32 wheel or self-built develop branch.

All in all, this looks like a possibly genuine edge case in OpenBLAS (or reference LAPACK? not sure where the single-precision gesdd kernel comes from)

I'll try to smoke it out on CI without codspeed now. Meanwhile, maybe we should just remove the assertion for the time being. The assertion is not required for the benchmark itself; it's just generally nicer to benchmark correctly working code, they say.

EDIT: #4777

@martin-frbg
Copy link
Collaborator

Weird error, INFO=4 would mean this input argument (the denominator of the scale factor) is either NAN or zero

@martin-frbg
Copy link
Collaborator

There is one call graph where this input argument is provided by SNRM2...

@ev-br
Copy link
Contributor

ev-br commented Jul 2, 2024

This is indeed codspeed-specific: https://github.com/OpenMathLib/OpenBLAS/actions/runs/9765188405/job/26955309472?pr=4777

I asked on their discord (https://discord.com/channels/1065233827569598464/1065686090452828251/threads/1257753281342738502 -- might need to join to see, no idea; will repost any insights here anyway).

Meanwhile, do we want to disable the assertion so that the runs are uploaded and we can at least see the flamegraphs?

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jul 2, 2024

Guess it would make sense to disable the assertion, especially if the problem cannot be reproduced outside codspeed. (This is unlikely to be cpu-specific as there is only one assembly kernel for SNRM2 in use on all x86_64 targets (except the plain C "GENERIC" one). Or if it is, it would have to be due to compiler/assembler misbehaviour)
Also the NRM2 code path in OpenBLAS runs single-threaded in any case, and the Reference-LAPACK codebase is not multithreaded except for a handful of routines that use OpenMP parallelization if compiled for it.

ev-br added a commit to ev-br/OpenBLAS that referenced this issue Jul 3, 2024
In OpenMathLib#4776
we're hitting
** On entry to SLASCL parameter number  4 had an illegal value

on codspeed, but not outside (either locally or on github runners)
@ev-br
Copy link
Contributor

ev-br commented Jul 3, 2024

okay, I repurposed #4777 to ignore the sgesdd failure.

ev-br added a commit to ev-br/OpenBLAS that referenced this issue Jul 3, 2024
In OpenMathLib#4776
we're hitting
** On entry to SLASCL parameter number  4 had an illegal value

on codspeed, but not outside (either locally or on github runners)
@martin-frbg
Copy link
Collaborator

martin-frbg commented Jul 18, 2024

The plot thickens ... this is almost certainly fallout from #4728 (where I fixed SSCAL to return NAN in certain corner cases like 0*NAN or 0*INFINITY without adequately appreciating its internal use in various functions that do not expect this behavior). Indeed #4794 gets flagged as a performance regression by codspeed, as far as I can tell solely for the fact that the sgesdd benchmark is now traversing a code path it did not enter before.

@ev-br
Copy link
Contributor

ev-br commented Jul 19, 2024

One check that the problem's fixed is to re-enable the sdgesdd codspeed benchmark:

$ git diff
diff --git a/benchmark/pybench/benchmarks/bench_blas.py b/benchmark/pybench/benchmarks/bench_blas.py
index 628c0cb2a..8127dd0c7 100644
--- a/benchmark/pybench/benchmarks/bench_blas.py
+++ b/benchmark/pybench/benchmarks/bench_blas.py
@@ -234,14 +234,10 @@ def test_gesdd(benchmark, mn, variant):
     gesdd = ow.get_func('gesdd', variant)
     u, s, vt, info = benchmark(run_gesdd, a, lwork, gesdd)
 
-    if variant != 's':
-        # On entry to SLASCL parameter number  4 had an illegal value
-        # under codspeed (cannot repro locally or on CI w/o codspeed)
-        # https://github.com/OpenMathLib/OpenBLAS/issues/4776
-        assert info == 0
-
-        atol = {'s': 1e-5, 'd': 1e-13}
-        np.testing.assert_allclose(u @ np.diag(s) @ vt, a, atol=atol[variant])
+    assert info == 0
+
+    atol = {'s': 1e-5, 'd': 1e-13}
+    np.testing.assert_allclose(u @ np.diag(s) @ vt, a, atol=atol[variant])
 
 
 # linalg.eigh

cc @art049 of codspeed fame, in case you've time to isolate the problem under codspeed, as discussed elsewhere. Would like to thank you publicly for building codspeed regardless

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants