Segmentation fault for inverse iteration solver #157

jordidj · 2023-12-09T09:34:16Z

Issue description

Use of the iteration solver returns segmentation fault.

Bug report

Minimal example for reproduction: config.par

&gridlist
  gridpoints = 51
/

&solvelist
  solver = "inverse-iteration"
  sigma = (0.0d0, 0.015d0)
  maxiter = 1000
  tolerance = 1.0d-7
/

&equilibriumlist
  equilibrium_type = "resistive_tearing"
  boundary_type = "wall_weak"
/

&savelist
  write_eigenfunctions = .false.
/

Actual result
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Expected result
Computation of the tearing mode.

Version info

Legolas version: 2.0.6 (master) and 2.1.0 (develop)
Operating system: macOS Sonoma 14.1.2
C(XX) compiler: AppleClang 15.0.0.15000040
gfortran 13.2.0
cmake 3.27.7
LAPACK 3.9.1 (from macOS Accelerate framework: https://developer.apple.com/documentation/accelerate/blas/ )
arpack-ng (commit 569a385)

The text was updated successfully, but these errors were encountered:

n-claes · 2023-12-10T09:57:23Z

Checked this on master and develop and I'm not able to reproduce this on either one: solver converges after only 2 iterations and tearing mode looks fine.
Is it only with "wall_weak" as boundary condition, or does it throw a segfault for the "wall" boundary as well (or other equilibria/solvers)?

I am running gfortran 12.2.0 instead of 13 though, so maybe there is an issue there. Can you check if the matrices look okay without actually solving the problem (i.e. `solver="none")?

Here is how the matrices look like in my case, same setup as the example above but with 5 gridpoints:

jordidj · 2023-12-11T10:01:43Z

Segfaults also appear for the wall BC and other equilibria (not for other solvers as far as I have seen). The matrices look fine, though there are some minor differences with yours (on the order of machine precision, it seems).

jordidj · 2023-12-12T09:04:22Z

I ran the tests locally and the first inverse iteration unit test [Test]: inverse iteration (AX = wX, sigma = -4.5 - 0.2i) failed. I am unsure if this is related to this problem, but 3 regression tests also failed:

FAILED test_mri_accretion.py::TestMRI_AccretionQRCholesky::test_derived_eigenfunction[w=-0.00202800+0.62772200j]
FAILED test_mri_accretion.py::TestMRI_AccretionQRCholesky::test_derived_eigenfunction[w=-0.00186038+0.58049560j]
FAILED test_mri_accretion.py::TestMRI_AccretionQRCholesky::test_derived_eigenfunction[w=-0.00173970+0.54439670j]

n-claes · 2023-12-14T19:06:46Z

The matrices look fine at first glance, some minor differences with machine precision zero but that should be a non-issue.

I ran the tests locally and the first inverse iteration unit test [Test]: inverse iteration (AX = wX, sigma = -4.5 - 0.2i) failed. I am unsure if this is related to this problem, but 3 regression tests also failed

Well the fact that only those regression tests are failing for the derived eigenfunctions for that specific case I can accept, the differences you show above are quite small. What's the RMS difference between the actual and stored solutions? Should be double-checked if it's due to numerical differences, and if so we can tweak the tolerance a bit for that particular case.

The failing unit test is a red flag though, those solutions are well-defined and tolerances are more than acceptable. Since it should really NOT fail that points to a deeper issue so I'll look into that.

What are your FC and CC environment variables set to? Are you indeed usng the Clang compiler or actually using gfortran? CMake should print this info if you do a clean build, you're looking for these lines:

-- The C compiler identification is GNU 12.2.0
-- The CXX compiler identification is GNU 12.2.0
-- The Fortran compiler identification is GNU 12.2.0

I'll bump our testing suite to include gfortran-13 first and see what gives, I'm hoping it's compiler-specific

n-claes · 2023-12-14T19:43:20Z

Tests are fine with gfortran-13 in #158 so at least that's something I guess... I'll see if I can include a macOS build as well for testing (it's about time we had that). It might be a local issue too, can you double-check the environment variables for the compilers (CC, FC, CXX etc)?

jordidj · 2023-12-16T09:21:32Z

I was using AppleClang

-- The C compiler identification is AppleClang 15.0.0.15000100
-- The CXX compiler identification is AppleClang 15.0.0.15000100
-- The Fortran compiler identification is GNU 13.2.0

but the problem persists after updating CC and CXX to

-- The C compiler identification is GNU 13.2.0
-- The CXX compiler identification is GNU 13.2.0
-- The Fortran compiler identification is GNU 13.2.0

The inverse iteration test fails due to a segmentation fault.

n-claes · 2023-12-19T16:23:28Z

Is it during the IVI process or does it crash before actually calling the solver?
I'd really like to know 1) in which code module/part the crash occurs and 2) what is being referenced. I'm unable to reproduce this, so if you could narrow down the line where the segfault occurs that would be super helpful!

jordidj · 2023-12-20T13:32:29Z

The logging output looks like this

before it throws a segmentation fault.

jordidj · 2023-12-20T13:45:45Z

Line 151 in smod_inverse_iteration.f08 is the issue:

ev = zdotc(N, x, 1, s, 1) / zdotc(N, x, 1, r, 1)

I added debug statements to reach this conclusion:

n-claes · 2024-01-06T14:35:40Z

The more I look into this the more confused I get...

The problematic line is nothing more than the dot product of two vectors, which are well-defined and well-allocated 1D arrays, whereas zdotc is a simple level-1 BLAS vector operation. The r and s vectors used in that product originate from the matrix-vector products zhbmv and zgbmv, respectively, both level-2 BLAS routines.

As far as I can see there are 3 possibilities here:

One of the dot products somehow throws a segfault. Could you check which one of the two (or both) it is?
Something is going wrong when r or s are calculated. Could you check these vectors for length, contents, etc? Though since both these routines exit normally I don't suspect problems here.
There is an issue with the x-vector. Seems very unlikely since this is still in the first iteration and then x has only unity values...

You can always double-check by quickly writing a dot-product routine and calling that instead of BLAS together with some sanity checks on input variables.
Sorry I can't be of more help here, I have tried a ton of things but I'm unable reproduce this issue in the first place (either locally on macOS or in the CI/CD pipeline on Linux).

jordidj · 2024-01-08T10:44:13Z

It seems Apple's vecLib Framework is the issue as described here. As they suggest, adding -ff2c to the CMakeLists.txt file in the target directory like so

get_filename_component(Fortran_COMPILER_NAME ${CMAKE_Fortran_COMPILER} NAME)
if (${Coverage})
    message(STATUS "====================")
    message(STATUS "Building with code coverage enabled.")
    message(STATUS "Default Fortran flags are disabled, optimisation disabled.")
    message(STATUS "====================")
    set (CMAKE_Fortran_FLAGS "--coverage -o0 -g -cpp -ff2c")
else()
    set(CMAKE_Fortran_FLAGS "-fcheck=all -fbounds-check -Wall \
                             -Wextra -Wconversion -pedantic -fbacktrace -cpp -ff2c")
endif()

fixes the issue. This flag affects all complex computations though, so I am not sure if we want this as the permanent solution.

Since there are no issues when installing openblas and lapack through Homebrew, I suggest just making this the recommended way to install on MacOS (and indicate this on the website).

n-claes · 2024-01-09T19:24:54Z

Ooh this is subtle. It's really the function zdotc specifically when BLAS from the vecLib framework is used in combination with a GNU compiler... Good catch!
Enabling the -ff2c flag by default is indeed probably not what we want:

However, if your code exports any functions returning single-precision
or (single- or double-precision) complex results, then any code that calls
those functions will be forced to follow F2C conventions as well.

This raises some concerns because I think this applies to the code base.

I'll see if it is possible to configure CMake in such a way that it detects this particular combo and only then enables the flag. Making Homebrew the default way to get openblas and lapack is a good idea, we can put it as such in the installation guide.

n-claes added the bug Something isn't working label Dec 10, 2023

n-claes self-assigned this Dec 14, 2023

n-claes added this to the Legolas 2.1 milestone Dec 14, 2023

jordidj closed this as completed Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault for inverse iteration solver #157

Segmentation fault for inverse iteration solver #157

jordidj commented Dec 9, 2023 •

edited

Loading

n-claes commented Dec 10, 2023 •

edited

Loading

jordidj commented Dec 11, 2023

jordidj commented Dec 12, 2023

n-claes commented Dec 14, 2023

n-claes commented Dec 14, 2023 •

edited

Loading

jordidj commented Dec 16, 2023

n-claes commented Dec 19, 2023

jordidj commented Dec 20, 2023

jordidj commented Dec 20, 2023

n-claes commented Jan 6, 2024

jordidj commented Jan 8, 2024 •

edited

Loading

n-claes commented Jan 9, 2024

Segmentation fault for inverse iteration solver #157

Segmentation fault for inverse iteration solver #157

Comments

jordidj commented Dec 9, 2023 • edited Loading

Issue description

Bug report

n-claes commented Dec 10, 2023 • edited Loading

jordidj commented Dec 11, 2023

jordidj commented Dec 12, 2023

n-claes commented Dec 14, 2023

n-claes commented Dec 14, 2023 • edited Loading

jordidj commented Dec 16, 2023

n-claes commented Dec 19, 2023

jordidj commented Dec 20, 2023

jordidj commented Dec 20, 2023

n-claes commented Jan 6, 2024

jordidj commented Jan 8, 2024 • edited Loading

n-claes commented Jan 9, 2024

jordidj commented Dec 9, 2023 •

edited

Loading

n-claes commented Dec 10, 2023 •

edited

Loading

n-claes commented Dec 14, 2023 •

edited

Loading

jordidj commented Jan 8, 2024 •

edited

Loading