Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault for inverse iteration solver #157

Closed
jordidj opened this issue Dec 9, 2023 · 12 comments
Closed

Segmentation fault for inverse iteration solver #157

jordidj opened this issue Dec 9, 2023 · 12 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@jordidj
Copy link
Collaborator

jordidj commented Dec 9, 2023

Issue description

Use of the iteration solver returns segmentation fault.

Bug report

Minimal example for reproduction: config.par

&gridlist
  gridpoints = 51
/

&solvelist
  solver = "inverse-iteration"
  sigma = (0.0d0, 0.015d0)
  maxiter = 1000
  tolerance = 1.0d-7
/

&equilibriumlist
  equilibrium_type = "resistive_tearing"
  boundary_type = "wall_weak"
/

&savelist
  write_eigenfunctions = .false.
/

Actual result
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Expected result
Computation of the tearing mode.

Version info

  • Legolas version: 2.0.6 (master) and 2.1.0 (develop)
  • Operating system: macOS Sonoma 14.1.2
  • C(XX) compiler: AppleClang 15.0.0.15000040
  • gfortran 13.2.0
  • cmake 3.27.7
  • LAPACK 3.9.1 (from macOS Accelerate framework: https://developer.apple.com/documentation/accelerate/blas/ )
  • arpack-ng (commit 569a385)
@n-claes n-claes added the bug Something isn't working label Dec 10, 2023
@n-claes
Copy link
Owner

n-claes commented Dec 10, 2023

Checked this on master and develop and I'm not able to reproduce this on either one: solver converges after only 2 iterations and tearing mode looks fine.
Is it only with "wall_weak" as boundary condition, or does it throw a segfault for the "wall" boundary as well (or other equilibria/solvers)?

I am running gfortran 12.2.0 instead of 13 though, so maybe there is an issue there. Can you check if the matrices look okay without actually solving the problem (i.e. `solver="none")?

Here is how the matrices look like in my case, same setup as the example above but with 5 gridpoints:

Screenshot 2023-12-10 at 11 00 27

@jordidj
Copy link
Collaborator Author

jordidj commented Dec 11, 2023

Segfaults also appear for the wall BC and other equilibria (not for other solvers as far as I have seen). The matrices look fine, though there are some minor differences with yours (on the order of machine precision, it seems).

matrices-1

@jordidj
Copy link
Collaborator Author

jordidj commented Dec 12, 2023

I ran the tests locally and the first inverse iteration unit test [Test]: inverse iteration (AX = wX, sigma = -4.5 - 0.2i) failed. I am unsure if this is related to this problem, but 3 regression tests also failed:

FAILED test_mri_accretion.py::TestMRI_AccretionQRCholesky::test_derived_eigenfunction[w=-0.00202800+0.62772200j]
FAILED test_mri_accretion.py::TestMRI_AccretionQRCholesky::test_derived_eigenfunction[w=-0.00186038+0.58049560j]
FAILED test_mri_accretion.py::TestMRI_AccretionQRCholesky::test_derived_eigenfunction[w=-0.00173970+0.54439670j]

mri_accretion_QR_cholesky_k2_0_k3_70_derived_efs_w_-0 00173970+0 54439670j-failed-diff
mri_accretion_QR_cholesky_k2_0_k3_70_derived_efs_w_-0 00186038+0 58049560j-failed-diff
mri_accretion_QR_cholesky_k2_0_k3_70_derived_efs_w_-0 00202800+0 62772200j-failed-diff

@n-claes
Copy link
Owner

n-claes commented Dec 14, 2023

The matrices look fine at first glance, some minor differences with machine precision zero but that should be a non-issue.

I ran the tests locally and the first inverse iteration unit test [Test]: inverse iteration (AX = wX, sigma = -4.5 - 0.2i) failed. I am unsure if this is related to this problem, but 3 regression tests also failed

Well the fact that only those regression tests are failing for the derived eigenfunctions for that specific case I can accept, the differences you show above are quite small. What's the RMS difference between the actual and stored solutions? Should be double-checked if it's due to numerical differences, and if so we can tweak the tolerance a bit for that particular case.

The failing unit test is a red flag though, those solutions are well-defined and tolerances are more than acceptable. Since it should really NOT fail that points to a deeper issue so I'll look into that.

What are your FC and CC environment variables set to? Are you indeed usng the Clang compiler or actually using gfortran? CMake should print this info if you do a clean build, you're looking for these lines:

-- The C compiler identification is GNU 12.2.0
-- The CXX compiler identification is GNU 12.2.0
-- The Fortran compiler identification is GNU 12.2.0

I'll bump our testing suite to include gfortran-13 first and see what gives, I'm hoping it's compiler-specific

@n-claes n-claes self-assigned this Dec 14, 2023
@n-claes n-claes added this to the Legolas 2.1 milestone Dec 14, 2023
@n-claes
Copy link
Owner

n-claes commented Dec 14, 2023

Tests are fine with gfortran-13 in #158 so at least that's something I guess... I'll see if I can include a macOS build as well for testing (it's about time we had that). It might be a local issue too, can you double-check the environment variables for the compilers (CC, FC, CXX etc)?

@jordidj
Copy link
Collaborator Author

jordidj commented Dec 16, 2023

I was using AppleClang

-- The C compiler identification is AppleClang 15.0.0.15000100
-- The CXX compiler identification is AppleClang 15.0.0.15000100
-- The Fortran compiler identification is GNU 13.2.0

but the problem persists after updating CC and CXX to

-- The C compiler identification is GNU 13.2.0
-- The CXX compiler identification is GNU 13.2.0
-- The Fortran compiler identification is GNU 13.2.0

The inverse iteration test fails due to a segmentation fault.

@n-claes
Copy link
Owner

n-claes commented Dec 19, 2023

Is it during the IVI process or does it crash before actually calling the solver?
I'd really like to know 1) in which code module/part the crash occurs and 2) what is being referenced. I'm unable to reproduce this, so if you could narrow down the line where the segfault occurs that would be super helpful!

@jordidj
Copy link
Collaborator Author

jordidj commented Dec 20, 2023

The logging output looks like this
Screenshot 2023-12-20 at 09 24 42
before it throws a segmentation fault.

@jordidj
Copy link
Collaborator Author

jordidj commented Dec 20, 2023

Line 151 in smod_inverse_iteration.f08 is the issue:

ev = zdotc(N, x, 1, s, 1) / zdotc(N, x, 1, r, 1)

I added debug statements to reach this conclusion:
Screenshot 2023-12-20 at 14 42 20
Screenshot 2023-12-20 at 14 42 53

@n-claes
Copy link
Owner

n-claes commented Jan 6, 2024

The more I look into this the more confused I get...

The problematic line is nothing more than the dot product of two vectors, which are well-defined and well-allocated 1D arrays, whereas zdotc is a simple level-1 BLAS vector operation. The r and s vectors used in that product originate from the matrix-vector products zhbmv and zgbmv, respectively, both level-2 BLAS routines.

As far as I can see there are 3 possibilities here:

  1. One of the dot products somehow throws a segfault. Could you check which one of the two (or both) it is?
  2. Something is going wrong when r or s are calculated. Could you check these vectors for length, contents, etc? Though since both these routines exit normally I don't suspect problems here.
  3. There is an issue with the x-vector. Seems very unlikely since this is still in the first iteration and then x has only unity values...

You can always double-check by quickly writing a dot-product routine and calling that instead of BLAS together with some sanity checks on input variables.
Sorry I can't be of more help here, I have tried a ton of things but I'm unable reproduce this issue in the first place (either locally on macOS or in the CI/CD pipeline on Linux).

@jordidj
Copy link
Collaborator Author

jordidj commented Jan 8, 2024

It seems Apple's vecLib Framework is the issue as described here. As they suggest, adding -ff2c to the CMakeLists.txt file in the target directory like so

get_filename_component(Fortran_COMPILER_NAME ${CMAKE_Fortran_COMPILER} NAME)
if (${Coverage})
    message(STATUS "====================")
    message(STATUS "Building with code coverage enabled.")
    message(STATUS "Default Fortran flags are disabled, optimisation disabled.")
    message(STATUS "====================")
    set (CMAKE_Fortran_FLAGS "--coverage -o0 -g -cpp -ff2c")
else()
    set(CMAKE_Fortran_FLAGS "-fcheck=all -fbounds-check -Wall \
                             -Wextra -Wconversion -pedantic -fbacktrace -cpp -ff2c")
endif()

fixes the issue. This flag affects all complex computations though, so I am not sure if we want this as the permanent solution.

Since there are no issues when installing openblas and lapack through Homebrew, I suggest just making this the recommended way to install on MacOS (and indicate this on the website).

@n-claes
Copy link
Owner

n-claes commented Jan 9, 2024

Ooh this is subtle. It's really the function zdotc specifically when BLAS from the vecLib framework is used in combination with a GNU compiler... Good catch!
Enabling the -ff2c flag by default is indeed probably not what we want:

However, if your code exports any functions returning single-precision
or (single- or double-precision) complex results, then any code that calls
those functions will be forced to follow F2C conventions as well.

This raises some concerns because I think this applies to the code base.

I'll see if it is possible to configure CMake in such a way that it detects this particular combo and only then enables the flag. Making Homebrew the default way to get openblas and lapack is a good idea, we can put it as such in the installation guide.

@jordidj jordidj closed this as completed Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants