Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Disable building with OpenMP on Apple Silicon #637

Merged
merged 2 commits into from
Jan 28, 2025

Conversation

mmuetzel
Copy link
Contributor

Building with OpenMP causes some test failures on macOS, see #634 (comment). The affected tests currently are: H1BasisEvaluation and SD_H1BasisEvaluation.

For the time being, disable building with OpenMP on macOS in CI.

OpenMP is needed as a transitional build dependency of the SuiteSparse package from Homebrew. So, also disable building with CHOLMOD.

The logic to build with OpenMP is left in place. So, it should be easy to re-activate building with it in case the underlying issues will be fixed.

Note that the tests with OpenMP fail for optimization level -O3 but succeed with -O2.

@mmuetzel
Copy link
Contributor Author

With these changes, the following test (still) fails on macOS 14 (Apple Silicon):

The following tests FAILED:
	833 - pointload2 (Failed)                               serial
Errors while running CTest

For the macOS 13 runner (Intel CPU):

The following tests FAILED:
Errors while running CTest
	484 - SD_H1BasisEvaluation (Failed)                     benchmark serendipity serial
	486 - SD_LinearFormsAssembly (Failed)                   benchmark serendipity serial threaded
Error: Process completed with exit code 8.

So, H1BasisEvaluation and SD_H1BasisEvaluation no longer fail on the runner with Apple Silicon when building without OpenMP.
SD_H1BasisEvaluation and SD_LinearFormsAssembly still fail on the Intel CPU. So, no changes with respect to the "baseline" before the changes from this PR for that runner.
pointload2 still fails on Apple Silicon.

Afaict, the remaining test failures in this PR cannot be related to OpenMP. After all, no implementation of OpenMP is installed on the runners now at all.

@mmuetzel
Copy link
Contributor Author

Since disabling OpenMP doesn't make a difference for the failing tests on Intel CPU, I updated this PR to build without OpenMP only on Apple Silicon. That way, we retain the CI coverage for OpenMP on macOS at least for Intel CPU.

@mmuetzel mmuetzel changed the title Disable building with OpenMP on macOS runners. CI: Disable building with OpenMP on Apple Silicon Jan 27, 2025
@juharu
Copy link
Contributor

juharu commented Jan 27, 2025 via email

@mmuetzel mmuetzel force-pushed the ci-macos branch 2 times, most recently from 608008c to 9b6bbff Compare January 27, 2025 10:54
@mmuetzel
Copy link
Contributor Author

Looking at the CI log, the file TEST.PASSED doesn't exist indeed:

Content of fem/tests/SD_H1BasisEvaluation
  ---- Files ----
  total 488
  drwxr-xr-x  5 runner  staff     160 Jan 27 09:30 CMakeFiles
  -rw-r--r--  1 runner  staff    1840 Jan 27 09:30 CTestTestfile.cmake
  -rw-r--r--  1 runner  staff      16 Jan 27 09:24 ELMERSOLVER_STARTINFO
  -rw-r--r--  1 runner  staff   55226 Jan 27 09:24 H1BasisEvaluation.F90
  -rwxr-xr-x  1 runner  staff  143824 Jan 27 09:40 H1BasisEvaluation.dylib
  -rw-r--r--  1 runner  staff   10358 Jan 27 09:30 Makefile
  -rw-r--r--  1 runner  staff    1481 Jan 27 09:30 cmake_install.cmake
  drwxr-x---  6 runner  staff     192 Jan 27 09:46 cube
  -rw-r--r--  1 runner  staff     449 Jan 27 09:24 cube.grd
  -rw-r--r--  1 runner  staff     658 Jan 27 09:29 h1basistest.sif
  -rw-r--r--  1 runner  staff       0 Jan 27 09:46 test-stderr_1.log
  -rw-r--r--  1 runner  staff   10178 Jan 27 09:46 test-stdout_1.log

Could there be a situation where the test passes, but that file is still not being created?

@juharu
Copy link
Contributor

juharu commented Jan 27, 2025 via email

@mmuetzel mmuetzel force-pushed the ci-macos branch 2 times, most recently from 836161e to b3b6f1d Compare January 27, 2025 11:10
@mmuetzel
Copy link
Contributor Author

Trying to follow your changes on the act_test branch, it looks like switching the optimization level from -O3 to -O2 seems to have "fixed" the test failure.

From GCC's documentation, the additional optimization passes that are activated with -O3 are:
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

-O3
Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the following optimization flags:

-fgcse-after-reload
-fipa-cp-clone
-floop-interchange
-floop-unroll-and-jam
-fpeel-loops
-fpredictive-commoning
-fsplit-loops
-fsplit-paths
-ftree-loop-distribution
-ftree-partial-pre
-funswitch-loops
-fvect-cost-model=dynamic
-fversion-loops-for-strides

Would it make sense to try and find out which of these flags are causing the issue? Are you suspicious it could be one of those options specifically?

@juharu
Copy link
Contributor

juharu commented Jan 27, 2025 via email

@mmuetzel
Copy link
Contributor Author

Yes, well, I have no idea which of those, one or more, could be the culprit. Could also be some combination, i guess. My take would be to downgrade to -O2, also for macos-14 and ARM64, and then you could also re-activate OpenMP.

Ideally, we could disable only one (or more) optimization only in the compilation units that are causing these issues.
But I understand that tracking that down is very tedious if the issues can only be reproduced with the CI runners. So, I agree that the "next best thing" might be to work around the issue only in CI (and hope that users will make sure that they won't run into the same issue).

@mmuetzel
Copy link
Contributor Author

I left the logic for adding runners without OpenMP in place to make it easier in case we would ever want to add them at some point in the future. Currently, the build matrix that doesn't contain that configuration though.

Like you proposed, I changed the build type to "RelWithDebInfo" for the Apple Silicon and Intel CPU runners.
Let's see if that avoids these pesky test errors.

@mmuetzel
Copy link
Contributor Author

In the latest iteration (with OpenMP, with -O2), the following tests are failing on macOS 14 (Apple Silicon) again:

The following tests FAILED:
	238 - H1BasisEvaluation (Failed)                        benchmark serial
	484 - SD_H1BasisEvaluation (Failed)                     benchmark serendipity serial
	833 - pointload2 (Failed)                               serial
Error: Process completed with exit code 8.

Maybe, will still need to build without OpenMP on Apple Silicon and can revert to -O3 for that runner.

The macOS 13 runner (Intel CPU) passed all tests in the test suite now. So, we could probably keep building with OpenMP and -O2 for that one.

@mmuetzel
Copy link
Contributor Author

The only failing test is for macOS 14 (Apple Silicon):

The following tests FAILED:
	833 - pointload2 (Failed)                               serial
Error: Process completed with exit code 8.

That is the same test that also failed for the Ubuntu on ARM runner.

It didn't matter whether ElmerFEM was built with or without OpenMP or which optimization level was used. That particular test always failed on macOS 14 (Apple Silicon).

@mmuetzel
Copy link
Contributor Author

What is really strange about the failing pointload2 test:
According to the CI log, the file TEST.PASSED exists after that test was run:

Content of fem/tests/pointload2
  ---- Files ----
  total 88
  drwxr-xr-x  4 runner  staff    128 Jan 28 07:41 CMakeFiles
  -rw-r--r--  1 runner  staff   1733 Jan 28 07:41 CTestTestfile.cmake
  -rw-r--r--  1 runner  staff      9 Jan 28 07:39 ELMERSOLVER_STARTINFO
  -rw-r--r--  1 runner  staff   7834 Jan 28 07:41 Makefile
  -rw-r--r--  1 runner  staff      2 Jan 28 07:55 TEST.PASSED
  -rw-r--r--  1 runner  staff   1669 Jan 28 07:41 case.sif
  -rw-r--r--  1 runner  staff   1461 Jan 28 07:41 cmake_install.cmake
  drwxr-x---  6 runner  staff    192 Jan 28 07:55 square
  -rw-r--r--  1 runner  staff    637 Jan 28 07:39 square.grd
  -rw-r--r--  1 runner  staff      0 Jan 28 07:55 test-stderr_1.log
  -rw-r--r--  1 runner  staff  10705 Jan 28 07:55 test-stdout_1.log

If I understood correctly, that file is only supposed to be created for succeeding tests. But it is still marked as failing.

Is something different for that test compared to the others?

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Jan 28, 2025

Never mind. If the file exists, it depends on its content whether the test is counted as passing. (The file content must be 1.)

The test seems to be failing due to an accuracy issue:

  WARNING:: CompareToReferenceSolution: Solver 1 FAILED:  Norm = 7.25882728E-01  RefNorm = 7.25476270E-01
  CompareToReferenceSolution: Relative Error to reference norm: 5.602639E-04

(In my opinion, the deviation doesn't look too bad. But tbh, I can't tell what is an acceptable deviation.)

@raback
Copy link
Contributor

raback commented Jan 28, 2025

Thanx for looking into this! I checked the pointload2 case. It is related to a point-like volumetric source that can be given to a flow equation. The issue is that it was very sensitive. I activated a "stokes" mode, decreased resolution and took up iterative solvers to fight the sensitivity. In the original case even MUMPS and Umfpack gave a little different solutions.

Building with OpenMP causes some test failures for macOS on Apple
Silicon, see:
ElmerCSC#634 (comment)
The affected tests currently are: `H1BasisEvaluation` and
`SD_H1BasisEvaluation`.

For the time being, disable building with OpenMP on Apple Silicon in CI.

OpenMP is needed as a transitional build dependency of the SuiteSparse
package from Homebrew. So, also disable building with CHOLMOD.
The tests `SD_H1BasisEvaluation` and `SD_LinearFormsAssembly` are
failing on macOS 13 (Intel CPU) at optimization level `-O3`.
They are passing when ElmerFEM is built with optimization level `-O2`.

Switch the build type to "RelWithDebInfo" for that runner as a
work-around.
@mmuetzel
Copy link
Contributor Author

Thanks!

I rebased this PR on top of your change. Let's see if that helps for the M1 runner.

@juharu juharu merged commit 7596fa9 into ElmerCSC:devel Jan 28, 2025
10 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants