Skip to content

superenv: handle formulae with runtime CPU detection#11608

Merged
carlocab merged 1 commit intoHomebrew:masterfrom
carlocab:runtime-arch
Jul 1, 2021
Merged

superenv: handle formulae with runtime CPU detection#11608
carlocab merged 1 commit intoHomebrew:masterfrom
carlocab:runtime-arch

Conversation

@carlocab
Copy link
Member

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same change?
  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your changes? Here's an example.
  • Have you successfully run brew style with your changes locally?
  • Have you successfully run brew typecheck with your changes locally?
  • Have you successfully run brew tests with your changes locally?

Some formulae are able to detect the features of the runtime CPU, and
execute code accordingly. This typically entails 1) the detection of
features of the build-time CPU in order to determine the targets that
the compiler can generate code for, and 2) generating code for the
targets that the compiler can support.

Our filtering of optimization flags can cause misdetection of compiler
features, leading to failed builds [1], and miscompilation even when the
build does not fail [2].

Let's try to fix this by allowing formulae to declare
ENV.runtime_cpu_detection which skips the filtering of -march and
related flags.

I've also skipped the filtering of the optimisation
level, since it seems to me that if upstream maintainers have gone to
the lengths of writing code that detects runtime hardware, they probably
also know better about appropriate -O flags to use.

This is a partial list of formulae that should make use of this feature:

  1. apache-arrow
  2. fftw
  3. gromacs
  4. open-mpi
  5. openblas

Partially resolves Homebrew/homebrew-core#76537.

[1] open-mpi/ompi#8306 and linked issues/PRs
[2] Homebrew/homebrew-core#76537

@BrewTestBot
Copy link
Contributor

Review period will end on 2021-06-29 at 00:00:00 UTC.

@BrewTestBot BrewTestBot added the waiting for feedback Merging is blocked until sufficient time has passed for review label Jun 27, 2021
@carlocab carlocab changed the title superenv: fix for formulae with runtime CPU detection superenv: handle formulae with runtime CPU detection Jun 27, 2021
@BrewTestBot BrewTestBot added waiting for feedback Merging is blocked until sufficient time has passed for review and removed waiting for feedback Merging is blocked until sufficient time has passed for review labels Jun 28, 2021
Copy link
Member

@MikeMcQuaid MikeMcQuaid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach but think it's too broad. If you have runtime CPU detection: we still shouldn't allow passing -march but -mtune/-mcpu/ should all be fine. I'd also rather see the existing O methods used over making this also allow that.

More specifically on the bug:

it seems Open MPI also used to pass -march=skylake-avx512 to the compiler as a way of detecting CPU features, and it looks like Arrow is doing the same thing here

Is this something that we could then scope just to e.g. configure and avoid using during e.g. make?

@carlocab
Copy link
Member Author

I like this approach but think it's too broad. If you have runtime CPU detection: we still shouldn't allow passing -march but -mtune/-mcpu/ should all be fine.

Allowing -march flags is actually the most important change here, as that's the one that fixes apache-arrow.

It should be safe to allow a build to pass -march flags to generate e.g. AVX512 instructions whenever the software does runtime CPU detection, as the AVX512 code is not run when the machine doesn't support it. (Of course, this is predicated on their runtime detection not having any significant bugs, but this seems no different from any other important feature of any other formula we ship.)

I'd also rather see the existing O methods used over making this also allow that.

The difficulty with O methods is that they're too broad: they apply to everything the compiler builds in a given make invocation when in practice a build system could be much more calibrated than that.

I just think that if upstream have invested a lot of work into optimising their build, it seems wiser to rely on the work they've done rather than try to duplicate it (poorly) ourselves. However, if you'd rather not defer to upstream on this issue, I'm fine with leaving this out of the PR.

More specifically on the bug:

it seems Open MPI also used to pass -march=skylake-avx512 to the compiler as a way of detecting CPU features, and it looks like Arrow is doing the same thing here

Is this something that we could then scope just to e.g. configure and avoid using during e.g. make?

Scoping flag refurbishment to make but avoiding it at configure is exactly what broke open-mpi, unfortunately. I'm also not aware of a way to do this with cmake.

When we filter these flags, we start violating the build system's assumptions about the code it generates and that leads to broken software. This is especially bad when we do one thing at configure and another thing at make. This isn't an issue when the build doesn't do runtime detection, because the build is typically mostly agnostic about the details of the generated code.

@BrewTestBot BrewTestBot removed the waiting for feedback Merging is blocked until sufficient time has passed for review label Jun 29, 2021
@BrewTestBot
Copy link
Contributor

Review period ended.

@MikeMcQuaid
Copy link
Member

It should be safe to allow a build to pass -march flags to generate e.g. AVX512 instructions whenever the software does runtime CPU detection, as the AVX512 code is not run when the machine doesn't support it. (Of course, this is predicated on their runtime detection not having any significant bugs, but this seems no different from any other important feature of any other formula we ship.)

It should be safe but I'm not completely concerned it's worth the risk (in the abstract). These sort of bugs are a bit of a nightmare to reproduce.

I just think that if upstream have invested a lot of work into optimising their build, it seems wiser to rely on the work they've done rather than try to duplicate it (poorly) ourselves. However, if you'd rather not defer to upstream on this issue, I'm fine with leaving this out of the PR.

Yes, I'd rather not defer to upstream on build flags when it doesn't actively break things (which it doesn't seem to with O).

Scoping flag refurbishment to make but avoiding it at configure is exactly what broke open-mpi, unfortunately. I'm also not aware of a way to do this with cmake.

Yeh, I've always been a bit dubious by this approach.

When we filter these flags, we start violating the build system's assumptions about the code it generates and that leads to broken software. This is especially bad when we do one thing at configure and another thing at make. This isn't an issue when the build doesn't do runtime detection, because the build is typically mostly agnostic about the details of the generated code.

I agree.

Sorry for you needing to talk me through on this so much: can you explain what we're filtering right now, how that breaks things (and on what formula) and what filtering this would avoid/fix (and on what formula)?

Thanks @carlocab ❤️

@carlocab
Copy link
Member Author

It should be safe but I'm not completely concerned it's worth the risk (in the abstract). These sort of bugs are a bit of a nightmare to reproduce.

There are two kinds of bugs here: one is when the upstream get their runtime detection code wrong, and another which we introduce via flag filtering. I'd argue the former is easier to diagnose and fix because it's something upstream will be able to reproduce in their own builds.

If we really are keen on filtering flags, we probably have to be much more aggressive about it, but this might take expertise we don't have and can be a bit brittle (see below).

I also think that if we want to restrict the code that the compiler generates, we should probably be doing so through an API they expose (e.g. configure flags and environment variables) -- at least for complicated build systems that are designed to generate code for different hardware targets.

Yes, I'd rather not defer to upstream on build flags when it doesn't actively break things (which it doesn't seem to with O).

Sure, will drop this.

can you explain what we're filtering right now, how that breaks things (and on what formula) and what filtering this would avoid/fix (and on what formula)?

The primary thing seems to be the filtering of -march, and I think that is summarised most concisely in this comment. From superenv logs in that comment:

clang++ called with: -DCXX_SUPPORTS_AVX512 -march=skylake-avx512 -mbmi2 -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -MD -MT CMakeFiles/cmTC_8a09b.dir/src.cxx.o -MF CMakeFiles/cmTC_8a09b.dir/src.cxx.o.d -o CMakeFiles/cmTC_8a09b.dir/src.cxx.o -c /tmp/apache-arrow-20210504-8671-1nk57rs/apache-arrow-4.0.0/build/CMakeFiles/CMakeTmp/src.cxx
superenv removed:  -march=skylake-avx512 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk
superenv added:    -pipe -w -Os -march=nehalem -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk --sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -isystem/usr/local/include -isystem/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/OpenGL.framework/Versions/Current/Headers -I/usr/local/opt/llvm/include -I/usr/local/opt/openssl@1.1/include -I/usr/local/opt/openblas/include -I/usr/local/opt/readline/include -I/usr/local/opt/sqlite/include
superenv executed: clang++ -pipe -w -Os -march=nehalem -DCXX_SUPPORTS_AVX512 -mbmi2 -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -MD -MT CMakeFiles/cmTC_8a09b.dir/src.cxx.o -MF CMakeFiles/cmTC_8a09b.dir/src.cxx.o.d -o CMakeFiles/cmTC_8a09b.dir/src.cxx.o -c /tmp/apache-arrow-20210504-8671-1nk57rs/apache-arrow-4.0.0/build/CMakeFiles/CMakeTmp/src.cxx -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk --sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -isystem/usr/local/include -isystem/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/OpenGL.framework/Versions/Current/Headers -I/usr/local/opt/llvm/include -I/usr/local/opt/openssl@1.1/include -I/usr/local/opt/openblas/include -I/usr/local/opt/readline/include -I/usr/local/opt/sqlite/include

clang++ called with: -march=skylake-avx512 -mbmi2 -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -Wl,-search_paths_first -Wl,-headerpad_max_install_names CMakeFiles/cmTC_8a09b.dir/src.cxx.o -o cmTC_8a09b
superenv removed:  -march=skylake-avx512 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk
superenv added:    -pipe -w -Os -march=nehalem -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk --sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -isystem/usr/local/include -isystem/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/OpenGL.framework/Versions/Current/Headers -I/usr/local/opt/llvm/include -I/usr/local/opt/openssl@1.1/include -I/usr/local/opt/openblas/include -I/usr/local/opt/readline/include -I/usr/local/opt/sqlite/include -L/usr/local/opt/openssl@1.1/lib -L/usr/local/opt/openblas/lib -L/usr/local/opt/readline/lib -L/usr/local/opt/sqlite/lib -L/usr/local/lib -L/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/OpenGL.framework/Versions/Current/Libraries
superenv executed: clang++ -pipe -w -Os -march=nehalem -mbmi2 -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -Wl,-search_paths_first -Wl,-headerpad_max_install_names CMakeFiles/cmTC_8a09b.dir/src.cxx.o -o cmTC_8a09b -isysroot/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk --sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -isystem/usr/local/include -isystem/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/OpenGL.framework/Versions/Current/Headers -I/usr/local/opt/llvm/include -I/usr/local/opt/openssl@1.1/include -I/usr/local/opt/openblas/include -I/usr/local/opt/readline/include -I/usr/local/opt/sqlite/include -L/usr/local/opt/openssl@1.1/lib -L/usr/local/opt/openblas/lib -L/usr/local/opt/readline/lib -L/usr/local/opt/sqlite/lib -L/usr/local/lib -L/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/OpenGL.framework/Versions/Current/Libraries -Wl,-headerpad_max_install_names

So the miscompilation seems to occur when we switch out the -march=skylake-avx512 flag, but not also the other related flags:

  • -DCXX_SUPPORTS_AVX512
  • -mavx512f
  • -mavx512cd
  • -mavx512vl
  • -mavx512dq
  • -mavx512bw

This actively breaks apache-arrow (but only at runtime), used to break open-mpi (upstream have since worked around this, but I still suspect we pass mixed flags somewhere, leading to things being subtly broken and difficult to reproduce).

In particular, the apache-arrow build gets broken because it still produces AVX512 instructions, but then gets the runtime hardware detection wrong and attempts to run AVX512 instructions on hardware that doesn't support it (possibly because it finds code that was generated with -march=nehalem and doesn't notice that it was also generated with -mbmi2 or -mavx512f and friends).

In light of this, I think our options here are:

  1. Don't filter this flag when the software supports runtime detection.
  2. Filter flags much more aggressively.

I find the latter a bit fragile. We'd need to know about all the compiler flags we want to avoid--these are woefully documented and change over time. This might also be software-dependent because of things like custom macros (e.g. -DCXX_SUPPORTS_AVX512 above).

@MikeMcQuaid
Copy link
Member

It should be safe but I'm not completely concerned it's worth the risk (in the abstract). These sort of bugs are a bit of a nightmare to reproduce.

There are two kinds of bugs here: one is when the upstream get their runtime detection code wrong, and another which we introduce via flag filtering. I'd argue the former is easier to diagnose and fix because it's something upstream will be able to reproduce in their own builds.

Well, there's the third potential bug that we're opening with this: upstream set march to a value based on the build machine that is too new for the machines that we run bottles on.

This is something we used to set a lot of and this filtering pretty much solved the problem.

  • -DCXX_SUPPORTS_AVX512

Feels like this is what we should be filtering out in this specific case. We don't want to use these instructions. We shouldn't be setting march to a value newer than the machines we expect this to be shipped on because the compiler can decide at any point to use instructions that machines we distribute the bottles on don't support. In this case: perhaps that's scoped entirely to code that's run at runtime but the added complexity here doesn't seem to outweigh just disabling AVX512 support here for this particular formula.

@carlocab
Copy link
Member Author

carlocab commented Jun 29, 2021

Well, there's the third potential bug that we're opening with this: upstream set march to a value based on the build machine that is too new for the machines that we run bottles on.

That's not a bug -- that is intentional. Build systems that do runtime detection will set march to values based on the build machine, but scopes this code to run only when the runtime hardware supports it. That's what the apache-arrow build does, and it successfully avoids running code generated for newer processors at runtime when you build it outside of brew. Trying to build it with brew, however, breaks this runtime detection.

We could try to disable AVX512 for specific formulae, but, even if we do, I think we would still need the changes here because everything I've said above about builds that do runtime detection is still true. (e.g. they could try to generate code that runs on processors older than Nehalem, but then our flag filtering interferes with that and could break things) I'm also not convinced that doing this is simpler than what I propose here.

I understand the concern about breaking things needlessly, but I am confident that this will not break things and any risk that we do face is worth it: it fixes a formula with currently broken bottles, allows us to ship bottles for other formulae that work better for users with newer machines than the oldest we support, and removes the risk of miscompiled bottles from inconsistent optimisation flags.

Just to emphasise how low the risk of this is: we've been shipping fftw, gromacs, and openblas bottles with AVX instructions for a while now (1-2 years, at least), and there have been zero complaints about them being broken, despite Nehalem having no AVX support. This change helps make sure things stay that way for these formulae, and allows us to enable better optimisations (e.g. AVX2/AVX512).

If it helps, I'm happy to slowly roll this out to formulae and monitor them closely to make sure nothing's broken. It seems risky, but the truth is that anything that we do here is risky, and in my view the least risky thing to do is to minimise interference with a build system that is designed to produce something that runs on a variety of hardware targets.

@MikeMcQuaid
Copy link
Member

@carlocab Thanks again for your explanations. Personally, I'm still not sure that this is worth it. While the ENV.runtime_cpu_detection accurately describes what we're using it for what it translates to is ENV.allow_march_mcpu_mtune. My concern is that this ends up being used in more formulae than the original intent in this PR because "upstream wants to use their own values".

If this is scoped to only be what we need it for today (i.e. -march only) and we add an audit with an allowlist: fine with me. If it starts getting misused, though, expect me to start asking for a revert 😁.

Do you have a sense of how many formulae you want to use this on?

@carlocab
Copy link
Member Author

carlocab commented Jun 30, 2021

No thanks are necessary!

While the ENV.runtime_cpu_detection accurately describes what we're using it for what it translates to is ENV.allow_march_mcpu_mtune.

I think the name is useful because it's suggestive enough for any maintainer reviewing a PR that adds this to ask the right questions about it.

If this is scoped to only be what we need it for today (i.e. -march only)

This sounds good to me. Now that I think about it, I don't think we run the risk of miscompilation from filtering out -mtune and -mcpu, so I'm happy restrict the scope here.

and we add an audit with an allowlist: fine with me. If it starts getting misused, though, expect me to start asking for a revert 😁.

This is good to me too, but I think I can do you one better: we can audit the use of ENV.runtime_cpu_detection in addition to having an allowlist using the following heuristic. We disassemble the TEXT sections of the MachO files in a bottle and check for a cpuid instruction. If it doesn't have that, then it is unlikely that the software actually does runtime detection.

I hope this can help allay your concern about upstream trying to abuse this.

Do you have a sense of how many formulae you want to use this on?

All the formulae I named in my initial post definitely do runtime detection and would be improved by the change here. Any formula used for computation-intensive tasks [1] is a good candidate too, but I'll need to check them carefully to make sure.

[1] e.g. vtk, pcl, opencv, tbb -- these all have cpuid instructions in their bottles too.

@MikeMcQuaid
Copy link
Member

This is good to me too, but I think I can do you one better: we can audit the use of ENV.runtime_cpu_detection in addition to having an allowlist using the following heuristic. We disassemble the TEXT sections of the MachO files in a bottle and check for a cpuid instruction. If it doesn't have that, then it is unlikely that the software actually does runtime detection.

I hope this can help allay your concern about upstream trying to abuse this.

Yes, that would completely allay my concerns!

I think the name is useful because it's suggestive enough for any maintainer reviewing a PR that adds this to ask the right questions about it.

Ok. Don't feel strongly on the name (particularly if audits for it appear at some stage too).

@carlocab
Copy link
Member Author

Great; thanks for hashing this out with me. I'm pleased with where we've landed here.

@carlocab carlocab marked this pull request as draft June 30, 2021 15:17
Some formulae are able to detect the features of the runtime CPU, and
execute code accordingly. This typically entails 1) the detection of
features of the build-time CPU in order to determine the targets that
the compiler can generate code for, and 2) generating code for the
targets that the compiler can support.

Our filtering of optimization flags can cause misdetection of compiler
features, leading to failed builds [1], and miscompilation even when the
build does not fail [2].

Let's try to fix this by allowing formulae to declare
`ENV.runtime_cpu_detection` which skips the filtering of `-march` and
related flags.

I've also skipped the filtering of the optimisation
level, since it seems to me that if upstream maintainers have gone to
the lengths of writing code that detects runtime hardware, they probably
also know better about appropriate `-O` flags to use.

This is a partial list of formulae that should make use of this feature:
1. apache-arrow
2. fftw
3. gromacs
4. open-mpi
5. openblas

Partially resolves Homebrew/homebrew-core#76537.

[1] open-mpi/ompi#8306 and linked issues/PRs
[2] Homebrew/homebrew-core#76537
@carlocab
Copy link
Member Author

carlocab commented Jul 1, 2021

I initially planned to add the audit here, but I'll do that in a separate PR.

@github-actions github-actions bot added the outdated PR was locked due to age label Aug 1, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

outdated PR was locked due to age

Projects

None yet

Development

Successfully merging this pull request may close these issues.

apache-arrow segfaults on intel

3 participants