Skip to content

Add linux-aarch64 and cuda 12.8#438

Merged
h-vetinari merged 18 commits into
conda-forge:mainfrom
dslarm:linux-aarch64-clean
Sep 10, 2025
Merged

Add linux-aarch64 and cuda 12.8#438
h-vetinari merged 18 commits into
conda-forge:mainfrom
dslarm:linux-aarch64-clean

Conversation

@dslarm
Copy link
Copy Markdown
Contributor

@dslarm dslarm commented Aug 1, 2025

This PR is a cleaned and updated branch to bring in both aarch64 and also cuda 12.8. This replaces the PR #426.

Nutshell of #426 learnings:

  • cross-compilation was a fool's errand - there are too many issues to resolve, we must therefore use native builds.
  • cuda support requires version 12.8 - as cuda 12.6 can't handle Arm NEON - and even 12.8 requires pinning to an earlier gcc (11)

Recent learnings:

  • gcc 14 will error where gcc 13 was happy to warn - we need to pin to gcc <= 13 across all platforms

Next

  • reviewers: please review

Outstanding questions - may or may not need me to do something:

  • what conda-forge.yml should be - I know I will need to native-build the binaries for a while, we're still waiting native linux-aarch64 builds via CI
  • and just curious.. how CI works for this monster.. it's 45 mins on a 192 core x86 or aarch64 box per combo..

@conda-forge-admin
Copy link
Copy Markdown
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe/meta.yaml:

  • ℹ️ The recipe is not parsable by parser conda-souschef (grayskull). This parser is not currently used by conda-forge, but may be in the future. We are collecting information to see which recipes are compatible with grayskull.

This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/16682556960. Examine the logs at this URL for more detail.

@dslarm dslarm mentioned this pull request Aug 1, 2025
@hmaarrfk
Copy link
Copy Markdown
Contributor

hmaarrfk commented Aug 5, 2025

Thank you for troubleshooting some GCC + CUDA incompatibilities.

reviewers: please review

I simply cannot review anything that doesn't address the broken status of this recipe for other platforms.

Conda-forge is where it is today because we strive for compatibiliy with many platforms.

x86 on linux and mac builds must be addressed first, before I can spend time thinking about aarch64.

My recommendation remains the same:

  • Create your own channel
  • tell users to add it, in addition to conda-forge's
  • Upload your packages there while we have time to address the other breakages.

@dslarm
Copy link
Copy Markdown
Contributor Author

dslarm commented Aug 5, 2025

...

x86 on linux and mac builds must be addressed first, before I can spend time thinking about aarch64.
...

The Linux x86 builds work - I have been through every single one of them manually with build-locally.py, and done the same for linux-aarch64

The problem is that these builds are taking 24 hours (fwiw, they take 30-ish minutes on a 192 core box) .. and every one of them is timing out - eg:

[linux_64_cuda_compiler_version12.8microarch_level1python3.10.____cpython](https://github.com/conda-forge/tensorflow-feedstock/actions/runs/16682556539/job/47224442583)
The job has exceeded the maximum execution time while awaiting a runner for 24h0m0s

I haven't tried to build osx manually - I don't have the right machine to do the x86 versions.. and osx-arm64 is just not the same as the build system uses.

@hmaarrfk
Copy link
Copy Markdown
Contributor

hmaarrfk commented Aug 5, 2025

Thank you for that clarification.

Isuru and I merged a "cleanup" PR to help alleviate the build matrix. Do you want to try to rebase oen that? It should also speed up your builds of ARM allowing you to build the big stuff "once" effectively as close to a 4x speedup as we can get.

@hmaarrfk
Copy link
Copy Markdown
Contributor

hmaarrfk commented Aug 5, 2025

rebasing might be hard for the build script. i agree that it might be easier to just "redo" the changes to that one manually.

@dslarm
Copy link
Copy Markdown
Contributor Author

dslarm commented Aug 5, 2025

rebasing might be hard for the build script. i agree that it might be easier to just "redo" the changes to that one manually.

I'll take a look, my changes to the build script were - in the end - fairly minor..

@dslarm dslarm force-pushed the linux-aarch64-clean branch from a7da5db to e253ee2 Compare August 5, 2025 15:36
@conda-forge-admin
Copy link
Copy Markdown
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I was trying to look for recipes to lint for you, but it appears we have a merge conflict. Please try to merge or rebase with the base branch to resolve this conflict.

Please ping the 'conda-forge/core' team (using the @ notation in a comment) if you believe this is a bug.

@dslarm dslarm force-pushed the linux-aarch64-clean branch from e253ee2 to ffb5957 Compare August 5, 2025 15:49
@conda-forge-admin
Copy link
Copy Markdown
Contributor

conda-forge-admin commented Aug 5, 2025

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe/meta.yaml:

  • ℹ️ The recipe is not parsable by parser conda-souschef (grayskull). This parser is not currently used by conda-forge, but may be in the future. We are collecting information to see which recipes are compatible with grayskull.

This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/17578969045. Examine the logs at this URL for more detail.

@dslarm dslarm closed this Aug 5, 2025
@dslarm dslarm deleted the linux-aarch64-clean branch August 5, 2025 15:59
@dslarm dslarm restored the linux-aarch64-clean branch August 5, 2025 16:06
@dslarm dslarm reopened this Aug 5, 2025
@dslarm
Copy link
Copy Markdown
Contributor Author

dslarm commented Aug 5, 2025

@conda-forge-admin, please rerender

dslarm added 2 commits August 7, 2025 13:07
* obvious changes to build.sh (noting NVIDIA's aarch64 is 'sbsa')
* replace custom bazel toolchain with the gen-bazel-toolchain package
* add patches:
    - 0031-bump-h5py-req.patch - h5py and psutil are not available as a binary for
      the versions of h5py in tensorflow 2.18's spec, but the
      system won't try to build it.  Bump the version so we get ones that
      do exist.
    - 0032-gpu_prim-error.patch - as per openxla's pull 16095 - but extended to also fix the Store methods, needed for cuda 12.8/12.9
…pull 393)

* The vendored xnnpack in tensorflow 2.18 is incompatible with gcc14, so we pin to gcc13
* Hold back compiler versions for aarch64 to gcc-11 with cuda. cuda 12.8 only
  handles aarch64 neon if gcc is < 12
@dslarm dslarm force-pushed the linux-aarch64-clean branch from 0020c69 to f7b55fb Compare August 7, 2025 12:10
@dslarm
Copy link
Copy Markdown
Contributor Author

dslarm commented Sep 5, 2025

could someone review this please - it is ready for that - AIUI CI fails are normal for this package. What built before builds now via build-locally.py, along with cuda 12.8 (as this is earliest version that will build tensorflow for Arm) and the arm builds. The new aarch64 support is now using cross-compilation.

@h-vetinari
Copy link
Copy Markdown
Member

@dslarm, unfortunately your odyssey isn't finished yet. There was no CUDA 12.8 job for aarch here yet, and now that I've added it, it runs into:

 [13 / 26] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: /home/conda/feedstock_root/build_artifacts/tensorflow-split_1757278331355/_build_env/share/bazel/a5889192f1201e14ae645981f8e2d4ca/external/double_conversion/BUILD:9:11: Compiling double-conversion/strtod.cc [for tool] failed: undeclared inclusion(s) in rule '@double_conversion//:double-conversion':
this rule is missing dependency declarations for the following files included by 'double-conversion/strtod.cc':

…5.09.07.20.31.15

Other tools:
- conda-build 25.7.0
- rattler-build 0.47.0
- rattler-build-conda-compat 1.4.6
@h-vetinari
Copy link
Copy Markdown
Member

h-vetinari commented Sep 8, 2025

OK, the intention seems to have been to downgrade to GCC 11 on aarch; I did that now. But it still runs into

 ERROR: $BUILD_PREFIX/share/bazel/c073781af00280756e7165719a85b5b2/external/llvm-project/llvm/BUILD.bazel:224:11: Compiling llvm/lib/Demangle/RustDemangle.cpp [for tool] failed: undeclared inclusion(s) in rule '@llvm-project//llvm:Demangle':
this rule is missing dependency declarations for the following files included by 'llvm/lib/Demangle/RustDemangle.cpp':
  '$BUILD_PREFIXv/aarch64-conda-linux-gnu/sysroot/usr/include/stdc-predef.h'
  '$BUILD_PREFIX/lib/gcc/aarch64-conda-linux-gnu/11.4.0/include/c++'
  [...]

which looks like bazel not understanding the cross-compiler setup, and related to rebuilding the vendored LLVM. Interestingly, this doesn't happen in the CPU build, which likely explains why we haven't seen this for osx-arm64 (which is also cross-compiled). Perhaps there are further place where we need to patch in the correct compiler setup.

@h-vetinari
Copy link
Copy Markdown
Member

h-vetinari commented Sep 8, 2025

Sigh

 The following packages are incompatible
├─ tensorflow-avx2 =2.18.0 cpu_py313hf8d5db8_51 is not installable because it requires
│  └─ tensorflow ==2.18.0 cpu_py313h1234567_51, which does not exist (perhaps a missing channel);
                                   ^^^^^^^^
                                   dummy hash

reminds me of conda/conda-build#5571.

Edit: unrelated tough

@dslarm
Copy link
Copy Markdown
Contributor Author

dslarm commented Sep 8, 2025

[..]

which looks like bazel not understanding the cross-compiler setup, and related to rebuilding the vendored LLVM. Interestingly, this doesn't happen in the CPU build, which likely explains why we haven't seen this for osx-arm64 (which is also cross-compiled). Perhaps there are further place where we need to patch in the correct compiler setup.

Thanks - that must be where I lost all hope last time with cross-compile of the CUDA side.. I'll try again.. there's already one dirty patch that was needed for CPU side, so another dirty patch may be in order ..

@dslarm
Copy link
Copy Markdown
Contributor Author

dslarm commented Sep 8, 2025

Is currently failing early(ish) with:

Repository rule cuda_configure defined at:
  /home/conda/feedstock_root/build_artifacts/tensorflow-split_1757347641253/_build_env/share/bazel/e02d69176749ddd5fdfadcb233e1dff7/external/local_tsl/third_party/gpus/cuda/hermetic/cuda_configure.bzl:553:33: in <toplevel>
ERROR: An error occurred during the fetch of repository 'local_config_cuda':
   Traceback (most recent call last):
	File "/home/conda/feedstock_root/build_artifacts/tensorflow-split_1757347641253/_build_env/share/bazel/e02d69176749ddd5fdfadcb233e1dff7/external/local_tsl/third_party/gpus/cuda/hermetic/cuda_configure.bzl", line 520, column 38, in _cuda_autoconf_impl
		_create_local_cuda_repository(repository_ctx)
	File "/home/conda/feedstock_root/build_artifacts/tensorflow-split_1757347641253/_build_env/share/bazel/e02d69176749ddd5fdfadcb233e1dff7/external/local_tsl/third_party/gpus/cuda/hermetic/cuda_configure.bzl", line 446, column 35, in _create_local_cuda_repository
		cuda_config = _get_cuda_config(repository_ctx)
	File "/home/conda/feedstock_root/build_artifacts/tensorflow-split_1757347641253/_build_env/share/bazel/e02d69176749ddd5fdfadcb233e1dff7/external/local_tsl/third_party/gpus/cuda/hermetic/cuda_configure.bzl", line 219, column 53, in _get_cuda_config
		compute_capabilities = _compute_capabilities(repository_ctx),
	File "/home/conda/feedstock_root/build_artifacts/tensorflow-split_1757347641253/_build_env/share/bazel/e02d69176749ddd5fdfadcb233e1dff7/external/local_tsl/third_party/gpus/cuda/hermetic/cuda_configure.bzl", line 180, column 33, in _compute_capabilities
		_auto_configure_fail("Invalid compute capability: %s" % capability)
	File "/home/conda/feedstock_root/build_artifacts/tensorflow-split_1757347641253/_build_env/share/bazel/e02d69176749ddd5fdfadcb233e1dff7/external/local_tsl/third_party/gpus/cuda/hermetic/cuda_configure.bzl", line 59, column 9, in _auto_configure_fail
		fail("\n%sCuda Configuration Error:%s %s\n" % (red, no_color, msg))
Error in fail: 
Cuda Configuration Error: Invalid compute capability: sm_100

this seems at odds with what cuda 12.8 claims (it should support sm_100)

@h-vetinari
Copy link
Copy Markdown
Member

this seems at odds with what cuda 12.8 claims (it should support sm_100)

Yeah, but 80c6778 was mostly for completeness (i.e. if we bump the CUDA version, we should match the capabilities). Feel free to remove the newer arches for now (or revert that commit), it's not the key part of this PR.

@h-vetinari
Copy link
Copy Markdown
Member

So, this now passes on everything except aarch+CUDA. We could take the aarch+CPU support and take this as an intermediate win that the other PRs can build on top of. Thoughts @conda-forge/tensorflow?

@hmaarrfk
Copy link
Copy Markdown
Contributor

hmaarrfk commented Sep 10, 2025

does it pass with aarch native compilation dslarm was willing to do that and invoke CFEP03.

I'm personally too burned out to care about anything that doesn't "unpin" abseil on tensorflow.

Its causing me real solving issues, where alot of the stack is getting downgraded to add tensorflow to an environment.

But you two got this to a good place, so maybe this is good to merge.

@dslarm
Copy link
Copy Markdown
Contributor Author

dslarm commented Sep 10, 2025

I'd take the win of linux-aarch64 - without CUDA - and using cross-compilation - can we move to merge this (any outstanding reviews etc)?

Although I'm happy to run the native case ('CFEP03' to get everything) - I think I may be get done with cross-compiled cuda when I can give a bit more time to it - hence take the win on cross-compiled non-cuda for now.

FWIW, I think there's an issue with the bazel-toolchain for cross-compile with CUDA, and it's just painful to debug (not a bazel expert..) - I assume (but haven't done the debug yet) the Rust.* compile that fails when cross-compiling for cuda+aarch64 is also done when plain aarch64 cross-compiles. If that's the case, then it must surely be in the cuda scripts 'crosstool_wrapper_driver_is_not_gcc' etc that appear to get invoked only in the cuda case and that would be missing the aarch64 system include directories.

h-vetinari added a commit that referenced this pull request Sep 10, 2025
@h-vetinari h-vetinari merged commit a463708 into conda-forge:main Sep 10, 2025
8 of 12 checks passed
@hmaarrfk
Copy link
Copy Markdown
Contributor

Ok. Feel free to upload the logs here for cfep03

@hmaarrfk
Copy link
Copy Markdown
Contributor

Thanks all!!!

@h-vetinari
Copy link
Copy Markdown
Member

Ok. Feel free to upload the logs here for cfep03

FWIW, since macos builds didn't change appreciably here, I'm not planning to do any CFEP 03 builds for this. If @dslarm can get the aarch+CUDA builds unblocked, I'd prefer a separate PR for that, even if it's built locally.

@dslarm
Copy link
Copy Markdown
Contributor Author

dslarm commented Sep 10, 2025

Thanks both for your help - I'll continue to try more on cuda shortly..

@h-vetinari
Copy link
Copy Markdown
Member

Hm, I definitely thought this, but apparently didn't write it down - better late than never I hope:

Thanks so much @dslarm for the patience and persistence in shepherding this PR to completion! This one was particularly tricky and took a very long time, sorry about that. 🙃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants