Add linux-aarch64 and cuda 12.8 by dslarm · Pull Request #438 · conda-forge/tensorflow-feedstock

dslarm · 2025-08-01T18:36:31Z

This PR is a cleaned and updated branch to bring in both aarch64 and also cuda 12.8. This replaces the PR #426.

Nutshell of #426 learnings:

cross-compilation was a fool's errand - there are too many issues to resolve, we must therefore use native builds.
cuda support requires version 12.8 - as cuda 12.6 can't handle Arm NEON - and even 12.8 requires pinning to an earlier gcc (11)

Recent learnings:

gcc 14 will error where gcc 13 was happy to warn - we need to pin to gcc <= 13 across all platforms

reviewers: please review

Outstanding questions - may or may not need me to do something:

what conda-forge.yml should be - I know I will need to native-build the binaries for a while, we're still waiting native linux-aarch64 builds via CI
and just curious.. how CI works for this monster.. it's 45 mins on a 192 core x86 or aarch64 box per combo..

conda-forge-admin · 2025-08-01T18:38:08Z

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe/meta.yaml:

ℹ️ The recipe is not parsable by parser conda-souschef (grayskull). This parser is not currently used by conda-forge, but may be in the future. We are collecting information to see which recipes are compatible with grayskull.

_{This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/16682556960. Examine the logs at this URL for more detail.}

hmaarrfk · 2025-08-05T13:23:17Z

Thank you for troubleshooting some GCC + CUDA incompatibilities.

reviewers: please review

I simply cannot review anything that doesn't address the broken status of this recipe for other platforms.

Conda-forge is where it is today because we strive for compatibiliy with many platforms.

x86 on linux and mac builds must be addressed first, before I can spend time thinking about aarch64.

My recommendation remains the same:

Create your own channel
tell users to add it, in addition to conda-forge's
Upload your packages there while we have time to address the other breakages.

dslarm · 2025-08-05T13:37:18Z

...

x86 on linux and mac builds must be addressed first, before I can spend time thinking about aarch64.
...

The Linux x86 builds work - I have been through every single one of them manually with build-locally.py, and done the same for linux-aarch64

The problem is that these builds are taking 24 hours (fwiw, they take 30-ish minutes on a 192 core box) .. and every one of them is timing out - eg:

[linux_64_cuda_compiler_version12.8microarch_level1python3.10.____cpython](https://github.com/conda-forge/tensorflow-feedstock/actions/runs/16682556539/job/47224442583)
The job has exceeded the maximum execution time while awaiting a runner for 24h0m0s

I haven't tried to build osx manually - I don't have the right machine to do the x86 versions.. and osx-arm64 is just not the same as the build system uses.

hmaarrfk · 2025-08-05T14:00:08Z

Thank you for that clarification.

Isuru and I merged a "cleanup" PR to help alleviate the build matrix. Do you want to try to rebase oen that? It should also speed up your builds of ARM allowing you to build the big stuff "once" effectively as close to a 4x speedup as we can get.

hmaarrfk · 2025-08-05T14:04:00Z

rebasing might be hard for the build script. i agree that it might be easier to just "redo" the changes to that one manually.

dslarm · 2025-08-05T14:04:49Z

rebasing might be hard for the build script. i agree that it might be easier to just "redo" the changes to that one manually.

I'll take a look, my changes to the build script were - in the end - fairly minor..

conda-forge-admin · 2025-08-05T15:38:18Z

Hi! This is the friendly automated conda-forge-linting service.

I was trying to look for recipes to lint for you, but it appears we have a merge conflict. Please try to merge or rebase with the base branch to resolve this conflict.

Please ping the 'conda-forge/core' team (using the @ notation in a comment) if you believe this is a bug.

conda-forge-admin · 2025-08-05T15:50:57Z

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe/meta.yaml:

ℹ️ The recipe is not parsable by parser conda-souschef (grayskull). This parser is not currently used by conda-forge, but may be in the future. We are collecting information to see which recipes are compatible with grayskull.

_{This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/17578969045. Examine the logs at this URL for more detail.}

dslarm · 2025-08-05T17:13:34Z

@conda-forge-admin, please rerender

* obvious changes to build.sh (noting NVIDIA's aarch64 is 'sbsa') * replace custom bazel toolchain with the gen-bazel-toolchain package * add patches: - 0031-bump-h5py-req.patch - h5py and psutil are not available as a binary for the versions of h5py in tensorflow 2.18's spec, but the system won't try to build it. Bump the version so we get ones that do exist. - 0032-gpu_prim-error.patch - as per openxla's pull 16095 - but extended to also fix the Store methods, needed for cuda 12.8/12.9

…pull 393) * The vendored xnnpack in tensorflow 2.18 is incompatible with gcc14, so we pin to gcc13 * Hold back compiler versions for aarch64 to gcc-11 with cuda. cuda 12.8 only handles aarch64 neon if gcc is < 12

dslarm · 2025-09-05T18:53:34Z

could someone review this please - it is ready for that - AIUI CI fails are normal for this package. What built before builds now via build-locally.py, along with cuda 12.8 (as this is earliest version that will build tensorflow for Arm) and the arm builds. The new aarch64 support is now using cross-compilation.

…5.09.04.02.19.31 Other tools: - conda-build 25.7.0 - rattler-build 0.47.0 - rattler-build-conda-compat 1.4.6

h-vetinari · 2025-09-08T06:45:27Z

@dslarm, unfortunately your odyssey isn't finished yet. There was no CUDA 12.8 job for aarch here yet, and now that I've added it, it runs into:

 [13 / 26] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: /home/conda/feedstock_root/build_artifacts/tensorflow-split_1757278331355/_build_env/share/bazel/a5889192f1201e14ae645981f8e2d4ca/external/double_conversion/BUILD:9:11: Compiling double-conversion/strtod.cc [for tool] failed: undeclared inclusion(s) in rule '@double_conversion//:double-conversion':
this rule is missing dependency declarations for the following files included by 'double-conversion/strtod.cc':

…5.09.07.20.31.15 Other tools: - conda-build 25.7.0 - rattler-build 0.47.0 - rattler-build-conda-compat 1.4.6

…r py-ver

h-vetinari · 2025-09-08T11:15:54Z

OK, the intention seems to have been to downgrade to GCC 11 on aarch; I did that now. But it still runs into

 ERROR: $BUILD_PREFIX/share/bazel/c073781af00280756e7165719a85b5b2/external/llvm-project/llvm/BUILD.bazel:224:11: Compiling llvm/lib/Demangle/RustDemangle.cpp [for tool] failed: undeclared inclusion(s) in rule '@llvm-project//llvm:Demangle':
this rule is missing dependency declarations for the following files included by 'llvm/lib/Demangle/RustDemangle.cpp':
  '$BUILD_PREFIXv/aarch64-conda-linux-gnu/sysroot/usr/include/stdc-predef.h'
  '$BUILD_PREFIX/lib/gcc/aarch64-conda-linux-gnu/11.4.0/include/c++'
  [...]

which looks like bazel not understanding the cross-compiler setup, and related to rebuilding the vendored LLVM. Interestingly, this doesn't happen in the CPU build, which likely explains why we haven't seen this for osx-arm64 (which is also cross-compiled). Perhaps there are further place where we need to patch in the correct compiler setup.

h-vetinari · 2025-09-08T14:05:45Z

Sigh

 The following packages are incompatible
├─ tensorflow-avx2 =2.18.0 cpu_py313hf8d5db8_51 is not installable because it requires
│  └─ tensorflow ==2.18.0 cpu_py313h1234567_51, which does not exist (perhaps a missing channel);
                                   ^^^^^^^^
                                   dummy hash

reminds me of conda/conda-build#5571.

Edit: unrelated tough

dslarm · 2025-09-08T16:28:58Z

[..]

which looks like bazel not understanding the cross-compiler setup, and related to rebuilding the vendored LLVM. Interestingly, this doesn't happen in the CPU build, which likely explains why we haven't seen this for osx-arm64 (which is also cross-compiled). Perhaps there are further place where we need to patch in the correct compiler setup.

Thanks - that must be where I lost all hope last time with cross-compile of the CUDA side.. I'll try again.. there's already one dirty patch that was needed for CPU side, so another dirty patch may be in order ..

dslarm · 2025-09-08T17:15:31Z

Is currently failing early(ish) with:

Repository rule cuda_configure defined at:
  /home/conda/feedstock_root/build_artifacts/tensorflow-split_1757347641253/_build_env/share/bazel/e02d69176749ddd5fdfadcb233e1dff7/external/local_tsl/third_party/gpus/cuda/hermetic/cuda_configure.bzl:553:33: in <toplevel>
ERROR: An error occurred during the fetch of repository 'local_config_cuda':
   Traceback (most recent call last):
	File "/home/conda/feedstock_root/build_artifacts/tensorflow-split_1757347641253/_build_env/share/bazel/e02d69176749ddd5fdfadcb233e1dff7/external/local_tsl/third_party/gpus/cuda/hermetic/cuda_configure.bzl", line 520, column 38, in _cuda_autoconf_impl
		_create_local_cuda_repository(repository_ctx)
	File "/home/conda/feedstock_root/build_artifacts/tensorflow-split_1757347641253/_build_env/share/bazel/e02d69176749ddd5fdfadcb233e1dff7/external/local_tsl/third_party/gpus/cuda/hermetic/cuda_configure.bzl", line 446, column 35, in _create_local_cuda_repository
		cuda_config = _get_cuda_config(repository_ctx)
	File "/home/conda/feedstock_root/build_artifacts/tensorflow-split_1757347641253/_build_env/share/bazel/e02d69176749ddd5fdfadcb233e1dff7/external/local_tsl/third_party/gpus/cuda/hermetic/cuda_configure.bzl", line 219, column 53, in _get_cuda_config
		compute_capabilities = _compute_capabilities(repository_ctx),
	File "/home/conda/feedstock_root/build_artifacts/tensorflow-split_1757347641253/_build_env/share/bazel/e02d69176749ddd5fdfadcb233e1dff7/external/local_tsl/third_party/gpus/cuda/hermetic/cuda_configure.bzl", line 180, column 33, in _compute_capabilities
		_auto_configure_fail("Invalid compute capability: %s" % capability)
	File "/home/conda/feedstock_root/build_artifacts/tensorflow-split_1757347641253/_build_env/share/bazel/e02d69176749ddd5fdfadcb233e1dff7/external/local_tsl/third_party/gpus/cuda/hermetic/cuda_configure.bzl", line 59, column 9, in _auto_configure_fail
		fail("\n%sCuda Configuration Error:%s %s\n" % (red, no_color, msg))
Error in fail: 
Cuda Configuration Error: Invalid compute capability: sm_100

this seems at odds with what cuda 12.8 claims (it should support sm_100)

h-vetinari · 2025-09-08T17:24:34Z

this seems at odds with what cuda 12.8 claims (it should support sm_100)

Yeah, but 80c6778 was mostly for completeness (i.e. if we bump the CUDA version, we should match the capabilities). Feel free to remove the newer arches for now (or revert that commit), it's not the key part of this PR.

h-vetinari · 2025-09-10T10:09:21Z

So, this now passes on everything except aarch+CUDA. We could take the aarch+CPU support and take this as an intermediate win that the other PRs can build on top of. Thoughts @conda-forge/tensorflow?

hmaarrfk · 2025-09-10T12:00:40Z

does it pass with aarch native compilation dslarm was willing to do that and invoke CFEP03.

I'm personally too burned out to care about anything that doesn't "unpin" abseil on tensorflow.

Its causing me real solving issues, where alot of the stack is getting downgraded to add tensorflow to an environment.

But you two got this to a good place, so maybe this is good to merge.

dslarm · 2025-09-10T15:10:14Z

I'd take the win of linux-aarch64 - without CUDA - and using cross-compilation - can we move to merge this (any outstanding reviews etc)?

Although I'm happy to run the native case ('CFEP03' to get everything) - I think I may be get done with cross-compiled cuda when I can give a bit more time to it - hence take the win on cross-compiled non-cuda for now.

FWIW, I think there's an issue with the bazel-toolchain for cross-compile with CUDA, and it's just painful to debug (not a bazel expert..) - I assume (but haven't done the debug yet) the Rust.* compile that fails when cross-compiling for cuda+aarch64 is also done when plain aarch64 cross-compiles. If that's the case, then it must surely be in the cuda scripts 'crosstool_wrapper_driver_is_not_gcc' etc that appear to get invoked only in the cuda case and that would be missing the aarch64 system include directories.

Add linux-aarch64 and cuda 12.8

hmaarrfk · 2025-09-10T16:40:06Z

Ok. Feel free to upload the logs here for cfep03

hmaarrfk · 2025-09-10T16:40:29Z

Thanks all!!!

h-vetinari · 2025-09-10T16:50:14Z

Ok. Feel free to upload the logs here for cfep03

FWIW, since macos builds didn't change appreciably here, I'm not planning to do any CFEP 03 builds for this. If @dslarm can get the aarch+CUDA builds unblocked, I'd prefer a separate PR for that, even if it's built locally.

dslarm · 2025-09-10T20:20:01Z

Thanks both for your help - I'll continue to try more on cuda shortly..

h-vetinari · 2025-09-19T19:32:31Z

Hm, I definitely thought this, but apparently didn't write it down - better late than never I hope:

Thanks so much @dslarm for the patience and persistence in shepherding this PR to completion! This one was particularly tricky and took a very long time, sorry about that. 🙃

dslarm requested review from FarhanTejani, ghego, h-vetinari, hajapy, hmaarrfk, jschueller, ngam, njzjz, waitingkuo, wolfv and xhochy as code owners August 1, 2025 18:36

dslarm mentioned this pull request Aug 1, 2025

add linux_aarch64 #426

Closed

dslarm force-pushed the linux-aarch64-clean branch from a7da5db to e253ee2 Compare August 5, 2025 15:36

dslarm force-pushed the linux-aarch64-clean branch from e253ee2 to ffb5957 Compare August 5, 2025 15:49

dslarm closed this Aug 5, 2025

dslarm deleted the linux-aarch64-clean branch August 5, 2025 15:59

dslarm restored the linux-aarch64-clean branch August 5, 2025 16:06

dslarm reopened this Aug 5, 2025

dslarm added 2 commits August 7, 2025 13:07

dslarm force-pushed the linux-aarch64-clean branch from 0020c69 to f7b55fb Compare August 7, 2025 12:10

h-vetinari mentioned this pull request Sep 3, 2025

How to adapt the software version 2.10.0 to the linux-aarhc64-aarhc64 #448

Open

h-vetinari added 5 commits September 7, 2025 17:41

trigger CI

86833a1

remove obsolete skip

8e1ce3f

fix cuda128 migrator

2c27251

don't override python in CBC

76db284

MNT: Re-rendered with conda-smithy 3.52.1 and conda-forge-pinning 202…

f983ef5

…5.09.04.02.19.31 Other tools: - conda-build 25.7.0 - rattler-build 0.47.0 - rattler-build-conda-compat 1.4.6

h-vetinari added 3 commits September 8, 2025 08:55

downgrade GCC on linux+aarch

66b4076

MNT: Re-rendered with conda-smithy 3.52.1 and conda-forge-pinning 202…

db059e1

…5.09.07.20.31.15 Other tools: - conda-build 25.7.0 - rattler-build 0.47.0 - rattler-build-conda-compat 1.4.6

fix pins in tensorflow-{cpu,gpu}; build tensorflow-{sse3,avx2} pe…

34bd5e7

…r py-ver

h-vetinari added 2 commits September 8, 2025 16:08

skip 3.13 builds for tensorflow-{sse3,avx2,avx512}

38d70e8

update HERMETIC_CUDA_COMPUTE_CAPABILITIES

80c6778

dslarm and others added 2 commits September 8, 2025 18:43

remove sm > sm_100 - it doesnt work here

87b2c3e

reinstate final compute_* arch

a463708

h-vetinari added a commit that referenced this pull request Sep 10, 2025

Merge pull request #438 from dslarm/linux-aarch64-clean

1d65c30

Add linux-aarch64 and cuda 12.8

h-vetinari merged commit a463708 into conda-forge:main Sep 10, 2025
8 of 12 checks passed

Uh oh!

Conversation

dslarm commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

conda-forge-admin commented Aug 1, 2025

Uh oh!

hmaarrfk commented Aug 5, 2025

Uh oh!

dslarm commented Aug 5, 2025

Uh oh!

hmaarrfk commented Aug 5, 2025

Uh oh!

hmaarrfk commented Aug 5, 2025

Uh oh!

dslarm commented Aug 5, 2025

Uh oh!

conda-forge-admin commented Aug 5, 2025

Uh oh!

conda-forge-admin commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dslarm commented Aug 5, 2025

Uh oh!

dslarm commented Sep 5, 2025

Uh oh!

h-vetinari commented Sep 8, 2025

Uh oh!

h-vetinari commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

h-vetinari commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dslarm commented Sep 8, 2025

Uh oh!

dslarm commented Sep 8, 2025

Uh oh!

h-vetinari commented Sep 8, 2025

Uh oh!

h-vetinari commented Sep 10, 2025

Uh oh!

hmaarrfk commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dslarm commented Sep 10, 2025

Uh oh!

Uh oh!

hmaarrfk commented Sep 10, 2025

Uh oh!

hmaarrfk commented Sep 10, 2025

Uh oh!

h-vetinari commented Sep 10, 2025

Uh oh!

dslarm commented Sep 10, 2025

Uh oh!

h-vetinari commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dslarm commented Aug 1, 2025 •

edited

Loading

conda-forge-admin commented Aug 5, 2025 •

edited

Loading

h-vetinari commented Sep 8, 2025 •

edited

Loading

h-vetinari commented Sep 8, 2025 •

edited

Loading

hmaarrfk commented Sep 10, 2025 •

edited

Loading