Skip to content

Rebuild for CUDA 12.9#393

Merged
h-vetinari merged 20 commits into
conda-forge:mainfrom
h-vetinari:cuda129
Jul 2, 2025
Merged

Rebuild for CUDA 12.9#393
h-vetinari merged 20 commits into
conda-forge:mainfrom
h-vetinari:cuda129

Conversation

@h-vetinari
Copy link
Copy Markdown
Member

This is both intended as a mergeable PR, as well as a test/proof of conda-forge/conda-forge-pinning-feedstock#7476

Due to an issue with conda-smithy's merging logic, this is rerendered with conda-forge/conda-smithy#2335

Also partially based on #391 (the channel/artefact changes), so that we can try to work around conda/infrastructure#1159

@conda-forge-admin
Copy link
Copy Markdown
Contributor

conda-forge-admin commented Jun 13, 2025

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe/meta.yaml:

  • ℹ️ The recipe is not parsable by parser conda-souschef (grayskull). This parser is not currently used by conda-forge, but may be in the future. We are collecting information to see which recipes are compatible with grayskull.
  • ℹ️ The recipe is not parsable by parser conda-recipe-manager. The recipe can only be automatically migrated to the new v1 format if it is parseable by conda-recipe-manager.

This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/15996521093. Examine the logs at this URL for more detail.

@h-vetinari h-vetinari force-pushed the cuda129 branch 2 times, most recently from ccd4694 to 67e7fd3 Compare June 13, 2025 04:56
Copy link
Copy Markdown
Contributor

@mgorny mgorny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

+ set(arch_name ${CMAKE_MATCH_1})
+ endif()
+ if(arch_name MATCHES "^([0-9]\\.[0-9](\\([0-9]\\.[0-9]\\))?)$")
+ if(arch_name MATCHES "^(1?[0-9]\\.[0-9](\\([0-9]\\.[0-9]\\))?)$")
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only actually relevant change in 0d06933

@mgorny
Copy link
Copy Markdown
Contributor

mgorny commented Jun 13, 2025

Well, the good news here is that the failures we've been saying with 2.7.1 aren't regressions in 2.7.1 but an external regressions.

@h-vetinari
Copy link
Copy Markdown
Member Author

Well, the good news here is that the failures we've been saying with 2.7.1 aren't regressions in 2.7.1 but an external regressions.

I'm certain it's due to numpy 2.3. The py39/py310 jobs aren't failing, where that version doesn't exist. I wanted to check for upstream patches before adding a cap for ~4 tests, but ran out of steam yesterday. Looking now.

@h-vetinari
Copy link
Copy Markdown
Member Author

So the should_error portion of test_norm_matrix_degenerate_shapes was removed in pytorch/pytorch@ff7b6d6, but that looks waaaaaay too big for a backport.

Given that it's really a corner case (that results of norm for for degenerate shapes match numpy), and that AFAICT upstream pytorch 2.7.0 has no bound on numpy either, let's just skip those tests.

Comment thread recipe/build.sh Outdated
12.[0-6])
export TORCH_CUDA_ARCH_LIST="5.0;6.0;6.1;7.0;7.5;8.0;8.6;8.9;9.0+PTX"
12.[89])
export TORCH_CUDA_ARCH_LIST="5.0;6.0;6.1;7.0;7.5;8.0;8.6;8.9;9.0;10.0;12.0+PTX"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@conda-forge/cuda @conda-forge/pytorch-cpu @isuruf
Feedback on what would be a good (sub)set of CUDA architectures here would be much appreciated. If we just keep adding architectures, we will keep further increasing the build time, which is already extreme.

FWIW, upstream uses a much-reduced set, which drops <7.5 as well as 8.9 compared to what we have now:

TORCH_CUDA_ARCH_LIST=7.5;8.0;8.6;9.0;10.0;12.0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with dropping "older than 7.5". I might even go as far as saying to drop "20 series" cards but that might be extreme.

https://developer.nvidia.com/cuda-gpus


There could be "alternatives" to explore:

  • Such as ABI3 builds, which should greatly reduce the build matrix right?

Personally, I would prefer to drop "pythons" rather than to drop "compute capabilities".

The reason is:

  • Those that are on old pythons, are likely on old hardware.
  • Their ability to install "python" with a non-confusing support matrix will remain

pytorch + cuda from conda-forged used to work for my hardware, and now it will continue to work"

I would go as far as dropping python 3.9, AND 3.10 leaving only 3.11 and 3.12.

Copy link
Copy Markdown

@bdice bdice Jun 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TORCH_CUDA_ARCH_LIST=7.5;8.0;8.6;9.0;10.0;12.0

Currently, RAPIDS builds this set plus 7.0 (Volta). (reference)

RAPIDS ships PTX for the latest arch only (12.0), to allow forward compatibility.

Compute capabilities older than 7.5 are deprecated as of CUDA 12.8. (reference)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the inputs. I know there's always the next thing to cut, but for now I'm mostly concerned with not blowing up our CI pipelines further. I like the idea of going with the rapids set

TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0

which would still drop a net two architectures (minus 5.0;6.0;6.1;8.9, plus 10.0;12.0). I did see somewhere that CUDA 12.9 also has 10.1 and 12.1 - are those relevant here?

For python that's a separate discussion, but ultimately my position there is that we should match upstream support unless extreme circumstances prevent us from doing so.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't dropping 8.9 effectively kill the entire 40 series lineup?
https://developer.nvidia.com/cuda-gpus

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume they would then use the 8.6 instructions; not sure how big of an improvement the delta between 8.9 and 8.6 provides, hence why I'm asking for inputs in choosing the set of architectures.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment above re 8.9 is correct. Upstream has never built for 8.9 specifically I believe. In the absence of evidence of there being significant performance or support differences, 8.9 should be dropped.

FWIW, upstream uses a much-reduced set, which drops <7.5 as well as 8.9 compared to what we have now:

It's important to note that upstream ships multiple sets of wheels. The CUDA 11.8 config isn't relevant anymore for conda-forge, but the CUDA 12.6 (default for PyPI still) and 12.8 (hosted separately) are:

  • 12.6: TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6;9.0"
  • 12.8/12.9: TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" #removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases

Just to make sure I understand this PR: it will result in only building with CUDA 12.9, right? There is no separate 12.6 or lower build that keeps support for 5.0/6.0?

If so, that seems a bit aggressive to me. That drops support for all Quadro M, Quadro P and GeForce GTX cards. The latter would impact me, I have a GTX 1080Ti in my main dev machine. It's getting old, but for development purposes it has been fine until now. Looking at the legacy compute capability table, I'd say 6.0 is at least as important as 7.0, so tacking 7.0 as the one legacy version onto the CUDA 12.9 config may not be justified.

Is it possible to split the builds like upstream does? CUDA 12.6 with legacy support, and 12.8 or 12.9 without it? On the one hand that'd be even more build time, on the other hand it's then possible to keep a large range of support and perhaps trim the architectures per build even further (e.g., 2x 5 architectures rather than the 10 in a single build that there are now).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point about the upstream split into 12.6 / 12.8. Indeed we would only do the CUDA build once, and it's clearly preferable to me to have one build with more architectures, rather than a 12.6 and a 12.9 build with different subsets.

I'd be equally fine to do

TORCH_CUDA_ARCH_LIST=5.0;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0

which would trade 6.1 & 8.9 for 10.0 & 12.0. I can even imagine building the full set (as currently committed) if there are good reasons to do so - I only wanted to make sure that we choose the set of architectures deliberately, rather than just by inertia.

@h-vetinari
Copy link
Copy Markdown
Member Author

Interesting build failure for the CUDA builds:

$BUILD_PREFIX/bin/../x86_64-conda-linux-gnu/sysroot/usr/lib/../lib/crti.o: in function `_init':
(.init+0x7): relocation truncated to fit: R_X86_64_GOTPCREL against undefined symbol `__gmon_start__'
caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/cuda/CUDAContext.cpp.o: in function `at::cuda::(anonymous namespace)::initDeviceProperty(signed char)':
CUDAContext.cpp:(.text._ZN2at4cuda12_GLOBAL__N_118initDevicePropertyEa+0x2d): relocation truncated to fit: R_X86_64_GOTPCRELX against symbol `memset@@GLIBC_2.2.5' defined in .text section in $BUILD_PREFIX/bin/../x86_64-conda-linux-gnu/sysroot/lib64/libc.so.6
CUDAContext.cpp:(.text._ZN2at4cuda12_GLOBAL__N_118initDevicePropertyEa+0x39): relocation truncated to fit: R_X86_64_GOTPCRELX against symbol `cudaGetDeviceProperties_v2@@libcudart.so.12' defined in .text section in $PREFIX/lib/libcudart.so
CUDAContext.cpp:(.text._ZN2at4cuda12_GLOBAL__N_118initDevicePropertyEa+0x5a): relocation truncated to fit: R_X86_64_GOTPCRELX against symbol `c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool)' defined in .text section in lib/libc10_cuda.so
CUDAContext.cpp:(.text._ZN2at4cuda12_GLOBAL__N_118initDevicePropertyEa+0x74): relocation truncated to fit: R_X86_64_PC32 against `.bss'
CUDAContext.cpp:(.text._ZN2at4cuda12_GLOBAL__N_118initDevicePropertyEa+0x7a): relocation truncated to fit: R_X86_64_GOTPCRELX against symbol `memcpy@@GLIBC_2.14' defined in .text section in $BUILD_PREFIX/bin/../x86_64-conda-linux-gnu/sysroot/lib64/libc.so.6
CUDAContext.cpp:(.text._ZN2at4cuda12_GLOBAL__N_118initDevicePropertyEa+0x9d): relocation truncated to fit: R_X86_64_GOTPCRELX against symbol `__stack_chk_fail@@GLIBC_2.4' defined in .text section in $BUILD_PREFIX/bin/../x86_64-conda-linux-gnu/sysroot/lib64/libc.so.6
caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/cuda/CUDAContext.cpp.o: in function `std::vector<cudaDeviceProp, std::allocator<cudaDeviceProp> >::~vector()':
CUDAContext.cpp:(.text._ZNSt6vectorI14cudaDevicePropSaIS0_EED2Ev[_ZNSt6vectorI14cudaDevicePropSaIS0_EED5Ev]+0x14): relocation truncated to fit: R_X86_64_GOTPCRELX against symbol `operator delete(void*, unsigned long)' defined in .text section in $PREFIX/lib/libcufile.so
caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/cuda/CUDAContext.cpp.o: in function `std::deque<c10::once_flag, std::allocator<c10::once_flag> >::~deque()':
CUDAContext.cpp:(.text._ZNSt5dequeIN3c109once_flagESaIS1_EED2Ev[_ZNSt5dequeIN3c109once_flagESaIS1_EED5Ev]+0x36): relocation truncated to fit: R_X86_64_GOTPCRELX against symbol `operator delete(void*, unsigned long)' defined in .text section in $PREFIX/lib/libcufile.so
CUDAContext.cpp:(.text._ZNSt5dequeIN3c109once_flagESaIS1_EED2Ev[_ZNSt5dequeIN3c109once_flagESaIS1_EED5Ev]+0x52): relocation truncated to fit: R_X86_64_GOTPCRELX against symbol `operator delete(void*, unsigned long)' defined in .text section in $PREFIX/lib/libcufile.so
caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/cuda/CUDAContext.cpp.o: in function `at::cuda::getCUDADeviceAllocator()':
CUDAContext.cpp:(.text._ZN2at4cuda22getCUDADeviceAllocatorEv+0x3): additional relocation overflows omitted from the output
caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/cuda/detail/CUDAHooks.cpp.o: in function `at::cuda::detail::CUDAHooks::nvrtc() const':
CUDAHooks.cpp:(.text._ZNK2at4cuda6detail9CUDAHooks5nvrtcEv+0x43): failed to convert GOTPCREL relocation against '_ZN2at4cuda6detail9lazyNVRTCE'; relink with --no-relax
$BUILD_PREFIX/bin/../lib/gcc/x86_64-conda-linux-gnu/13.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: final link failed
collect2: error: ld returned 1 exit status

@h-vetinari
Copy link
Copy Markdown
Member Author

Also @conda-forge/cuda, any inputs about the linker errors with CUDA 12.9 would be much appreciated 🙏

CUDAContext.cpp:(.text._ZN2at4cuda12_GLOBAL__N_118initDevicePropertyEa+0x2d): relocation truncated to fit: R_X86_64_GOTPCRELX against symbol

@Tobias-Fischer
Copy link
Copy Markdown
Contributor

Do we know why the Windows jobs don't run @h-vetinari?

@h-vetinari
Copy link
Copy Markdown
Member Author

I've been trying to figure out the OOM issues and lack of parallelism with the people from prefix and cirun. We have no answer for the parallelism question yet, and I don't know why things run OOM - could be a machine health issue in principle (though hard to imagine with a cloud machine that just gets spun up on demand). The only other explanation would be that nvcc 12.9 needs more memory than 12.6, and we were close enough to the max. memory limit before this PR that we're now over it.

In any case, the recommendation was to try a larger runner (which is enabled for this feedstock already), but those runner don't seem to start at all. I'm waiting for response from the cirun folks if there's a config issue at play (given that - I believe - this is the first time that this is being used in conda-forge).

@h-vetinari
Copy link
Copy Markdown
Member Author

For those following along, it seems the vCPU count for the runners in the azure subscription by prefix had been lowered (unclear how, Wolf had said nothing changed) to 15, which meant that only one 2xl runner (=8 vCPU) could run at once, and no 4xl runner (=16 vCPU) at all. We've bumped that limit up now, so we're back to having parallel builds on windows. 🥳

Thanks Amit! 🙏

Now it's fingers crossed the bigger runners will not run into OOM anymore 🤞

@h-vetinari
Copy link
Copy Markdown
Member Author

h-vetinari commented Jul 1, 2025

Now it's fingers crossed the bigger runners will not run into OOM anymore 🤞

Welllllll, it's still OOM-ing. I guess it's down to CUDA 12.9 after all, or perhaps the new architectures (which also caused issues elswhere, e.g. compilation errors for xformers). Will try 12.8, but I kinda doubt that'll be much better.

CC @conda-forge/cuda

Edit: the xformers issues are real, but I had misremembered: 9.0 isn't a new architecture; also, that happened with CUDA 12.6

@h-vetinari
Copy link
Copy Markdown
Member Author

Will try 12.8, but I kinda doubt that'll be much better.

Looks like I jinxed it in the positive sense - CUDA 12.8 seems to be doing fine where 12.9 was OOM-ing. 🥳

On the CPU side, we're down to a single failure

 ================================== FAILURES ===================================
____________________ TestMkldnnCPU.test_batch_norm_2d_cpu _____________________
[gw1] win32 -- Python 3.12.11 %PREFIX%\python.exe

self = <test_mkldnn.TestMkldnnCPU testMethod=test_batch_norm_2d_cpu>

    def test_batch_norm_2d(self):
        N = torch.randint(3, 10, (1,)).item()
        C = torch.randint(3, 100, (1,)).item()
        x = torch.randn(N, C, 35, 45, dtype=torch.float32) * 10
        self._test_batch_norm_base(dim=2, channels=C, input=x)
>       self._test_batch_norm_train_base(dim=2, channels=C, input=x)

test\test_mkldnn.py:1005: 
[...]
 >       return torch.batch_norm(
            input,
            weight,
            bias,
            running_mean,
            running_var,
            training,
            momentum,
            eps,
            torch.backends.cudnn.enabled,
        )
E       RuntimeError: could not execute a primitive

I vaguely remember seeing this failure before already, but there seems to be no further context. I'll see if adding some --retries will help...

h-vetinari added a commit that referenced this pull request Jul 2, 2025
@h-vetinari h-vetinari merged commit 693dd49 into conda-forge:main Jul 2, 2025
25 of 27 checks passed
@h-vetinari h-vetinari deleted the cuda129 branch July 2, 2025 00:28
@h-vetinari
Copy link
Copy Markdown
Member Author

h-vetinari commented Jul 2, 2025

We're still running into conda/infrastructure#1159 on windows, and the work-around with the artefact persistence doesn't work (yet), because it runs into

 conda-build directory does not exist
Error: Process completed with exit code 1.

due to conda-forge/conda-smithy#2345

dslarm added a commit to dslarm/tensorflow-feedstock that referenced this pull request Jul 30, 2025
dslarm added a commit to dslarm/tensorflow-feedstock that referenced this pull request Jul 31, 2025
dslarm added a commit to dslarm/tensorflow-feedstock that referenced this pull request Jul 31, 2025
dslarm added a commit to dslarm/tensorflow-feedstock that referenced this pull request Jul 31, 2025
dslarm added a commit to dslarm/tensorflow-feedstock that referenced this pull request Aug 1, 2025
dslarm added a commit to dslarm/tensorflow-feedstock that referenced this pull request Aug 1, 2025
conda-forge/pytorch-cpu-feedstock#393)
* The vendored xnnpack in tensorflow 2.18 is incompatible with gcc14, so we pin to gcc13
* Hold back compiler versions for aarch64 to gcc-11 with cuda. cuda 12.8 only
  handles aarch64 neon if gcc is < 12
dslarm added a commit to dslarm/tensorflow-feedstock that referenced this pull request Aug 1, 2025
conda-forge/pytorch-cpu-feedstock#393)
* The vendored xnnpack in tensorflow 2.18 is incompatible with gcc14, so we pin to gcc13
* Hold back compiler versions for aarch64 to gcc-11 with cuda. cuda 12.8 only
  handles aarch64 neon if gcc is < 12
dslarm added a commit to dslarm/tensorflow-feedstock that referenced this pull request Aug 1, 2025
conda-forge/pytorch-cpu-feedstock#393)
* The vendored xnnpack in tensorflow 2.18 is incompatible with gcc14, so we pin to gcc13
* Hold back compiler versions for aarch64 to gcc-11 with cuda. cuda 12.8 only
  handles aarch64 neon if gcc is < 12
dslarm added a commit to dslarm/tensorflow-feedstock that referenced this pull request Aug 5, 2025
conda-forge/pytorch-cpu-feedstock#393)
* The vendored xnnpack in tensorflow 2.18 is incompatible with gcc14, so we pin to gcc13
* Hold back compiler versions for aarch64 to gcc-11 with cuda. cuda 12.8 only
  handles aarch64 neon if gcc is < 12
dslarm added a commit to dslarm/tensorflow-feedstock that referenced this pull request Aug 5, 2025
conda-forge/pytorch-cpu-feedstock#393)
* The vendored xnnpack in tensorflow 2.18 is incompatible with gcc14, so we pin to gcc13
* Hold back compiler versions for aarch64 to gcc-11 with cuda. cuda 12.8 only
  handles aarch64 neon if gcc is < 12
@h-vetinari
Copy link
Copy Markdown
Member Author

CUDA 12.8 seems to be doing fine where 12.9 was OOM-ing. 🥳

For reference, this is a regression in nvcc, more info in pytorch/pytorch#156181

@h-vetinari
Copy link
Copy Markdown
Member Author

BTW @dslarm, please don't refer to PRs in commit message during development. Every single pushed iteration of such a commit (even if force pushed away later) remains a permanent reference that clutters the timeline of the referenced PR.

Either only reference PRs in the final iteration, or better, refer to the respective commits rather than the associated PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants