Rebuild for CUDA 12.9 by h-vetinari · Pull Request #393 · conda-forge/pytorch-cpu-feedstock

h-vetinari · 2025-06-13T04:26:44Z

This is both intended as a mergeable PR, as well as a test/proof of conda-forge/conda-forge-pinning-feedstock#7476

Due to an issue with conda-smithy's merging logic, this is rerendered with conda-forge/conda-smithy#2335

Also partially based on #391 (the channel/artefact changes), so that we can try to work around conda/infrastructure#1159

conda-forge-admin · 2025-06-13T04:28:30Z

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe/meta.yaml:

ℹ️ The recipe is not parsable by parser conda-souschef (grayskull). This parser is not currently used by conda-forge, but may be in the future. We are collecting information to see which recipes are compatible with grayskull.
ℹ️ The recipe is not parsable by parser conda-recipe-manager. The recipe can only be automatically migrated to the new v1 format if it is parseable by conda-recipe-manager.

_{This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/15996521093. Examine the logs at this URL for more detail.}

mgorny

LGTM

h-vetinari · 2025-06-13T05:44:40Z

 +      set(arch_name ${CMAKE_MATCH_1})
 +    endif()
-+    if(arch_name MATCHES "^([0-9]\\.[0-9](\\([0-9]\\.[0-9]\\))?)$")
+    if(arch_name MATCHES "^(1?[0-9]\\.[0-9](\\([0-9]\\.[0-9]\\))?)$")


This is the only actually relevant change in 0d06933

mgorny · 2025-06-13T13:50:24Z

Well, the good news here is that the failures we've been saying with 2.7.1 aren't regressions in 2.7.1 but an external regressions.

h-vetinari · 2025-06-13T19:44:23Z

Well, the good news here is that the failures we've been saying with 2.7.1 aren't regressions in 2.7.1 but an external regressions.

I'm certain it's due to numpy 2.3. The py39/py310 jobs aren't failing, where that version doesn't exist. I wanted to check for upstream patches before adding a cap for ~4 tests, but ran out of steam yesterday. Looking now.

h-vetinari · 2025-06-13T20:10:00Z

So the should_error portion of test_norm_matrix_degenerate_shapes was removed in pytorch/pytorch@ff7b6d6, but that looks waaaaaay too big for a backport.

Given that it's really a corner case (that results of norm for for degenerate shapes match numpy), and that AFAICT upstream pytorch 2.7.0 has no bound on numpy either, let's just skip those tests.

h-vetinari · 2025-06-14T07:40:24Z

-        12.[0-6])
-            export TORCH_CUDA_ARCH_LIST="5.0;6.0;6.1;7.0;7.5;8.0;8.6;8.9;9.0+PTX"
+        12.[89])
+            export TORCH_CUDA_ARCH_LIST="5.0;6.0;6.1;7.0;7.5;8.0;8.6;8.9;9.0;10.0;12.0+PTX"


@conda-forge/cuda @conda-forge/pytorch-cpu @isuruf
Feedback on what would be a good (sub)set of CUDA architectures here would be much appreciated. If we just keep adding architectures, we will keep further increasing the build time, which is already extreme.

FWIW, upstream uses a much-reduced set, which drops <7.5 as well as 8.9 compared to what we have now:

TORCH_CUDA_ARCH_LIST=7.5;8.0;8.6;9.0;10.0;12.0

I agree with dropping "older than 7.5". I might even go as far as saying to drop "20 series" cards but that might be extreme.

https://developer.nvidia.com/cuda-gpus

There could be "alternatives" to explore:

Such as ABI3 builds, which should greatly reduce the build matrix right?

Personally, I would prefer to drop "pythons" rather than to drop "compute capabilities".

The reason is:

Those that are on old pythons, are likely on old hardware.

Their ability to install "python" with a non-confusing support matrix will remain

pytorch + cuda from conda-forged used to work for my hardware, and now it will continue to work"

I would go as far as dropping python 3.9, AND 3.10 leaving only 3.11 and 3.12.

TORCH_CUDA_ARCH_LIST=7.5;8.0;8.6;9.0;10.0;12.0

Currently, RAPIDS builds this set plus 7.0 (Volta). (reference)

RAPIDS ships PTX for the latest arch only (12.0), to allow forward compatibility.

Compute capabilities older than 7.5 are deprecated as of CUDA 12.8. (reference)

Thanks for the inputs. I know there's always the next thing to cut, but for now I'm mostly concerned with not blowing up our CI pipelines further. I like the idea of going with the rapids set

TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0

which would still drop a net two architectures (minus 5.0;6.0;6.1;8.9, plus 10.0;12.0). I did see somewhere that CUDA 12.9 also has 10.1 and 12.1 - are those relevant here?

For python that's a separate discussion, but ultimately my position there is that we should match upstream support unless extreme circumstances prevent us from doing so.

Doesn't dropping 8.9 effectively kill the entire 40 series lineup?
https://developer.nvidia.com/cuda-gpus

I assume they would then use the 8.6 instructions; not sure how big of an improvement the delta between 8.9 and 8.6 provides, hence why I'm asking for inputs in choosing the set of architectures.

I think the comment above re 8.9 is correct. Upstream has never built for 8.9 specifically I believe. In the absence of evidence of there being significant performance or support differences, 8.9 should be dropped.

FWIW, upstream uses a much-reduced set, which drops <7.5 as well as 8.9 compared to what we have now:

It's important to note that upstream ships multiple sets of wheels. The CUDA 11.8 config isn't relevant anymore for conda-forge, but the CUDA 12.6 (default for PyPI still) and 12.8 (hosted separately) are:

12.6: TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6;9.0"

12.8/12.9: TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" #removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases

Just to make sure I understand this PR: it will result in only building with CUDA 12.9, right? There is no separate 12.6 or lower build that keeps support for 5.0/6.0?

If so, that seems a bit aggressive to me. That drops support for all Quadro M, Quadro P and GeForce GTX cards. The latter would impact me, I have a GTX 1080Ti in my main dev machine. It's getting old, but for development purposes it has been fine until now. Looking at the legacy compute capability table, I'd say 6.0 is at least as important as 7.0, so tacking 7.0 as the one legacy version onto the CUDA 12.9 config may not be justified.

Is it possible to split the builds like upstream does? CUDA 12.6 with legacy support, and 12.8 or 12.9 without it? On the one hand that'd be even more build time, on the other hand it's then possible to keep a large range of support and perhaps trim the architectures per build even further (e.g., 2x 5 architectures rather than the 10 in a single build that there are now).

Fair point about the upstream split into 12.6 / 12.8. Indeed we would only do the CUDA build once, and it's clearly preferable to me to have one build with more architectures, rather than a 12.6 and a 12.9 build with different subsets.

I'd be equally fine to do

TORCH_CUDA_ARCH_LIST=5.0;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0

which would trade 6.1 & 8.9 for 10.0 & 12.0. I can even imagine building the full set (as currently committed) if there are good reasons to do so - I only wanted to make sure that we choose the set of architectures deliberately, rather than just by inertia.

h-vetinari · 2025-06-14T09:48:43Z

Interesting build failure for the CUDA builds:

$BUILD_PREFIX/bin/../x86_64-conda-linux-gnu/sysroot/usr/lib/../lib/crti.o: in function `_init':
(.init+0x7): relocation truncated to fit: R_X86_64_GOTPCREL against undefined symbol `__gmon_start__'
caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/cuda/CUDAContext.cpp.o: in function `at::cuda::(anonymous namespace)::initDeviceProperty(signed char)':
CUDAContext.cpp:(.text._ZN2at4cuda12_GLOBAL__N_118initDevicePropertyEa+0x2d): relocation truncated to fit: R_X86_64_GOTPCRELX against symbol `memset@@GLIBC_2.2.5' defined in .text section in $BUILD_PREFIX/bin/../x86_64-conda-linux-gnu/sysroot/lib64/libc.so.6
CUDAContext.cpp:(.text._ZN2at4cuda12_GLOBAL__N_118initDevicePropertyEa+0x39): relocation truncated to fit: R_X86_64_GOTPCRELX against symbol `cudaGetDeviceProperties_v2@@libcudart.so.12' defined in .text section in $PREFIX/lib/libcudart.so
CUDAContext.cpp:(.text._ZN2at4cuda12_GLOBAL__N_118initDevicePropertyEa+0x5a): relocation truncated to fit: R_X86_64_GOTPCRELX against symbol `c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool)' defined in .text section in lib/libc10_cuda.so
CUDAContext.cpp:(.text._ZN2at4cuda12_GLOBAL__N_118initDevicePropertyEa+0x74): relocation truncated to fit: R_X86_64_PC32 against `.bss'
CUDAContext.cpp:(.text._ZN2at4cuda12_GLOBAL__N_118initDevicePropertyEa+0x7a): relocation truncated to fit: R_X86_64_GOTPCRELX against symbol `memcpy@@GLIBC_2.14' defined in .text section in $BUILD_PREFIX/bin/../x86_64-conda-linux-gnu/sysroot/lib64/libc.so.6
CUDAContext.cpp:(.text._ZN2at4cuda12_GLOBAL__N_118initDevicePropertyEa+0x9d): relocation truncated to fit: R_X86_64_GOTPCRELX against symbol `__stack_chk_fail@@GLIBC_2.4' defined in .text section in $BUILD_PREFIX/bin/../x86_64-conda-linux-gnu/sysroot/lib64/libc.so.6
caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/cuda/CUDAContext.cpp.o: in function `std::vector<cudaDeviceProp, std::allocator<cudaDeviceProp> >::~vector()':
CUDAContext.cpp:(.text._ZNSt6vectorI14cudaDevicePropSaIS0_EED2Ev[_ZNSt6vectorI14cudaDevicePropSaIS0_EED5Ev]+0x14): relocation truncated to fit: R_X86_64_GOTPCRELX against symbol `operator delete(void*, unsigned long)' defined in .text section in $PREFIX/lib/libcufile.so
caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/cuda/CUDAContext.cpp.o: in function `std::deque<c10::once_flag, std::allocator<c10::once_flag> >::~deque()':
CUDAContext.cpp:(.text._ZNSt5dequeIN3c109once_flagESaIS1_EED2Ev[_ZNSt5dequeIN3c109once_flagESaIS1_EED5Ev]+0x36): relocation truncated to fit: R_X86_64_GOTPCRELX against symbol `operator delete(void*, unsigned long)' defined in .text section in $PREFIX/lib/libcufile.so
CUDAContext.cpp:(.text._ZNSt5dequeIN3c109once_flagESaIS1_EED2Ev[_ZNSt5dequeIN3c109once_flagESaIS1_EED5Ev]+0x52): relocation truncated to fit: R_X86_64_GOTPCRELX against symbol `operator delete(void*, unsigned long)' defined in .text section in $PREFIX/lib/libcufile.so
caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/cuda/CUDAContext.cpp.o: in function `at::cuda::getCUDADeviceAllocator()':
CUDAContext.cpp:(.text._ZN2at4cuda22getCUDADeviceAllocatorEv+0x3): additional relocation overflows omitted from the output
caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/cuda/detail/CUDAHooks.cpp.o: in function `at::cuda::detail::CUDAHooks::nvrtc() const':
CUDAHooks.cpp:(.text._ZNK2at4cuda6detail9CUDAHooks5nvrtcEv+0x43): failed to convert GOTPCREL relocation against '_ZN2at4cuda6detail9lazyNVRTCE'; relink with --no-relax
$BUILD_PREFIX/bin/../lib/gcc/x86_64-conda-linux-gnu/13.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: final link failed
collect2: error: ld returned 1 exit status

h-vetinari · 2025-06-14T20:24:22Z

Also @conda-forge/cuda, any inputs about the linker errors with CUDA 12.9 would be much appreciated 🙏

CUDAContext.cpp:(.text._ZN2at4cuda12_GLOBAL__N_118initDevicePropertyEa+0x2d): relocation truncated to fit: R_X86_64_GOTPCRELX against symbol

This reverts commit 5917d8c.

Tobias-Fischer · 2025-07-01T00:45:43Z

Do we know why the Windows jobs don't run @h-vetinari?

h-vetinari · 2025-07-01T01:24:21Z

I've been trying to figure out the OOM issues and lack of parallelism with the people from prefix and cirun. We have no answer for the parallelism question yet, and I don't know why things run OOM - could be a machine health issue in principle (though hard to imagine with a cloud machine that just gets spun up on demand). The only other explanation would be that nvcc 12.9 needs more memory than 12.6, and we were close enough to the max. memory limit before this PR that we're now over it.

In any case, the recommendation was to try a larger runner (which is enabled for this feedstock already), but those runner don't seem to start at all. I'm waiting for response from the cirun folks if there's a config issue at play (given that - I believe - this is the first time that this is being used in conda-forge).

h-vetinari · 2025-07-01T06:40:23Z

For those following along, it seems the vCPU count for the runners in the azure subscription by prefix had been lowered (unclear how, Wolf had said nothing changed) to 15, which meant that only one 2xl runner (=8 vCPU) could run at once, and no 4xl runner (=16 vCPU) at all. We've bumped that limit up now, so we're back to having parallel builds on windows. 🥳

Thanks Amit! 🙏

Now it's fingers crossed the bigger runners will not run into OOM anymore 🤞

h-vetinari · 2025-07-01T10:10:20Z

Now it's fingers crossed the bigger runners will not run into OOM anymore 🤞

Welllllll, it's still OOM-ing. I guess it's down to CUDA 12.9 after all, ~~or perhaps the new architectures (which also caused issues elswhere, e.g. compilation errors for xformers)~~. Will try 12.8, but I kinda doubt that'll be much better.

CC @conda-forge/cuda

Edit: the xformers issues are real, but I had misremembered: 9.0 isn't a new architecture; also, that happened with CUDA 12.6

h-vetinari · 2025-07-01T20:36:29Z

Will try 12.8, but I kinda doubt that'll be much better.

Looks like I jinxed it in the positive sense - CUDA 12.8 seems to be doing fine where 12.9 was OOM-ing. 🥳

On the CPU side, we're down to a single failure

 ================================== FAILURES ===================================
____________________ TestMkldnnCPU.test_batch_norm_2d_cpu _____________________
[gw1] win32 -- Python 3.12.11 %PREFIX%\python.exe

self = <test_mkldnn.TestMkldnnCPU testMethod=test_batch_norm_2d_cpu>

    def test_batch_norm_2d(self):
        N = torch.randint(3, 10, (1,)).item()
        C = torch.randint(3, 100, (1,)).item()
        x = torch.randn(N, C, 35, 45, dtype=torch.float32) * 10
        self._test_batch_norm_base(dim=2, channels=C, input=x)
>       self._test_batch_norm_train_base(dim=2, channels=C, input=x)

test\test_mkldnn.py:1005: 
[...]
 >       return torch.batch_norm(
            input,
            weight,
            bias,
            running_mean,
            running_var,
            training,
            momentum,
            eps,
            torch.backends.cudnn.enabled,
        )
E       RuntimeError: could not execute a primitive

I vaguely remember seeing this failure before already, but there seems to be no further context. I'll see if adding some --retries will help...

h-vetinari · 2025-07-02T20:58:36Z

We're still running into conda/infrastructure#1159 on windows, and the work-around with the artefact persistence doesn't work (yet), because it runs into

 conda-build directory does not exist
Error: Process completed with exit code 1.

due to conda-forge/conda-smithy#2345

conda-forge/pytorch-cpu-feedstock#393)

conda-forge/pytorch-cpu-feedstock#393) Incompatible with gcc14, so we pin to gcc13

conda-forge/pytorch-cpu-feedstock#393) * The vendored xnnpack in tensorflow 2.18 is incompatible with gcc14, so we pin to gcc13 * Hold back compiler versions for aarch64 to gcc-11 with cuda. cuda 12.8 only handles aarch64 neon if gcc is < 12

h-vetinari · 2025-08-06T19:53:01Z

CUDA 12.8 seems to be doing fine where 12.9 was OOM-ing. 🥳

For reference, this is a regression in nvcc, more info in pytorch/pytorch#156181

h-vetinari · 2025-08-07T02:29:04Z

BTW @dslarm, please don't refer to PRs in commit message during development. Every single pushed iteration of such a commit (even if force pushed away later) remains a permanent reference that clutters the timeline of the referenced PR.

Either only reference PRs in the final iteration, or better, refer to the respective commits rather than the associated PRs.

h-vetinari requested review from Tobias-Fischer, baszalmstra, beckermr, benjaminrwilson, hmaarrfk, jeongseok-meta, mgorny and sodre as code owners June 13, 2025 04:26

h-vetinari mentioned this pull request Jun 13, 2025

Migrate for CUDA 12.9 conda-forge/conda-forge-pinning-feedstock#7476

Merged

h-vetinari force-pushed the cuda129 branch 2 times, most recently from ccd4694 to 67e7fd3 Compare June 13, 2025 04:56

mgorny approved these changes Jun 13, 2025

View reviewed changes

h-vetinari commented Jun 13, 2025

View reviewed changes

h-vetinari mentioned this pull request Jun 13, 2025

add linux_aarch64 conda-forge/tensorflow-feedstock#426

Closed

h-vetinari commented Jun 14, 2025

View reviewed changes

Tobias-Fischer mentioned this pull request Jun 15, 2025

support for cuda 12.8 VSLAM-LAB/VSLAM-LAB#18

Closed

weiji14 mentioned this pull request Jun 16, 2025

flash-attn v2.8.0.post2 conda-forge/flash-attn-feedstock#36

Closed

3 tasks

h-vetinari mentioned this pull request Jun 17, 2025

pytorch 2.7.1; switch label on windows; turn on artefact persistence #391

Merged

h-vetinari added 6 commits June 19, 2025 07:29

add CUDA 12.9 migrator

e8a0608

update TORCH_CUDA_ARCH_LIST

216502d

drop obsolete numpy2 migrator

bbb8a0b

remove obsolete skip for CUDA 11.8

f532e06

Revert "push windows builds to different label"

1f1bdee

This reverts commit 5917d8c.

bump build number

0ceb9ba

h-vetinari added 2 commits June 28, 2025 19:59

back to vs2019 on windows

846da4d

use bigger machine on windows; vs2022 again

52b2181

use CUDA 12.8 on windows

693dd49

h-vetinari added a commit that referenced this pull request Jul 2, 2025

Merge pull request #393 from h-vetinari/cuda129

b9aa810

h-vetinari merged commit 693dd49 into conda-forge:main Jul 2, 2025
25 of 27 checks passed

h-vetinari deleted the cuda129 branch July 2, 2025 00:28

h-vetinari mentioned this pull request Jul 2, 2025

Failing to upload large windows packages from cirun queue (apparent timeout?) conda/infrastructure#1159

Open

2 tasks

This was referenced Jul 3, 2025

Fix artefact persistence on windows #395

Merged

Prune torch cuda arch list to match upstream #306

Closed

Adding CUDA 12.8 Support #390

Closed

Fix dot on MKL builds #399

Merged

h-vetinari mentioned this pull request Jul 25, 2025

[main] Upgrade to CUDA 12.9 #404

Closed

dslarm added a commit to dslarm/tensorflow-feedstock that referenced this pull request Jul 30, 2025

Added cuda 12.9 migrations (copy of

c44d439

conda-forge/pytorch-cpu-feedstock#393)

dslarm added a commit to dslarm/tensorflow-feedstock that referenced this pull request Jul 31, 2025

Added cuda 12.9 migrations (copy of

01488c2

conda-forge/pytorch-cpu-feedstock#393)

dslarm added a commit to dslarm/tensorflow-feedstock that referenced this pull request Jul 31, 2025

Added cuda 12.9 migrations (copy of

679a6a9

conda-forge/pytorch-cpu-feedstock#393)

dslarm added a commit to dslarm/tensorflow-feedstock that referenced this pull request Jul 31, 2025

Added cuda 12.9 migrations (copy of

31f8a61

conda-forge/pytorch-cpu-feedstock#393) Incompatible with gcc14, so we pin to gcc13

dslarm added a commit to dslarm/tensorflow-feedstock that referenced this pull request Aug 1, 2025

Added cuda 12.9 migrations (copy of

0af4189

conda-forge/pytorch-cpu-feedstock#393) Incompatible with gcc14, so we pin to gcc13

Uh oh!

Conversation

h-vetinari commented Jun 13, 2025

Uh oh!

conda-forge-admin commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgorny left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgorny commented Jun 13, 2025

Uh oh!

h-vetinari commented Jun 13, 2025

Uh oh!

h-vetinari commented Jun 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdice Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

h-vetinari commented Jun 14, 2025

Uh oh!

h-vetinari commented Jun 14, 2025

Uh oh!

Tobias-Fischer commented Jul 1, 2025

Uh oh!

h-vetinari commented Jul 1, 2025

Uh oh!

h-vetinari commented Jul 1, 2025

Uh oh!

h-vetinari commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

h-vetinari commented Jul 1, 2025

Uh oh!

Uh oh!

h-vetinari commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

h-vetinari commented Aug 6, 2025

Uh oh!

h-vetinari commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

conda-forge-admin commented Jun 13, 2025 •

edited

Loading

bdice Jun 14, 2025 •

edited

Loading

h-vetinari commented Jul 1, 2025 •

edited

Loading

h-vetinari commented Jul 2, 2025 •

edited

Loading