Rebuild for CUDA 12.9#393
Conversation
|
Hi! This is the friendly automated conda-forge-linting service. I just wanted to let you know that I linted all conda-recipes in your PR ( I do have some suggestions for making it better though... For recipe/meta.yaml:
This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/15996521093. Examine the logs at this URL for more detail. |
ccd4694 to
67e7fd3
Compare
| + set(arch_name ${CMAKE_MATCH_1}) | ||
| + endif() | ||
| + if(arch_name MATCHES "^([0-9]\\.[0-9](\\([0-9]\\.[0-9]\\))?)$") | ||
| + if(arch_name MATCHES "^(1?[0-9]\\.[0-9](\\([0-9]\\.[0-9]\\))?)$") |
There was a problem hiding this comment.
This is the only actually relevant change in 0d06933
|
Well, the good news here is that the failures we've been saying with 2.7.1 aren't regressions in 2.7.1 but an external regressions. |
I'm certain it's due to numpy 2.3. The py39/py310 jobs aren't failing, where that version doesn't exist. I wanted to check for upstream patches before adding a cap for ~4 tests, but ran out of steam yesterday. Looking now. |
|
So the Given that it's really a corner case (that results of norm for for degenerate shapes match numpy), and that AFAICT upstream pytorch 2.7.0 has no bound on numpy either, let's just skip those tests. |
| 12.[0-6]) | ||
| export TORCH_CUDA_ARCH_LIST="5.0;6.0;6.1;7.0;7.5;8.0;8.6;8.9;9.0+PTX" | ||
| 12.[89]) | ||
| export TORCH_CUDA_ARCH_LIST="5.0;6.0;6.1;7.0;7.5;8.0;8.6;8.9;9.0;10.0;12.0+PTX" |
There was a problem hiding this comment.
@conda-forge/cuda @conda-forge/pytorch-cpu @isuruf
Feedback on what would be a good (sub)set of CUDA architectures here would be much appreciated. If we just keep adding architectures, we will keep further increasing the build time, which is already extreme.
FWIW, upstream uses a much-reduced set, which drops <7.5 as well as 8.9 compared to what we have now:
TORCH_CUDA_ARCH_LIST=7.5;8.0;8.6;9.0;10.0;12.0
There was a problem hiding this comment.
I agree with dropping "older than 7.5". I might even go as far as saying to drop "20 series" cards but that might be extreme.
https://developer.nvidia.com/cuda-gpus
There could be "alternatives" to explore:
- Such as ABI3 builds, which should greatly reduce the build matrix right?
Personally, I would prefer to drop "pythons" rather than to drop "compute capabilities".
The reason is:
- Those that are on old pythons, are likely on old hardware.
- Their ability to install "python" with a non-confusing support matrix will remain
pytorch + cuda from conda-forged used to work for my hardware, and now it will continue to work"
I would go as far as dropping python 3.9, AND 3.10 leaving only 3.11 and 3.12.
There was a problem hiding this comment.
There was a problem hiding this comment.
Thanks for the inputs. I know there's always the next thing to cut, but for now I'm mostly concerned with not blowing up our CI pipelines further. I like the idea of going with the rapids set
TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0
which would still drop a net two architectures (minus 5.0;6.0;6.1;8.9, plus 10.0;12.0). I did see somewhere that CUDA 12.9 also has 10.1 and 12.1 - are those relevant here?
For python that's a separate discussion, but ultimately my position there is that we should match upstream support unless extreme circumstances prevent us from doing so.
There was a problem hiding this comment.
Doesn't dropping 8.9 effectively kill the entire 40 series lineup?
https://developer.nvidia.com/cuda-gpus
There was a problem hiding this comment.
I assume they would then use the 8.6 instructions; not sure how big of an improvement the delta between 8.9 and 8.6 provides, hence why I'm asking for inputs in choosing the set of architectures.
There was a problem hiding this comment.
I think the comment above re 8.9 is correct. Upstream has never built for 8.9 specifically I believe. In the absence of evidence of there being significant performance or support differences, 8.9 should be dropped.
FWIW, upstream uses a much-reduced set, which drops <7.5 as well as 8.9 compared to what we have now:
It's important to note that upstream ships multiple sets of wheels. The CUDA 11.8 config isn't relevant anymore for conda-forge, but the CUDA 12.6 (default for PyPI still) and 12.8 (hosted separately) are:
- 12.6:
TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6;9.0" - 12.8/12.9:
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" #removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases
Just to make sure I understand this PR: it will result in only building with CUDA 12.9, right? There is no separate 12.6 or lower build that keeps support for 5.0/6.0?
If so, that seems a bit aggressive to me. That drops support for all Quadro M, Quadro P and GeForce GTX cards. The latter would impact me, I have a GTX 1080Ti in my main dev machine. It's getting old, but for development purposes it has been fine until now. Looking at the legacy compute capability table, I'd say 6.0 is at least as important as 7.0, so tacking 7.0 as the one legacy version onto the CUDA 12.9 config may not be justified.
Is it possible to split the builds like upstream does? CUDA 12.6 with legacy support, and 12.8 or 12.9 without it? On the one hand that'd be even more build time, on the other hand it's then possible to keep a large range of support and perhaps trim the architectures per build even further (e.g., 2x 5 architectures rather than the 10 in a single build that there are now).
There was a problem hiding this comment.
Fair point about the upstream split into 12.6 / 12.8. Indeed we would only do the CUDA build once, and it's clearly preferable to me to have one build with more architectures, rather than a 12.6 and a 12.9 build with different subsets.
I'd be equally fine to do
TORCH_CUDA_ARCH_LIST=5.0;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0
which would trade 6.1 & 8.9 for 10.0 & 12.0. I can even imagine building the full set (as currently committed) if there are good reasons to do so - I only wanted to make sure that we choose the set of architectures deliberately, rather than just by inertia.
|
Interesting build failure for the CUDA builds: |
|
Also @conda-forge/cuda, any inputs about the linker errors with CUDA 12.9 would be much appreciated 🙏 |
|
Do we know why the Windows jobs don't run @h-vetinari? |
|
I've been trying to figure out the OOM issues and lack of parallelism with the people from prefix and cirun. We have no answer for the parallelism question yet, and I don't know why things run OOM - could be a machine health issue in principle (though hard to imagine with a cloud machine that just gets spun up on demand). The only other explanation would be that nvcc 12.9 needs more memory than 12.6, and we were close enough to the max. memory limit before this PR that we're now over it. In any case, the recommendation was to try a larger runner (which is enabled for this feedstock already), but those runner don't seem to start at all. I'm waiting for response from the cirun folks if there's a config issue at play (given that - I believe - this is the first time that this is being used in conda-forge). |
|
For those following along, it seems the vCPU count for the runners in the azure subscription by prefix had been lowered (unclear how, Wolf had said nothing changed) to 15, which meant that only one 2xl runner (=8 vCPU) could run at once, and no 4xl runner (=16 vCPU) at all. We've bumped that limit up now, so we're back to having parallel builds on windows. 🥳 Thanks Amit! 🙏 Now it's fingers crossed the bigger runners will not run into OOM anymore 🤞 |
Welllllll, it's still OOM-ing. I guess it's down to CUDA 12.9 after all, CC @conda-forge/cuda Edit: the xformers issues are real, but I had misremembered: 9.0 isn't a new architecture; also, that happened with CUDA 12.6 |
Looks like I jinxed it in the positive sense - CUDA 12.8 seems to be doing fine where 12.9 was OOM-ing. 🥳 On the CPU side, we're down to a single failure I vaguely remember seeing this failure before already, but there seems to be no further context. I'll see if adding some |
|
We're still running into conda/infrastructure#1159 on windows, and the work-around with the artefact persistence doesn't work (yet), because it runs into |
conda-forge/pytorch-cpu-feedstock#393) Incompatible with gcc14, so we pin to gcc13
conda-forge/pytorch-cpu-feedstock#393) Incompatible with gcc14, so we pin to gcc13
conda-forge/pytorch-cpu-feedstock#393) * The vendored xnnpack in tensorflow 2.18 is incompatible with gcc14, so we pin to gcc13 * Hold back compiler versions for aarch64 to gcc-11 with cuda. cuda 12.8 only handles aarch64 neon if gcc is < 12
conda-forge/pytorch-cpu-feedstock#393) * The vendored xnnpack in tensorflow 2.18 is incompatible with gcc14, so we pin to gcc13 * Hold back compiler versions for aarch64 to gcc-11 with cuda. cuda 12.8 only handles aarch64 neon if gcc is < 12
conda-forge/pytorch-cpu-feedstock#393) * The vendored xnnpack in tensorflow 2.18 is incompatible with gcc14, so we pin to gcc13 * Hold back compiler versions for aarch64 to gcc-11 with cuda. cuda 12.8 only handles aarch64 neon if gcc is < 12
conda-forge/pytorch-cpu-feedstock#393) * The vendored xnnpack in tensorflow 2.18 is incompatible with gcc14, so we pin to gcc13 * Hold back compiler versions for aarch64 to gcc-11 with cuda. cuda 12.8 only handles aarch64 neon if gcc is < 12
conda-forge/pytorch-cpu-feedstock#393) * The vendored xnnpack in tensorflow 2.18 is incompatible with gcc14, so we pin to gcc13 * Hold back compiler versions for aarch64 to gcc-11 with cuda. cuda 12.8 only handles aarch64 neon if gcc is < 12
For reference, this is a regression in nvcc, more info in pytorch/pytorch#156181 |
|
BTW @dslarm, please don't refer to PRs in commit message during development. Every single pushed iteration of such a commit (even if force pushed away later) remains a permanent reference that clutters the timeline of the referenced PR. Either only reference PRs in the final iteration, or better, refer to the respective commits rather than the associated PRs. |
This is both intended as a mergeable PR, as well as a test/proof of conda-forge/conda-forge-pinning-feedstock#7476
Due to an issue with conda-smithy's merging logic, this is rerendered with conda-forge/conda-smithy#2335
Also partially based on #391 (the channel/artefact changes), so that we can try to work around conda/infrastructure#1159