Skip to content

Sort values for cuda_compute_capabilities templates #5144

Open
Flamefire wants to merge 4 commits intoeasybuilders:developfrom
Flamefire:sort-cuda-ccs
Open

Sort values for cuda_compute_capabilities templates #5144
Flamefire wants to merge 4 commits intoeasybuilders:developfrom
Flamefire:sort-cuda-ccs

Conversation

@Flamefire
Copy link
Copy Markdown
Contributor

@Flamefire Flamefire commented Mar 10, 2026

There have been instances where the order matters so sort the values.
Even if it doesn't solve some issues it will at least be consistent.

E.g. --cuda-compute-capabilities=7.5,8.0 should be the same as --cuda-compute-capabilities=8.0,7.5

This is also important to avoid issues with Blackwell that introduced CUDA CC 10.x which when sorting naively will be incorrectly the "lowest" CC.

See companion easyblock PR:

There have been instances where the order matters so sort the values.
Even if it doesn't solve some issues it will at least be consistent.
@boegel
Copy link
Copy Markdown
Member

boegel commented Apr 7, 2026

Although this makes sense in general, strictly speaking this is a backwards-incompatible change.

Don't we have easyblock that make assumptions on the order of CUDA compute capabilities that are specified, where first listed is the preferred one to use (there are cases where using multiple ones isn't possible)?

@Flamefire
Copy link
Copy Markdown
Contributor Author

Flamefire commented Apr 7, 2026

IIRC we only sort and take lowest or highest.
See the companion PR for easyblocks where I replaced most usages including the sorted assumption

@boegel
Copy link
Copy Markdown
Member

boegel commented Apr 8, 2026

@casparvl Thoughts on this?

@casparvl
Copy link
Copy Markdown
Contributor

casparvl commented Apr 9, 2026

I like the idea of this PR. Right now, it's up to the easyblock to do something sensible with this list of compute capabilities. E.g. the LAMMPS easyblock does some sorting https://github.com/easybuilders/easybuild-easyblocks/blob/b01ea7afd303b0456f860588f9e1c6acc44509a6/easybuild/easyblocks/l/lammps.py#L334 but indeed seems to suffer from the issue mentioned by @Flamefire : it calls sorted but doesn't specify a LooseVersion key. Thus, a list like ["9.0a", "10.0f", "8.0", "12.0f"] will be sorted to ['10.0f', '12.0f', '8.0', '9.0a'] (I think).

It is much nicer if the framework takes care of this. If this is documented properly (which is done in this PR), the EasyBlock knows what to expect, and doesn't need to do any sorting itself (reducing the potential for error on this).

That being said, @boegel is also right that this is, at least potentially, a breaking change. After merging this PR and easybuilders/easybuild-easyblocks#4092 , Lammps would be build with 12.0f instead of 9.0a in the above case. That being said: using 9.0a in the above could simply be considered a bug - in which case a change in behavior is NOT a bad thing.

I haven't made the analysis for all the other easyblocks, i.e. what would be the before vs after situation.

We should also realize that this likely only affects software that can only be build for a single CUDA CC, since in that case some choice has to be made from that list. Sites that use a list of CUDA CCs probably build in some common prefix, and use that on node types with diverse GPU archs - but that usage is already broken today for this subset of software, since it simply cant be build for multiple targets. So it's very well possible these sites have already made exceptions for this, and build these particular packages in an architecture-specific prefix. So yes: in theory this could lead to a breaking change. But in practice, it would surprise me if anyone is really "hit" by this in practice.

In summary: I would consider this acceptable, provided we document this very clearly in the release notes.

Copy link
Copy Markdown
Contributor

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also change the help description for --cuda-compute-capabilities and state that the order in which CCs are specified does not matter - just to make this behavior explicit.

@boegel
Copy link
Copy Markdown
Member

boegel commented Apr 9, 2026

Just to be clear: it's game over for EasyBuild v5.3.0, this will need to wait until the release after that...

@Flamefire
Copy link
Copy Markdown
Contributor Author

@casparvl Done. Also replaced "List" by "Set" to emphasize this.

@Flamefire Flamefire requested a review from casparvl April 27, 2026 09:07
@casparvl
Copy link
Copy Markdown
Contributor

casparvl commented Apr 28, 2026

Don't have time for a full review now, but one additional remark:

provided we document this very clearly in the release notes.

By this I meant that (ideally) we should also try to find out which software this potentially could affect. I.e. search for places in EasyBlocks & EasyConfigs in which the CUDA-related templates are used, check if they rely on certain order, and if so, report those in the release notes. This way it should be very clear to people "oh, I'm building software XYZ, behavior of how that is build may have changed in this release".

I'm not sure how many blocks/configs use these templates, and whether that is feasible to list completely. If it's O(10s) I think we should do it, if it's O(100s) it's probably intractable.

@Flamefire
Copy link
Copy Markdown
Contributor Author

Flamefire commented Apr 28, 2026

For easyconfigs using templates this shouldn't change anything as the order shouldn't matter there: They are usually directly translated to nvcc flags through e.g. CMAKE_CUDA_ARCHITECTURES.
One issue with the order I've seen was caused by a bug in DeepSpeed which had undefined behavior which in case of sorted CUDA CCs wasn't visible while in the other it broke the software in mysterious ways

Our easyblocks are fixed in easybuilders/easybuild-easyblocks#4092 : They sorted values themselves where it mattered. Usually taking the highest or lowest value. So this PR fixes the 12.x issue for those. I doubt many will notice that as most will not have 12.x enabled or using that as the only CC

Or did you mean to report all software (easyblocks & easyconfigs) that use those templates at all to report something like: "Although unlikely the behavior of this software may change if you didn't pass a sorted list of CCCs already: ...."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants