{ai}[foss/2024a] PyTorch v2.7.1 w/ CUDA 12.6.0#23923
{ai}[foss/2024a] PyTorch v2.7.1 w/ CUDA 12.6.0#23923boegel merged 15 commits intoeasybuilders:developfrom
Conversation
|
Diff of new easyconfig(s) against existing ones is too long for a GitHub comment. Use |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
Test report by @boegel |
|
|
Test report by @boegel |
|
Seemingly changed by mistake. Fixed |
|
Test report by @boegel |
|
Test report by @boegel |
|
The H100 failures are mostly from With the (now) default of 10 allowed failures that should be enough to pass As for the V100: I already had more failures on A100 suggesting they don't test on "older" GPUs anymore... If you can attach the log of the test step I'll take a look at the failures |
|
Test report by @Flamefire SUCCESS on rerun but upload failed due to expired token: |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
Test report by @boegel |
|
4 (of 8) failures are in test_cpu_select_algorithm and test_select_algorithm which I assume have the same cause. However the errors are not in the gist, so can't tell Is it possibly this one?
Then I have a patch for that. In any case: I remove the allowed failures = 6, which now uses the default of 10 which would make your run pass. |
|
Test report by @boegel |
Mostly |
|
@Flamefire Found this in the log: I'll share the whole log with you (via Slack) |
|
Weird it looks like that failure was written to the XML file: But still:
So it seems the parser didn't pick up the failure. Almost all the rest seem to be specific to V100, and its non-support of BF16 (I reported it at pytorch/pytorch#172085):
If we can fix the missed
Would still be good to know why the parser didn't find it to avoid similar issues. Could be related though: The faulty close might cause the writing of the XML file fail Patching the failing tests is also possible if this does indeed return false for V100s (could use a pip-installed pytorch to test): |
|
Test report by @Flamefire |
|
Test report by @boegelbot |
|
Test report by @Flamefire |
Maybe we should make this a warning only |
Warning by default, but with a way to make it a hard error perhaps? @Flamefire In any case, I don't think we need to block this PR any further, what do you think? |
An EC option In my report the cause is a timeout after which the test process gets killed without writing an XML entry. However the test has "rerun" entries, so we could use that: If a test only shows up as "rerun" but not as "success" it is an error.
Do we want to increase the allowed failures to allow your previous build to pass? Or let people see those errors for old-ish GPUs? |
@Flamefire I'm in favor of allowed some more failures by default, maybe even up to The issues about not finding the result of a test should be less fatal too, but that's work for the easyblock, so doesn't need to block this PR. |
|
Test report by @Flamefire |
|
My builds are currently on day 7+ of running PyTorch tests. Do you have any suggestions to make them run faster? Should I just always build PyTorch on a full node? |
|
7 days is certainly too much. With 2.9.1 I identified an issue that caused an infinite hang. But that exact issue is not present in 2.7. Maybe check if any sub-process has been hanging for days or if the tests are just very slow on your machine. I do indeed use a full node. |
|
Oh, now the first one has finished. So this is an issue for the EasyBlock I guess. The slow test is probably which has been running since Jan 13 in the another build. Anyway, I'm happy with the state of this PR, so go ahead and merge when you are happy with it. |
|
This here is an issue worth checking:
Can you attach the full log and ideally the |
boegel
left a comment
There was a problem hiding this comment.
lgtm
It's high time that we get this merged.
There will probably be follow-up PRs (especially for the PyTorch easyblock), but this has been proven to be mature across a variety of systems.
@Flamefire Thanks a lot for all the effort on this!
|
Going in, thanks @Flamefire! |
|
@VRehnberg If the excessive long time that you observed needs further attention, please open an issue for that, we can't keep track of things in merged PRs... |
|
From the log: So for some reason the process gets killed and doesn't have a chance to write the XML result which is why we miss it. |
(created using
eb --new-pr)Requires:
Bundlegeneric easyblock to support use of post-install patches easybuild-easyblocks#3887I included the easyconfigs here for convenience