Skip to content

Conversation

@Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Jan 22, 2026

(created using eb --new-pr)

This relaxes the PyTorch test evaluation a bit. The easyblock parses the XML files and compares that against the summary output in stdout of the test command. We have 2 cases:

1: There are more failures in the XML files than in the summary -> PyTorch didn't consider something as failed that we do. Very weird and might be an issue with the XML parser.
However this is only a minor issue as we counted too many failures (from the XML files) than might be actually present. So if the allowed-test-failure-count check still succeeds we can ignore this, at least for users.

2: The summary shows a failure we have not found in the XML files -> The XML report might be missing because the test crashed or otherwise didn't write its results.
This is an issue because one test ("suite") might contain 100s of test cases where many could have failed but we didn't count any of those failures.
Of course there might be only a single failure but we cannot know for sure, hence we fail.

I added 2 options: allow_extra_failures & allow_missing_failures for those 2 cases.

They can be set to True/False but also to a maximum number

@boegel boegel added this to the next release (5.2.1?) milestone Jan 28, 2026
@boegel
Copy link
Member

boegel commented Jan 28, 2026

@boegelbot please test @ jsc-zen3-a100
CORE_CNT=16
EB_ARGS="--installpath /tmp/$USER/pr4052-PyTorch-2.7.1-CUDA PyTorch-2.7.1-foss-2024a-CUDA-12.6.0.eb"

@boegelbot
Copy link

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=4052 EB_ARGS="--installpath /tmp/$USER/pr4052-PyTorch-2.7.1-CUDA PyTorch-2.7.1-foss-2024a-CUDA-12.6.0.eb" EB_CONTAINER= EB_REPO=easybuild-easyblocks EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_4052 --ntasks="16" --partition=jsczen3g --gres=gpu:1 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9510

Test results coming soon (I hope)...

Details

- notification for comment with ID 3809792319 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel
Copy link
Member

boegel commented Jan 28, 2026

@boegelbot please test @ jsc-zen3
CORE_CNT=16
EB_ARGS="--installpath /tmp/$USER/pr4052-PyTorch-2.6.0 PyTorch-2.6.0-foss-2024a.eb"

@boegelbot
Copy link

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=4052 EB_ARGS="--installpath /tmp/$USER/pr4052-PyTorch-2.6.0 PyTorch-2.6.0-foss-2024a.eb" EB_CONTAINER= EB_REPO=easybuild-easyblocks EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_4052 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9520

Test results coming soon (I hope)...

Details

- notification for comment with ID 3811375867 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link

Test report by @boegelbot

Overview of tested easyconfigs (in order)

  • SUCCESS PyTorch-2.6.0-foss-2024a.eb

Build succeeded for 1 out of 1 (total: 47 hours 36 mins 39 secs) (1 easyconfigs in total)
jsczen3c2.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.7, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.23
See https://gist.github.com/boegelbot/890f3c9807ae1e02967f8c74b4c8d5a8 for a full test report.

@boegelbot
Copy link

Test report by @boegelbot

Overview of tested easyconfigs (in order)

  • SUCCESS PyTorch-2.7.1-foss-2024a-CUDA-12.6.0.eb

Build succeeded for 1 out of 1 (total: 52 hours 4 mins 5 secs) (1 easyconfigs in total)
jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.7, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 590.44.01, Python 3.9.23
See https://gist.github.com/boegelbot/3833453c591f0f8716185274bcee7d7f for a full test report.

@boegel
Copy link
Member

boegel commented Feb 10, 2026

@boegelbot please test @ jsc-zen3-a100
CORE_CNT=16
EB_ARGS="PyTorch-2.7.1-foss-2024a-CUDA-12.6.0.eb"

@boegelbot
Copy link

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=4052 EB_ARGS="PyTorch-2.7.1-foss-2024a-CUDA-12.6.0.eb" EB_CONTAINER= EB_REPO=easybuild-easyblocks EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_4052 --ntasks="16" --partition=jsczen3g --gres=gpu:1 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9623

Test results coming soon (I hope)...

Details

- notification for comment with ID 3880190685 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel
Copy link
Member

boegel commented Feb 13, 2026

PyTorch/2.7.1-foss-2024a-CUDA-12.6.0 should almost be back @ jsc-zen3...

[boegelbot@jsczen3l1 ~]$ q
             JOBID PARTITION                                               NAME     USER    STATE       TIME TIME_LIMI  NODES   CPUS  NODELIST(REASON) MIN_MEMORY
...
              9623  jsczen3g                                       test_PR_4052 boegelbo  RUNNING 1-12:24:19 4-04:00:00      1     16         jsczen3g1 3800M

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants