Skip to content

consistently use patch to correctly detect Slurm job environment for jax v0.6.2 and v0.7.0#24447

Merged
boegel merged 5 commits intoeasybuilders:developfrom
pavelToman:patch-25
Dec 29, 2025
Merged

consistently use patch to correctly detect Slurm job environment for jax v0.6.2 and v0.7.0#24447
boegel merged 5 commits intoeasybuilders:developfrom
pavelToman:patch-25

Conversation

@pavelToman
Copy link
Collaborator

The patch from: jax-ml/jax#32799
Requires:

@github-actions github-actions bot added the change label Nov 4, 2025
@Thyre Thyre added 2024a issues & PRs related to 2024a common toolchains 2025a issues & PRs related to 2025a common toolchains labels Nov 4, 2025
@Thyre
Copy link
Collaborator

Thyre commented Dec 15, 2025

@pavelToman, can you sync the PR with develop? Now that the PR with the patch is merged, the CI should hopefully work.

@boegel
Copy link
Member

boegel commented Dec 15, 2025

@pavelToman, can you sync the PR with develop? Now that the PR with the patch is merged, the CI should hopefully work.

done

@boegel boegel added this to the next release (5.2.0) milestone Dec 15, 2025
@boegel boegel added bug fix and removed change labels Dec 15, 2025
Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Member

boegel commented Dec 15, 2025

@boegelbot please test @ jsc-zen3

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=24447 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_24447 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9158

Test results coming soon (I hope)...

Details

- notification for comment with ID 3654706479 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel
Copy link
Member

boegel commented Dec 15, 2025

I've manually cancelled job 9158 at jsc-zen3...

@boegel
Copy link
Member

boegel commented Dec 15, 2025

@boegelbot please test @ jsc-zen3
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=24447 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_24447 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9169

Test results coming soon (I hope)...

Details

- notification for comment with ID 3655767109 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 0 out of 2 (total: 1 secs) (2 easyconfigs in total)
jsczen3c2.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.6, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.21
See https://gist.github.com/boegelbot/d29d48ca6556d240f80b39470a7e6b3f for a full test report.

@boegel
Copy link
Member

boegel commented Dec 15, 2025

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (total: 1 hour 38 mins 56 secs) (2 easyconfigs in total)
node4246.shinx.os - Linux RHEL 9.6, x86_64, AMD EPYC 9654 96-Core Processor (zen4), Python 3.9.21
See https://gist.github.com/boegel/f2509aa65a8ddf757380a94c8dd5cc34 for a full test report.

@Thyre
Copy link
Collaborator

Thyre commented Dec 15, 2025

Test report by @boegelbot
FAILED
Build succeeded for 0 out of 2 (total: 1 secs) (2 easyconfigs in total)
jsczen3c2.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.6, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.21
See https://gist.github.com/boegelbot/d29d48ca6556d240f80b39470a7e6b3f for a full test report.

Locks were still present.

@boegel
Copy link
Member

boegel commented Dec 15, 2025

Test report by @boegelbot
FAILED
Build succeeded for 0 out of 2 (total: 1 secs) (2 easyconfigs in total)
jsczen3c2.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.6, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.21
See https://gist.github.com/boegelbot/d29d48ca6556d240f80b39470a7e6b3f for a full test report.

Locks were still present.

Locks removed...

@boegel
Copy link
Member

boegel commented Dec 15, 2025

@boegelbot please test @ jsc-zen3
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=24447 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_24447 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9180

Test results coming soon (I hope)...

Details

- notification for comment with ID 3656878771 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel boegel changed the title Add patch for SLURM envs to jax v0.6.2 and v0.7.0 consistently use patch to correctly detect Slurm job environment for jax v0.6.2 and v0.7.0 Dec 15, 2025
Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 0 out of 2 (total: 0 secs) (2 easyconfigs in total)
jsczen3c2.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.6, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.21
See https://gist.github.com/boegelbot/bfcdbe7216f52fd462155244470c82a0 for a full test report.

@boegel
Copy link
Member

boegel commented Dec 15, 2025

@boegelbot please test @ jsc-zen3
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=24447 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_24447 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9189

Test results coming soon (I hope)...

Details

- notification for comment with ID 3657838860 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 0 out of 2 (total: 0 secs) (2 easyconfigs in total)
jsczen3c2.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.6, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.21
See https://gist.github.com/boegelbot/662e2fd57f00b7e3ad7baa0bde73d106 for a full test report.

@pavelToman
Copy link
Collaborator Author

@boegelbot please test @ jsc-zen3
EB_ARGS="--ignore-locks"
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@pavelToman: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=24447 EB_ARGS="--ignore-locks" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_24447 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9194

Test results coming soon (I hope)...

Details

- notification for comment with ID 3659315197 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel
Copy link
Member

boegel commented Dec 16, 2025

I don't think the problem is the locks...

I'm testing an interactive installation in the bot account to figure out what's going on.

@boegel
Copy link
Member

boegel commented Dec 16, 2025

I don't think the problem is the locks...

I'm testing an interactive installation in the bot account to figure out what's going on.

Looks like running the jax tests crashes the node... :-/


==      testing...
  >> running shell command:
        export PYTHONPATH=/tmp/eb-fe5khz9a/tmpwfyjo4eh/lib/python3.12/site-packages:$PYTHONPATH && export PATH=/tmp/eb-fe5khz9a/tmpwfyjo4eh/bin:$PATH &&   /project/def-maintainers/boegelbot/rocky9/zen3/software/Python/3.12.3-GCCcore-13.3.0/bin/python -m pip install --prefix=/tmp/eb-fe5khz9a/tmpwfyjo4eh  --no-deps --ignore-installed --no-build-isolation .
        [started at: 2025-12-16 08:57:57]
        [working dir: /dev/shm/boegelbot/jax/0.6.2/gfbf-2024a/jax/jax-jax-v0.6.2]
        [output and state saved to /tmp/eb-fe5khz9a/run-shell-cmd-output/export-hmvc0wc3]
  >> command completed: exit 0, ran in 00h00m04s
  >> running shell command:
        export PYTHONPATH=/tmp/eb-fe5khz9a/tmpwfyjo4eh/lib/python3.12/site-packages:$PYTHONPATH && export PATH=/tmp/eb-fe5khz9a/tmpwfyjo4eh/bin:$PATH &&   pytest -n 16 tests --deselect=tests/api_test.py::BackendsTest::test_no_backend_warning_on_cpu_if_platform_specified --deselect=tests/version_test.py || pytest -n 16 tests --deselect=tests/api_test.py::BackendsTest::test_no_backend_warning_on_cpu_if_platform_specified --deselect=tests/version_test.py --last-failed
        [started at: 2025-12-16 08:58:03]
        [working dir: /dev/shm/boegelbot/jax/0.6.2/gfbf-2024a/jax/jax-jax-v0.6.2]
        [output and state saved to /tmp/eb-fe5khz9a/run-shell-cmd-output/export-muxezfm9]
srun: error: Node failure on jsczen3c2
srun: error: Node failure on jsczen3c2

@Thyre
Copy link
Collaborator

Thyre commented Dec 19, 2025

Test report by @Thyre
FAILED
Build succeeded for 0 out of 1 (total: 54 secs) (1 easyconfigs in total)
jrc0900.jureca - Linux Rocky Linux 9.6, AArch64, ARM UNKNOWN (neoverse_v2), 1 x NVIDIA NVIDIA GH200 480GB, 580.95.05, Python 3.9.21
See https://gist.github.com/Thyre/f087dfb7063cc1158063befcdfd209dd for a full test report.

@SebastianAchilles
Copy link
Member

I canceled job 9194, since the jobs crashed nodes.

It looks like 4 GB of RAM per core is insufficient for the 'jax' test suite. Using fewer cores but more RAM per core seems to work in my tests.

@SebastianAchilles
Copy link
Member

@boegelbot please test @ jsc-zen3
CORE_CNT=4
EB_ARGS="--parallel 4"
SLURM_ARGS="--mem-per-cpu=14000M"

@boegelbot
Copy link
Collaborator

@SebastianAchilles: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=24447 EB_ARGS="--parallel 4" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_24447 --ntasks="4" "--mem-per-cpu=14000M" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9265

Test results coming soon (I hope)...

Details

- notification for comment with ID 3677752683 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (total: 5 hours 35 mins 26 secs) (2 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.7, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.23
See https://gist.github.com/boegelbot/e99af7bf7cd0c52c5b4166021c46a04b for a full test report.

@boegel
Copy link
Member

boegel commented Dec 29, 2025

Going in, thanks @pavelToman!

@boegel boegel merged commit 10fca68 into easybuilders:develop Dec 29, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2024a issues & PRs related to 2024a common toolchains 2025a issues & PRs related to 2025a common toolchains bug fix change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants