Skip to content

{lib}[foss/2023a] TensorFlow v2.15.1 w/ CUDA 12.1.1 + add missing patches for TensorFlow v2.15.1 + NCCL v2.18.3#20358

Merged
boegel merged 13 commits intoeasybuilders:developfrom
yqshao:20240413152217_new_pr_TensorFlow2151
Aug 20, 2024
Merged

{lib}[foss/2023a] TensorFlow v2.15.1 w/ CUDA 12.1.1 + add missing patches for TensorFlow v2.15.1 + NCCL v2.18.3#20358
boegel merged 13 commits intoeasybuilders:developfrom
yqshao:20240413152217_new_pr_TensorFlow2151

Conversation

@yqshao
Copy link
Contributor

@yqshao yqshao commented Apr 13, 2024

@yqshao

This comment was marked as resolved.

@yqshao
Copy link
Contributor Author

yqshao commented Apr 15, 2024

Test report by @yqshao
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3303
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis1-10 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 2 x NVIDIA Tesla V100-SXM2-32GB, 550.54.14, Python 3.6.8
See https://gist.github.com/yqshao/fad950f321c4e87bccd8f3a4369e8bf9 for a full test report.

@migueldiascosta migueldiascosta added this to the 4.x milestone Apr 16, 2024
@tiwoe
Copy link

tiwoe commented Apr 16, 2024

Thank you for the effort. Can you add the patch files to your pr? TensorFlow-2.15.1_remove-duplicate-gpu-tests.patch and TensorFlow-2.15.1_fix-cuda_build_defs.patch?

@yqshao
Copy link
Contributor Author

yqshao commented Apr 16, 2024

Sorry, missed that, there's also a rebased-on-dependencies version at yqshao/easybuild-easyconfigs@tf-2.15.1-cuda, but I'll will wait a bit (until the deps are merged) before force-pushing here...

@casparvl
Copy link
Contributor

casparvl commented Jun 5, 2024

#20191 is now merged. From your previous comment here, I think you wanted to make some more changes in this PR? Let me know once those are done, then we can also start reviewing/testing this one again :)

@yqshao
Copy link
Contributor Author

yqshao commented Jun 7, 2024

Test report by @yqshao
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3303
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
alvis1-04 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 2 x NVIDIA Tesla V100-SXM2-32GB, 550.54.14, Python 3.6.8
See https://gist.github.com/yqshao/2a41b12be92022e0171b420282dd028b for a full test report.

@yqshao
Copy link
Contributor Author

yqshao commented Jun 7, 2024

Hi, I checked again the PR and there is not much addition from the CPU version; however I have to admit that I have let some patches slip through which should still be relevant (sorry for the hindsight) I added back the following back to both the CPU and CUDA configs, but those are not tested on our system, so I would appreciate cross-checks. @casparvl @Flamefire

@Flamefire
Copy link
Contributor

Makes sense

disable-avx512-extensions: though I did no seem to reproduce the issue with our build without the patch on Skylake cpus;

I can check if this is still required on skylake and cascade-lake but I'm pretty sure it is

@casparvl
Copy link
Contributor

Hm, good point, I should have probably also tested that PR for the CPU version on our GPU nodes, they have AVX512 capabilities... For now at least, I'll upload test reports for this full pr from our GPU nodes and another one for the CPU version from our CPU nodes. Build is going right now, so test reports should appear later this afternoon...

@casparvl
Copy link
Contributor

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=20358 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_20358 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 13690

Test results coming soon (I hope)...

Details

- notification for comment with ID 2157944314 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 0 out of 2 (2 easyconfigs in total)
cns1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/5e92fce2e6c07a0eacfafce52d978280 for a full test report.

@casparvl
Copy link
Contributor

Test report by @casparvl
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3303
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
tcn1.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, AMD EPYC 7H12 64-Core Processor, Python 3.6.8
See https://gist.github.com/casparvl/52894745607e7457540cdc9bd83633a7 for a full test report.

@casparvl
Copy link
Contributor

Oh, silly, I forgot to instruct boegelbot to use the new easyblock. So... we can ignore that failure.

Regarding my own tests: the GPU build on my GPU node failed. It has failing tests:

[  FAILED  ] 4 tests, listed below:
[  FAILED  ] Test/FusedMatMulWithBiasOpTest/0.MatMul256x128x64, where TypeParam = float
[  FAILED  ] Test/FusedMatMulWithBiasOpTest/0.MatMul1x256x256, where TypeParam = float
[  FAILED  ] Test/FusedMatMulWithBiasOpTest/0.MatMul256x256x1, where TypeParam = float
[  FAILED  ] Test/FusedMatMulWithBiasOpTest/0.MatMul256x128x64WithActivation, where TypeParam = float

 4 FAILED TESTS

The output is all looking similar to this:

tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (2.6044831275939941 not close to 2.6047005653381348)
Expected: true
i = 0 Tx[i] = 2.6044831275939941 Ty[i] = 2.6047005653381348
tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (2.9819025993347168 not close to 2.9816701412200928)
Expected: true
i = 1 Tx[i] = 2.9819025993347168 Ty[i] = 2.9816701412200928
tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (2.4911799430847168 not close to 2.491544246673584)
Expected: true
i = 2 Tx[i] = 2.4911799430847168 Ty[i] = 2.491544246673584
tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (1.9187320470809937 not close to 1.9185069799423218)
Expected: true
i = 4 Tx[i] = 1.9187320470809937 Ty[i] = 1.9185069799423218
tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (1.6246750354766846 not close to 1.6246215105056763)
Expected: true
i = 7 Tx[i] = 1.6246750354766846 Ty[i] = 1.6246215105056763
tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (0.22215569019317627 not close to 0.22175788879394531)
Expected: true
i = 8 Tx[i] = 0.22215569019317627 Ty[i] = 0.22175788879394531
tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (1.4460333585739136 not close to 1.4467771053314209)
Expected: true
i = 11 Tx[i] = 1.4460333585739136 Ty[i] = 1.4467771053314209
tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (0.91000175476074219 not close to 0.90951979160308838)
Expected: true
i = 13 Tx[i] = 0.91000175476074219 Ty[i] = 0.90951979160308838
tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (3.9072332382202148 not close to 3.907407283782959)
Expected: true
i = 14 Tx[i] = 3.9072332382202148 Ty[i] = 3.907407283782959
tensorflow/core/framework/tensor_testutil.cc:184: Failure
Value of: IsClose(Tx[i], Ty[i], typed_atol, typed_rtol)
  Actual: false (0.46104967594146729 not close to 0.46095246076583862)
Expected: true
i = 15 Tx[i] = 0.46104967594146729 Ty[i] = 0.46095246076583862
tensorflow/core/framework/tensor_testutil.cc:187: Failure
Expected: (num_failures) < (max_failures), actual: 10 vs 10
Too many mismatches (atol = 1.0000000000000001e-05 rtol = -1), giving up.

In other words: the numbers are close, but not close enough to meet the tolerance. My bet is this is another example of tolerances that are exceeded as a result of the TF32 datatype. Do these ring a bell @Flamefire ? Didn't you at some point have a patch to increase those tolerances (or was that for PyTorch...)?

@casparvl
Copy link
Contributor

@boegelbot please test @ generoso
EB_ARGS="--include-easyblocks-from-pr 3303"

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=20358 EB_ARGS="--include-easyblocks-from-pr 3303" EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_20358 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 13705

Test results coming soon (I hope)...

Details

- notification for comment with ID 2158634583 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@casparvl
Copy link
Contributor

Test report by @casparvl
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3303
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
gcn6.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/casparvl/b9f276db34528d9451a2f190d11309ba for a full test report.

@casparvl
Copy link
Contributor

@boegelbot please test @ jsc-zen3
EB_ARGS="--include-easyblocks-from-pr 3303"

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=20358 EB_ARGS="--include-easyblocks-from-pr 3303" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_20358 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 4356

Test results coming soon (I hope)...

Details

- notification for comment with ID 2159025567 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3303
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/c0e6165b8a4c45d2b58f1c9ff7fb971f for a full test report.

@boegel
Copy link
Member

boegel commented Aug 2, 2024

@casparvl Is this ready to go now you think?

@casparvl
Copy link
Contributor

Yeah, I've been hesitant to pull the trigger on this one, but @Flamefire 's failing build was in one of the dependencies. I've asked @laraPPr to upload some test report from her system, I'll also trigger boegelbot again for a final set of tests. If succesfull, I say we merge, since it probably works for the majority of people (and tackle any remaining issues in follow-up PRs).

@casparvl
Copy link
Contributor

@boegelbot please test @ generoso
EB_ARGS="--include-easyblocks-from-pr 3303"

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=20358 EB_ARGS="--include-easyblocks-from-pr 3303" EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_20358 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 14057

Test results coming soon (I hope)...

Details

- notification for comment with ID 2286064510 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3303
FAILED
Build succeeded for 1 out of 3 (3 easyconfigs in total)
cns1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/ddfe8bcfef27486b4af9bb587be27070 for a full test report.

@casparvl
Copy link
Contributor

@boegelbot please test @ jsc-zen3
EB_ARGS="--include-easyblocks-from-pr 3303"

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=20358 EB_ARGS="--include-easyblocks-from-pr 3303" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_20358 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 4668

Test results coming soon (I hope)...

Details

- notification for comment with ID 2286423224 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@casparvl
Copy link
Contributor

Ah...

Kenneth Hoste (boegel)
  4:02 PM
FYI: I’m doing a forced rebuild of Python/3.11.3-GCCcore-12.3.0 on jsc-zen3, it got messed up by a pip install command that is run via setup.py, see https://github.com/Juniper/py-junos-eznc/issues/1318 + https://github.com/easybuilders/easybuild-easyconfigs/pull/21166
edit: same problem on generoso (edited) 

So that explains the generoso failure...

@casparvl
Copy link
Contributor

@boegelbot please test @ generoso
EB_ARGS="--include-easyblocks-from-pr 3303"

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=20358 EB_ARGS="--include-easyblocks-from-pr 3303" EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_20358 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 14059

Test results coming soon (I hope)...

Details

- notification for comment with ID 2286482354 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3303
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
jsczen3c2.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/edf15f9663f507ee4ef7379411c8e0a5 for a full test report.

@casparvl
Copy link
Contributor

@boegelbot please test @ generoso
EB_ARGS="--include-easyblocks-from-pr 3303"

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=20358 EB_ARGS="--include-easyblocks-from-pr 3303" EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_20358 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 14074

Test results coming soon (I hope)...

Details

- notification for comment with ID 2288508553 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@laraPPr
Copy link
Contributor

laraPPr commented Aug 14, 2024

Test report by @laraPPr
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3303
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
node4009.donphan.os - Linux RHEL 8.8 (Ootpa), x86_64, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz, 1 x NVIDIA NVIDIA A2, 545.23.08, Python 3.11.3
See https://gist.github.com/laraPPr/f0a871731a5f0138e53fd6e6454d1000 for a full test report.

@laraPPr
Copy link
Contributor

laraPPr commented Aug 14, 2024

the third one failed because of lock will clean it up and retrigger the one that failed later

@boegel
Copy link
Member

boegel commented Aug 14, 2024

Test report by @boegel
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
node3900.accelgor.os - Linux RHEL 8.8, x86_64, AMD EPYC 7413 24-Core Processor, 1 x NVIDIA NVIDIA A100-SXM4-80GB, 545.23.08, Python 3.6.8
See https://gist.github.com/boegel/242c392c61a7c8765863dc76fd7b4eb3 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3303
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
cns1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/92f838f6a845e5eb7d22abf5cafc7b9d for a full test report.

@boegel
Copy link
Member

boegel commented Aug 15, 2024

Test report by @boegel
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
node3302.joltik.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 545.23.08, Python 3.6.8
See https://gist.github.com/boegel/43f6d9edffd144a580247e8a04247699 for a full test report.

@boegel
Copy link
Member

boegel commented Aug 15, 2024

@boegelbot please test @ generoso
EB_ARGS="--include-easyblocks-from-pr 3303 TensorFlow-2.15.1-foss-2023a-CUDA-12.1.1.eb"

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on login1

PR test command 'EB_PR=20358 EB_ARGS="--include-easyblocks-from-pr 3303 TensorFlow-2.15.1-foss-2023a-CUDA-12.1.1.eb" EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_20358 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 14082

Test results coming soon (I hope)...

Details

- notification for comment with ID 2291702091 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3303
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
cns1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/b0da64e1ba8b233116b68db4cbf9ba3c for a full test report.

@boegel boegel changed the title {lib}[foss/2023a] TensorFlow v2.15.1 w/ CUDA 12.1.1 {lib}[foss/2023a] TensorFlow v2.15.1 w/ CUDA 12.1.1 + add missing patches for TensorFlow v2.15.1 + NCCL v2.18.3 Aug 20, 2024
Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel boegel modified the milestones: 4.x, release after 4.9.2 Aug 20, 2024
@boegel
Copy link
Member

boegel commented Aug 20, 2024

Going in, thanks @yqshao!

@boegel boegel merged commit b688717 into easybuilders:develop Aug 20, 2024
@yqshao yqshao deleted the 20240413152217_new_pr_TensorFlow2151 branch August 20, 2024 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.