`pack_size !=- 1` "Memory access fault" on Frontier #115

pgrete · 2024-09-09T11:57:42Z

While running some tests on Frontier I noticed the following issue:

$ srun -N 1 -n 8 -c 1 --gpus-per-node=8 --gpu-bind=closest /ccs/proj/ast146/pgrete/src/athenapk/build-bump-parth/bin/athenaPK -i ./linear_wave3d.in parthenon/meshblock/nx1=256 parthenon/meshblock/nx2=256 parthenon/meshblock/nx3=256 parthenon/mesh/nx1=1024 parthenon/mesh/nx2=1024 parthenon/mesh/nx3=1024 parthenon/time/nlim=20 parthenon/time/integrator=rk2 parthenon/mesh/pack_size=4
Memory access fault by GPU node-8 (Agent handle: 0x61253f0) on address 0x7ff7f2522000. Reason: Unknown.
srun: error: frontier08577: task 0: Aborted
srun: Terminating StepId=2345368.15
slurmstepd: error: *** STEP 2345368.15 ON frontier08577 CANCELLED AT 2024-09-06T06:38:27 ***
^[[A^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=2345368.15 tasks 1-7: running
srun: StepId=2345368.15 task 0: exited abnormally

Should be confirmed if this is Frontier specific or more general AthenaPK or Parthenon.

The text was updated successfully, but these errors were encountered:

pgrete · 2024-09-09T13:40:55Z

Does work as expected on GH200, so it seems that the "Memory access fault" is one of the standard Frontier/Lumi/MI250X/Cray errors.

BenWibking · 2024-09-09T13:47:09Z

It's probably an LLVM AMDGPU compiler bug. It's been known for years, but AMD has not been able to fix it: https://discourse.llvm.org/t/how-to-verify-correct-regalloc-for-a-kernel/80811

The cause is when register pressure is high, and there is conditional execution (virtually all of our kernels), it can produce incorrect machine code for restoring registers that have been spilled to memory (due to running out of hardware registers) that trashes the registers that hold memory addresses. Then, boom, memory error and crash.

For us, we've only seen it so far with reaction networks (that use ~1000s of registers), but it's as the AMD engineer says in the thread, it's not predictable when it happens, it cannot be verified that any given kernel is compiled correctly, and it's even difficult to see the bug when manually inspecting the generated machine code.

BenWibking · 2024-09-09T13:49:30Z

Here's another example of this kind of compiler bug: llvm/llvm-project#96353

pgrete · 2024-09-09T13:57:31Z

yikes... I guess we'll wait and see then.

BenWibking · 2024-10-01T02:09:28Z

The PR that was expected to fix (all?) of these kinds of bugs was just merged into LLVM: llvm/llvm-project#93526.

It may be possible to build a working compiler using Spack with spack install llvm@main target=zen3,amdgpu on Frontier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`pack_size !=- 1` "Memory access fault" on Frontier #115

`pack_size !=- 1` "Memory access fault" on Frontier #115

pgrete commented Sep 9, 2024

pgrete commented Sep 9, 2024

BenWibking commented Sep 9, 2024

BenWibking commented Sep 9, 2024

pgrete commented Sep 9, 2024

BenWibking commented Oct 1, 2024 •

edited

Loading

pack_size !=- 1 "Memory access fault" on Frontier #115

pack_size !=- 1 "Memory access fault" on Frontier #115

Comments

pgrete commented Sep 9, 2024

pgrete commented Sep 9, 2024

BenWibking commented Sep 9, 2024

BenWibking commented Sep 9, 2024

pgrete commented Sep 9, 2024

BenWibking commented Oct 1, 2024 • edited Loading

`pack_size !=- 1` "Memory access fault" on Frontier #115

`pack_size !=- 1` "Memory access fault" on Frontier #115

BenWibking commented Oct 1, 2024 •

edited

Loading