Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-serial versions of tests using 5x5_amazon failing RUN #2423

Closed
glemieux opened this issue Mar 14, 2024 · 12 comments
Closed

Non-serial versions of tests using 5x5_amazon failing RUN #2423

glemieux opened this issue Mar 14, 2024 · 12 comments
Labels
bug something is working incorrectly

Comments

@glemieux
Copy link
Collaborator

glemieux commented Mar 14, 2024

Brief summary of bug

mpibind seems to have an issue with 5x5_amazon resolutions when run with full mpi (i.e. no MPI-serial) since ctsm5.1.dev173. Originally posted at NCAR/mpibind#5.

General bug information

CTSM version you are using: ctsm5.1.dev173

Does this bug cause significantly incorrect results in the model's science? [Yes / No] Run fails so no assessment possible

Details of bug

This was discovered when running the FatesColdSeedDispersal test while generating new fates baselines for the dev173 update. I was able to also replicate this failure using a non-serial MPI version of the hillslope clm-only test. The run immediately fails producing a cesm.log entry with a note about one of the core selections being invalid (see below). It also produced an mpibind.log that I hadn't noticed before.

This prompted me to compare dev172 and dev173 runs for non-serial MPI versions of the hillslope test that use 5x5_amazon. The dev172 version passes, but I noticed that the preview_run output is different:

dev172:

    MPIRUN (job=case.test):
      mpiexec  --label  --line-buffer  -n 5 /glade/derecho/scratch/glemieux/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope/SMS_D_Ld5.5x5_amazon.I1850Clm51Bgc.derecho_gnu.clm-HillslopeC.mpi-nonserial-check-clm_hillslope/bld/cesm.exe   >> cesm.log.$LID 2>&1 

dev173:

    MPIRUN (job=case.test):
      mpibind  --label  --line-buffer  --  /glade/derecho/scratch/glemieux/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope-dev173/SMS_D_Ld5.5x5_amazon.I1850Clm51Bgc.derecho_gnu.clm-HillslopeC.mpi-nonserial-check-clm_hillslope-dev173/bld/cesm.exe   >> cesm.log.$LID 2>&1 

What is odd to me is that mpibind was brought in dev172 via ccs_config_cesm0.0.92, so why is the call not activated for that tag? Why is it only being invoked with dev173?

Important details of your setup / configuration so we can reproduce the bug

You can view the SRCROOT_GIT_STATUS files for both dev173 and dev172 hillslope runs here, respectively:
/glade/u/home/glemieux/scratch/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope-dev173
/glade/u/home/glemieux/scratch/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope

Important output or errors that show the problem

cesm.log

  1 dec0417.hsn.de.hpc.ucar.edu 4: <65-65> is invalid
  2 dec0417.hsn.de.hpc.ucar.edu 4: libnuma: Warning: cpu argument 65-65 is out of range
  3 dec0417.hsn.de.hpc.ucar.edu 4:
  4 dec0417.hsn.de.hpc.ucar.edu 4: usage: numactl [--all | -a] [--balancing | -b] [--interleave= | -i <nodes>]
  5 dec0417.hsn.de.hpc.ucar.edu 4:                [--preferred= | -p <node>] [--physcpubind= | -C <cpus>]
  6 dec0417.hsn.de.hpc.ucar.edu 4:                [--cpunodebind= | -N <nodes>] [--membind= | -m <nodes>]
  7 dec0417.hsn.de.hpc.ucar.edu 4:                [--localalloc | -l] command args ...
  8 dec0417.hsn.de.hpc.ucar.edu 4:        numactl [--show | -s]
  9 dec0417.hsn.de.hpc.ucar.edu 4:        numactl [--hardware | -H]
 10 dec0417.hsn.de.hpc.ucar.edu 4:        numactl [--length | -L <length>] [--offset | -o <offset>] [--shmmode | -M <shmmode>]
 11 dec0417.hsn.de.hpc.ucar.edu 4:                [--strict | -t]
 12 dec0417.hsn.de.hpc.ucar.edu 4:                [--shmid | -I <id>] --shm | -S <shmkeyfile>
 13 dec0417.hsn.de.hpc.ucar.edu 4:                [--shmid | -I <id>] --file | -f <tmpfsfile>
 14 dec0417.hsn.de.hpc.ucar.edu 4:                [--huge | -u] [--touch | -T]
 15 dec0417.hsn.de.hpc.ucar.edu 4:                memory policy [--dump | -d] [--dump-nodes | -D]
 16 dec0417.hsn.de.hpc.ucar.edu 4:
 17 dec0417.hsn.de.hpc.ucar.edu 4: memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
 18 dec0417.hsn.de.hpc.ucar.edu 4: <nodes> is a comma delimited list of node numbers or A-B ranges or all.
 19 dec0417.hsn.de.hpc.ucar.edu 4: Instead of a number a node can also be:
 20 dec0417.hsn.de.hpc.ucar.edu 4:   netdev:DEV the node connected to network device DEV
 21 dec0417.hsn.de.hpc.ucar.edu 4:   file:PATH  the node the block device of path is connected to
 22 dec0417.hsn.de.hpc.ucar.edu 4:   ip:HOST    the node of the network device host routes through
 23 dec0417.hsn.de.hpc.ucar.edu 4:   block:PATH the node of block device path
 24 dec0417.hsn.de.hpc.ucar.edu 4:   pci:[seg:]bus:dev[:func] The node of a PCI device
 25 dec0417.hsn.de.hpc.ucar.edu 4: <cpus> is a comma delimited list of cpu numbers or A-B ranges or all
 26 dec0417.hsn.de.hpc.ucar.edu 4: all ranges can be inverted with !
 27 dec0417.hsn.de.hpc.ucar.edu 4: all numbers and ranges can be made cpuset-relative with +
 28 dec0417.hsn.de.hpc.ucar.edu 4: the old --cpubind argument is deprecated.
 29 dec0417.hsn.de.hpc.ucar.edu 4: use --cpunodebind or --physcpubind instead
 30 dec0417.hsn.de.hpc.ucar.edu 4: use --balancing | -b to enable Linux kernel NUMA balancing
 31 dec0417.hsn.de.hpc.ucar.edu 4: for the process if it is supported by kernel
 32 dec0417.hsn.de.hpc.ucar.edu 4: <length> can have g (GB), m (MB) or k (KB) suffixes
 33 dec0417.hsn.de.hpc.ucar.edu 3: <64-64> is invalid
 34 dec0417.hsn.de.hpc.ucar.edu 3: libnuma: Warning: cpu argument 64-64 is out of range
 35 dec0417.hsn.de.hpc.ucar.edu 3:
 36 dec0417.hsn.de.hpc.ucar.edu 3: usage: numactl [--all | -a] [--balancing | -b] [--interleave= | -i <nodes>]
 37 dec0417.hsn.de.hpc.ucar.edu 3:                [--preferred= | -p <node>] [--physcpubind= | -C <cpus>]
 38 dec0417.hsn.de.hpc.ucar.edu 3:                [--cpunodebind= | -N <nodes>] [--membind= | -m <nodes>]
 39 dec0417.hsn.de.hpc.ucar.edu 3:                [--localalloc | -l] command args ...
 40 dec0417.hsn.de.hpc.ucar.edu 3:        numactl [--show | -s]
 41 dec0417.hsn.de.hpc.ucar.edu 3:        numactl [--hardware | -H]
 42 dec0417.hsn.de.hpc.ucar.edu 3:        numactl [--length | -L <length>] [--offset | -o <offset>] [--shmmode | -M <shmmode>]
 43 dec0417.hsn.de.hpc.ucar.edu 3:                [--strict | -t]
 44 dec0417.hsn.de.hpc.ucar.edu 3:                [--shmid | -I <id>] --shm | -S <shmkeyfile>
 45 dec0417.hsn.de.hpc.ucar.edu 3:                [--shmid | -I <id>] --file | -f <tmpfsfile>
 46 dec0417.hsn.de.hpc.ucar.edu 3:                [--huge | -u] [--touch | -T]
 47 dec0417.hsn.de.hpc.ucar.edu 3:                memory policy [--dump | -d] [--dump-nodes | -D]
 48 dec0417.hsn.de.hpc.ucar.edu 3:
dec0417.hsn.de.hpc.ucar.edu 3: memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
 50 dec0417.hsn.de.hpc.ucar.edu 3: <nodes> is a comma delimited list of node numbers or A-B ranges or all.
 51 dec0417.hsn.de.hpc.ucar.edu 3: Instead of a number a node can also be:
 52 dec0417.hsn.de.hpc.ucar.edu 3:   netdev:DEV the node connected to network device DEV
 53 dec0417.hsn.de.hpc.ucar.edu 3:   file:PATH  the node the block device of path is connected to
 54 dec0417.hsn.de.hpc.ucar.edu 3:   ip:HOST    the node of the network device host routes through
 55 dec0417.hsn.de.hpc.ucar.edu 3:   block:PATH the node of block device path
 56 dec0417.hsn.de.hpc.ucar.edu 3:   pci:[seg:]bus:dev[:func] The node of a PCI device
 57 dec0417.hsn.de.hpc.ucar.edu 3: <cpus> is a comma delimited list of cpu numbers or A-B ranges or all
 58 dec0417.hsn.de.hpc.ucar.edu 3: all ranges can be inverted with !
 59 dec0417.hsn.de.hpc.ucar.edu 3: all numbers and ranges can be made cpuset-relative with +
 60 dec0417.hsn.de.hpc.ucar.edu 3: the old --cpubind argument is deprecated.
 61 dec0417.hsn.de.hpc.ucar.edu 3: use --cpunodebind or --physcpubind instead
 62 dec0417.hsn.de.hpc.ucar.edu 3: use --balancing | -b to enable Linux kernel NUMA balancing
 63 dec0417.hsn.de.hpc.ucar.edu 3: for the process if it is supported by kernel
 64 dec0417.hsn.de.hpc.ucar.edu 3: <length> can have g (GB), m (MB) or k (KB) suffixes
 65 dec0417.hsn.de.hpc.ucar.edu: rank 3 exited with code 1
 66 dec0417.hsn.de.hpc.ucar.edu: rank 0 died from signal 15

mpibind.log

Chunk info
  1:ncpus=5:mpiprocs=5:ompthreads=1:mem=230GB:Qlist=cpu:ngpus=0
-- -- -- --
MPI exec line:
  mpiexec --label --line-buffer -n 5 -ppn 5 --cpu-bind none -env OMP_NUM_THREADS=1 /glade/u/apps/opt/mpitools/mpibind/cpu_bind /glade/derecho/scratch/glemieux/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope-dev173/SMS_D_Ld5.5x5_amazon.I1850Clm51Bgc.derecho_gnu.clm-HillslopeC.mpi-nonserial-check-clm_hillslope-dev173/bld/cesm.exe 
-- -- -- --
Binding Report:
rank: 0, cores: 0-0
rank: 1, cores: 1-1
rank: 3, cores: 64-64
rank: 4, cores: 65-65
@glemieux glemieux changed the title Non-serial versions of tests using 5x5_amazon failing run Non-serial versions of tests using 5x5_amazon failing RUN Mar 14, 2024
@glemieux
Copy link
Collaborator Author

@ekluzek given the feedback from NCAR/mpibind#5 (comment), should I make an issue in the ccs_config_cesm repo?

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 15, 2024

@glemieux yes go ahead and do that.

@glemieux
Copy link
Collaborator Author

glemieux commented Mar 18, 2024

During the ctsm stand-up meeting today we came up with the following actions for the time being:

  • Add a non-serial 5x5_amazon test to aux_clm on derecho and to the expected failure list referencing this issue.
  • Temporarily convert the FatesColdSeedDisp testmod to run on f10

It was also noted that this doesn't seem to be an issue for izumi

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 22, 2024

@glemieux note this also relates to another problem I ran into:

#2427 (comment)

where the new use of mpibind needed me to do something different for mksurfdata_esmf.

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 22, 2024

The ccs_config issue is here:

ESMCI/ccs_config_cesm#142

glemieux added a commit to glemieux/ctsm that referenced this issue Mar 25, 2024
This will be reverted once issue ESCOMP#2423 has been addressed
@glemieux
Copy link
Collaborator Author

During the ctsm stand-up meeting today we came up with the following actions for the time being:

  • Add a non-serial 5x5_amazon test to aux_clm on derecho and to the expected failure list referencing this issue.
  • Temporarily convert the FatesColdSeedDisp testmod to run on f10

It was also noted that this doesn't seem to be an issue for izumi

Completed these actions items per #2436.

@ekluzek ekluzek added the bug something is working incorrectly label Mar 29, 2024
@samsrabin
Copy link
Collaborator

It seems like the non-serial 5x5_amazon test (SMS_D_Ld5.5x5_amazon.I1850Clm60Bgc.derecho_gnu.clm-HillslopeC) is now passing as of ctsm5.2.027. Should this issue be closed and that test removed from the expected failure list?

@samsrabin samsrabin added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Sep 20, 2024
@samsrabin
Copy link
Collaborator

Actually, it would probably be worth checking whether the original test you noticed this with—the FatesColdSeedDispersal one—still fails.

@ekluzek
Copy link
Collaborator

ekluzek commented Sep 27, 2024

@samsrabin good question on the removal of the MPI version of this test. The utility of the MPI test is to check that MPI works for a simple regional grid. As a way to make sure small regional cases work with MPI in general. It also makes sure you can use MPI for a grid that's only a fraction of a node.

Now at this point we also have the nldas2 grid that we test that's a larger regional grid so we could call that sufficient.

The advantage here though is that 5x5 amazon is a simple, fast, small grid for testing. So I like the idea of keeping it for at least some of our testing, if not this specific test for fates seed dispersal.

@samsrabin
Copy link
Collaborator

Thanks, Erik. I am indeed planning (#2434) to keep a serial 5x5_amazon hillslope test in the aux_clm suite, but I've moved the parallel version @glemieux added to just be in the special hillslope suite.

@wwieder wwieder removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Oct 31, 2024
@wwieder
Copy link
Contributor

wwieder commented Oct 31, 2024

@glemieux will check on this to see if it's still an issue

@glemieux
Copy link
Collaborator Author

I can confirm that the original issue is resolved. I reinstated the 5x5_amazon test that was changed via e9fb075 and the run passed without issue. I'll make another issue to reinstate the test.

Test case: /glade/u/home/glemieux/scratch/ctsm-tests/tests_1031-111717de

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly
Projects
None yet
Development

No branches or pull requests

4 participants