Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.11.1 slurmctld core dumps with error message: double free or corruption (!prev) #6529

Open
gwolski opened this issue Nov 2, 2024 · 15 comments

Comments

@gwolski
Copy link

gwolski commented Nov 2, 2024

Attempting to move to parallelcluster 3.11.1. I have been using 3.9.1 (with its known bug since May) w/o this issue.
3.9.1 is custom Rocky 8.9 image with parallelcluster 3.9.1 overlay.

New set up is a custom Rocky 8.10 AMI upon which I have overlayed parallelcuster 3.11.1 with pcluster build.

Deployed just fine. Ran for a week with small number of jobs submitted.
Started banging on it a bit more with hundreds of jobs/day. About every 24 hours, slurmctld dumps core with ultimate core dump error message in /var/log/messages:

Oct 30 11:04:30 ip-10-6-11-248 slurmctld[1374]: double free or corruption (!prev)
Oct 30 11:04:30 ip-10-6-11-248 systemd[1]: Created slice system-systemd\x2dcoredump.slice.
Oct 30 11:04:30 ip-10-6-11-248 systemd[1]: Started Process Core Dump (PID 2977/UID 0).
Oct 30 11:04:30 ip-10-6-11-248 systemd-coredump[2978]: Process 1374 (slurmctld) of user 401 dumped core.#12#012Stack trace of thread 2531:#12#0

Reboot the machine and 24 hours later:

Oct 31 10:48:04 ip-10-6-11-248 slurmctld[1383]: corrupted double-linked list
Oct 31 10:48:04 ip-10-6-11-248 systemd[1]: Created slice system-systemd\x2dcoredump.slice.
Oct 31 10:48:04 ip-10-6-11-248 systemd[1]: Started Process Core Dump (PID 728511/UID 0).
Oct 31 10:48:04 ip-10-6-11-248 systemd-coredump[728512]: Process 1383 (slurmctld) of user 401 dumped core.#12#012Stack trace of thread 728510:#12#0

Prior error messages right before this are of the form:
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-21:6818) failed: Name or service not known
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-21"
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: unable to split forward hostlist
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: _thread_per_group_rpc: no ret_list given
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-25:6818) failed: Name or service not known
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-25"
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: unable to split forward hostlist
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: _thread_per_group_rpc: no ret_list given
[duplicate messages not included, just different node names]

Oct 30 11:04:30 ip-10-6-11-248 slurmctld[1374]: slurmctld: agent/is_node_resp: node:sp-m7a-l-dy-sp-8-gb-2-cores-4 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication conne
[duplicate messages not included, just different node names]

Error messages in slurmctld.log from same time as the second crash message above:

[2024-10-31T10:47:04.008] error: unable to split forward hostlist
[2024-10-31T10:47:04.008] error: _thread_per_group_rpc: no ret_list given
[2024-10-31T10:47:05.134] error: slurm_receive_msg [10.6.2.57:50156]: Zero Bytes were transmitted or received
[2024-10-31T10:47:18.718] error: slurm_receive_msg [10.6.9.248:41134]: Zero Bytes were transmitted or received
[2024-10-31T10:47:20.862] error: slurm_receive_msg [10.6.14.229:53816]: Zero Bytes were transmitted or received
[2024-10-31T10:47:22.137] error: slurm_receive_msg [10.6.2.57:50996]: Zero Bytes were transmitted or received
[2024-10-31T10:48:04.000] cleanup_completing: JobId=4094 completion process took 134 seconds
[2024-10-31T10:48:04.000] error: Nodes sp-r7a-m-dy-sp-8-gb-1-cores-37 not responding, setting DOWN
[2024-10-31T10:48:04.003] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-10:6818) failed: Name or service not known
[2024-10-31T10:48:04.003] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-10"
[2024-10-31T10:48:04.005] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-11:6818) failed: Name or service not known
[2024-10-31T10:48:04.005] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-11"
[2024-10-31T10:48:04.007] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-33:6818) failed: Name or service not known
[2024-10-31T10:48:04.007] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-33"
[2024-10-31T10:48:04.009] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-36:6818) failed: Name or service not known
[2024-10-31T10:48:04.009] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-36"
[2024-10-31T10:48:04.010] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores

Has anyone seen this? I'm going back to 3.10.1 and will attempt to deploy that version.

@gwolski
Copy link
Author

gwolski commented Nov 5, 2024

After the last restart I have not seen this issue again (four days ago). There was something wrong with my reverse DNS lookup that I did fix before the restart. I don't want to believe that is it, but isn't it always DNS? Nonetheless, I am increasing my testing, but I've also modified the systemctl service file for slurmctl to include the ability to self-restart.

# /etc/systemd/system/slurmctld.service
[Unit]
Description=Slurm controller daemon
After=network-online.target munge.service remote-fs.target
Wants=network-online.target
ConditionPathExists=/opt/slurm/etc/slurm.conf
StartLimitIntervalSec=30
StartLimitBurst=2

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmctld
ExecStart=/opt/slurm/sbin/slurmctld -D -s $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=562930
LimitMEMLOCK=infinity
LimitSTACK=infinity
Restart=on-failure
RestartSec=10s

[Install]
WantedBy=multi-user.target

Four new lines added:
In the [Unit] section:
StartLimitIntervalSec=30
StartLimitBurst=2

and in the [Service] section:
Restart=on-failure
RestartSec=10s

Maybe something you might want to consider adding to standard distribution?

@himani2411
Copy link
Contributor

himani2411 commented Nov 5, 2024

Hi @gwolski

From the previous error logs it does look like a Networking issue. Are you still facing an issue with changes you made to DNS?

Thank you for the suggestion on self-restarting slurmctld, will track this in our backlog.

@gwolski
Copy link
Author

gwolski commented Nov 6, 2024

My DNS server did not have a reverse lookup zone so slurmctld could not find my submission hosts. I don't know why this would be a requirement but once I added my submission hosts to the reverse lookup zone, things quieted down. So I am suspicious this was the cause.

I am now building a new 3.11.1 pcluster image with my latest custom AMI changes and will deploy in a new 3.11.1 cluster. Will run and then update this ticket.

I will add a feature request for the self-starting slurmctld so it is outside of this ticket.

@demartinofra
Copy link
Contributor

Hi @gwolski,

Thank you for reporting the extra details and keeping us posted about the progress.

In order for us to better investigate what has caused Slurm to crash, would you mind sharing the following if they are still available:

  1. The ParallelCluster cluster configuration file.
  2. The full slurmctld logs.
  3. The processed core dump file. Could you please run the following gdb command and share the output?
    gdb -ex 't a a bt' -batch /opt/slurm/sbin/slurmctld /var/spool/slurm.state/core.NNN > /tmp/core.out
    
  4. The full Slurm configuration. You can easily dump it with the command scontrol show config.

Thank you, Francesco

@gwolski
Copy link
Author

gwolski commented Nov 7, 2024

Hi @demartinofra,
The entire cluster is now gone as I have replaced it afresh.
I am about to write it off as a poor DNS set up on my end with lack of proper reverse records.
I have not quite enabled the next cluster the same way in allowing submission nodes to exist, so I'd like to not close this until I do that and have success. I am debugging what I hope is a "me problem". Just sharing here in case it tickles something you might know of.

I'm starting srun jobs directly from the head_node and I'm encountering random errors on one type of simulation job. I have some tcl code that tries to write to stdout, and I get;

# transcript error: error writing "stdout": stale remote file handle
    while executing
"puts -nonewline stdout $s"

It's "random" as to why one job out of hundreds gets this error. (I'm running same regression over and over to stress the system). I can't imagine why I have a stale file handle. I'm using nfsv3 with AWS FSX Ontap storage and I never see this error elsewhere (I have a 3.9.1 cluster running this same code and it has had 10s of thousands of jobs go through).

I'm running a custom Rocky 8.10 ami (with latest updates) with parallelcluster 3.11.1 overlaid.

I'll update once I figure out this problem and enable submission hosts to the cluster and have success (or failure).

@demartinofra
Copy link
Contributor

Not something I've seen already. How are you mounting the AWS FSX Ontap storage on the compute nodes?

@gwolski
Copy link
Author

gwolski commented Nov 7, 2024

I'm mounting the filesystems with autofs.
I spent some quality time following the tool flow and examining logs, I'm presently convinced that we have multiple simulations running in the same directory (by mistake) and they are stomping over each other. Maybe even the directory is being removed once one simulation finishes and then the tcl code is failing. I've passed it on to someone else to debug.

I will be enabling my remote submission hosts tomorrow.

@demartinofra
Copy link
Contributor

Thanks for the update. Let us know if you are able to reproduce the coredump issue again and in the case please share the content of the coredump file with us so that we can look into it.

@gwolski
Copy link
Author

gwolski commented Nov 12, 2024

@demartinofra I had a crash during the night.

Reviewing the slurmctld.log file, again I'm seeing slurmctld unable to resolve compute node names. And now something clicked in that I have my /etc/resolv.conf locked down so it does not have an entry for my cluster_name.pcluster domain on the. search line. I have now updated my /etc/resolv.conf search line to include "tsi4.pcluster" so compute names will resolve (though I have not modified it on my compute nodes). However, I've been running the same in PC 3.9.1 for months with no crashes, however I do see the same occasional error message in those logs - the last one from Nov 6. So how on earth is slurm resolving node names if I don't have my /etc/resolv.conf correct for the many days things "are just working"? Maybe this is the cause of the crash?

Right before the first coredump, my /var/log/messages has:

Nov 12 03:00:41 ip-10-6-2-5 slurmctld[9322]: slurmctld: error: xgetaddrinfo: getaddrinfo(sp-c7a-4xl-dy-sp-32-gb-16-cores-1:6818) failed: Name or service not known
Nov 12 03:00:41 ip-10-6-2-5 slurmctld[9322]: slurmctld: error: slurm_set_addr: Unable to resolve "sp-c7a-4xl-dy-sp-32-gb-16-cores-1"
Nov 12 03:00:41 ip-10-6-2-5 slurmctld[9322]: double free or corruption (!prev)

And here's the start of the core dump info:

Nov 12 03:00:41 ip-10-6-2-5 systemd-coredump[3641170]: Process 9322 (slurmctld) of user 401 dumped core.#12#012Stack trace of thread 3641156:#12#0 0x0000149e3c4e952f __GI_raise (libc.so.6)#12#1 0x0000149e3c4bce65 __GI_abort (libc.so.6)#12#2 0x0000149e3c52a727 __libc_message (libc.so.6)#12#3 0x0000149e3c531a2c malloc_printerr (libc.so.6)#12#4 0x0000149e3c533abc _int_free (libc.so.6)#12#5 0x0000149e3d595823 slurm_xfree (libslurmfull.so)#12#6 0x0000149e3d527b1b slurm_free_node_alias_addrs_members (libslurmfull.so)#12#7 0x0000149e3d4d56b4 destroy_forward (libslurmfull.so)#12#8 0x0000000000426258 _thread_per_group_rpc (slurmctld)#12#9 0x

I have a bunch of core dumps in:

$ ls -ltr /var/lib/systemd/coredump/
total 7060
-rw-r----- 1 root root 2681757 Nov 12 03:00 core.slurmctld.401.7eb70c2c3f464dc4b2c16da5b4c79efa.9322.1731409241000000.lz4
-rw-r----- 1 root root 1076727 Nov 12 03:00 core.slurmctld.401.7eb70c2c3f464dc4b2c16da5b4c79efa.3641181.1731409254000000.lz4
-rw-r----- 1 root root 1278469 Nov 12 03:01 core.slurmctld.401.7eb70c2c3f464dc4b2c16da5b4c79efa.3641254.1731409267000000.lz4
-rw-r----- 1 root root 1116909 Nov 12 03:51 core.slurmctld.401.7eb70c2c3f464dc4b2c16da5b4c79efa.3642879.1731412283000000.lz4
-rw-r----- 1 root root 1067271 Nov 12 03:51 core.slurmctld.401.7eb70c2c3f464dc4b2c16da5b4c79efa.3643236.1731412295000000.lz4

You asked me to run a gdb command above, but I don't have a /var/spool/slurm.state/ directory. Maybe you want me to run it on one of these core dumps (note they are .lz4 files - I am guessing I have to uncompress first?). Do you want one of the coredump files? It probably has confidential info - any way I can get it to you via another way rather than an attachment?

Attached are all the files I believe you have asked for in these two tar files:
tsi4_config_files.tar.gz
tsi4_headnode_logs.tar.gz

Please let me know how to provide the core files and if I can provide anything else.

@gwolski
Copy link
Author

gwolski commented Nov 13, 2024

Hi @demartinofra, The entire cluster is now gone as I have replaced it afresh. I am about to write it off as a poor DNS set up on my end with lack of proper reverse records. I have not quite enabled the next cluster the same way in allowing submission nodes to exist, so I'd like to not close this until I do that and have success. I am debugging what I hope is a "me problem". Just sharing here in case it tickles something you might know of.

I'm starting srun jobs directly from the head_node and I'm encountering random errors on one type of simulation job. I have some tcl code that tries to write to stdout, and I get;

# transcript error: error writing "stdout": stale remote file handle
    while executing
"puts -nonewline stdout $s"

It's "random" as to why one job out of hundreds gets this error. (I'm running same regression over and over to stress the system). I can't imagine why I have a stale file handle. I'm using nfsv3 with AWS FSX Ontap storage and I never see this error elsewhere (I have a 3.9.1 cluster running this same code and it has had 10s of thousands of jobs go through).

I'm running a custom Rocky 8.10 ami (with latest updates) with parallelcluster 3.11.1 overlaid.

I'll update once I figure out this problem and enable submission hosts to the cluster and have success (or failure).

The team has root caused the "transcript error" message above and why it is happening with 3.11.1. It is due to our EDA tool having a five minute time out waiting for machines to come alive. If they do not come alive, it resubmits the job. So we were having two of the same jobs submitted, so they were stomping on each other. The five minute time out worked just fine in 3.9.3 as the machines mostly came up in under five minutes. In 3.11.1 I am seeing start up times of machines of over seven minutes. I just built a cluster with your generic rhel8 AMI today and the first machine I created I saw a boot time of of 7:08 minutes today, i.e. from CF to RUNNING. It was a spot r7a.medium. See #6479.

I have been running 3.9.3 and I just saw this happen once just last week. As compared to every 100 or so simulations in 3.11.1. We run 100s of sims per day through PC right now.

@demartinofra
Copy link
Contributor

So how on earth is slurm resolving node names if I don't have my /etc/resolv.conf correct for the many days things "are just working"? Maybe this is the cause of the crash?

We configure Slurm so that IP addresses are propagated by the controller. In particular we explicitly set the ip address (NodeAddr) of each Slurm node as part of our instance provisioning process so that Slurm does not need to perform any DNS resolution. The fact that Slurmctld is crashing when dns resolution is misconfigured sounds like an issue in how Slurm is handling it. Once we have have the coredumps available we can work with SchedMD to nail down the issue.

You asked me to run a gdb command above, but I don't have a /var/spool/slurm.state/ directory.

This is unexpected. The StateSaveLocation is set to /var/spool/slurm.state in Slurm config which means Slurm should be using that directory on the head node to store state information.

To process systemd coredumps (in case you don't have the coredumps in slurm.state directory) you can use coredumpctl, with something like:

coredumpctl debug --debugger-arguments="-ex 't a a bt' -batch" slurmctld

Do you want one of the coredump files? It probably has confidential info - any way I can get it to you via another way rather than an attachment?

You are right. if you have the possibility to open an AWS case you could do that and upload the file in the case. Otherwise I can reach out over email and provide you with a different way to safely upload the coredump.

I just built a cluster with your generic rhel8 AMI today and the first machine I created I saw a boot time of of 7:08 minutes today, i.e. from CF to RUNNING. It was a spot r7a.medium. See #6479.

I will ask the team to look into this.

@gwolski
Copy link
Author

gwolski commented Nov 13, 2024

You asked me to run a gdb command above, but I don't have a /var/spool/slurm.state/ directory.

This is unexpected. The StateSaveLocation is set to /var/spool/slurm.state in Slurm config which means Slurm should be using that directory on the head node to store state information.

I mistyped - I meant to say I have no core.NNN files in /var/spool/slurm.state directory. Here is what I have in /var/spool/slurm.state directory:

$ ls
assoc_mgr_state      hash.5                node_state.old
assoc_mgr_state.old  hash.6                part_state
assoc_usage          hash.7                part_state.old
assoc_usage.old      hash.8                priority_last_decay_ran
clustername          hash.9                priority_last_decay_ran.old
fed_mgr_state        job_state             qos_usage
fed_mgr_state.old    job_state.old         qos_usage.old
hash.0               last_config_lite      resv_state
hash.1               last_config_lite.old  resv_state.old
hash.2               last_tres             trigger_state
hash.3               last_tres.old         trigger_state.old
hash.4               node_state

To process systemd coredumps (in case you don't have the coredumps in slurm.state directory) you can use coredumpctl, with something like:

coredumpctl debug --debugger-arguments="-ex 't a a bt' -batch" slurmctld

I'm running on Rocky8 and coredumpctl does not support --debugger-arguments. I can run

 coredumpctl debug slurmctld

I have attached that output.

I have never used coredumpctl and the last time I seriously used gdb was back in the 90s, so I'm really rusty there... Would need your help if you want me to do anything specific.
coredumpctl.log

I do not have AWS technical support right now. You can get my email from Pedro Gil or Allan Carter @cartalla in your company and send me a private note.

@demartinofra
Copy link
Contributor

No worries at all, let me try to help :)

If the coredumpctl version does not support the --debugger-arguments option I guess we can first dump the coredump and then process them separately with gdp. Can you try something like this:

coredumpctl dump slurmctld -1 --output=coredump
gdb -ex 't a a bt' -batch /opt/slurm/sbin/slurmctld coredump > slurmctld_coredump_analysis.txt

if that works can you send both coredump and slurmctld_coredump_analysis.txt to us? I reached out to you over e-mail so that you don't need to share them here.

@gwolski
Copy link
Author

gwolski commented Nov 15, 2024

hmm. trying the coredumpctl command today, and not having success:

$ ls -al /var/lib/systemd/coredump
total 7064
drwxr-xr-x. 2 root root    4096 Nov 12 03:51 .
drwxr-xr-x. 6 root root      84 Sep 24 02:26 ..
-rw-r-----  1 root root 1076727 Nov 12 03:00 core.slurmctld.401.7eb70c2c3f464dc4b2c16da5b4c79efa.3641181.1731409254000000.lz4
-rw-r-----  1 root root 1278469 Nov 12 03:01 core.slurmctld.401.7eb70c2c3f464dc4b2c16da5b4c79efa.3641254.1731409267000000.lz4
-rw-r-----  1 root root 1116909 Nov 12 03:51 core.slurmctld.401.7eb70c2c3f464dc4b2c16da5b4c79efa.3642879.1731412283000000.lz4
-rw-r-----  1 root root 1067271 Nov 12 03:51 core.slurmctld.401.7eb70c2c3f464dc4b2c16da5b4c79efa.3643236.1731412295000000.lz4
-rw-r-----  1 root root 2681757 Nov 12 03:00 core.slurmctld.401.7eb70c2c3f464dc4b2c16da5b4c79efa.9322.1731409241000000.lz4
[root@ip-10-6-2-5 coredump]# coredumpctl list
No coredumps found.

Yet this worked a few days ago. journalctl --verify shows things are "PASS", i.e. all ok. I tried googling for ideas as to why no corefiles get listed, but can't find anything. I can keep looking, but hoping you have an idea...

@gwolski
Copy link
Author

gwolski commented Nov 18, 2024

It seems the journal files are being rotated so coredumpctl can't find them anymore, but I do have the coredump files. I will package up all the coredump files for you to review.
Learned more about coredumpctl here: systemd/systemd#17910

I was able to run your gdb command on each individual coredump file after uncompressing. Details sent in private email on how you can retrieve the information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants