Nomad server doesn't accept connections after some time #8038

pznamensky · 2020-05-21T14:27:15Z

Nomad version

Nomad v0.11.2 (807cfeb)

Operating system and Environment details

CentOS 7.8

Issue

We've got a nomad cluster with 3 server nodes. Each several days one of the nomad servers stops receiving any connections:

srv2~ $ nomad status
Error querying jobs: Get "https://127.0.0.1:4646/v1/jobs": net/http: TLS handshake timeout

And others mark that node as left:

~ $ nomad server members
Name                           Address                         Port  Status  Leader  Protocol  Build   Datacenter  Region
srv1.global  <ip>  4648  alive   false   2         0.11.2  staging     global
srv2.global  <ip>  4648  left    false   2         0.11.2  staging     global
srv3.global  <ip>  4648  alive   true    2         0.11.2  staging     global

I tried to trace the broken nomad process with strace, but the only system calls were: epoll_pwait, nanosleep, sched_yield and futex.
Previous release (0.10.5) seems to be working fine. Nomad agents work well so far.

Reproduction steps

Set up nomad cluster and wait several days :)

Nomad Server config

datacenter = "staging"
data_dir = "/var/lib/nomad"
bind_addr = "::"
enable_syslog = true

server {
    enabled = true
    bootstrap_expect = 3

    retry_join = ["srv2:4648","srv3:4648"]
    retry_interval = "15s"
}

client {
    enabled = false
}

advertise {
    http = "<ip>:4646"
    rpc  = "<ip>:4647"
    serf = "<ip>:4648"
}

consul {
   server_auto_join = false
   client_auto_join = false
   token = "<token>"
}

tls {
   http = true
   rpc  = true

   ca_file   = "/etc/nomad.d/nomad-ca.crt"
   cert_file = "/etc/nomad.d/server.global.nomad.crt"
   key_file  = "/etc/nomad.d/server.global.nomad.private.key"

   verify_server_hostname = true
   verify_https_client    = false
}

acl {
   enabled = true
   token_ttl = "60s"
   policy_ttl = "60s"
}

log_level = "DEBUG"

Nomad Server logs

The last lines on the failed server

May 21 15:13:11 srv2 nomad: 2020-05-21T15:13:11.401+0300 [DEBUG] http: request complete: method=GET path=/v1/allocation/9c9729b0-c6fd-eafb-ffce-127742366060 duration=18.652021ms
May 21 15:13:11 srv2 nomad[22722]: http: request complete: method=GET path=/v1/allocation/56d54b1f-2462-1a5f-b493-ac873d35bc92 duration=17.652458ms
May 21 15:13:11 srv2 nomad[22722]: http: request complete: method=GET path=/v1/allocation/805e9147-c708-1fed-9e10-9b31a27c7ade duration=17.441663ms
May 21 15:13:11 srv2 nomad[22722]: http: request complete: method=GET path=/v1/allocation/30d5c473-3ffc-968a-31b6-5610b1e211fc duration=17.117029ms
May 21 15:13:11 srv2 nomad: 2020-05-21T15:13:11.402+0300 [DEBUG] http: request complete: method=GET path=/v1/allocation/082342be-41ad-12cb-5933-18b85c4a8c0c duration=18.67848ms
May 21 15:13:11 srv2 nomad: 2020-05-21T15:13:11.404+0300 [DEBUG] http: request complete: method=GET path=/v1/allocation/ec8c15f8-0d02-f143-77f6-f57e963ab6b9 duration=21.744965ms
May 21 15:13:11 srv2 nomad[22722]: http: request complete: method=GET path=/v1/node/c7533df9-ed3c-de50-db4d-a12e31cfeffe duration=9.68731ms
May 21 15:13:11 srv2 nomad[22722]: http: request complete: method=GET path=/v1/node/12cb5e69-dc66-1411-2f18-048edfb0c2ac duration=16.544546ms
May 21 15:13:11 srv2 nomad[22722]: http: request complete: method=GET path=/v1/allocation/0b5dd865-4e3b-b444-f26f-b9faa8c0241e duration=18.700859ms
May 21 15:13:11 srv2 nomad[22722]: http: request complete: method=GET path=/v1/allocation/9c9729b0-c6fd-eafb-ffce-127742366060 duration=18.652021ms
May 21 15:13:11 srv2 nomad[22722]: http: request complete: method=GET path=/v1/allocation/082342be-41ad-12cb-5933-18b85c4a8c0c duration=18.67848ms
May 21 15:13:11 srv2 nomad[22722]: http: request complete: method=GET path=/v1/allocation/ec8c15f8-0d02-f143-77f6-f57e963ab6b9 duration=21.744965ms

Corresponding logs on another server

May 21 15:13:20 srv1 nomad: 2020-05-21T15:13:20.225+0300 [DEBUG] nomad: memberlist: Failed ping: srv2.global (timeout reached)
May 21 15:13:20 srv1 nomad[22533]: nomad: memberlist: Failed ping: srv2.global (timeout reached)

I understand that it's probably not enough diagnostic information and I could provide more information if you let me know what could also be useful.

The text was updated successfully, but these errors were encountered:

zyclonite · 2020-05-25T14:13:14Z

having a similar issue, happens randomly after some days and one cpu core goes up to 100%
this happens with all 0.11.x versions but i could not reproduce it on demand

pznamensky · 2020-06-05T12:23:28Z

@schmichael any chances this behaviour will be fixed in 0.11.3?
I would say it's a critical bug.
We have had to roll back our cluster to 0.10 but still hope the issue will be fixed in 0.11.3.

pznamensky · 2020-06-10T06:14:55Z

After sending SIGABRT (09:03:54 in the log) to unresponsive process, I got this log:
nomad.log

pznamensky · 2020-06-22T07:56:48Z

Can't reproduce on 0.11.3. See #8163 (comment)

github-actions · 2022-11-06T02:35:14Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

galeep added the stage/needs-investigation label May 27, 2020

schmichael added type/bug theme/core labels May 29, 2020

pznamensky mentioned this issue Jun 2, 2020

Nomad Client becomes unresponsive after time #8085

Closed

rkettelerij mentioned this issue Jun 4, 2020

Nomad node becoming un-responsoive #7987

Closed

shoenig self-assigned this Jun 10, 2020

pznamensky mentioned this issue Jun 15, 2020

Nomad server daemon hung with 100% CPU #8163

Closed

pznamensky closed this as completed Jun 22, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad server doesn't accept connections after some time #8038

Nomad server doesn't accept connections after some time #8038

pznamensky commented May 21, 2020

zyclonite commented May 25, 2020

pznamensky commented Jun 5, 2020

pznamensky commented Jun 10, 2020

pznamensky commented Jun 22, 2020

github-actions bot commented Nov 6, 2022

Nomad server doesn't accept connections after some time #8038

Nomad server doesn't accept connections after some time #8038

Comments

pznamensky commented May 21, 2020

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Server config

Nomad Server logs

The last lines on the failed server

Corresponding logs on another server

zyclonite commented May 25, 2020

pznamensky commented Jun 5, 2020

pznamensky commented Jun 10, 2020

pznamensky commented Jun 22, 2020

github-actions bot commented Nov 6, 2022