Reloading consul certs #17297

juananinca · 2023-05-23T18:38:14Z

Nomad version

Nomad v1.5.0
BuildDate 2023-03-01T10:11:42Z
Revision fc40c49

Operating system and Environment details

NAME="Oracle Linux Server"
VERSION="8.7"

Issue

I setup a Nomad cluster with Consul consisting in a few clients and one single server. Both Nomad and Consul are secured using mutual tls generated by Vault's PKI secret engine and rotating them with a TTL of 1h using consul-template in each node, just like in this tutorial https://developer.hashicorp.com/nomad/tutorials/integrate-vault/vault-pki-nomad, (the vault service is not running within this cluster). After every rotation consul-template sends a SIGHUP to both nomad and consul through a systemctl reload.

While I was testing the cluster I found out that one of the clients (let's call it CA) was unable to register any service into consul altough if I tried to run the same job in another client (let's call it CB) there was no problem registering it. Besides I noticied in the nomad's log from CA that it was unable to get Consul's checks:
{"@level":"error","@message":"failed to retrieve check statuses","@module":"watch.checks","@timestamp":"2023-05-23T00:52:25.791875+02:00","error":"Get \"https://127.0.0.1:8500/v1/agent/checks\": remote error: tls: bad certificate"}

After restarting nomad in CA, all jobs run in that client are registered into Consul and the bad certificate error trying to get Consul checks is gone until the expiration time comes and then I am at the starting poing once again, unable to register new services from CA and bad certificates errors in the nomad logs. What is weird is that CB keeps registering and communicating with consul even after certs are rotated without any restart of the nomad service.

I headed for the documentation (https://developer.hashicorp.com/nomad/docs/configuration#configuration-reload) and it made sense to me (kind of).

tls: note this only reloads the TLS configuration between Nomad agents (servers and clients), and not the TLS configuration for communication with Consul or Vault.

This perfectly explains the CA's behaviour regarding the Consul communication, but not the CB's.

Who's right and who's wrong? Is CA acting as it is supposed to? And what about CB?

Note: I double checked the certs expiration time from CB, by copying them (thus they aren't rotated) at the moment I restart nomad which they are valid and waiting until they are expired. Once they are expired I use them to curl the consul check for instance https://localhost:8500/v1/agent/checks and I get an bad cert error, but nomad is still requesting them without any error and no need to restart the service just sending a SIGHUP signal by systemctl reload nomad.

Reproduction steps

Expected Result

Not sure

Actual Result

Clients behaving different under same conditions

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

{"@Level":"error","@message":"failed to retrieve check statuses","@module":"watch.checks","@timestamp":"2023-05-22T12:37:10.809608+02:00","error":"Get "https://127.0.0.1:8500/v1/agent/checks\": remote error: tls: bad certificate"}
{"@Level":"error","@message":"failed to retrieve check statuses","@module":"watch.checks","@timestamp":"2023-05-22T12:37:12.837512+02:00","error":"Get "https://127.0.0.1:8500/v1/agent/checks\": remote error: tls: bad certificate"}
{"@Level":"error","@message":"failed to retrieve check statuses","@module":"watch.checks","@timestamp":"2023-05-22T12:37:14.866487+02:00","error":"Get "https://127.0.0.1:8500/v1/agent/checks\": remote error: tls: bad certificate"}
{"@Level":"error","@message":"failed to retrieve check statuses","@module":"watch.checks","@timestamp":"2023-05-22T12:37:18.923713+02:00","error":"Get "https://127.0.0.1:8500/v1/agent/checks\": remote error: tls: bad certificate"}
{"@Level":"error","@message":"failed to retrieve check statuses","@module":"watch.checks","@timestamp":"2023-05-22T12:37:20.953657+02:00","error":"Get "https://127.0.0.1:8500/v1/agent/checks\": remote error: tls: bad certificate"}
{"@Level":"error","@message":"failed to retrieve check statuses","@module":"watch.checks","@timestamp":"2023-05-22T12:37:22.983925+02:00","error":"Get "https://127.0.0.1:8500/v1/agent/checks\": remote error: tls: bad certificate"}
{"@Level":"error","@message":"failed to retrieve check statuses","@module":"watch.checks","@timestamp":"2023-05-22T12:37:25.010972+02:00","error":"Get "https://127.0.0.1:8500/v1/agent/checks\": remote error: tls: bad certificate"}

The text was updated successfully, but these errors were encountered:

lgfa29 · 2023-06-03T02:09:12Z

Hi @juananinca 👋

I believe CA's behaviour is the expected one in this case. This is the code for the Nomad agent configuration reload:

nomad/command/agent/agent.go

Lines 1286 to 1355 in da9ec8c

    
           // Reload handles configuration changes for the agent. Provides a method that 
        
           // is easier to unit test, as this action is invoked via SIGHUP. 
        
           func (a *Agent) Reload(newConfig *Config) error { 
        
           	a.configLock.Lock() 
        
           	defer a.configLock.Unlock() 
        
           	current := a.config.Copy() 
        
           	updatedLogging := newConfig != nil && (newConfig.LogLevel != current.LogLevel) 
        
           	if newConfig == nil || newConfig.TLSConfig == nil && !updatedLogging { 
        
           		return fmt.Errorf("cannot reload agent with nil configuration") 
        
           	} 
        
           	if updatedLogging { 
        
           		current.LogLevel = newConfig.LogLevel 
        
           		a.logger.SetLevel(log.LevelFromString(current.LogLevel)) 
        
           	} 
        
           	// Update eventer config 
        
           	if newConfig.Audit != nil { 
        
           		if err := a.entReloadEventer(newConfig.Audit); err != nil { 
        
           			return err 
        
           		} 
        
           	} 
        
           	// Allow auditor to call reopen regardless of config changes 
        
           	// This is primarily for enterprise audit logging to allow the underlying 
        
           	// file to be reopened if necessary 
        
           	if err := a.auditor.Reopen(); err != nil { 
        
           		return err 
        
           	} 
        
           	fullUpdateTLSConfig := func() { 
        
           		// Completely reload the agent's TLS configuration (moving from non-TLS to 
        
           		// TLS, or vice versa) 
        
           		// This does not handle errors in loading the new TLS configuration 
        
           		current.TLSConfig = newConfig.TLSConfig.Copy() 
        
           	} 
        
           	if !current.TLSConfig.IsEmpty() && !newConfig.TLSConfig.IsEmpty() { 
        
           		// This is just a TLS configuration reload, we don't need to refresh 
        
           		// existing network connections 
        
           		// Reload the certificates on the keyloader and on success store the 
        
           		// updated TLS config. It is important to reuse the same keyloader 
        
           		// as this allows us to dynamically reload configurations not only 
        
           		// on the Agent but on the Server and Client too (they are 
        
           		// referencing the same keyloader). 
        
           		keyloader := current.TLSConfig.GetKeyLoader() 
        
           		_, err := keyloader.LoadKeyPair(newConfig.TLSConfig.CertFile, newConfig.TLSConfig.KeyFile) 
        
           		if err != nil { 
        
           			return err 
        
           		} 
        
           		current.TLSConfig = newConfig.TLSConfig 
        
           		current.TLSConfig.KeyLoader = keyloader 
        
           		a.config = current 
        
           		return nil 
        
           	} else if newConfig.TLSConfig.IsEmpty() && !current.TLSConfig.IsEmpty() { 
        
           		a.logger.Warn("downgrading agent's existing TLS configuration to plaintext") 
        
           		fullUpdateTLSConfig() 
        
           	} else if !newConfig.TLSConfig.IsEmpty() && current.TLSConfig.IsEmpty() { 
        
           		a.logger.Info("upgrading from plaintext configuration to TLS") 
        
           		fullUpdateTLSConfig() 
        
           	} 
        
           	// Set agent config to the updated config 
        
           	a.config = current 
        
           	return nil 
        
           }

As mentioned in the docs, only Nomad's own TLS configuration is reloaded.

Are CA and CB configuration identical? Both for Nomad and Consul? I could imagine this happening if one of the agents is configure to ignore TLS certs.

juananinca · 2023-08-28T09:06:15Z

Sorry for the delay.

Yes, both configs are the same.
Here you have the consul config, the only difference between the clients are the node_name and the advertise_addr:

  "disable_update_check": false,
  "bootstrap": false,
  "server": false,
  "node_name": "NODE_NAME",
  "datacenter": "DATACENTER_NAME",
  "data_dir": "/opt/consul/data",
  "encrypt": "aaaaaaaaaaaaaaaa==",
  "disable_update_check": true,
  "bind_addr": "0.0.0.0",
  "advertise_addr": "10.10.10.10",
  "addresses": {
    "https": "0.0.0.0",
    "dns": "0.0.0.0"
  },
  "ports": {
    "https": 8500,
    "http": -1
  },
  "key_file": "/opt/consul/ssl/server-key.pem",
  "cert_file": "/opt/consul/ssl/server.pem",
  "ca_file": "/opt/consul/ssl/consul-ca.pem",
  "verify_incoming": true,
  "verify_outgoing": true,
  "retry_join": [
    "11.11.11.11"
  ],
  "log_file": "/var/log/consul/",
  "log_json": true,
  "log_rotate_max_files": 7,
  "limits": {
    "https_handshake_timeout": "10s",
    "http_max_conns_per_client": 1000,
    "rpc_handshake_timeout": "10s",
    "rpc_max_conns_per_client": 1000
  },
  "connect": {
    "enabled": true
  },
  "acl": {
    "enabled": true,
    "default_policy": "deny",
    "enable_token_persistence": true,
    "tokens": {
      "agent": "aaaaaaa-bbbb-cccc-ddddd-eeeeeeeeee"
    }
  }
}

And the nomad config, in this case the only difference are the name and limits.* ip's:

name = "CLIENT_NAME"
log_level = "WARN"
leave_on_interrupt = true
leave_on_terminate = true
data_dir = "/var/nomad/data"
bind_addr = "0.0.0.0"
disable_update_check = true
limits {
        https_handshake_timeout   = "10s"
        http_max_conns_per_client = 400
        rpc_handshake_timeout     = "10s"
        rpc_max_conns_per_client  = 400
}
advertise {
    http = "10.10.10.10:4646"
    rpc = "10.10.10.10:4647"
    serf = "10.10.10.10:4648"
}
tls {
  http = true
  rpc  = true
  cert_file = "/opt/nomad/ssl/server.pem"
  key_file = "/opt/nomad/ssl/server-key.pem"
  ca_file = "/opt/nomad/ssl/nomad-ca.pem"
  verify_server_hostname = true
  verify_https_client    = true

}
log_file = "/var/log/nomad/"
log_json = true
log_rotate_max_files = 7
consul {
    address = "127.0.0.1:8500"
    server_service_name = "nomad-server"
    client_service_name = "nomad-client"
    auto_advertise = true
    server_auto_join = true
    client_auto_join = true

    ssl = true
    ca_file = "/opt/consul/ssl/consul-ca.pem"
    cert_file = "/opt/consul/ssl/server.pem"
    key_file = "/opt/consul/ssl/server-key.pem"
        token = "aaaaaa-79bbbbb74-cccc-dddddd-eeeeeeee"
    

}
acl {
  enabled = true
}

vault {
    enabled = true
    address = "https://my.vault.addr:8200/"
    ca_file = "/opt/vault/ssl/vault-ca.pem"
    cert_file = "/opt/vault/ssl/client-vault.pem"
    key_file = "/opt/vault/ssl/client-vault-key.pem"
}
telemetry {
  publish_allocation_metrics = true
  publish_node_metrics       = true
}

As you can see the verify_incoming config is set to true in the consul configuration file.

deuspt · 2024-10-09T16:46:04Z

I'm also experiencing a similar error in 1.8.1. Does not happen very often given the cert rotation is not very high, but it manifests on some allocations and clients a few days after certificates are renewed (using consul-template for this).
If first starts with errors like these:

...
{"@level":"error","@message":"still unable to update services in Consul","@module":"consul.sync","@timestamp":"2024-10-09T16:05:45.945552Z","error":"failed to query Consul services: Get \"https://localhost:8501/v1/agent/services\": remote error: tls: expired certificate","failures":2400}
{"@level":"error","@message":"still unable to update services in Consul","@module":"consul.sync","@timestamp":"2024-10-09T16:15:46.366750Z","error":"failed to query Consul services: Get \"https://localhost:8501/v1/agent/services\": remote error: tls: expired certificate","failures":2420}
...

Then evolves to some allocations failing (and don't restart) with errors in the connect sidecar, like:

...
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-10-09T16:17:47.711518Z","alloc_id":"76b119e4-3641-a042-d4d6-d617a4b87f9c","failed":false,"msg":"envoy_version: error retrieving supported Envoy versions from Consul: Get \"https://localhost:8501/v1/agent/self\": remote error: tls: expired certificate","task":"connect-proxy-app1","type":"Task hook failed"}
{"@level":"error","@message":"prestart failed","@module":"client.alloc_runner.task_runner","@timestamp":"2024-10-09T16:17:47.713778Z","alloc_id":"76b119e4-3641-a042-d4d6-d617a4b87f9c","error":"prestart hook \"envoy_version\" failed: error retrieving supported Envoy versions from Consul: Get \"https://localhost:8501/v1/agent/self\": remote error: tls: expired certificate","task":"connect-proxy-app1"}
...

It recovers after Consul and Nomad are both restarted in the affected client. Haven't been able to figure out why...

juananinca added the type/bug label May 23, 2023

lgfa29 added stage/waiting-reply theme/consul labels Jun 3, 2023

lgfa29 self-assigned this Jun 3, 2023

tgross removed the stage/waiting-reply label Jun 24, 2024

tgross unassigned lgfa29 Jun 24, 2024

tgross added the stage/needs-verification Issue needs verifying it still exists label Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reloading consul certs #17297

Reloading consul certs #17297

juananinca commented May 23, 2023 •

edited

Loading

lgfa29 commented Jun 3, 2023

juananinca commented Aug 28, 2023

deuspt commented Oct 9, 2024

Reloading consul certs #17297

Reloading consul certs #17297

Comments

juananinca commented May 23, 2023 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

lgfa29 commented Jun 3, 2023

juananinca commented Aug 28, 2023

deuspt commented Oct 9, 2024

juananinca commented May 23, 2023 •

edited

Loading