Template change_mode intermittently ignored after client restart #24140

ygersie · 2024-10-07T10:03:23Z

Nomad version

1.8.2+ent

Issue

Every time we restart a large fleet of clients some allocations end up with rendered dynamic certificates but they haven't been signaled/restarted despite having a proper change_mode set in the jobspec. This leads to production incidents where secrets have expired from a workload perspective (didn't get a signal or restart) even though the certificates on the filesystem have been updated.

Reproduction steps

I haven't been able to reproduce this in an isolated manner. It does happen every time on just a few allocations across our entire cluster after client restarts. On the same node there are many other allocations that get signaled/restarted just fine.

Logs

When issuing dynamic certificates from Vault through the template stanza a new certificate is created on each client restart. This should always trigger a change_mode event but occasionally that does not happen for some allocations which leads to outages.
Example here from a client log snippet:

mynodename:~# zgrep '1f514495-e45b-1428-2ab6-b4f2c67e9a0e' /var/log/nomad.log-*.gz
/var/log/nomad.log-20241001.gz:    2024-09-30T10:51:24.428Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1f514495-e45b-1428-2ab6-b4f2c67e9a0e task=mytask type=Received msg="Task received by client" failed=false
/var/log/nomad.log-20241001.gz:    2024-09-30T10:51:24.934Z [WARN]  client.alloc_runner.task_runner.task_hook.api: error creating task api socket: alloc_id=1f514495-e45b-1428-2ab6-b4f2c67e9a0e task=mytask path=/data/podman/root/nomad/data/alloc/1f514495-e45b-1428-2ab6-b4f2c67e9a0e/mytask/secrets/api.sock error="listen unix /data/podman/root/nomad/data/alloc/1f514495-e45b-1428-2ab6-b4f2c67e9a0e/mytask/secrets/api.sock: bind: invalid argument"
/var/log/nomad.log-20241001.gz:    2024-09-30T10:51:25.119Z [INFO]  agent: (runner) rendered "(dynamic)" => "/data/podman/root/nomad/data/alloc/1f514495-e45b-1428-2ab6-b4f2c67e9a0e/mytask/secrets/client.pem"
/var/log/nomad.log-20241001.gz:    2024-09-30T10:51:25.135Z [INFO]  agent: (runner) rendered "(dynamic)" => "/data/podman/root/nomad/data/alloc/1f514495-e45b-1428-2ab6-b4f2c67e9a0e/mytask/secrets/server.pem"

As you can see the certificate has been rendered correctly but a re-rendered event never triggered or isn't shown in the log file. For this workload change_mode="restart" is set and we should therefore have seen:

[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1f514495-e45b-1428-2ab6-b4f2c67e9a0e task=mytask type="Restart Signaled" msg="Template with change_mode restart re-rendered" failed=false

But it's not there.

The text was updated successfully, but these errors were encountered:

tgross · 2024-10-07T15:23:45Z

Hi @ygersie! One thing that complicates debugging this is that the template runner will report the "rendering" logs you show even if there's no change to the contents. See below for an example of where I've updated a Nomad Variable with a no-op change:

    2024-10-07T11:12:48.717-0400 [DEBUG] agent: (runner) receiving dependency nomad.var.block(nomad/jobs/[email protected])
    2024-10-07T11:12:48.717-0400 [DEBUG] agent: (runner) initiating run
    2024-10-07T11:12:48.717-0400 [DEBUG] agent: (runner) checking template 99264c04aea457a2a707ad264732193f
    2024-10-07T11:12:48.718-0400 [DEBUG] agent: (runner) diffing and updating dependencies
    2024-10-07T11:12:48.718-0400 [DEBUG] agent: (runner) nomad.var.block(nomad/jobs/[email protected]) is still needed
    2024-10-07T11:12:48.718-0400 [DEBUG] agent: (runner) watching 1 dependencies
    2024-10-07T11:12:48.718-0400 [DEBUG] agent: (runner) all templates rendered
    2024-10-07T11:12:53.718-0400 [DEBUG] agent: (runner) received template "99264c04aea457a2a707ad264732193f" from quiescence
    2024-10-07T11:12:53.718-0400 [DEBUG] agent: (runner) initiating run
    2024-10-07T11:12:53.718-0400 [DEBUG] agent: (runner) checking template 99264c04aea457a2a707ad264732193f
    2024-10-07T11:12:53.720-0400 [DEBUG] agent: (runner) rendering "(dynamic)" => "/var/nomad/data/alloc/d222bce3-4bd6-2029-a6a5-742f3e662595/http/local/test.txt"
    2024-10-07T11:12:53.729-0400 [DEBUG] agent: (runner) diffing and updating dependencies
    2024-10-07T11:12:53.729-0400 [DEBUG] agent: (runner) nomad.var.block(nomad/jobs/[email protected]) is still needed
    2024-10-07T11:12:53.729-0400 [DEBUG] agent: (runner) watching 1 dependencies
    2024-10-07T11:12:53.729-0400 [DEBUG] agent: (runner) all templates rendered
    2024-10-07T11:12:53.729-0400 [DEBUG] agent: (runner) enabling global quiescence for "99264c04aea457a2a707ad264732193f"

What's happening here is the following workflow:

The template runner makes an initial query for the contents and renders it to disk.
The template runner makes a "blocking query" to Consul or Nomad APIs (Vault doesn't have blocking queries so it has to poll instead), which waits until the index of the table in state has changed.
The template runner's query unblocks (or gets a value from Vault after polling), and renders the template.
The template runner tells Nomad whether the template changed by issuing a render event.
Nomad reads from the render events channel (ref template.go#L302) and that triggers the change_mode.

But assuming that you know the template contents did change, there aren't currently any logs that could capture the difference between "we got the event but didn't trigger the change" and "we didn't get an event". That this is happening around client restart is suspicious because of the notion that we check if the template has previously rendered.

Just in case there's a clue here... is the missing rendering happening immediately following client restart or some time afterwards?

ygersie · 2024-10-07T16:15:26Z

@tgross just so you're aware, I've been able to reproduce after a ton of restarts in nonprod and posted the TRACE logs along with my jobspec in the Enterprise support ticket. What's important to know here is that I'm using a dynamic PKI certificate. Each time you restart Nomad agent this triggers a new certificate as Nomad doesn't introspect the already rendered certificate on the filesystem to determine a new Vault lease time for renewal. It therefore should always trigger a restart or a signal for each allocation with a dynamic PKI cert. That is the problem here, it's not. Every so often a signal/restart is never triggered even though the actual certificate on the filesystem has been renewed. We then end up with a situation of a workload that has a new certificate on the filesystem but the one in runtime is never refreshed so it expires.

ygersie · 2024-10-07T16:21:11Z

Also important to note, this seems to only happen after an agent restart. We have not seen this occurring during "normal" operations where a certificate is renewed every < 10 days.

tgross · 2024-10-08T18:26:55Z

I haven't yet been able to reproduce but I do have some very preliminary findings. In the logs you've provided, the allocation of interest is 80f6cc3d. The node is running several other allocations, and this makes it challenging to de-tangle the template runner logs, which don't include the allocation ID. Fortunately, the allocation in question has at least one unique Vault path in its template so that makes it possible to pick it out. The allocation has two dependencies:

vault.read(credentials/data/ca/chain/default)
vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f)

First we see that both the update hook and prestart hook run concurrently:

2024-10-07T15:46:10.067Z [TRACE] client.alloc_runner.task_runner: running update hooks: alloc_id=80f6cc3d-0456-9a23-c01c-90063c2c6606 task=ygersie-test start="2024-10-07 15:46:10.067840601 +0000 UTC m=+10.315820980"
2024-10-07T15:46:10.067Z [TRACE] client.alloc_runner.task_runner: running prestart hook: alloc_id=80f6cc3d-0456-9a23-c01c-90063c2c6606 task=ygersie-test name=template start="2024-10-07 15:46:10.06789509 +0000 UTC m=+10.315875468"

All the template hook's top-level methods take a lock (ex. template_hook.go#L116-L117) so they can't interfere with each other. The prestart hook blocks until the first event is handled. In this case, we see that the update hook lost the race, because it exits last. So this appears to be a red herring:

2024-10-07T15:46:10.143Z [TRACE] client.alloc_runner.task_runner: finished prestart hook: alloc_id=80f6cc3d-0456-9a23-c01c-90063c2c6606 task=ygersie-test name=template end="2024-10-07 15:46:10.14306976 +0000 UTC m=+10.391050138" duration=75.17467ms
...
2024-10-07T15:46:10.143Z [TRACE] client.alloc_runner.task_runner: finished update hooks: alloc_id=80f6cc3d-0456-9a23-c01c-90063c2c6606 task=ygersie-test name=template end="2024-10-07 15:46:10.143131204 +0000 UTC m=+10.391111585" duration=75.092735ms

Next, I've extracted what looks to be the relevant bits of the logs for this template here:

logs

2024-10-07T15:46:10.070Z [INFO] agent: (runner) starting
2024-10-07T15:46:10.070Z [DEBUG] agent: (runner) running initial templates
2024-10-07T15:46:10.070Z [DEBUG] agent: (runner) initiating run
2024-10-07T15:46:10.070Z [DEBUG] agent: (runner) missing data for 1 dependencies
2024-10-07T15:46:10.070Z [DEBUG] agent: (runner) missing dependency: vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f)
2024-10-07T15:46:10.070Z [DEBUG] agent: (runner) add used dependency vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) to missing since isLeader but do not have a watcher
2024-10-07T15:46:10.070Z [DEBUG] agent: (runner) was not watching 1 dependencies
2024-10-07T15:46:10.070Z [DEBUG] agent: (watcher) adding vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f)
2024-10-07T15:46:10.070Z [TRACE] agent: (watcher) vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) starting
2024-10-07T15:46:10.070Z [DEBUG] agent: (runner) checking template 7a1c81c5aeaad195d4d0bd825982c12f
2024-10-07T15:46:10.070Z [TRACE] agent: (view) vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) starting fetch
2024-10-07T15:46:10.071Z [TRACE] agent: vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f): PUT /v1/pki/issue/nomad-cie-orch-ygersie-test?stale=true&wait=5m0s
2024-10-07T15:46:10.071Z [DEBUG] agent: (runner) missing data for 1 dependencies
2024-10-07T15:46:10.071Z [DEBUG] agent: (runner) missing dependency: vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f)
2024-10-07T15:46:10.071Z [DEBUG] agent: (runner) missing data for 1 dependencies
...
2024-10-07T15:46:10.091Z [TRACE] agent: vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f): non-renewable secret, set sleep for 52m18.608832275s
2024-10-07T15:46:10.091Z [TRACE] agent: (view) vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) marking successful data response
2024-10-07T15:46:10.091Z [TRACE] agent: (view) vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) successful contact, resetting retries
2024-10-07T15:46:10.091Z [TRACE] agent: (view) vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) received data
2024-10-07T15:46:10.091Z [TRACE] agent: (view) vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) starting fetch
2024-10-07T15:46:10.091Z [DEBUG] agent: (runner) receiving dependency vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f)
2024-10-07T15:46:10.091Z [DEBUG] agent: (runner) initiating run
...
2024-10-07T15:46:10.101Z [INFO] agent: (runner) rendered "(dynamic)" => "/data/podman/root/nomad/data/alloc/80f6cc3d-0456-9a23-c01c-90063c2c6606/ygersie-test/secrets/client.crt"
...
2024-10-07T15:46:10.102Z [DEBUG] agent: (runner) rendering "(dynamic)" => "/data/podman/root/nomad/data/alloc/80f6cc3d-0456-9a23-c01c-90063c2c6606/ygersie-test/secrets/client.key"
...
2024-10-07T15:46:10.112Z [INFO] agent: (runner) rendered "(dynamic)" => "/data/podman/root/nomad/data/alloc/80f6cc3d-0456-9a23-c01c-90063c2c6606/ygersie-test/secrets/client.key"
2024-10-07T15:46:10.112Z [DEBUG] agent: (runner) vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) is still needed
...
2024-10-07T15:46:10.112Z [DEBUG] agent: (runner) rendering "(dynamic)" => "/data/podman/root/nomad/data/alloc/80f6cc3d-0456-9a23-c01c-90063c2c6606/ygersie-test/secrets/client.crt"
...
2024-10-07T15:46:10.122Z [DEBUG] agent: (runner) rendering "(dynamic)" => "/data/podman/root/nomad/data/alloc/80f6cc3d-0456-9a23-c01c-90063c2c6606/ygersie-test/secrets/client.key"
...
2024-10-07T15:46:10.142Z [DEBUG] agent: (runner) vault.write(pki/issue/nomad-cie-orch-ygersie-test -> 8eed479f) is still needed
...
2024-10-07T15:46:10.143Z [TRACE] client.alloc_runner.task_runner: finished prestart hook: alloc_id=80f6cc3d-0456-9a23-c01c-90063c2c6606 task=ygersie-test name=template end="> 2024-10-07 15:46:10.14306976 +0000 UTC m=+10.391050138" duration=75.17467ms

What we can see from these is that we fetch the secret once from Vault, but then we have multiple rendering/rendered logs for the two destination files. The last "rending" doesn't have a paired "rendered":

The "rendering" log comes from runner.go#L887
The "rendered" log comes after the conditional for result.DidRender at runner.go#L926.
Meanwhile, Nomad is polling on the TemplateRenderedCh at template.go#L314-L316. Just below that line you can see where we fire the on-rendered event if the rendered event is "dirty".
The rendering events are stored in a map keyed by template ID at runner.go#L638, which means that the last event received determines if the template state is dirty, regardless of how many events were actually produced.

So my hypothesis is that:

Nomad doesn't have any way of polling template rendering events themselves from the channel, only that an event occurred. Then it has to lookup the events in the map.
consul-template is recording the first event, sending the event notification, and then immediately recording a second event before Nomad can process the first one.
Nomad reads the second event, finds that the template state isn't dirty as a result, and doesn't re-render.
Note that the channel is buffered so at that point CT itself blocks but it doesn't matter to Nomad's template hook so long as it has a number of events to process equal to the number of (de-duplicated) templates.

The reason we're seeing this during client restart is because we never trigger the on-render events for the first time we've rendered the template, so we don't care if there are multiple events. I haven't yet determined why there are multiple events and whether it's possible for them to appear outside of client restarts.

All of this also aligns with an issue reported back in October 2022 #15057 that was never verified.

I'm going to huddle up with the rest of the engineering team to try to figure out the best path forward towards figuring out a fix.

ygersie · 2024-10-09T12:15:22Z

@tgross since I can reproduce (although sometimes it takes a while) do you have a recommendation to improve visibility on what's going on?

tgross · 2024-10-09T12:43:12Z

@tgross since I can reproduce (although sometimes it takes a while) do you have a recommendation to improve visibility on what's going on?

I think we're good on more information from your environment at this point, but thanks for offering. Right now @schmichael is spending some time working on reproducing more reliably and will report back once we know more. Thanks!

ygersie added the type/bug label Oct 7, 2024

tgross added theme/template hcc/jira hcc/cst Admin - internal labels Oct 7, 2024

tgross self-assigned this Oct 8, 2024

schmichael added a commit that referenced this issue Oct 9, 2024

some minor changes while poking at #24140

ab65f52

tgross assigned schmichael and unassigned tgross Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Template change_mode intermittently ignored after client restart #24140

Template change_mode intermittently ignored after client restart #24140

ygersie commented Oct 7, 2024

tgross commented Oct 7, 2024

ygersie commented Oct 7, 2024

ygersie commented Oct 7, 2024 •

edited

Loading

tgross commented Oct 8, 2024

ygersie commented Oct 9, 2024

tgross commented Oct 9, 2024

Template change_mode intermittently ignored after client restart #24140

Template change_mode intermittently ignored after client restart #24140

Comments

ygersie commented Oct 7, 2024

Nomad version

Issue

Reproduction steps

Logs

tgross commented Oct 7, 2024

ygersie commented Oct 7, 2024

ygersie commented Oct 7, 2024 • edited Loading

tgross commented Oct 8, 2024

ygersie commented Oct 9, 2024

tgross commented Oct 9, 2024

ygersie commented Oct 7, 2024 •

edited

Loading