-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Template change_mode intermittently ignored after client restart #24140
Comments
Hi @ygersie! One thing that complicates debugging this is that the template runner will report the "rendering" logs you show even if there's no change to the contents. See below for an example of where I've updated a Nomad Variable with a no-op change:
What's happening here is the following workflow:
But assuming that you know the template contents did change, there aren't currently any logs that could capture the difference between "we got the event but didn't trigger the change" and "we didn't get an event". That this is happening around client restart is suspicious because of the notion that we check if the template has previously rendered. Just in case there's a clue here... is the missing rendering happening immediately following client restart or some time afterwards? |
@tgross just so you're aware, I've been able to reproduce after a ton of restarts in nonprod and posted the TRACE logs along with my jobspec in the Enterprise support ticket. What's important to know here is that I'm using a dynamic PKI certificate. Each time you restart Nomad agent this triggers a new certificate as Nomad doesn't introspect the already rendered certificate on the filesystem to determine a new Vault lease time for renewal. It therefore should always trigger a restart or a signal for each allocation with a dynamic PKI cert. That is the problem here, it's not. Every so often a signal/restart is never triggered even though the actual certificate on the filesystem has been renewed. We then end up with a situation of a workload that has a new certificate on the filesystem but the one in runtime is never refreshed so it expires. |
Also important to note, this seems to only happen after an agent restart. We have not seen this occurring during "normal" operations where a certificate is renewed every < 10 days. |
I haven't yet been able to reproduce but I do have some very preliminary findings. In the logs you've provided, the allocation of interest is
First we see that both the update hook and prestart hook run concurrently:
All the template hook's top-level methods take a lock (ex.
Next, I've extracted what looks to be the relevant bits of the logs for this template here: logs
What we can see from these is that we fetch the secret once from Vault, but then we have multiple rendering/rendered logs for the two destination files. The last "rending" doesn't have a paired "rendered":
So my hypothesis is that:
The reason we're seeing this during client restart is because we never trigger the on-render events for the first time we've rendered the template, so we don't care if there are multiple events. I haven't yet determined why there are multiple events and whether it's possible for them to appear outside of client restarts. All of this also aligns with an issue reported back in October 2022 #15057 that was never verified. I'm going to huddle up with the rest of the engineering team to try to figure out the best path forward towards figuring out a fix. |
@tgross since I can reproduce (although sometimes it takes a while) do you have a recommendation to improve visibility on what's going on? |
I think we're good on more information from your environment at this point, but thanks for offering. Right now @schmichael is spending some time working on reproducing more reliably and will report back once we know more. Thanks! |
Nomad version
1.8.2+ent
Issue
Every time we restart a large fleet of clients some allocations end up with rendered dynamic certificates but they haven't been signaled/restarted despite having a proper
change_mode
set in the jobspec. This leads to production incidents where secrets have expired from a workload perspective (didn't get a signal or restart) even though the certificates on the filesystem have been updated.Reproduction steps
I haven't been able to reproduce this in an isolated manner. It does happen every time on just a few allocations across our entire cluster after client restarts. On the same node there are many other allocations that get signaled/restarted just fine.
Logs
When issuing dynamic certificates from Vault through the template stanza a new certificate is created on each client restart. This should always trigger a
change_mode
event but occasionally that does not happen for some allocations which leads to outages.Example here from a client log snippet:
As you can see the certificate has been rendered correctly but a
re-rendered
event never triggered or isn't shown in the log file. For this workloadchange_mode="restart"
is set and we should therefore have seen:But it's not there.
The text was updated successfully, but these errors were encountered: