Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems when using mod_md "clustered" using shared storage directory on NFS #292

Open
moschlar opened this issue Jul 11, 2022 · 9 comments

Comments

@moschlar
Copy link
Contributor

We keep seeing problems with the renewal of certificates on our four node Apache httpd "cluster" using mod_md (with ~1000 managed domains) with a shared storage directory on NFSv3.

The node -01 is configured with

export MDRenewWindow="33%"
export MDRenewMode="auto"

so it could be called "primary".

And the other nodes are configured with

export MDRenewWindow="30%"
export MDRenewMode="manual"

I suspect that it happens when one of the secondary nodes first reloads/restarts httpd (I think maybe due to logrotate) before the primary node got a chance to and therefore it somehow gets messed up. (Or it might be that the logrotates run too simultaneously...)

From the Apache error logs regarding the domain https://crt.sh/?q=infosys.informatik.uni-mainz.de as an example:

Node -01:

/var/log/apache2/old/error.log-20220710.gz:[Sat Jul 09 17:32:50.738944 2022] [md:notice] [pid 3548953:tid 140380633847552] AH10059: The Managed Domain infosys.informatik.uni-mainz.de has been setup and changes will be activated on next (graceful) server restart.

Node -02:

/var/log/apache2/old/error.log-20220711:[Sun Jul 10 00:04:45.042386 2022] [md:error] [pid 4077496:tid 139884661714240] (17)File exists: AH10069: infosys.informatik.uni-mainz.de: error loading staged set

Now to come to the actual issue/request:
This does not show up in the monitoring page at md-status - there, all domains are listed and seem fine...

Or do you have any other ideas or shall I look for additional clues?

@icing
Copy link
Owner

icing commented Jul 11, 2022

In such a setup, with a shared fs, you are asking for trouble when you reload 2 or more cluster nodes at the same time. All reloading instances will try to activate the newly staged certificates and stumble over each other. Now, when I say "stumble" this means error message as the one you see in the log.

The activation of the new cert set in staging to domains is done as atomically as possible, using move of directories in the file system. If both nodes try this at the same time, one might fail and log the error you reported. However, the directory in domains should still be fine and have the correct content. That is probably why you do not see any errors in the md status page.

I'd recommend to first reload on cluster node and then the others. It does not matter which one. Or you use MDMessageCmd to trigger a job with sufficient privileges to move the directories yourself and then reload the cluster.

Some people also live without a shared filesystem and use MDMessageCmd to copy files on renewal and even prevent renewals to happen on all but a single cluster node.

@moschlar
Copy link
Contributor Author

Yeah, I actually already took many measures to spread out reloads in general, just when writing this up, it occured to me that it might be the logrotate job that triggers these ones close to another. I've spread them out now.

However, the directories in domains were really not ok - I've already seen ones that only had the fallback key and cert files, today some still had the proper pubcert.pem but without the corresponding key. Sometimes there is no job.json but given that in this state, the final state according to job.json rellay doesn't match reality, that's forgivable.

So maybe there is something that you could probably tweak there after all ;-)

I don't really like the solutions where (ab)using MDMessageCmd for something other than notifications, though I'm not quite sure what my issue with that really is. I sense that that would be your recommendation for building a stable clustered solution, or would you go a totally different way?

@icing
Copy link
Owner

icing commented Jul 14, 2022

Looking at my code in this light again, the overall strategy is on a start/reload:

  1. look if there is a staging/mydomain with all data needed
  2. copy over all data to tmp/mydomain my reading (parsing) and writing
  3. move domains/mydomain to archive/mydomain.n
  4. move tmp/mydomain to domains/mydomain

this is all nice on a single host and survives aborted restarts quite well.

However on a cluster with a shared file system, several nodes may be working on the same tmp/mydomain and when one node moves it, it may be incomplete due to another node messing with it.

The best approach here, without some cluster-wide locking, is probably to have tmp/mydomain on a local file system. But that still may give trouble if steps 1-4 are interleaved on several cluster nodes. But at least the produced domains/mydomain directory would be complete.

The only safe way I can think of would require some cluster sync, like holding some lock while processing a staged domain. Which leads us back to MDMessageCmd with a new pre-install action.

There is file locking in the Apache runtime, but I do not know if/how that translates to your NFSv3. Any ideas?

@icing
Copy link
Owner

icing commented Jul 15, 2022

Would be nice if you could try v2.4.18 with the new MDStoreLocks directive. If that works nice in your setup, maybe we could also add such locking for renewal attempts.

@icing
Copy link
Owner

icing commented Jul 29, 2022

@moschlar maybe this escaped your notice. Could you test if the new version addresses the restarting in your cluster?

@moschlar
Copy link
Contributor Author

moschlar commented Jul 29, 2022 via email

@whereisaaron
Copy link

Apache 2.4.54 only ships with v2.4.17, but would love to test MDStoreLocks this when it drops. I am assuming the approach used is compatible with NFS 4's native file locking? In that I gather flock() in newer kernel support NFS v4 file locks?

Q: From the docs I take it the MDStoreLocks time will potentially block a graceful restart such that new requests will be blocked for (up to) that time? But that only affects simultaneously restarting nodes who do not gain the lock first?

Q: Would/should nodes that were not restarting and thus did not activate the staged certificate, notice that the domains directory has a new cert that it is not using (or notice a timestamp file that is readable), and issue an MDMessageCmd to that effect? The user could then implement some variety of random back-off to locally restart the node. Or where there are only 2-3 nodes, each could just restart immediately, since one node has already finished restarting.

@icing
Copy link
Owner

icing commented Sep 5, 2022

Apache 2.4.54 only ships with v2.4.17, but would love to test MDStoreLocks this when it drops. I am assuming the approach used is compatible with NFS 4's native file locking? In that I gather flock() in newer kernel support NFS v4 file locks?

It sounds like it, but I cannot verify.

Q: From the docs I take it the MDStoreLocks time will potentially block a graceful restart such that new requests will be blocked for (up to) that time? But that only affects simultaneously restarting nodes who do not gain the lock first?

Yes, that is how it is intended to work.

Q: Would/should nodes that were not restarting and thus did not activate the staged certificate, notice that the domains directory has a new cert that it is not using (or notice a timestamp file that is readable), and issue an MDMessageCmd to that effect? The user could then implement some variety of random back-off to locally restart the node. Or where there are only 2-3 nodes, each could just restart immediately, since one node has already finished restarting.

I do not see a nice way to make that happen. The node already has read domains and to detect that some other changed it, it would need to assess it again (but when exactly?).

If you want a way to detect that your nodes are not all using the same certificate, you might want to asses the md-status handler for each node. That can give you JSON data with information about the certificate used.

@whereisaaron
Copy link

Thank you for the answers @icing! I'm hoping for a way for nodes to detect and restart themselves purely by observing the shared filesystem, without knowledge or network access to the other nodes.

I was thinking that if the node installing a new certificate also touched a world readable empty file somewhere in the shared filesystem (e.g. /etc/apache2/md/last_install) then mod_md on nodes could periodically check if the data/time of that world-readable file was newer than their own start time and if so, invoke MDMessageCmd restart-required. The command invoked could do the same as it would do for MDMessageCmd renewed, it just wouldn't know which domain(s) were renewed.

BTW, is MDMessage installed called during the MDStoreLocks locked period or only after the lock is released?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants