Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flock to prevent concurrent clustercheck runs using up connections #11

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tomgidden
Copy link

When one of our nodes got a bit tied up due to a disk space issue, clustercheck started filling up the ps list, waiting on mysql queries.

Wrapping the whole routine in this advice from flock(1):

       (
         flock -n 9 || exit 1
         # ... commands executed under lock ...
       ) 9>/var/lock/mylockfile

As a result of this extra nesting, the majority of the file has been indented.

I've also pulled the HTTP responses out into functions to avoid repetition. The Content-Length calculations might be slightly off, as I'm not sure whether or not all the \r\ns are counted or not, so it just uses a string length check.

@olafz
Copy link
Owner

olafz commented Mar 19, 2015

Thanks for the pull request. I have a question though. Did you see many clustercheck-processes? Because there already is a timeout of 10 seconds in the execution of the mysql command, after which it exits.

If the problem is a filling of ps, this won't solve your problem:

flock -w $TIMEOUT 9 || report_fail "clustercheck is blocked up."

With or without this change, there should never be any clustercheck-process running for more than 10 seconds. But instead of waiting for the mysql command, it will now wait for a file lock. But the ps-list still increases?

@tomgidden
Copy link
Author

As a production cluster, I was in a bit of a rush and didn't stop to investigate this incidental flaw, but there were ten to twenty clustercheck processes in ps that were apparently blocking on mysql commands, and that was the case for a lot longer than ten seconds. At the time, it was possible to connect to mysqld, but any query -- even a simple SHOW STATUS LIKE ... -- would block. Now, the fact that the node was so messed up that it blocked on such a straightforward query is a different matter entirely ;)

To be clear, the mysql commands were connecting to mysqld successfully (and instantly) so --connect-timeout was not relevant. And, there was no query timeout set on those calls by default... which, admittedly, is another problem!

Suffice to say, it was a fairly screwed-up situation that shouldn't have happened, but the numerous clustercheck-launched mysqls all blocking was the problem here.

Anyway, the flock -w $TIMEOUT with a TIMEOUT equal to 10 should mean a clustercheck process should wait on the flock call for up to ten seconds, and then exit 1 if it fails to acquire the lock. In this scenario, it'd mean I'd still have one clustercheck blocking on the query, but at least I wouldn't be getting "Too many connections" just from clusterchecks alone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants