fix a potential blocking issue when StreamHealth fails#3898
fix a potential blocking issue when StreamHealth fails#3898sougou merged 2 commits intovitessio:masterfrom
Conversation
Drop the lock on the healthCheckConn in the error handler for `StreamHealth` before calling `Close()` on the connection to avoid holding the lock for a potentially long network operation. Also change `GetConnection` to drop the lock on the HealthCheck object before taking the lock on the individual connection object. Finally add an Info log when the `StreamHealth` operation fails to help in future diagnosis of this kind of error. Signed-off-by: Michael Demmer <mdemmer@slack-corp.com>
sougou
left a comment
There was a problem hiding this comment.
Similar change is needed in finalizeConn.
I'm tempted to wager that one top level simple mutex would still perform well enough and will be much more readable, but we don't have enough data.
However, given that RW mutexes are mighty expensive, hcc.mu should definitely be a simple mutex. I'm tempted to clean this up...
| if hc.listener != nil { | ||
| hc.listener.StatsUpdate(&ts) | ||
| } | ||
| conn.Close(ctx) |
There was a problem hiding this comment.
I think this should be:
within lock:
hccConn := hcc.conn
hcc.conn = nil
and hccConn.Close(ctx) here.
There was a problem hiding this comment.
I changed this as you suggested, but I can't think of any case in which conn != hcc.conn here and IMO this actually detracts from readability.
There was a problem hiding this comment.
For the record I removed this change again because we discussed in Slack and agreed that the original version was just as safe and was cleaner.
| hc.mu.RUnlock() | ||
| return nil | ||
| } | ||
| hc.mu.RUnlock() |
There was a problem hiding this comment.
This fix can cause problems because hcc.Conn can become nil after hc.mu.RUnlock, and Getconnection will end up returning nil.
Obtaining the read lock before releasing hc.mu will work. But then, we might as well leave it like before.
There was a problem hiding this comment.
As discussed on Slack I don't actually think this changes the semantics of whether or not hcc.Conn is nil.
|
Thanks for noticing |
Signed-off-by: Michael Demmer <mdemmer@slack-corp.com>
02b7b93 to
494c423
Compare
Last week Slack experienced a multi-minute Vitess outage where all vtgates went into a "query black hole" situation.
Shortly before the period started, one of our tablet replica hosts failed all monitoring checks, then seemed to come back briefly, then failed again. Then a period of time started when the vtgate mysql server was accepting incoming connections and query requests, but no queries were routed to any tablets. At the same time all of our metrics scraping against
/debug/varswere timing out. Notably, queries which could be handled only at vtgate (show vitess_shards) were successfully executed during this time. The situation resolved for all vtgates at approximately the same time, at which point all the accepted queries executed (albeit with extremely long execution times), and normal operation continued.After digging through the code (with @sougou's help), we believe the issue stems from the fact that the error handler that executes when
StreamHealthfails callsCloseon the connection while thehealthCheckConnlock is held. As a result the lock could remain held for a long time ifClosetakes a while due to networking issues. Furthermore, the connection state is only set to non-serving after theCloseoperation completes, so that connection remains a viable target for query routing.In addition,
HealthCheckImpl.GetConnection, which is used for query routing, first takes a lock on theHealthCheckImplobject, and then takes a lock on the chosenhealthCheckConnwithout first dropping the health check lock. This means that if any one connection lock is held for a long time, then the wholeHealthCheckobject would also be locked which could block queries to any tablet until theCloseoperation finally completes and times out.Finally, the
HealthCheckmodule publishes theHealthcheckConnectionsgauge as a callback function which first locks the healthcheck and then locks each connection in turn. This would also be blocked during the period when any connection (or the healthcheck itself) was locked.These issues put together all plausibly explain the symptoms which we experienced.
To fix this issue:
Close()on the connection.GetConnectionto first drop the lock on the wholeHealthCheckobject before taking the lock on any individual connection object.StreamHealthfails, which would have helped to diagnose this issue.