fix a potential blocking issue when StreamHealth fails by demmer · Pull Request #3898 · vitessio/vitess

demmer · 2018-05-05T21:43:42Z

Last week Slack experienced a multi-minute Vitess outage where all vtgates went into a "query black hole" situation.

Shortly before the period started, one of our tablet replica hosts failed all monitoring checks, then seemed to come back briefly, then failed again. Then a period of time started when the vtgate mysql server was accepting incoming connections and query requests, but no queries were routed to any tablets. At the same time all of our metrics scraping against /debug/vars were timing out. Notably, queries which could be handled only at vtgate (show vitess_shards) were successfully executed during this time. The situation resolved for all vtgates at approximately the same time, at which point all the accepted queries executed (albeit with extremely long execution times), and normal operation continued.

After digging through the code (with @sougou's help), we believe the issue stems from the fact that the error handler that executes when StreamHealth fails calls Close on the connection while the healthCheckConn lock is held. As a result the lock could remain held for a long time if Close takes a while due to networking issues. Furthermore, the connection state is only set to non-serving after the Close operation completes, so that connection remains a viable target for query routing.

In addition, HealthCheckImpl.GetConnection, which is used for query routing, first takes a lock on the HealthCheckImpl object, and then takes a lock on the chosen healthCheckConn without first dropping the health check lock. This means that if any one connection lock is held for a long time, then the whole HealthCheck object would also be locked which could block queries to any tablet until the Close operation finally completes and times out.

Finally, the HealthCheck module publishes the HealthcheckConnections gauge as a callback function which first locks the healthcheck and then locks each connection in turn. This would also be blocked during the period when any connection (or the healthcheck itself) was locked.

These issues put together all plausibly explain the symptoms which we experienced.

To fix this issue:

Change the error handler on StreamHealth failure to set the tablet to be non-serving and unlock the lock before calling Close() on the connection.
Change GetConnection to first drop the lock on the whole HealthCheck object before taking the lock on any individual connection object.
Add an info log whenever StreamHealth fails, which would have helped to diagnose this issue.

Drop the lock on the healthCheckConn in the error handler for `StreamHealth` before calling `Close()` on the connection to avoid holding the lock for a potentially long network operation. Also change `GetConnection` to drop the lock on the HealthCheck object before taking the lock on the individual connection object. Finally add an Info log when the `StreamHealth` operation fails to help in future diagnosis of this kind of error. Signed-off-by: Michael Demmer <mdemmer@slack-corp.com>

sougou

Similar change is needed in finalizeConn.

I'm tempted to wager that one top level simple mutex would still perform well enough and will be much more readable, but we don't have enough data.

However, given that RW mutexes are mighty expensive, hcc.mu should definitely be a simple mutex. I'm tempted to clean this up...

sougou · 2018-05-06T16:33:47Z

go/vt/discovery/healthcheck.go

 		if hc.listener != nil {
 			hc.listener.StatsUpdate(&ts)
 		}
+		conn.Close(ctx)


I think this should be:
within lock:

hccConn := hcc.conn hcc.conn = nil

and hccConn.Close(ctx) here.

I changed this as you suggested, but I can't think of any case in which conn != hcc.conn here and IMO this actually detracts from readability.

For the record I removed this change again because we discussed in Slack and agreed that the original version was just as safe and was cleaner.

sougou · 2018-05-06T16:50:48Z

go/vt/discovery/healthcheck.go

+		hc.mu.RUnlock()
 		return nil
 	}
+	hc.mu.RUnlock()


This fix can cause problems because hcc.Conn can become nil after hc.mu.RUnlock, and Getconnection will end up returning nil.

Obtaining the read lock before releasing hc.mu will work. But then, we might as well leave it like before.

As discussed on Slack I don't actually think this changes the semantics of whether or not hcc.Conn is nil.

demmer · 2018-05-06T18:26:34Z

Thanks for noticing finalizeConn -- I updated that as well.

Signed-off-by: Michael Demmer <mdemmer@slack-corp.com>

demmer requested review from alainjobart and sougou May 5, 2018 21:44

demmer mentioned this pull request May 5, 2018

potential deadlock in TabletStatsCache #3899

Closed

sougou reviewed May 6, 2018

View reviewed changes

drop the lock in finalizeConn as well before calling Close()

494c423

Signed-off-by: Michael Demmer <mdemmer@slack-corp.com>

demmer force-pushed the fix-healthcheck-close-stall branch from 02b7b93 to 494c423 Compare May 6, 2018 18:32

sougou approved these changes May 6, 2018

View reviewed changes

sougou merged commit 5559dac into vitessio:master May 6, 2018

rafael mentioned this pull request May 15, 2018

gRPC settings improvement #3929

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix a potential blocking issue when StreamHealth fails#3898

fix a potential blocking issue when StreamHealth fails#3898
sougou merged 2 commits intovitessio:masterfrom
tinyspeck:fix-healthcheck-close-stall

demmer commented May 5, 2018 •

edited

Loading

Uh oh!

sougou left a comment

Uh oh!

sougou May 6, 2018

Uh oh!

demmer May 6, 2018

Uh oh!

demmer May 6, 2018

Uh oh!

sougou May 6, 2018

Uh oh!

demmer May 6, 2018

Uh oh!

demmer commented May 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

demmer commented May 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sougou left a comment

Choose a reason for hiding this comment

Uh oh!

sougou May 6, 2018

Choose a reason for hiding this comment

Uh oh!

demmer May 6, 2018

Choose a reason for hiding this comment

Uh oh!

demmer May 6, 2018

Choose a reason for hiding this comment

Uh oh!

sougou May 6, 2018

Choose a reason for hiding this comment

Uh oh!

demmer May 6, 2018

Choose a reason for hiding this comment

Uh oh!

demmer commented May 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

demmer commented May 5, 2018 •

edited

Loading