Skip to content

fix a potential blocking issue when StreamHealth fails#3898

Merged
sougou merged 2 commits intovitessio:masterfrom
tinyspeck:fix-healthcheck-close-stall
May 6, 2018
Merged

fix a potential blocking issue when StreamHealth fails#3898
sougou merged 2 commits intovitessio:masterfrom
tinyspeck:fix-healthcheck-close-stall

Conversation

@demmer
Copy link
Copy Markdown
Member

@demmer demmer commented May 5, 2018

Last week Slack experienced a multi-minute Vitess outage where all vtgates went into a "query black hole" situation.

Shortly before the period started, one of our tablet replica hosts failed all monitoring checks, then seemed to come back briefly, then failed again. Then a period of time started when the vtgate mysql server was accepting incoming connections and query requests, but no queries were routed to any tablets. At the same time all of our metrics scraping against /debug/vars were timing out. Notably, queries which could be handled only at vtgate (show vitess_shards) were successfully executed during this time. The situation resolved for all vtgates at approximately the same time, at which point all the accepted queries executed (albeit with extremely long execution times), and normal operation continued.

After digging through the code (with @sougou's help), we believe the issue stems from the fact that the error handler that executes when StreamHealth fails calls Close on the connection while the healthCheckConn lock is held. As a result the lock could remain held for a long time if Close takes a while due to networking issues. Furthermore, the connection state is only set to non-serving after the Close operation completes, so that connection remains a viable target for query routing.

In addition, HealthCheckImpl.GetConnection, which is used for query routing, first takes a lock on the HealthCheckImpl object, and then takes a lock on the chosen healthCheckConn without first dropping the health check lock. This means that if any one connection lock is held for a long time, then the whole HealthCheck object would also be locked which could block queries to any tablet until the Close operation finally completes and times out.

Finally, the HealthCheck module publishes the HealthcheckConnections gauge as a callback function which first locks the healthcheck and then locks each connection in turn. This would also be blocked during the period when any connection (or the healthcheck itself) was locked.

These issues put together all plausibly explain the symptoms which we experienced.

To fix this issue:

  • Change the error handler on StreamHealth failure to set the tablet to be non-serving and unlock the lock before calling Close() on the connection.
  • Change GetConnection to first drop the lock on the whole HealthCheck object before taking the lock on any individual connection object.
  • Add an info log whenever StreamHealth fails, which would have helped to diagnose this issue.

Drop the lock on the healthCheckConn in the error handler for
`StreamHealth` before calling `Close()` on the connection to
avoid holding the lock for a potentially long network operation.

Also change `GetConnection` to drop the lock on the HealthCheck
object before taking the lock on the individual connection object.

Finally add an Info log when the `StreamHealth` operation fails to
help in future diagnosis of this kind of error.

Signed-off-by: Michael Demmer <mdemmer@slack-corp.com>
Copy link
Copy Markdown
Contributor

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar change is needed in finalizeConn.

I'm tempted to wager that one top level simple mutex would still perform well enough and will be much more readable, but we don't have enough data.

However, given that RW mutexes are mighty expensive, hcc.mu should definitely be a simple mutex. I'm tempted to clean this up...

if hc.listener != nil {
hc.listener.StatsUpdate(&ts)
}
conn.Close(ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be:
within lock:

hccConn := hcc.conn
hcc.conn = nil

and hccConn.Close(ctx) here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this as you suggested, but I can't think of any case in which conn != hcc.conn here and IMO this actually detracts from readability.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record I removed this change again because we discussed in Slack and agreed that the original version was just as safe and was cleaner.

hc.mu.RUnlock()
return nil
}
hc.mu.RUnlock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix can cause problems because hcc.Conn can become nil after hc.mu.RUnlock, and Getconnection will end up returning nil.

Obtaining the read lock before releasing hc.mu will work. But then, we might as well leave it like before.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on Slack I don't actually think this changes the semantics of whether or not hcc.Conn is nil.

@demmer
Copy link
Copy Markdown
Member Author

demmer commented May 6, 2018

Thanks for noticing finalizeConn -- I updated that as well.

Signed-off-by: Michael Demmer <mdemmer@slack-corp.com>
@demmer demmer force-pushed the fix-healthcheck-close-stall branch from 02b7b93 to 494c423 Compare May 6, 2018 18:32
@sougou sougou merged commit 5559dac into vitessio:master May 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants