Skip to content

healthcheck: refactor to use less locking#3919

Merged
sougou merged 5 commits intovitessio:masterfrom
sougou:healthcheck
Nov 18, 2018
Merged

healthcheck: refactor to use less locking#3919
sougou merged 5 commits intovitessio:masterfrom
sougou:healthcheck

Conversation

@sougou
Copy link
Copy Markdown
Contributor

@sougou sougou commented May 11, 2018

This change gets rid of all locks in hcc. Instead we use the approach
of "sharing by communicating". Whenever there is a change in state,
hcc communicates the change to hc, which then performs the necessary
updates and handling.

Now that hc updates are trivial, the lock has been changed to
a simple Mutex, which is more efficient that RWMutex.

Also, instead of looping through every stream, this change
makes every stream goroutine perform its own timeout check.

@sougou sougou requested a review from demmer May 11, 2018 01:04
Copy link
Copy Markdown
Member

@demmer demmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After an initial (non-exhaustive) pass on this change one high level reaction is that it seems potentially confusing to have two TabletStats structs for each downstream tablet, one in the tabletHealth data structure and another for each healthCheckConn.

I wonder if it would be possible to remove the one from healthCheckConn since that's no longer used for making routing decisions and then we can accomplish the goals of the refactor without the duplication.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to have some comments indicating what each of these fields does. In particular I think this used to be called streamCancelFunc which made it (somewhat) clearer what it did.

Also it's not clear from an initial read what the relationship is between the tabletStats in this struct vs the one in healthCheckConn.

IMO renaming to something like latestTabletStats would make it clearer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancelFunc wasn't renamed. It was moved from hcc into here. The streamCancelFunc that was in hcc is now a local variable within the goroutine that runs the healthcheck.

Explained the relationship between hcc and th in the comments.

Renamed tabletStats->latestTabletStats

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we found out the hard way, under certain circumstances this could take a long time. so I think it's cleaner to update the stats to mark the connection as not serving before doing the connection close which is how I did it before.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be clearer if serving was declared inside the watcher goroutine and not outside.

It's correct as written but since they're only used in the context of the watcher I think that's clearer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timedout is set here inside the watcher goroutine but is then read outside in the main thread.

This seems like a data race to me since there's no synchronization.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. I've changed timedout to use atomic operations.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although very very unlikely in practice, it seems like bad things would happen if the watcher goroutine got stalled and the servingStatus channel does actually fill up. Instead of 5 slots would it make more sense to just have one and then select here in the unlikely event it's blocked?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of calling copy() on the way in, it seems more defensive to copy it here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing the copy here will reduce some duplication, but it may not separate the concerns correctly. It will introduce a requirement that healthCheckConn should not change the contents of ts while updateHealth is executing.

But I could be convinced otherwise.

Copy link
Copy Markdown
Contributor Author

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most comments addressed, and some responses inline.

On the high level approach of tabletHealth vs healthCheckConn: The separation was needed to simplify the locking scheme. In the new world, healthCheckConn is internal to the checkConn goroutine. For each tabletHealth, a checkConn is launched, which is responsible for providing health updates. So, tabletHealth only cares about the goroutine, and knows nothing of healthCheckConn.

I've added a TODO for moving healthCheckConn and the checkConn to their own file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancelFunc wasn't renamed. It was moved from hcc into here. The streamCancelFunc that was in hcc is now a local variable within the goroutine that runs the healthcheck.

Explained the relationship between hcc and th in the comments.

Renamed tabletStats->latestTabletStats

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing the copy here will reduce some duplication, but it may not separate the concerns correctly. It will introduce a requirement that healthCheckConn should not change the contents of ts while updateHealth is executing.

But I could be convinced otherwise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Done.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. I've changed timedout to use atomic operations.

@sougou
Copy link
Copy Markdown
Contributor Author

sougou commented May 26, 2018

I had to move the serving assignment out of the goroutine. It introduced a data race.

Instead of looping through every stream, this change
makes every stream goroutine perform its own timeout check.

Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
This change gets rid of all locks in hcc. Instead we use the approach
of "sharing by communicating". Whenever there is a change in state,
hcc communicates the change to hc, which then performs the necessary
updates and handling.

Also, now that hc updates are trivial, the lock has been changed to
a simple Mutex, which is more efficient that RWMutex.

Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
It turns out that moving the serving assignment within the
goroutine introduces a data race. I've moved it back out.

Also found another incidental data race: did you know that
functions like t.Fatalf should not be called from goroutines?

Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Copy link
Copy Markdown
Member

@demmer demmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made another pass through this after it being in the queue for a long time...

In general I think this refactor is a great step in the right direction, but there are a number of suggestions to improve things even more and one possibly dangerous bug lurking in here that I think should be fixed.

if conn == nil {
var err error
conn, err = tabletconn.GetDialer()(hcc.tabletStats.Tablet, grpcclient.FailFast(true))
if hcc.conn == nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already an issue before this change, but it's hard to reason about why some methods are attached to the healthCheckConn (taking the HealthCheckImpl as a parameter), while others are attached to the impl and take the conn as a parameter.

My intuition is that this would all be a lot clearer if it was consistent -- any functionality that refers to a single connection (checkConn, finalizeConn, stream, etc) would be defined on the conn struct, while functionality related to the overall data structure (UpdateHealth, AddTablet, etc) would be defined on the containing impl struct.

That would also help when you split out healthcheck_conn into a separate file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried making this change, but I encountered a readability problem: checkConn calls hc.connsWG.Done(), which belongs to HealthCheckImpl. This would be confusing if we moved this to a different file. Also, checkConn references multiple other things in hc. So, we'll probably need to do a deeper refactor to clean up this mess.

But returns are diminishing.

// later.
timedout := sync2.NewAtomicBool(false)
serving := hcc.tabletStats.Serving
go func() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may need to bump up the overall threads running wg and then bump down when the goroutine exits?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you see below, after the stream call, we call streamCancel which will ensure this goroutine will eventually exit. There's no need for this goroutine to exit before HC is closed. We just have to make sure we don't leak them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I found a deeper problem. An inherent race between Close and all the other AddTablet, etc. I've now changed Close to set addrToHealth to nil as a guard to ensure nobody accepts more changes after HC is closed.

Also found a race between Close and other calls that modify state.
That case is also fixed.

Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants