healthcheck: refactor to use less locking#3919
healthcheck: refactor to use less locking#3919sougou merged 5 commits intovitessio:masterfrom sougou:healthcheck
Conversation
demmer
left a comment
There was a problem hiding this comment.
After an initial (non-exhaustive) pass on this change one high level reaction is that it seems potentially confusing to have two TabletStats structs for each downstream tablet, one in the tabletHealth data structure and another for each healthCheckConn.
I wonder if it would be possible to remove the one from healthCheckConn since that's no longer used for making routing decisions and then we can accomplish the goals of the refactor without the duplication.
go/vt/discovery/healthcheck.go
Outdated
There was a problem hiding this comment.
It would be helpful to have some comments indicating what each of these fields does. In particular I think this used to be called streamCancelFunc which made it (somewhat) clearer what it did.
Also it's not clear from an initial read what the relationship is between the tabletStats in this struct vs the one in healthCheckConn.
IMO renaming to something like latestTabletStats would make it clearer.
There was a problem hiding this comment.
cancelFunc wasn't renamed. It was moved from hcc into here. The streamCancelFunc that was in hcc is now a local variable within the goroutine that runs the healthcheck.
Explained the relationship between hcc and th in the comments.
Renamed tabletStats->latestTabletStats
go/vt/discovery/healthcheck.go
Outdated
There was a problem hiding this comment.
As we found out the hard way, under certain circumstances this could take a long time. so I think it's cleaner to update the stats to mark the connection as not serving before doing the connection close which is how I did it before.
go/vt/discovery/healthcheck.go
Outdated
There was a problem hiding this comment.
I think it would be clearer if serving was declared inside the watcher goroutine and not outside.
It's correct as written but since they're only used in the context of the watcher I think that's clearer.
go/vt/discovery/healthcheck.go
Outdated
There was a problem hiding this comment.
timedout is set here inside the watcher goroutine but is then read outside in the main thread.
This seems like a data race to me since there's no synchronization.
There was a problem hiding this comment.
Nice catch. I've changed timedout to use atomic operations.
go/vt/discovery/healthcheck.go
Outdated
There was a problem hiding this comment.
Although very very unlikely in practice, it seems like bad things would happen if the watcher goroutine got stalled and the servingStatus channel does actually fill up. Instead of 5 slots would it make more sense to just have one and then select here in the unlikely event it's blocked?
go/vt/discovery/healthcheck.go
Outdated
There was a problem hiding this comment.
Instead of calling copy() on the way in, it seems more defensive to copy it here.
There was a problem hiding this comment.
Doing the copy here will reduce some duplication, but it may not separate the concerns correctly. It will introduce a requirement that healthCheckConn should not change the contents of ts while updateHealth is executing.
But I could be convinced otherwise.
sougou
left a comment
There was a problem hiding this comment.
Most comments addressed, and some responses inline.
On the high level approach of tabletHealth vs healthCheckConn: The separation was needed to simplify the locking scheme. In the new world, healthCheckConn is internal to the checkConn goroutine. For each tabletHealth, a checkConn is launched, which is responsible for providing health updates. So, tabletHealth only cares about the goroutine, and knows nothing of healthCheckConn.
I've added a TODO for moving healthCheckConn and the checkConn to their own file.
go/vt/discovery/healthcheck.go
Outdated
There was a problem hiding this comment.
cancelFunc wasn't renamed. It was moved from hcc into here. The streamCancelFunc that was in hcc is now a local variable within the goroutine that runs the healthcheck.
Explained the relationship between hcc and th in the comments.
Renamed tabletStats->latestTabletStats
go/vt/discovery/healthcheck.go
Outdated
There was a problem hiding this comment.
Doing the copy here will reduce some duplication, but it may not separate the concerns correctly. It will introduce a requirement that healthCheckConn should not change the contents of ts while updateHealth is executing.
But I could be convinced otherwise.
go/vt/discovery/healthcheck.go
Outdated
go/vt/discovery/healthcheck.go
Outdated
go/vt/discovery/healthcheck.go
Outdated
go/vt/discovery/healthcheck.go
Outdated
There was a problem hiding this comment.
Nice catch. I've changed timedout to use atomic operations.
|
I had to move the serving assignment out of the goroutine. It introduced a data race. |
Instead of looping through every stream, this change makes every stream goroutine perform its own timeout check. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
This change gets rid of all locks in hcc. Instead we use the approach of "sharing by communicating". Whenever there is a change in state, hcc communicates the change to hc, which then performs the necessary updates and handling. Also, now that hc updates are trivial, the lock has been changed to a simple Mutex, which is more efficient that RWMutex. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
It turns out that moving the serving assignment within the goroutine introduces a data race. I've moved it back out. Also found another incidental data race: did you know that functions like t.Fatalf should not be called from goroutines? Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
demmer
left a comment
There was a problem hiding this comment.
I made another pass through this after it being in the queue for a long time...
In general I think this refactor is a great step in the right direction, but there are a number of suggestions to improve things even more and one possibly dangerous bug lurking in here that I think should be fixed.
| if conn == nil { | ||
| var err error | ||
| conn, err = tabletconn.GetDialer()(hcc.tabletStats.Tablet, grpcclient.FailFast(true)) | ||
| if hcc.conn == nil { |
There was a problem hiding this comment.
This was already an issue before this change, but it's hard to reason about why some methods are attached to the healthCheckConn (taking the HealthCheckImpl as a parameter), while others are attached to the impl and take the conn as a parameter.
My intuition is that this would all be a lot clearer if it was consistent -- any functionality that refers to a single connection (checkConn, finalizeConn, stream, etc) would be defined on the conn struct, while functionality related to the overall data structure (UpdateHealth, AddTablet, etc) would be defined on the containing impl struct.
That would also help when you split out healthcheck_conn into a separate file.
There was a problem hiding this comment.
I just tried making this change, but I encountered a readability problem: checkConn calls hc.connsWG.Done(), which belongs to HealthCheckImpl. This would be confusing if we moved this to a different file. Also, checkConn references multiple other things in hc. So, we'll probably need to do a deeper refactor to clean up this mess.
But returns are diminishing.
| // later. | ||
| timedout := sync2.NewAtomicBool(false) | ||
| serving := hcc.tabletStats.Serving | ||
| go func() { |
There was a problem hiding this comment.
I think this may need to bump up the overall threads running wg and then bump down when the goroutine exits?
There was a problem hiding this comment.
If you see below, after the stream call, we call streamCancel which will ensure this goroutine will eventually exit. There's no need for this goroutine to exit before HC is closed. We just have to make sure we don't leak them.
There was a problem hiding this comment.
But I found a deeper problem. An inherent race between Close and all the other AddTablet, etc. I've now changed Close to set addrToHealth to nil as a guard to ensure nobody accepts more changes after HC is closed.
Also found a race between Close and other calls that modify state. That case is also fixed. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
This change gets rid of all locks in hcc. Instead we use the approach
of "sharing by communicating". Whenever there is a change in state,
hcc communicates the change to hc, which then performs the necessary
updates and handling.
Now that hc updates are trivial, the lock has been changed to
a simple Mutex, which is more efficient that RWMutex.
Also, instead of looping through every stream, this change
makes every stream goroutine perform its own timeout check.