-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cassandra] Add Configuration.Close() to ensure TLS cert watcher is closed #4515
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Head branch was pushed to by a user without write access
The first push failed because of the test case in TestProxyBuilder in build_test.go file is calling fswatcher.go Close function twice. Added protection so that if the channel is already closed, it will not try to attempt to close the channel again. |
pkg/fswatcher/fswatcher.go
Outdated
close(w.stop) | ||
w.wg.Wait() | ||
} | ||
w.isClosed = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unprotected bool can be a source of race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, let me add the mutex lock to prevent the race conditions.
please make sure all commits are signed (easiest is to squash into one) |
Codecov ReportPatch coverage has no change and project coverage change:
Additional details and impacted files@@ Coverage Diff @@
## main #4515 +/- ##
==========================================
+ Coverage 97.03% 97.04% +0.01%
==========================================
Files 301 301
Lines 17817 17817
==========================================
+ Hits 17289 17291 +2
+ Misses 424 422 -2
Partials 104 104
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
eae8f07
to
274d6c4
Compare
I squashed and signed all my commits. Can you approve it again? |
pkg/fswatcher/fswatcher.go
Outdated
func (w *FSWatcher) Close() error { | ||
if w.isClosed.CompareAndSwap(false, true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking closer at this and I am not convinced this is all needed. When we call w.watcher.Close()
, it closes the Events
and Errors
channels, so our goroutine will exit naturally. Yes, there is a race condition if you're checking for its exit immediately, but that cannot be completely avoided, even with these changes the goroutine may go to sleep after L143 lowers the semaphore.
If you observed that the goroutine is still running for a long time after close, there may be something else going on.
We can still use the waitgroup introduced here, but without the channel/bool:
func (w *FSWatcher) Close() error {
// watcher.Close() takes care of multiple calls by returning "already closed" error
if err := w.watcher.Close(); err != nil {
return err
}
w.wg.Wait()
return nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. let me send out a new commit with these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate how you determined that the GR was still running? Can it be reproduced with a unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you are right, the original code should work fine without any issues. I dig deeper into why this issue will happen in the first place. The main reason is that when you create a new Cassandra Cluster on L149, the code calls c.TLS.Config()
. The c.TLS.Config
call newCertWatcher
which then calls watchCertPair()
. watchCertPair()
call fswatcher.new()
which creates a new leaky go routine. After that in Cassandra, there is no method or function to Close the goroutine. In short, the code is working correctly, it is just that when creating the newCluster, the Close function for fswatcher was never called or the method was never implemented to close that go Routine.
https://github.com/jaegertracing/jaeger/blob/main/pkg/cassandra/config/config.go#L149
https://github.com/jaegertracing/jaeger/blob/main/pkg/config/tlscfg/options.go#L99
https://github.com/jaegertracing/jaeger/blob/main/pkg/config/tlscfg/cert_watcher.go#L69
https://github.com/jaegertracing/jaeger/blob/main/pkg/config/tlscfg/cert_watcher.go#L97
https://github.com/jaegertracing/jaeger/blob/main/pkg/config/tlscfg/cert_watcher.go#L106
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose that in https://github.com/jaegertracing/jaeger/blob/main/pkg/cassandra/config/config.go we implement a function called Close that will allow us to close that goroutine.
func (c *Configuration) Close() error { return c.TLS.Close() }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
eb2efd2
to
e2b9f11
Compare
Signed-off-by: kennyaz <[email protected]>
ea618cb
to
b42bea8
Compare
@@ -163,6 +163,10 @@ func (c *Configuration) NewCluster(logger *zap.Logger) (*gocql.ClusterConfig, er | |||
return cluster, nil | |||
} | |||
|
|||
func (c *Configuration) Close() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this function did not exist before, nothing is calling it, so just adding it here is not going to address the issue in the ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I looked at the code base, the leaky go routine only happens when new Cassandra Cluster is created directly with no option to close. So, the leaky-go routine is happening on our side, not on your end. I am thinking about changing a ticket from bug to feature request so that users who created the new Cassandra Cluster will have the option to close the fswatcher leaky go routine.
…losed (jaegertracing#4515) Which problem is this PR solving? Resolves jaegertracing#4514 Short description of the changes Implement a Close Function in Cassandra to allow us to close the leaky-go routine. Signed-off-by: kennyaz <[email protected]> Co-authored-by: Yuri Shkuro <[email protected]> Signed-off-by: KevinSchneider <[email protected]>
Which problem is this PR solving?
Resolves #4514
Short description of the changes
Implement a Close Function in Cassandra to allow us to close the leaky-go routine.