Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak on closed watch channel #1263

Merged
merged 4 commits into from
Apr 6, 2021

Conversation

ericmillin
Copy link
Contributor

@ericmillin ericmillin commented Mar 15, 2021

Description

This fixes a serious bug in the wait logic. If the watcher's channel closes, the loop tightens as it continuously retries until the command times out. This has three negative effects:

  1. The code retries even though the results channel is never recreated, meaning the wait will never succeed
  2. The CPU is pegged
  3. time.After creates a new timer on each invocation. The timers aren't garbage collected until they expire, resulting in a crippling memory leak.

It's not clear to me why the results channel is closed, but it happens regularly in our testing. I believe some of the issues were introduced by the changes in November. The old logic appears to have recreated watchers on retries, which presumably included a new results channel.

Changes

Ultimately, the code needs a refactor, but the changes here fix the all the bugs other than the permanently closed results channel. I also removed some cruft in waitForEvent::Wait.

/lint

@google-cla
Copy link

google-cla bot commented Mar 15, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot added the cla: no Indicates the PR's author has not signed the CLA. label Mar 15, 2021
Copy link
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericmillin: 0 warnings.

In response to this:

Description

This fixes a serious bug in the wait logic. If the watcher's channel closes, the loop tightens as it continuously retries until the command times out. This has three negative effects:

  1. The code continuously retries waiting even though the results channel is never recreated, meaning the wait will never succeed
  2. The CPU is pegged
  3. time.After creates a new timer on each invocation. The timers aren't garbage collected until they expire, resulting in a crippling memory leak.

It's not clear to me why the results channel is closed, but it happens regularly in our testing. I believe some of the issues were introduced by the changes in November. The old logic appears to have recreated watchers on retries, which presumably included a new results channel.

Changes

Ultimately, the code needs a refactor, but the changes here fix the all the bugs other than the permanently closed results channel. I also removed some cruft in waitForEvent::Wait.

/lint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot
Copy link
Contributor

Welcome @ericmillin! It looks like this is your first PR to knative/client 🎉

@knative-prow-robot knative-prow-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 15, 2021
@knative-prow-robot
Copy link
Contributor

Hi @ericmillin. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot knative-prow-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 15, 2021
@dsimansk
Copy link
Contributor

Thanks for the contribution!
/ok-to-test

Could you please take look?
/assign @rhuss

@knative-prow-robot knative-prow-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 15, 2021
@@ -236,15 +239,12 @@ func (w *waitForEvent) Wait(watcher watch.Interface, name string, options Option
timeout := options.timeoutWithDefault()
start := time.Now()
// channel used to transport the error
errChan := make(chan error)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed as cruft

return fmt.Errorf("timeout: %s '%s' not ready after %d seconds", w.kind, name, int(timeout/time.Second)), time.Since(start)
}

if retry {
// restart loop
// sleep to prevent CPU pegging and restart the loop
time.Sleep(pollInterval)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary if there's no longer a chance the CPU can be pegged. However, unless you're 100% sure that can't happen, then it should remain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really clear to me. As I understand this, it's meant as a safeguard in case the retry loop would go wild again (like that waitForReadyCondition returning immediately with retry == true, which has happened up to now when not detecting that the watcher channel has been closed).

Looking at the code I can't see how the CPU could be pegged, but I have no prove. Adding a sleep here however would make it hard to detect such a situation, but of course, could be 'good enough'. So I'm a bit biased here :)

However, I would not couple this to the pollInterval which is used as poll interval for a special kind of watcher (i.e. when the API does not support intrinsic watching like this is the case for CloudRun). Instead, I probably would use a shorter safe-guard, maybe 200 or 500ms ? (but I would use a different constant or even a literal value here).

@@ -111,20 +111,20 @@ func NoopMessageCallback() MessageCallback {
func (w *waitForReadyConfig) Wait(watcher watch.Interface, name string, options Options, msgCallback MessageCallback) (error, time.Duration) {

timeout := options.timeoutWithDefault()
floatingTimeout := timeout
timeoutTimer := time.NewTimer(timeout)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reusing the same timer prevents the timer leak caused by time.After. It is also cleaner because the floatingTImeout math is no longer needed.

@google-cla
Copy link

google-cla bot commented Mar 15, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

Copy link
Contributor

@maximilien maximilien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks cleaner. I guess this is difficult to test?

/ok-to-test

@ericmillin
Copy link
Contributor Author

ericmillin commented Mar 16, 2021

Looks cleaner. I guess this is difficult to test?

/ok-to-test

Yes, I tried with the mocks, but I couldn't get it working. You need to close the channel after the watcher starts. I didn't see a way to do that. Happy to take another crack at it if you see a way to make it all work.

Copy link
Contributor

@rhuss rhuss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, very good catch ! I only have a minor question, otherwise we should merge this asap.

@maximilien I think this is really hard to test, maybe one could simulate a watch function that closes the channel but also detects if someone reads from that closed channel afterwards (waiting some time) and then fails the test if this is the casde. Not sure if this is easy to implement, I would merge the PR without it, too.

return fmt.Errorf("timeout: %s '%s' not ready after %d seconds", w.kind, name, int(timeout/time.Second)), time.Since(start)
}

if retry {
// restart loop
// sleep to prevent CPU pegging and restart the loop
time.Sleep(pollInterval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really clear to me. As I understand this, it's meant as a safeguard in case the retry loop would go wild again (like that waitForReadyCondition returning immediately with retry == true, which has happened up to now when not detecting that the watcher channel has been closed).

Looking at the code I can't see how the CPU could be pegged, but I have no prove. Adding a sleep here however would make it hard to detect such a situation, but of course, could be 'good enough'. So I'm a bit biased here :)

However, I would not couple this to the pollInterval which is used as poll interval for a special kind of watcher (i.e. when the API does not support intrinsic watching like this is the case for CloudRun). Instead, I probably would use a shorter safe-guard, maybe 200 or 500ms ? (but I would use a different constant or even a literal value here).

@dsimansk
Copy link
Contributor

Let me add a few thoughts after digging into watch API a bit.

  1. The CPU is pegged
  2. time.After creates a new timer on each invocation. The timers aren't garbage collected until they expire, resulting in a crippling memory leak.

That's a neat fix. Thanks for pointing that out. The go docs are pretty clear there.

1. The code retries even though the results channel is never recreated, meaning the wait will never succeed

Indeed you're correct here as well. With the fix for watcher race condition we actually lost the ability to recreate the watcher.
Even though the wait loop checks for !ok channel and it's the only place producing retry=true, it's ineffective without remove initialization code.

https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes

When retrieving a collection of resources (either namespace or cluster scoped), the response from the server will contain a resourceVersion value that can be used to initiate a watch against the server. The server will return all changes (creates, deletes, and updates) that occur after the supplied resourceVersion. This allows a client to fetch the current state and then watch for changes without missing any updates.

As pointed out by @markusthoemmes, the resourceVersion can be used to avoid races like that, see the k8s docs above.

Therefore I think the right fix for the race condition is to leverage resourceVersion in our watch request. I've done a bit of testing and it seems to behave as expected, even with synthetic timeout postponing the watcher creation the events are lost. That'd make room to bring back the correct retry behaviour when channel is closed. As an good exampple and inspiration client-go's retrywatcher is done the same way.

@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 18, 2021
@ericmillin
Copy link
Contributor Author

We're doing more testing with the change I made and it's definitely not enough to get the correct behavior, though it does solve the CPU and memory issues. Seems like the watch really needs to be recreate on closure.

@dsimansk I've tried a different approach and reverted the November fix for the race. I then made some simpler changes to fix the issue. It works so far. Happy to share the changes. However, it sounds like resourceVersion may be the "right way" to address this.

@dsimansk
Copy link
Contributor

@ericmillin I don't have the complete code yet, but you can take a look here. Hopefully I can put it together next week and have PR ready to gather more thoughts on it.

Anyway we should get this PR merged once @rhuss questions are resolved. :)

@rhuss
Copy link
Contributor

rhuss commented Apr 1, 2021

Hi @ericmillin I think we should get this fix for the next release which is scheduled for next Tuesday, and then we can still continue to work on the resource version syncing. That's what we agreed on in the last client WG call.

Would you be able to resolve the conflicts ? If not, no problem, we would then take over.

@ericmillin
Copy link
Contributor Author

ericmillin commented Apr 5, 2021

@rhuss Sorry going incommunicado--I was out of town. I can resolve the conflicts, but you may need to take over if you want this merged by Tuesday. I'm still awaiting approval for the CLA.

The version syncing sounds brilliant. While my changes resolved some of the issues, we were still seeing unexpected behavior which is likely caused by fact that the logic exits when the channel is closed, rather than retrying. Because we had to release changes for an app using v0.19.x, I did a quick-and-dirty patch on a fork. The changes are a big improvement but still have problems accurately tracking state. @duglin has done a lot of manual testing and can provide more details if needed.

This is my quick-and-dirty patch which reverts to the pre-November logic and takes a different approach to solving the race: ericmillin@a523953

It uses a channel returned by the Wait interface to keep things in sync.

Waiting() <-chan bool

@rhuss
Copy link
Contributor

rhuss commented Apr 6, 2021

@ericmillin no worries, I will resolve the conflict now and then merge this PR for release 0.23

@dsimansk please pick up after the release to improve with the discussed idea on syncing on revisions.

@google-cla google-cla bot removed the cla: no Indicates the PR's author has not signed the CLA. label Apr 6, 2021
@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 6, 2021
@google-cla google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Apr 6, 2021
@google-cla
Copy link

google-cla bot commented Apr 6, 2021

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot added cla: no Indicates the PR's author has not signed the CLA. and removed cla: yes Indicates the PR's author has signed the CLA. labels Apr 6, 2021
@google-cla
Copy link

google-cla bot commented Apr 6, 2021

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Apr 6, 2021

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@knative-metrics-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-knative-client-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/wait/wait_for_ready.go 76.3% 76.1% -0.3

@rhuss
Copy link
Contributor

rhuss commented Apr 6, 2021

@googlebot I consent.

@google-cla google-cla bot added cla: yes Indicates the PR's author has signed the CLA. and removed cla: no Indicates the PR's author has not signed the CLA. labels Apr 6, 2021
@rhuss
Copy link
Contributor

rhuss commented Apr 6, 2021

/approve
/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 6, 2021
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rhuss

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 6, 2021
@knative-prow-robot knative-prow-robot merged commit 804895f into knative:main Apr 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants