Fix memory leak on closed watch channel #1263

ericmillin · 2021-03-15T14:25:23Z

Description

This fixes a serious bug in the wait logic. If the watcher's channel closes, the loop tightens as it continuously retries until the command times out. This has three negative effects:

The code retries even though the results channel is never recreated, meaning the wait will never succeed
The CPU is pegged
time.After creates a new timer on each invocation. The timers aren't garbage collected until they expire, resulting in a crippling memory leak.

It's not clear to me why the results channel is closed, but it happens regularly in our testing. I believe some of the issues were introduced by the changes in November. The old logic appears to have recreated watchers on retries, which presumably included a new results channel.

Changes

Ultimately, the code needs a refactor, but the changes here fix the all the bugs other than the permanently closed results channel. I also removed some cruft in waitForEvent::Wait.

/lint

google-cla · 2021-03-15T14:25:28Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

knative-prow-robot

@ericmillin: 0 warnings.

In response to this:

Description

This fixes a serious bug in the wait logic. If the watcher's channel closes, the loop tightens as it continuously retries until the command times out. This has three negative effects:

The code continuously retries waiting even though the results channel is never recreated, meaning the wait will never succeed

The CPU is pegged

time.After creates a new timer on each invocation. The timers aren't garbage collected until they expire, resulting in a crippling memory leak.

It's not clear to me why the results channel is closed, but it happens regularly in our testing. I believe some of the issues were introduced by the changes in November. The old logic appears to have recreated watchers on retries, which presumably included a new results channel.

Changes

Ultimately, the code needs a refactor, but the changes here fix the all the bugs other than the permanently closed results channel. I also removed some cruft in waitForEvent::Wait.

/lint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow-robot · 2021-03-15T14:25:34Z

Welcome @ericmillin! It looks like this is your first PR to knative/client 🎉

knative-prow-robot · 2021-03-15T14:25:35Z

Hi @ericmillin. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dsimansk · 2021-03-15T14:50:33Z

Thanks for the contribution!
/ok-to-test

Could you please take look?
/assign @rhuss

ericmillin · 2021-03-15T14:54:43Z

pkg/wait/wait_for_ready.go

@@ -236,15 +239,12 @@ func (w *waitForEvent) Wait(watcher watch.Interface, name string, options Option
 	timeout := options.timeoutWithDefault()
 	start := time.Now()
 	// channel used to transport the error
-	errChan := make(chan error)


removed as cruft

ericmillin · 2021-03-15T15:19:38Z

pkg/wait/wait_for_ready.go

 			return fmt.Errorf("timeout: %s '%s' not ready after %d seconds", w.kind, name, int(timeout/time.Second)), time.Since(start)
 		}

 		if retry {
-			// restart loop
+			// sleep to prevent CPU pegging and restart the loop
+			time.Sleep(pollInterval)


This isn't necessary if there's no longer a chance the CPU can be pegged. However, unless you're 100% sure that can't happen, then it should remain.

This is not really clear to me. As I understand this, it's meant as a safeguard in case the retry loop would go wild again (like that waitForReadyCondition returning immediately with retry == true, which has happened up to now when not detecting that the watcher channel has been closed).

Looking at the code I can't see how the CPU could be pegged, but I have no prove. Adding a sleep here however would make it hard to detect such a situation, but of course, could be 'good enough'. So I'm a bit biased here :)

However, I would not couple this to the pollInterval which is used as poll interval for a special kind of watcher (i.e. when the API does not support intrinsic watching like this is the case for CloudRun). Instead, I probably would use a shorter safe-guard, maybe 200 or 500ms ? (but I would use a different constant or even a literal value here).

ericmillin · 2021-03-15T15:22:38Z

pkg/wait/wait_for_ready.go

@@ -111,20 +111,20 @@ func NoopMessageCallback() MessageCallback {
 func (w *waitForReadyConfig) Wait(watcher watch.Interface, name string, options Options, msgCallback MessageCallback) (error, time.Duration) {

 	timeout := options.timeoutWithDefault()
-	floatingTimeout := timeout
+	timeoutTimer := time.NewTimer(timeout)


Reusing the same timer prevents the timer leak caused by time.After. It is also cleaner because the floatingTImeout math is no longer needed.

google-cla · 2021-03-15T15:29:50Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

maximilien

Looks cleaner. I guess this is difficult to test?

/ok-to-test

ericmillin · 2021-03-16T21:52:22Z

Looks cleaner. I guess this is difficult to test?

/ok-to-test

Yes, I tried with the mocks, but I couldn't get it working. You need to close the channel after the watcher starts. I didn't see a way to do that. Happy to take another crack at it if you see a way to make it all work.

rhuss

Thanks a lot, very good catch ! I only have a minor question, otherwise we should merge this asap.

@maximilien I think this is really hard to test, maybe one could simulate a watch function that closes the channel but also detects if someone reads from that closed channel afterwards (waiting some time) and then fails the test if this is the casde. Not sure if this is easy to implement, I would merge the PR without it, too.

rhuss · 2021-03-17T10:12:57Z

pkg/wait/wait_for_ready.go

 			return fmt.Errorf("timeout: %s '%s' not ready after %d seconds", w.kind, name, int(timeout/time.Second)), time.Since(start)
 		}

 		if retry {
-			// restart loop
+			// sleep to prevent CPU pegging and restart the loop
+			time.Sleep(pollInterval)


This is not really clear to me. As I understand this, it's meant as a safeguard in case the retry loop would go wild again (like that waitForReadyCondition returning immediately with retry == true, which has happened up to now when not detecting that the watcher channel has been closed).

Looking at the code I can't see how the CPU could be pegged, but I have no prove. Adding a sleep here however would make it hard to detect such a situation, but of course, could be 'good enough'. So I'm a bit biased here :)

However, I would not couple this to the pollInterval which is used as poll interval for a special kind of watcher (i.e. when the API does not support intrinsic watching like this is the case for CloudRun). Instead, I probably would use a shorter safe-guard, maybe 200 or 500ms ? (but I would use a different constant or even a literal value here).

dsimansk · 2021-03-18T19:25:29Z

Let me add a few thoughts after digging into watch API a bit.

The CPU is pegged

time.After creates a new timer on each invocation. The timers aren't garbage collected until they expire, resulting in a crippling memory leak.

That's a neat fix. Thanks for pointing that out. The go docs are pretty clear there.

1. The code retries even though the results channel is never recreated, meaning the wait will never succeed

Indeed you're correct here as well. With the fix for watcher race condition we actually lost the ability to recreate the watcher.
Even though the wait loop checks for !ok channel and it's the only place producing retry=true, it's ineffective without remove initialization code.

https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes

When retrieving a collection of resources (either namespace or cluster scoped), the response from the server will contain a resourceVersion value that can be used to initiate a watch against the server. The server will return all changes (creates, deletes, and updates) that occur after the supplied resourceVersion. This allows a client to fetch the current state and then watch for changes without missing any updates.

As pointed out by @markusthoemmes, the resourceVersion can be used to avoid races like that, see the k8s docs above.

Therefore I think the right fix for the race condition is to leverage resourceVersion in our watch request. I've done a bit of testing and it seems to behave as expected, even with synthetic timeout postponing the watcher creation the events are lost. That'd make room to bring back the correct retry behaviour when channel is closed. As an good exampple and inspiration client-go's retrywatcher is done the same way.

ericmillin · 2021-03-18T19:59:23Z

We're doing more testing with the change I made and it's definitely not enough to get the correct behavior, though it does solve the CPU and memory issues. Seems like the watch really needs to be recreate on closure.

@dsimansk I've tried a different approach and reverted the November fix for the race. I then made some simpler changes to fix the issue. It works so far. Happy to share the changes. However, it sounds like resourceVersion may be the "right way" to address this.

dsimansk · 2021-03-19T17:11:22Z

@ericmillin I don't have the complete code yet, but you can take a look here. Hopefully I can put it together next week and have PR ready to gather more thoughts on it.

Anyway we should get this PR merged once @rhuss questions are resolved. :)

rhuss · 2021-04-01T15:19:06Z

Hi @ericmillin I think we should get this fix for the next release which is scheduled for next Tuesday, and then we can still continue to work on the resource version syncing. That's what we agreed on in the last client WG call.

Would you be able to resolve the conflicts ? If not, no problem, we would then take over.

ericmillin · 2021-04-05T15:24:17Z

@rhuss Sorry going incommunicado--I was out of town. I can resolve the conflicts, but you may need to take over if you want this merged by Tuesday. I'm still awaiting approval for the CLA.

The version syncing sounds brilliant. While my changes resolved some of the issues, we were still seeing unexpected behavior which is likely caused by fact that the logic exits when the channel is closed, rather than retrying. Because we had to release changes for an app using v0.19.x, I did a quick-and-dirty patch on a fork. The changes are a big improvement but still have problems accurately tracking state. @duglin has done a lot of manual testing and can provide more details if needed.

This is my quick-and-dirty patch which reverts to the pre-November logic and takes a different approach to solving the race: ericmillin@a523953

It uses a channel returned by the Wait interface to keep things in sync.

Waiting() <-chan bool

rhuss · 2021-04-06T16:09:31Z

@ericmillin no worries, I will resolve the conflict now and then merge this PR for release 0.23

@dsimansk please pick up after the release to improve with the discussed idea on syncing on revisions.

google-cla · 2021-04-06T17:21:52Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

google-cla · 2021-04-06T17:24:24Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

google-cla · 2021-04-06T17:27:44Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

knative-metrics-robot · 2021-04-06T17:30:14Z

The following is the coverage report on the affected files.
Say /test pull-knative-client-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/wait/wait_for_ready.go	76.3%	76.1%	-0.3

rhuss · 2021-04-06T18:33:31Z

@googlebot I consent.

rhuss · 2021-04-06T18:34:14Z

/approve
/lgtm

knative-prow-robot · 2021-04-06T18:34:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rhuss

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rhuss]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ericmillin added 2 commits March 15, 2021 09:07

Fix timer leak on closed watch channel

a5c808a

Update changelog

fb43c1f

google-cla bot added the cla: no Indicates the PR's author has not signed the CLA. label Mar 15, 2021

knative-prow-robot reviewed Mar 15, 2021

View reviewed changes

knative-prow-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 15, 2021

knative-prow-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 15, 2021

knative-prow-robot assigned rhuss Mar 15, 2021

knative-prow-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 15, 2021

ericmillin commented Mar 15, 2021

View reviewed changes

Stop watch timer on Wait exit

9fc6dc6

maximilien reviewed Mar 16, 2021

View reviewed changes

rhuss reviewed Mar 17, 2021

View reviewed changes

knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 18, 2021

google-cla bot removed the cla: no Indicates the PR's author has not signed the CLA. label Apr 6, 2021

knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 6, 2021

google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Apr 6, 2021

google-cla bot added cla: no Indicates the PR's author has not signed the CLA. and removed cla: yes Indicates the PR's author has signed the CLA. labels Apr 6, 2021

rhuss force-pushed the fix_leak branch from d1bc6e5 to c5061cb Compare April 6, 2021 17:24

Merge branch 'main' into fix_leak

87cdd8b

rhuss force-pushed the fix_leak branch from c5061cb to 87cdd8b Compare April 6, 2021 17:27

google-cla bot added cla: yes Indicates the PR's author has signed the CLA. and removed cla: no Indicates the PR's author has not signed the CLA. labels Apr 6, 2021

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 6, 2021

knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 6, 2021

knative-prow-robot merged commit 804895f into knative:main Apr 6, 2021

dsimansk mentioned this pull request Apr 13, 2021

Refactor watch functions to pass resourceVersion #1287

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory leak on closed watch channel #1263

Fix memory leak on closed watch channel #1263

ericmillin commented Mar 15, 2021 •

edited

Loading

google-cla bot commented Mar 15, 2021

knative-prow-robot left a comment

knative-prow-robot commented Mar 15, 2021

knative-prow-robot commented Mar 15, 2021

dsimansk commented Mar 15, 2021

ericmillin Mar 15, 2021

ericmillin Mar 15, 2021

rhuss Mar 17, 2021

ericmillin Mar 15, 2021

google-cla bot commented Mar 15, 2021

maximilien left a comment

ericmillin commented Mar 16, 2021 •

edited

Loading

rhuss left a comment

rhuss Mar 17, 2021

dsimansk commented Mar 18, 2021

ericmillin commented Mar 18, 2021

dsimansk commented Mar 19, 2021

rhuss commented Apr 1, 2021

ericmillin commented Apr 5, 2021 •

edited

Loading

rhuss commented Apr 6, 2021

google-cla bot commented Apr 6, 2021

google-cla bot commented Apr 6, 2021

google-cla bot commented Apr 6, 2021

knative-metrics-robot commented Apr 6, 2021

rhuss commented Apr 6, 2021

rhuss commented Apr 6, 2021

knative-prow-robot commented Apr 6, 2021

Fix memory leak on closed watch channel #1263

Fix memory leak on closed watch channel #1263

Conversation

ericmillin commented Mar 15, 2021 • edited Loading

Description

Changes

google-cla bot commented Mar 15, 2021

What to do if you already signed the CLA

Individual signers

Corporate signers

knative-prow-robot left a comment

Choose a reason for hiding this comment

Description

Changes

knative-prow-robot commented Mar 15, 2021

knative-prow-robot commented Mar 15, 2021

dsimansk commented Mar 15, 2021

ericmillin Mar 15, 2021

Choose a reason for hiding this comment

ericmillin Mar 15, 2021

Choose a reason for hiding this comment

rhuss Mar 17, 2021

Choose a reason for hiding this comment

ericmillin Mar 15, 2021

Choose a reason for hiding this comment

google-cla bot commented Mar 15, 2021

What to do if you already signed the CLA

Individual signers

Corporate signers

maximilien left a comment

Choose a reason for hiding this comment

ericmillin commented Mar 16, 2021 • edited Loading

rhuss left a comment

Choose a reason for hiding this comment

rhuss Mar 17, 2021

Choose a reason for hiding this comment

dsimansk commented Mar 18, 2021

ericmillin commented Mar 18, 2021

dsimansk commented Mar 19, 2021

rhuss commented Apr 1, 2021

ericmillin commented Apr 5, 2021 • edited Loading

rhuss commented Apr 6, 2021

google-cla bot commented Apr 6, 2021

google-cla bot commented Apr 6, 2021

google-cla bot commented Apr 6, 2021

knative-metrics-robot commented Apr 6, 2021

rhuss commented Apr 6, 2021

rhuss commented Apr 6, 2021

knative-prow-robot commented Apr 6, 2021

ericmillin commented Mar 15, 2021 •

edited

Loading

ericmillin commented Mar 16, 2021 •

edited

Loading

ericmillin commented Apr 5, 2021 •

edited

Loading