-
-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug]: No retry is performed when the ResourceWatcher fail to watch a resource #739
Comments
Thank you for bringing up this matter. WDYT @nachtjasmin ? |
@buehler Oh yeah, that's a bug. There's no reassignment to |
@nachtjasmin Hi, I'm running in exactly the same issue: watcher dies after some time. I would assume that, inside ResourceWatcher, you then also need to keep track of the latest ResourceVersion of the resource that was watched so that the watch can be re-enabled using that specific resource version, in order to prevent retriggering resources we already processed in the operator? |
Although I appreciate that y'all mention me in this issue, I'm not the primary maintainer. I'd love to fix the mentioned issue, but right now, I don't have the capacity to do so. From my point of view, the patch below should fix the problem (and also #763). But again, I don't have the capacities to fix it now. So, without any integration tests and all, I can only assume that it's going to work. diff --git a/src/KubeOps.Operator/Watcher/ResourceWatcher{TEntity}.cs b/src/KubeOps.Operator/Watcher/ResourceWatcher{TEntity}.cs
index 58ba7b1..447b298 100644
--- a/src/KubeOps.Operator/Watcher/ResourceWatcher{TEntity}.cs
+++ b/src/KubeOps.Operator/Watcher/ResourceWatcher{TEntity}.cs
@@ -128,9 +128,9 @@ internal class ResourceWatcher<TEntity>(
private async Task WatchClientEventsAsync(CancellationToken stoppingToken)
{
- try
+ while (!stoppingToken.IsCancellationRequested)
{
- while (!stoppingToken.IsCancellationRequested)
+ try
{
await foreach ((WatchEventType type, TEntity? entity) in client.WatchAsync<TEntity>(
settings.Namespace,
@@ -184,14 +184,14 @@ internal class ResourceWatcher<TEntity>(
}
}
}
- }
- catch (OperationCanceledException) when (stoppingToken.IsCancellationRequested)
- {
- // Don't throw if the cancellation was indeed requested.
- }
- catch (Exception e)
- {
- await OnWatchErrorAsync(e);
+ catch (OperationCanceledException) when (stoppingToken.IsCancellationRequested)
+ {
+ // Don't throw if the cancellation was indeed requested.
+ }
+ catch (Exception e)
+ {
+ await OnWatchErrorAsync(e);
+ }
}
} |
Thank you ;-) |
Here's a PR to add the fix: #765 I tried to add an Integration Test for the case but I couldn't figure out how to simulate the failure without introducing a stub for the client to mock the failure or interfere with other tests by breaking the network of the host. Any suggestions welcome. |
@HappyCodeSloth @buehler So, in my opinion, a better solution would look more in line of this:
This approach will keep track of the resource-version provided by the watch, and resume from the last known resource-version in case the API server drops the connection. Any thoughts? |
I've given it a go here: https://github.com/duke-bartholomew/dotnet-operator-sdk I'm currently still testing this against a kubernetes platform (on AKS), but it looks like it does the trick ... |
…il to watch a resource
…il to watch a resource
…il to watch a resource
…il to watch a resource
Thanks for the investigation. Did you create a PR? |
@duke-bartholomew @buehler thank you for your work on this. I also see this issue come up quite frequently, which makes our application non production ready :( How is the progress on this issue looking? |
We are facing the same issue and hoping an update will be released soon. |
I'm currently at a music festival, I will have a look at the stuff next week |
Describe the bug
Hello,
sometimes I see that when the operator starts, it doesn't succeed to watch a resource type at the first try.
In the image an example of the error that sometimes I get and that trigger the bug.
With version 8 everything was ok because there were retries until a watch was successfully established.
With version 9 no retries are performed in case of error: it means that when this situation occurs, the particular resource type is not watched at all (a restart of the operator is necessary).
To reproduce
Configure an operator that watch for particular resource type.
Restart it multiple times, checking the logs for an error during the resource watch process.
When it happens, try to create/modify a resource of the watched type: the operator will not start the reconcile function.
Expected behavior
The operator should retry to watch the resource in case of failure during the watch establishment.
Screenshots
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered: