Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: No retry is performed when the ResourceWatcher fail to watch a resource #739

Closed
gabrieledemaglie opened this issue Mar 20, 2024 · 12 comments · Fixed by #780
Closed
Labels
bug Something isn't working

Comments

@gabrieledemaglie
Copy link

Describe the bug

Hello,

sometimes I see that when the operator starts, it doesn't succeed to watch a resource type at the first try.
In the image an example of the error that sometimes I get and that trigger the bug.

image

With version 8 everything was ok because there were retries until a watch was successfully established.
With version 9 no retries are performed in case of error: it means that when this situation occurs, the particular resource type is not watched at all (a restart of the operator is necessary).

To reproduce

Configure an operator that watch for particular resource type.
Restart it multiple times, checking the logs for an error during the resource watch process.
When it happens, try to create/modify a resource of the watched type: the operator will not start the reconcile function.

Expected behavior

The operator should retry to watch the resource in case of failure during the watch establishment.

Screenshots

No response

Additional Context

No response

@gabrieledemaglie gabrieledemaglie added the bug Something isn't working label Mar 20, 2024
@buehler
Copy link
Owner

buehler commented Apr 19, 2024

Hey @gabrieledemaglie

Thank you for bringing up this matter.
This may be related to @nachtjasmin's implementation of the async methods.

WDYT @nachtjasmin ?

@nachtjasmin
Copy link
Contributor

nachtjasmin commented Apr 22, 2024

@buehler Oh yeah, that's a bug. There's no reassignment to _eventWatcher being done. Once it throws, it terminates, therefore the watcher does not restart automatically. I think that wrapping whole logic of WatchClientEventsAsync inside of while (!stoppingToken.IsCancellationRequested) and therefore moving the try/catch into the loop should solve the issue.

@duke-bartholomew
Copy link

@nachtjasmin Hi, I'm running in exactly the same issue: watcher dies after some time.
Apparently it's normal for the KubeAPI to drop the websocket connection after some idle-time, but will simply recreating the watch be enough in this case?

I would assume that, inside ResourceWatcher, you then also need to keep track of the latest ResourceVersion of the resource that was watched so that the watch can be re-enabled using that specific resource version, in order to prevent retriggering resources we already processed in the operator?
And this would then also mean: properly handling a "resource too old" exception?

@nachtjasmin
Copy link
Contributor

Although I appreciate that y'all mention me in this issue, I'm not the primary maintainer. I'd love to fix the mentioned issue, but right now, I don't have the capacity to do so.

From my point of view, the patch below should fix the problem (and also #763). But again, I don't have the capacities to fix it now. So, without any integration tests and all, I can only assume that it's going to work.

diff --git a/src/KubeOps.Operator/Watcher/ResourceWatcher{TEntity}.cs b/src/KubeOps.Operator/Watcher/ResourceWatcher{TEntity}.cs
index 58ba7b1..447b298 100644
--- a/src/KubeOps.Operator/Watcher/ResourceWatcher{TEntity}.cs
+++ b/src/KubeOps.Operator/Watcher/ResourceWatcher{TEntity}.cs
@@ -128,9 +128,9 @@ internal class ResourceWatcher<TEntity>(
 
     private async Task WatchClientEventsAsync(CancellationToken stoppingToken)
     {
-        try
+        while (!stoppingToken.IsCancellationRequested)
         {
-            while (!stoppingToken.IsCancellationRequested)
+            try
             {
                 await foreach ((WatchEventType type, TEntity? entity) in client.WatchAsync<TEntity>(
                                    settings.Namespace,
@@ -184,14 +184,14 @@ internal class ResourceWatcher<TEntity>(
                     }
                 }
             }
-        }
-        catch (OperationCanceledException) when (stoppingToken.IsCancellationRequested)
-        {
-            // Don't throw if the cancellation was indeed requested.
-        }
-        catch (Exception e)
-        {
-            await OnWatchErrorAsync(e);
+            catch (OperationCanceledException) when (stoppingToken.IsCancellationRequested)
+            {
+                // Don't throw if the cancellation was indeed requested.
+            }
+            catch (Exception e)
+            {
+                await OnWatchErrorAsync(e);
+            }
         }
     }

@buehler
Copy link
Owner

buehler commented May 22, 2024

Thank you ;-)
no worries, I know how it feels, I'll have a try at your suggestion as soon as I've the time to do so. Thanks for giving thought!

@HappyCodeSloth
Copy link

Here's a PR to add the fix: #765

I tried to add an Integration Test for the case but I couldn't figure out how to simulate the failure without introducing a stub for the client to mock the failure or interfere with other tests by breaking the network of the host.

Any suggestions welcome.

@duke-bartholomew
Copy link

duke-bartholomew commented May 23, 2024

@HappyCodeSloth @buehler
I'm not sure if the suggested fix will be enough ...
For one, it might go into a infinite failure loop as catching anything catch (Exception) could be any kind of failure.
Secondly, re-creating the watch will result in all Entities to be delivered again because the ResourceVersion is not taken into account.
As I noticed the API-Server consistently dropping the idle connections after 30 minutes (in my case), this approach would then lead to a full reprocessing of all the watched entities I already processed every 30 minutes.

So, in my opinion, a better solution would look more in line of this:

private async Task DoWatch(CancellationToken stoppingToken, string? resourceVersion = null) {
  var currentVersion = resourceVersion;
  while (!stoppingToken.IsCancellationRequested) {
    try  {
      await foreach ((WatchEventType type, TEntity? entity) in client.WatchAsync<TEntity>( resourceVersion: currentVersion, ... ) {
        currentVersion = entity.ResourceVersion();    // or rely on WatchEventType.Bookmark to update the resourceVersion ?
        await OnEventAsync(type, entity, stoppingToken);
      }
    }
    catch(KubernetesException cause) when (cause.Status.Code == 504 || cause.Status.Code == 410) {
      currentVersion = . . .   // regex extract the last resourceVersion from the exception message
                               // bail out when no resourceVersion could be extracted
    }
    catch(OperationCanceledException) when (stoppingToken.IsCancellationRequested) { }
    catch (HttpOperationException cause) when (cause.Response.StatusCode == HttpStatusCode.Conflict) {
      break;    // bail out of the watch
    }
    catch (HttpRequestException cause) { // when 'connection reset by peer'
      await Task.Delay(TimeSpan.FromSeconds(5), stoppingToken);  // wait a bit and retry the watch
    }
    catch ( ... ) // maybe other exception we can/need to catch to retry the current watch ?
    catch (Exception cause) {
      throw;  // I guess, because something really wrong happened ....
    }
  }
}
private async Task<(string? ResourceVersion, IList<TEntity> Entities)> DoList(CancellationToken stoppingToken) {
  ...
  var resultList = await client.ApiClient.CustomObjects.ListClusterCustomObjectAsync<EntityList<TEntity>>(...)
  return (resultList.Metadata.ResourceVersion, resultList.Items );
}
private async Task WatchClientEventsAsync(CancellationToken stoppingToken) {
        while (!stoppingToken.IsCancellationRequested) {
            _entityCache.Clear();

            var (resourceVersion, entities) = await DoList(stoppingToken);
            foreach (var entity in entities) {
                await OnEventAsync(WatchEventType.Added, entity, stoppingToken);
            }

            // start watching for changes starting from the known resource-version
            await DoWatch(resourceVersion, stoppingToken);
        }
}

This approach will keep track of the resource-version provided by the watch, and resume from the last known resource-version in case the API server drops the connection.
Any (known) exceptions leading to the watch not being usable anymore would trigger a full re-initialization (i.e. List() + Watch-From_ResourceVersion, as described in the kubernetes documentation)

Any thoughts?

@duke-bartholomew
Copy link

duke-bartholomew commented May 27, 2024

I've given it a go here: https://github.com/duke-bartholomew/dotnet-operator-sdk
This version will automatically recreate the watch when the API server drops the connection. The watch will be recreated taking the correct ResourceVersion into account so that previously processed events are not processed again.
(keep in mind: the KubernetesClient API got changed in this version)

I'm currently still testing this against a kubernetes platform (on AKS), but it looks like it does the trick ...
This should also fix #763

duke-bartholomew added a commit to duke-bartholomew/dotnet-operator-sdk that referenced this issue May 27, 2024
duke-bartholomew added a commit to duke-bartholomew/dotnet-operator-sdk that referenced this issue May 27, 2024
duke-bartholomew added a commit to duke-bartholomew/dotnet-operator-sdk that referenced this issue May 27, 2024
duke-bartholomew added a commit to duke-bartholomew/dotnet-operator-sdk that referenced this issue May 27, 2024
@buehler
Copy link
Owner

buehler commented May 29, 2024

Hey @duke-bartholomew

Thanks for the investigation. Did you create a PR?

@slacki123
Copy link
Contributor

@duke-bartholomew @buehler thank you for your work on this. I also see this issue come up quite frequently, which makes our application non production ready :(

How is the progress on this issue looking?

@rajaniraog
Copy link
Contributor

We are facing the same issue and hoping an update will be released soon.

@buehler
Copy link
Owner

buehler commented Jun 13, 2024

I'm currently at a music festival, I will have a look at the stuff next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants