Conversation
There was a problem hiding this comment.
Adding this return can break scenarios where an orchestrator is not watching the process.
Can we keep the previous logging message and rely on the backoff.Do to prevent this from ever reaching the "broken state"?
There was a problem hiding this comment.
The backoff.Do won't recover, it will only throw errors explicitly every X seconds. To recover from such broken state we have to teardown the whole client. I don't think we can do it without leaking memory and goroutines all around the place.
What we could do to mitigate is assign a retry quota, like allow 3 reconnection attempts in 5 minutes, and crash if we exceed it. But this approach is worse for users using an orchestrator as the plugin won't fail fast.
Another approach would be to configure a TELEPORT_PLUGIN_FAIL_FAST environment variable in the chart and check if the var is set in the plugin. This would be backward compatible for non-chart users.
There was a problem hiding this comment.
Another approach would be to configure a TELEPORT_PLUGIN_FAIL_FAST environment variable in the chart and check if the var is set in the plugin. This would be backward compatible for non-chart users.
Seems like a good idea 👍
There was a problem hiding this comment.
I implemented the feature flag in f56a732
This commit changes the watcherjob retry behaviour. Instead of relying on gRPC's retry mechanism, the plugins will now fail fast when something happens to the connection. This means the plugin will exit in error more often, but it won't be stuck in a retry loop, silently smallowing connection errors.
5965bbe to
84b050e
Compare
84b050e to
f56a732
Compare
|
@hugoShaka See the table below for backport results.
|
… instead of retrying infinetly and hanging (#30039) * integrations/access: avoid infinite retry on broken connection This commit changes the watcherjob retry behaviour. Instead of relying on gRPC's retry mechanism, the plugins will now fail fast when something happens to the connection. This means the plugin will exit in error more often, but it won't be stuck in a retry loop, silently smallowing connection errors. * Add TELEPORT_PLUGIN_FAIL_FAST environment variable * fixup! Add TELEPORT_PLUGIN_FAIL_FAST environment variable
… instead of retrying infinetly and hanging (#30039) (#30431) * integrations/access: avoid infinite retry on broken connection This commit changes the watcherjob retry behaviour. Instead of relying on gRPC's retry mechanism, the plugins will now fail fast when something happens to the connection. This means the plugin will exit in error more often, but it won't be stuck in a retry loop, silently smallowing connection errors. * Add TELEPORT_PLUGIN_FAIL_FAST environment variable * fixup! Add TELEPORT_PLUGIN_FAIL_FAST environment variable
Fixes gravitational/teleport-plugins#871
This commit changes the watcherjob retry behaviour when
TELEPORT_PLUGIN_FAIL_FASTis set. Instead of relying on gRPC's retry mechanism, the plugins will now fail fast when something happens to the connection. This means the plugin will exit in error more often, but it won't be stuck in a retry loop, silently swallowing connection errors.Note for reviewers:
this is a potentially disrupting change as we will now exit in error in cases that might have been retriablethis was gated behind a flag