Fix: misc issues with Jetstream management#3543
Fix: misc issues with Jetstream management#3543fullykubed wants to merge 8 commits intoargoproj:masterfrom
Conversation
Signed-off-by: fullykubed <github@fullstackjack.io>
Signed-off-by: fullykubed <github@fullstackjack.io>
Signed-off-by: fullykubed <github@fullstackjack.io>
Signed-off-by: fullykubed <github@fullstackjack.io>
Signed-off-by: fullykubed <github@fullstackjack.io>
62adeec to
000d3f1
Compare
Signed-off-by: fullykubed <github@fullstackjack.io>
|
@whynowy There appears to be some nondeterminism in the unit test at The CI runner is showing it as failing but I am not able to reproduce the failure locally and the test is not related to anything that this PR changes. |
|
@whynowy For the linting, I am not sure what is wrong, and some assistance would be appreciated. When I run However, the vast majority of these files are not ones that this PR changes nor are these the same errors that show up in the CI runner. Are there particular versions of the linting utilities that I should be installing? |
|
@whynowy Anything I can add to help get this integrated? I have another patch for the Jetstream sensor stability & availability improvements that I'd like to open a PR for, but it builds on the work here. |
Some extra blank lines. |
Hmmm... I don't think I am following. Looking at the diff, I don't believe we added any blank lines to the files that are failing the linter? |
|
I still need more information about the outdated |
I think this article provides a nice summary. Consider the following scenario for the current state of affairs:
When we added the feature for propagating Jetstream updates, we noticed this other issue was also causing settings to become out-of-sync. There are two options:
K8s implement SSA to tackle this specific issue as retries are not very performant and are more brittle. |
Would it be easier if we update the watch logic from generation changes to resourceVersion changes? Also, Why do you think there's no retry logic, every returned reconciliation error leads to a requeue, did I miss anything? |
It's been a bit since I looked at it, but it wasn't retrying when it failed to apply the updated Deployment specs which was most of the time in our cluster (unless the entire argo-events pod was restarted). If it was retrying, I'd expect to see either (a) a long/infinite loop of retries (as evidenced by the failure logs) as the update would never succeed as the code never updates the resource version or (b) a successful reconcile. We did not observe either which is what started us out putting together this fix. However, we didn't dig too deep as it was pretty straightforward to just switch to server-side apply to address the issue.
Ultimately, how to proceed here is mostly a stylistic choice -- our only concern is 100% correctness. The k8s team has provided server-side apply to simplify operations for controller devs because it means that you don't need to worry about tracking resource versions at all (reasoning here). There are many more benefits in addition to simply squashing this particular bug, so I'd recommend checking out that link. While the move from client-side to server-side apply is a change worth thinking critically about, it seems to be simpler to implement and maintain than adding more layers of tracking onto the client-side approach, especially if you can now leverage the k8s API server to do that tracking + merging for you. I'd recommend that approach as the default unless you had a specific reason to avoid it. However, if you wanted to add the additional client-side tracking necessary to ensure reconciles always succeeded, we'd definitely be happy to test it out in our environment where we were noticing reconciliation failures. |
Thanks for all the information! I tried by myself, did see the reconciliation error as you mentioned (when the log option you added is enabled), and those errors are swallowed by controller-runtime, which means there's no way to retry. In this case, I sort of tend to use the server-side apply solution you proposed. Considering this is not a corner case, do you know if any other projects had similar problem, and what kind of solution they took? |
I have not personally run into this issue with other controllers -- I believe most are defaulting to SSA nowadays since it has been available for a couple years, so issues like this are becoming less common. Here is a blog post from the team at Google that was rolling this out in 2021 with some thoughts on best practices. AFAIK there is no downside of SSA for controller authors other than slightly increased load on the k8s API server which would only come into play if you were continuously making many updates which isn't the case here (or for the vast majority of controllers). There are some other use cases (e.g., CICD pipelines) where using client-side apply instead makes sense, but I don't think that is relevant here. Not exactly like this particular case, but here are a few example PRs from other projects who have been migrating if you were interested in seeing some specific examples of reasons other projects have been switching to it for deploying resource updates in various scenarios:
Let me know if there is any more info I can provide. |
| // during the migration from client-side to server-side apply. Otherewise, users | ||
| // may end up in a state where the finalizer cannot be removed automatically. | ||
| patch := &v1alpha1.EventBus{ | ||
| Status: busCopy.Status, |
There was a problem hiding this comment.
Do we need to add Status (even though adding it won't make any difference I think)? Since this is only supposed to update the finalizers.
|
This Pull Request is stale because it has been open for 60 days with |
|
@whynowy Is there anything else you need from me here? Would love to get this merged in. Have some other patches that depend on this ready to open. |
|
@whynowy Hey, just checking in here. Anything you need on my end? |
|
This Pull Request is stale because it has been open for 60 days with |
|
It would be cool to get this merged if possible! |
whynowy
left a comment
There was a problem hiding this comment.
Can you also fix the lint error?
|
This Pull Request is stale because it has been open for 60 days with |

Motivation
We use Argo Events with the Jetstream client and noticed some reliability issues, so we dug in and resolved a handful of bugs in our custom patch that we run in production. I am now attempting to upstream those changes so the community can benefit. Here are the fixes:
streamSettingsupdates for the EventBus, but those were not being propagated to streams that were already created. This path adds this update capability and adds a watch that triggers an EventSource reconciliation anytime a connected EventBus's settings are updated.resourceVersionfields due to a stale cache. There was no retry logic, so we either had to (a) implement retry logic or (b) switch to server-side applies (SSA). Since SSA has been stable since Kubernetes 1.22 and is the recommended way for controllers to manage resources, we swapped out the client-side applies for SSA. The SSA isn't 100% optimized due to the CRDs not having all the required merge configurations, but it is more performant than the original implementation and resolves theresourceVersionconflict bug.streamSettingsweren't getting logged out with their keys, so it was impossible to tell what settings were actually being applied without logging into NATS directly. Resolved.Notes
needsUpdatecheck for patching EventBus, Sensor, and EventSource resources in the reconciliation loop. In transitioning from client-side to server-side applies, theFieldOwnerneeds to be set on the finalizers even if the finalizer isn't updated in order to allow the controller to delete the finalizers in future server-side applies. From our testing, this didn't have any significant impact since the reconciliation loop is run so infrequently -- we figured better safe than sorry here.Testing
We have done the following testing:
streamSettingson theEventBusresource now appropriately updates the stream configuration in NATS.Future Work