Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use finalizers to spin up AppRepo clean-up jobs #6647

Merged
merged 23 commits into from
Aug 22, 2023

Conversation

antgamdia
Copy link
Contributor

@antgamdia antgamdia commented Aug 17, 2023

Description of the change

Even if the sync jobs were added a security context (by means of each AppRepo CRD), this information was not available for Cleanup jobs. This is mainly due to the fact that those jobs are spun up once a NotFound error is thrown when fetching an AppRepo.
However, Kubernetes does have a native approach for dealing with these scenarios: finalizers.

In #6605 we proposed a simplistic workaround based on adding more params to the controller... but as suggested in #6605 (comment), moving to finalizers is a better long-term solution.

Benefits

Cleanup jobs are now handled within an existing AppRepo context... meaning we have all the syncJobTemplate available to be used (ie, securityPolicies and so on).

Possible drawbacks

When dealing with finalizers in the past I often found it really annoying when they get stuck and prevent the resource to get deleted. I wonder if we should add some info in the FAQ on how to manually remove the finalizers.

Additionally, and this might be something important: for the AppRepo controller to be able to update AppRepos in other namespaces != kubeapps.... (to add the finalizer) it now needs extra RBAC. Before we were just granting ...-appprepos-read.. but now we would need to grant ...-write as well...and I'm not sure we really want to do so.
WDYT, @absoludity ?
Another idea is using an admission policy... but not sure again if we want to deal with that...

(I haven't modified the RBAC yet in this PR) Changes have been performed finally

Applicable issues

Additional information

This PR is based on top of #6646, but the main change to review is 6e70910
The rest is just moving code into separate files, mostly.

Also, I have been taking a look at kubebuilder to create a new controller, based on the sigs.k8s.io/controller-runtime rather than on the workqueues we currently have. While it is pretty easy to start with (see quickstart), I think it is adding too much complexity (using kustomize, adding rbac proxies, prometheus metrics, etc...
I also quickly tried the k8s codegen scripts, but ran into some issues with my setup... but perhaps it's the best option.

IMO, at some point we should start thinking about moving towards a new state-of-the-art k8s controller boilerplate.

@netlify
Copy link

netlify bot commented Aug 17, 2023

Deploy Preview for kubeapps-dev canceled.

Built without sensitive environment variables

Name Link
🔨 Latest commit eb46c13
🔍 Latest deploy log https://app.netlify.com/sites/kubeapps-dev/deploys/64e47c018a2645000875ba1f

Signed-off-by: Antonio Gamez Diaz <[email protected]>
…rrencyPolicy` configuration

Signed-off-by: Antonio Gamez Diaz <[email protected]>
Signed-off-by: Antonio Gamez Diaz <[email protected]>
Copy link
Contributor Author

@antgamdia antgamdia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some in-line comments. Most changes are just refactoring and separating the code into different files... it was becoming too messy to deal with 😅

Comment on lines 360 to 375
// if the object is not being deleted, check the finalizers
if apprepo.GetDeletionTimestamp().IsZero() {
// check if it contains the finalizer, if not, add it and update the object
if !containsFinalizer(apprepo, finalizerName) {
log.Infof("the AppRepository %q doesn't have a finalizer yet, adding one...", apprepo.GetName())

ok := addFinalizer(apprepo, finalizerName)
if !ok {
return fmt.Errorf("error adding finalizer to the AppRepository %q", apprepo.GetName())
}
_, err = c.apprepoclientset.KubeappsV1alpha1().AppRepositories(namespace).Update(context.TODO(), apprepo, metav1.UpdateOptions{})
if err != nil {
return fmt.Errorf("error updating the AppRepository %q: %v", apprepo.GetName(), err)
}
}
} else {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we need to have more privileges, an AppRepo update is required to add the finalizers.

Comment on lines 498 to 506
// if the object is being deleted and contains a finalizer
if !apprepo.GetDeletionTimestamp().IsZero() && containsFinalizer(apprepo, finalizerName) {
// if the object is being deleted and contains a finalizer and the event is not a deletion event,
// then handle the finalizer-derived clean-up tasks and remove the finalizer
if !shouldDelete {
log.Infof("Starting the clean-up tasks derived from the finalizer of the AppRepository %q", apprepo.GetName())

if apprepo.Spec.FilterRule.JQ != "" {
rulesJSON, err := json.Marshal(apprepo.Spec.FilterRule)
if err != nil {
log.Errorf("Unable to parse filter rules for %s: %v", apprepo.Name, err)
} else {
args = append(args, "--filter-rules", string(rulesJSON))
}
}
// start the clean-up
err = c.cleanUpAppRepo(apprepo, apprepo.GetNamespace(), apprepo.GetName())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it sounds a bit hacky/tricky... but when testing IRL (and then confirmed in the k8s repo), a deletion is triggering also update events... so I ended up handling it that way.
I did my best at explaining the code, but if you can think of a better solution, let me know :P

Comment on lines +89 to +91
c.Flags().Int32Var(&serveOpts.SuccessfulJobsHistoryLimit, "successful-jobs-history-limit", 3, "Number of successful finished jobs to retain")
c.Flags().Int32Var(&serveOpts.FailedJobsHistoryLimit, "failed-jobs-history-limit", 1, "Number of failed finished jobs to retain")
c.Flags().StringVar(&serveOpts.ConcurrencyPolicy, "concurrency-policy", "Replace", "How to treat concurrent executions of a Job. Valid values are: 'Allow', 'Forbid' and 'Replace'")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also added these extra flags just in case we wanted to configure the cronjob. For now, I left the existing default values, so as to the behavior doesn't actually change.

@absoludity
Copy link
Contributor

When dealing with finalizers in the past I often found it really annoying when they get stuck and prevent the resource to get deleted. I wonder if we should add some info in the FAQ on how to manually remove the finalizers.

Yeah - might be worth while. I think our finalizer shouldn't cause issues like others do (in that, it's not doing anything complex), but right, we all think that then something unexpected will stop it (DB connection or something). Updating the FAQ sounds like a good idea (once this is out).

Additionally, and this might be something important: for the AppRepo controller to be able to update AppRepos in other namespaces != kubeapps.... (to add the finalizer) it now needs extra RBAC. Before we were just granting ...-appprepos-read.. but now we would need to grant ...-write as well...and I'm not sure we really want to do so. WDYT, @absoludity ? Another idea is using an admission policy... but not sure again if we want to deal with that...

Hmm. I don't have a problem with our controller having write access to app repos in all namespaces? It's our CR, I don't think people generally have an issue with the controller having write access across all namespaces. Or did you mean something else?

@antgamdia
Copy link
Contributor Author

Hmm. I don't have a problem with our controller having write access to app repos in all namespaces? It's our CR, I don't think people generally have an issue with the controller having write access across all namespaces. Or did you mean something else?

Great, I just wanted to double-check just in case I was missing something. I agree, the CR-controller's SA should be able to manage the resources cluster-wide. I have performed the change at 30861d4. I created a separate ClusterRole because the apprepo-...-write was also granting read on secrets, which is something we don't wanna do cluster-wise.

Yeah - might be worth while.

Done at 1241da1 :)

Copy link
Contributor

@absoludity absoludity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unreal - thanks Antonio for going the extra mile to do this properly so we don't accumulate more tech-debt!

kind: ClusterRole
metadata:
name: {{ template "kubeapps.apprepository.fullname" . }}
namespace: {{ .Release.Namespace | quote }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that it matters much, but I don't think we need to specify a namespace for a clusterrole? (as in, it'll have no effect, I think?). Maybe we've specified them in our templates elsewhere and that's why it's here. Fine either way, just not sure if it could cause confusion. Same for the ClusterRoleBinding below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, I thought I had actually removed them. They don't have any effect, as you pointed out.
Ah, I know what happened... I copied and pasted the metadata so that all the objects had the same... and I forgot to remove the namespace again! Thanks for catching it!

@@ -261,18 +262,18 @@ func (c *Controller) processNextWorkItem() bool {
// Forget here else we'd go into a loop of attempting to
// process a work item that is invalid.
c.workqueue.Forget(obj)
runtime.HandleError(fmt.Errorf("expected string in workqueue but got %#v", obj))
runtime.HandleError(fmt.Errorf("Expected string in workqueue but got %#v", obj))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for capitalizing these? I know our codebase is not consistent, but if we were making it consistent, I'd go the other way. See https://google.github.io/styleguide/go/decisions.html#error-strings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason apart from internal consistency... having mixed-up capitalized log traces was killing me... so I just normalized them using the most common choice (in this piece of code). But I agree, log traces should be lower-cased instead.
If we start changing little by little, at some point we might end up with a consistent codebase :P, which will be helpful if we ever think of exporting metrics with OpenTelemetry or any other stack.

For the record, the regex (debug|warn|Error|Info)(f?)\(("|')\p{Lu} yields 549 upper-cased occurrences across 66 .go files. For now, I'm going to lower-case the ones in apprepo controller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just note: I think that google golang style guide has error strings as lower case, while log lines as upper case (next section following the link), so I think the log lines should remain upper case, just the errors like the one here should be lower-cased?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, you're right... I was mixing up the error strings (fmt.Error....) and the log error traces (log.Errorf...), so:

  • log.* --> uppercased
  • fmt.Errorf --> lowercased

Reverting changes in 1696b83 then :P

"--database-url=" + config.DBURL,
"--database-user=" + config.DBUser,
"--database-name=" + config.DBName,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that there's a lot of deletions in this file - guessing they'll be re-added below. EDIT: yep, see them in the job_utils. Mild preference for keeping moves of code like this to separate PRs, but realise it's not always possible when the code is changing too. Sometimes I do the code changes and refrain from moving/reorganising until a follow-up PR, just so it's easier to follow what has actually changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... I should have done that too. In the beginning, I wasn't planning to move the code... but while coding I noticed the endless one-file was starting to be unbearable and I just splitted it apart. Sorry for the inconvenience when reviewing it 😅

}
})
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the new tests for the new functionality too.

image: docker.io/kubeapps/asset-syncer:latest
name: delete
EOF
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And thanks for the doc update!

Signed-off-by: Antonio Gamez Diaz <[email protected]>
Signed-off-by: Antonio Gamez Diaz <[email protected]>
@antgamdia antgamdia force-pushed the 6545-add-seccontext-apprepo-2 branch from 7ceb5ce to eb46c13 Compare August 22, 2023 09:12
@antgamdia antgamdia merged commit 4e5ac7a into vmware-tanzu:main Aug 22, 2023
@antgamdia antgamdia deleted the 6545-add-seccontext-apprepo-2 branch August 22, 2023 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot configure securityContext of sync and cleanup jobs
3 participants