-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive secret resources generation issue with starborad scanning #936
Comments
I'm seeing the same symptoms on EKS, hundreds of secrets created in the starboard namespace. Environment: EKS (1.20) |
It would be very helpful to see some logs streamed by Starboard Operator's pod and minimal reproduction steps on upstream K8s cluster. We have limited capacity to support managed platforms with custom configurations. In particular, I'd like to see what is the root cause of scan jobs failing, which probably prevents us from cleaning up orphaned Secrets properly. I can only assume it's related to some PSP or admission control that prevents scan jobs from running successfully, but we need more details to advice. It's also very useful to look at events created in the |
Today this killed the secrets api in one of our clusters ....
@danielpacak When deleting the starboard namespace I found that there were far over 20k secrets created in 9 days |
Edit: So apparently I was not able to really disable polaris. I just removed our ImageReference, which somehow delayed the error messages. Not sure how that works but that explains why initially didn't see secrets being created. In our case the issue was the plugin "polaris" which kept failing. I let it run for a day in which it produced 203 Error Logs in the starboard-operator and left 1569 secrets behind. I'm not sure how these numbers correlate, maybe there are 7-8 retries on average? Removing the plugin stopped the errors and stopped leaving secrets behind. The secrets which are left behind contain the values I'm unable to further track down the issue because the scan jobs immediately die and won't leave logs. Here is the logentry from the starboard-operator (reformatted for better readability): {
"level": "error",
"ts": 1646135225.9482412,
"logger": "reconciler.configauditreport",
"msg": "Scan job container",
"job": "starboard-operator/scan-configauditreport-797f6d9d6d",
"container": "polaris",
"status.reason": "Error",
"status.message": "",
"stacktrace": "github.com/aquasecurity/starboard/pkg/operator/controller.(*ConfigAuditReportReconciler).reconcileJobs.func1
/home/runner/work/starboard/starboard/pkg/operator/controller/configauditreport.go:363
sigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile
/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/reconcile/reconcile.go:102
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"
} Versions we have been using: |
I found the parameter to disable polaris and let it run for over an hour. So far no error logs regarding polaris and also no secrets being stuck. Alternatively I tried to switch to Conftest instead of Polaris but received different errors and abandoned the idea. Here is the parameter to disable configAuditScanner and therefore polaris as well.
|
Same issue, on cluster with smart jobs. (with private registry) |
We now have a working version with polaris. The underlying issue were missing IAM permissions. We also needed to use polaris 4.2 instead of 5.0. Every starup of the starboard-operator we received "401 Unauthorized: Not Authorized" error for AWS Images from ECR. E.g.: {
"level": "error",
"ts": 1646123793.6945415,
"logger": "reconciler.vulnerabilityreport",
"msg": "Scan job container",
"job": "starboard-operator/scan-vulnerabilityreport-6cd9546b84",
"container": "fluent-bit",
"status.reason": "Error",
"status.message": "2022-03-01T08:36:33.091Z\t\u001b[31mFATAL\u001b[0m\tscanner initialize error: unable to initialize the docker scanner: 3 errors occurred:
* unable to inspect the image (906394416424.dkr.ecr.eu-central-1.amazonaws.com/aws-for-fluent-bit:2.21.5): Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
* unable to initialize Podman client: no podman socket found: stat podman/podman.sock: no such file or directory
* GET https://906394416424.dkr.ecr.eu-central-1.amazonaws.com/v2/aws-for-fluent-bit/manifests/2.21.5: unexpected status code 401 Unauthorized: Not Authorized",
"stacktrace": "github.com/aquasecurity/starboard/pkg/operator/controller.(*VulnerabilityReportReconciler).reconcileJobs.func1
/home/runner/work/starboard/starboard/pkg/operator/controller/vulnerabilityreport.go:320
sigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile
/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/reconcile/reconcile.go:102
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"
} These were the Images which produced the error. The account IDs are from AWS, not from us:
The solution for these images was giving following permissions to starboard (as described by the 'Important' block here https://docs.aws.amazon.com/AmazonECR/latest/userguide/repository-policy-examples.html)
This probably would have been much easier to debug with proper error messages from polaris... |
We've marked this issue as won't fix because we merged #971 that performs configuration audits without creating Kubernetes Jobs and Secrets. We call it a built-in configuration audit scanner and it will be enabled by default in the upcoming v0.15.0 release. Polaris and Conftest will be deprecated at some point. We'll keep this issue open until v0.15.0 is released. |
Just a quick update. While we were able to fix the secret creation and errors on one cluster, another one keeps creating secrets. Not sure if this is a permission problem again, although we don't see any errors regarding them. We will disable Polaris completely and wait for your v0.15.0 release to replace it. |
What steps did you take and what happened:
Environment : OpenShift v.4.7
Aqua v6.2.x
Aqua Enforcer installed with non-privileged mode
Kube Enforcer with starboard installed
When we perform an scan using starboard, it created a scan job and a secret. But when scan failed secret didn't get deleted. In customer env its not deleted even when scan is successful.
due to this It created multiple secrets around 80k in customer env.
What did you expect to happen:
Temp secrets should be auto-deleted even when scan is successful or failed.
The text was updated successfully, but these errors were encountered: