Excessive secret resources generation issue with starborad scanning #936

gurugautm · 2022-01-31T05:18:22Z

What steps did you take and what happened:

Environment : OpenShift v.4.7
Aqua v6.2.x
Aqua Enforcer installed with non-privileged mode
Kube Enforcer with starboard installed

When we perform an scan using starboard, it created a scan job and a secret. But when scan failed secret didn't get deleted. In customer env its not deleted even when scan is successful.
due to this It created multiple secrets around 80k in customer env.

What did you expect to happen:

Temp secrets should be auto-deleted even when scan is successful or failed.

shadowbreakerr · 2022-02-15T13:13:37Z

I'm seeing the same symptoms on EKS, hundreds of secrets created in the starboard namespace.

Environment: EKS (1.20)
Starboard-Operator 0.13.2

danielpacak · 2022-02-15T13:22:09Z

It would be very helpful to see some logs streamed by Starboard Operator's pod and minimal reproduction steps on upstream K8s cluster. We have limited capacity to support managed platforms with custom configurations. In particular, I'd like to see what is the root cause of scan jobs failing, which probably prevents us from cleaning up orphaned Secrets properly. I can only assume it's related to some PSP or admission control that prevents scan jobs from running successfully, but we need more details to advice.

It's also very useful to look at events created in the starboard-system namespace with kubectl get events -n starboard-system. (Sometimes pods do not have enough information in under ContainerStatuses, but we can figure from events why certain pods failed.)

markussiebert · 2022-02-24T10:01:16Z

Today this killed the secrets api in one of our clusters ....

kubectl get secrets -n starboard-operator | grep -c Opaque
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get secrets)
7500

@danielpacak
In our case the root cause was the github api limit while updating ...

When deleting the starboard namespace I found that there were far over 20k secrets created in 9 days

MPritsch · 2022-03-02T10:39:55Z

Edit: So apparently I was not able to really disable polaris. I just removed our ImageReference, which somehow delayed the error messages. Not sure how that works but that explains why initially didn't see secrets being created.

In our case the issue was the plugin "polaris" which kept failing. I let it run for a day in which it produced 203 Error Logs in the starboard-operator and left 1569 secrets behind. I'm not sure how these numbers correlate, maybe there are 7-8 retries on average? Removing the plugin stopped the errors and stopped leaving secrets behind.

The secrets which are left behind contain the values worker.password and worker.username

I'm unable to further track down the issue because the scan jobs immediately die and won't leave logs. Here is the logentry from the starboard-operator (reformatted for better readability):

{
  "level": "error",
  "ts": 1646135225.9482412,
  "logger": "reconciler.configauditreport",
  "msg": "Scan job container",
  "job": "starboard-operator/scan-configauditreport-797f6d9d6d",
  "container": "polaris",
  "status.reason": "Error",
  "status.message": "",
  "stacktrace": "github.com/aquasecurity/starboard/pkg/operator/controller.(*ConfigAuditReportReconciler).reconcileJobs.func1
	/home/runner/work/starboard/starboard/pkg/operator/controller/configauditreport.go:363
sigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile
	/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/reconcile/reconcile.go:102
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"
}

Versions we have been using:
Environment: AWS EKS 1.21
Starboard-Operator: aquasec/starboard-operator:0.14.1
Starboard-Operator Helm Chart: 0.9.1
Trivy: aquasec/trivy:0.24.0
Trivy Helm Chart: 0.4.11
Polaris: fairwinds/polaris:5.0

MPritsch · 2022-03-03T10:01:05Z

I found the parameter to disable polaris and let it run for over an hour. So far no error logs regarding polaris and also no secrets being stuck. Alternatively I tried to switch to Conftest instead of Polaris but received different errors and abandoned the idea.

Here is the parameter to disable configAuditScanner and therefore polaris as well.

operator: {
              configAuditScannerEnabled: false,
}

danielpacak · 2022-03-03T10:24:07Z

Thank you for the feedback @MPritsch We are actually working on so called built-in configuration audit scanner that is going to displace Polaris and Conftest plugins in the upcoming release. It won't create Kubernetes Job objects nor Secrets and it will be much faster. See #971 for more details.

cdesaintleger · 2022-03-03T10:54:22Z

Same issue, on cluster with smart jobs. (with private registry)
ex : Job is created, scan beginning , job terminated and deleted before the end of scan. The secret remains.

MPritsch · 2022-03-04T09:05:50Z

We now have a working version with polaris. The underlying issue were missing IAM permissions. We also needed to use polaris 4.2 instead of 5.0.

Every starup of the starboard-operator we received "401 Unauthorized: Not Authorized" error for AWS Images from ECR. E.g.:

{
  "level": "error",
  "ts": 1646123793.6945415,
  "logger": "reconciler.vulnerabilityreport",
  "msg": "Scan job container",
  "job": "starboard-operator/scan-vulnerabilityreport-6cd9546b84",
  "container": "fluent-bit",
  "status.reason": "Error",
  "status.message": "2022-03-01T08:36:33.091Z\t\u001b[31mFATAL\u001b[0m\tscanner initialize error: unable to initialize the docker scanner: 3 errors occurred:
	* unable to inspect the image (906394416424.dkr.ecr.eu-central-1.amazonaws.com/aws-for-fluent-bit:2.21.5): Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
	* unable to initialize Podman client: no podman socket found: stat podman/podman.sock: no such file or directory
	* GET https://906394416424.dkr.ecr.eu-central-1.amazonaws.com/v2/aws-for-fluent-bit/manifests/2.21.5: unexpected status code 401 Unauthorized: Not Authorized",
  "stacktrace": "github.com/aquasecurity/starboard/pkg/operator/controller.(*VulnerabilityReportReconciler).reconcileJobs.func1
	/home/runner/work/starboard/starboard/pkg/operator/controller/vulnerabilityreport.go:320
sigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile
	/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/reconcile/reconcile.go:102
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"
}

These were the Images which produced the error. The account IDs are from AWS, not from us:

602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.3.1
602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.10.1
602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/kube-proxy:v1.21.2-eksbuild.2
602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.8.4-eksbuild.1
906394416424.dkr.ecr.eu-central-1.amazonaws.com/aws-for-fluent-bit:2.21.5

The solution for these images was giving following permissions to starboard (as described by the 'Important' block here https://docs.aws.amazon.com/AmazonECR/latest/userguide/repository-policy-examples.html)

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetDownloadUrlForLayer",
                "ecr:GetRepositoryPolicy",
                "ecr:DescribeRepositories",
                "ecr:ListImages",
                "ecr:BatchGetImage"
            ],
            "Resource": [
                "arn:aws:ecr:*:602401143452:repository/*",
                "arn:aws:ecr:*:906394416424:repository/*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": "ecr:GetAuthorizationToken",
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

This probably would have been much easier to debug with proper error messages from polaris...

danielpacak · 2022-03-09T07:59:27Z

We've marked this issue as won't fix because we merged #971 that performs configuration audits without creating Kubernetes Jobs and Secrets. We call it a built-in configuration audit scanner and it will be enabled by default in the upcoming v0.15.0 release. Polaris and Conftest will be deprecated at some point.

We'll keep this issue open until v0.15.0 is released.

MPritsch · 2022-03-16T10:32:48Z

Just a quick update. While we were able to fix the secret creation and errors on one cluster, another one keeps creating secrets. Not sure if this is a permission problem again, although we don't see any errors regarding them. We will disable Polaris completely and wait for your v0.15.0 release to replace it.

danielpacak assigned deven0t Feb 4, 2022

danielpacak added the ⏳ additional info required Additional information required to close an issue label Feb 15, 2022

danielpacak added 🙅 wontfix This will not be worked on and removed ⏳ additional info required Additional information required to close an issue labels Mar 9, 2022

MPritsch mentioned this issue Mar 18, 2022

Secrets not being removed after duplicate scan-vulnerabilityreport-job creation fails #1038

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive secret resources generation issue with starborad scanning #936

Excessive secret resources generation issue with starborad scanning #936

gurugautm commented Jan 31, 2022

shadowbreakerr commented Feb 15, 2022

danielpacak commented Feb 15, 2022 •

edited

Loading

markussiebert commented Feb 24, 2022 •

edited

Loading

MPritsch commented Mar 2, 2022 •

edited

Loading

MPritsch commented Mar 3, 2022

danielpacak commented Mar 3, 2022

cdesaintleger commented Mar 3, 2022 •

edited

Loading

MPritsch commented Mar 4, 2022

danielpacak commented Mar 9, 2022

MPritsch commented Mar 16, 2022

Excessive secret resources generation issue with starborad scanning #936

Excessive secret resources generation issue with starborad scanning #936

Comments

gurugautm commented Jan 31, 2022

shadowbreakerr commented Feb 15, 2022

danielpacak commented Feb 15, 2022 • edited Loading

markussiebert commented Feb 24, 2022 • edited Loading

MPritsch commented Mar 2, 2022 • edited Loading

MPritsch commented Mar 3, 2022

danielpacak commented Mar 3, 2022

cdesaintleger commented Mar 3, 2022 • edited Loading

MPritsch commented Mar 4, 2022

danielpacak commented Mar 9, 2022

MPritsch commented Mar 16, 2022

danielpacak commented Feb 15, 2022 •

edited

Loading

markussiebert commented Feb 24, 2022 •

edited

Loading

MPritsch commented Mar 2, 2022 •

edited

Loading

cdesaintleger commented Mar 3, 2022 •

edited

Loading