Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulnerability scanning encounters "etcdserver: request is too large" #899

Open
FrederikNJS opened this issue Jan 10, 2022 · 5 comments
Open
Labels
🚀 enhancement New feature or request

Comments

@FrederikNJS
Copy link

FrederikNJS commented Jan 10, 2022

What steps did you take and what happened:

I installed Starboard-operator using the helm chart and allowed it to run on my entire cluster. Some of the vulnerability scan jobs get stuck and the starboard-operator is logging messages about "etcdserver: request is too large". Here's a complete log line:

{"level":"error","ts":1641839401.717973,"logger":"controller.job","msg":"Reconciler error","reconciler group":"batch","reconciler kind":"Job","name":"scan-vulnerabilityreport-68cbdf566b","namespace":"starboard-system","error":"etcdserver: request is too large","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}

I suspect that I have some images with way too many vulnerabilities... So being able to store them so I can track them down would be really nice.

What did you expect to happen:

I expected Starboard to be able to store the vulnerability reports properly.

Anything else you would like to add:

It seems to already be discussed in #208, but it seems that some information was stripped out of the vulnerabilityreport, and the issue was closed due to being "too unlikely", even though the issue still occurs for me.

My complete values for the helm chart is:

targetNamespaces: ""
trivy:
  githubToken: <REDACTED>
  resources:
    limits:
      memory: 1000Mi
    requests:
      memory: 1000Mi

Environment:

  • Helm chart version: 0.8.2
  • Starboard version (use starboard version): 0.13.2
  • Kubernetes version (use kubectl version): 1.19.7
@FrederikNJS
Copy link
Author

Additionally I can see that these stuck vulnerability scans count against the scanJobsConcurrentLimit, and the operator doesn't give up on them either when the scanJobTimeout expires...

My scanJobTimeout is set to the default 5 minutes, and I have seen jobs stuck for more than an hour, clogging up the system, blocking other scans from starting.

@FrederikNJS
Copy link
Author

As a workaround, I have tried limiting trivy's severity to only include HIGH,CRITICAL, which of course cuts down on the amount of vulnerabilities to write into the VulnerabilityReport, and in turn makes the reports small enough to save to etcd. This seems to work nicely, it would however still be nice to be able to save all the vulnerabilities, to get a complete overview.

@danielpacak
Copy link
Contributor

danielpacak commented Jan 19, 2022

👋 @FrederikNS Thank you for the feedback. This is a well known limitation of Starboard (and K8s with its default etcd storage) right now, and we do not implement any fallback strategy. Do you have any ideas what we could do in such case?

BTW, is it possible to share the image and image size or at least the number of all vulnerabilities found by Trivy that cause this error?

@danielpacak danielpacak added the 🚀 enhancement New feature or request label Jan 19, 2022
@Arabus
Copy link
Contributor

Arabus commented Jan 24, 2022

In the specific case that the report is too large I propose at least storing everything except the vulnerabilities list and adding an annotation starboard.aquasecurity.github.io/report-too-large=true or something, one could filter and monitor for

On a more general scope I assume compressing the vulnerabilities field could do the trick (helm went that way early on). OC this would require some changes throughout the complete tooling stack.

A more "advanced" change would be to allow storing the reports in a database, at least for the operator deployments. Maybe something memcache compatible, considering that the reports can be ephemeral. TBO abusing the ETCD resource store for this kind of data sounds like a malpractice altogether.

Another way might be to provide a report consumer that collects and stores the reports and services them on request e.g., via a webinterface.

@bgoareguer
Copy link
Contributor

I like the current behavior of having the reports in same namespace as the resource they relate to. It makes it possible to use RBAC to restrict access to those reports (which may contain sensitive information).

If the reports are stored outside of Etcd, we need to make sure that we cannot access all the reports with a single set of credentials.

An idea would be to create a set of credentials in each namespace. Using the credentials from a namespace should only give access to the reports related to this namespace.
This kind of behavior could work well with Minio where we could have one bucket per namespace. The credentials stored in a namespace then only give access to the bucket corresponding to this namespace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🚀 enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants