Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set resource limits for containers #68

Closed
bigg01 opened this issue May 29, 2020 · 15 comments · Fixed by #176
Closed

Set resource limits for containers #68

bigg01 opened this issue May 29, 2020 · 15 comments · Fixed by #176
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@bigg01
Copy link

bigg01 commented May 29, 2020

As a Platform Engineer I need to control usage of CPU Memory per Container.
Please add ressource limits:

resources:
          requests:
            cpu: 100m
            memory: 100Mi
          limits:
            memory: 200Mi
            cpu: 400m

The Aide PODS were running for a day and used 1.6 Gb of memory for no reason.

cheers

@jhrozek
Copy link
Contributor

jhrozek commented Jun 2, 2020

Thank you for filing the issue. We'll look into it next sprint.
While the resource limits are something we wanted to set either way, we also want to see if we can find the root cause of the leak.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 16, 2020
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 15, 2020
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@felixkrohn
Copy link

Would it be possible to re-open this issue? After a week running the pods consume about 3GiB RAM each.
Current workaround could be to set namespaced defaults, but I find this less elegant.

@JAORMX JAORMX reopened this Apr 8, 2021
@JAORMX
Copy link
Contributor

JAORMX commented Apr 8, 2021

@felixkrohn what versoin are you using?

@felixkrohn
Copy link

0.1.13 as distributed by RH on operatorhub (image: http://quay.io/file-integrity-operator/file-integrity-operator:0.1.13)

@JAORMX JAORMX changed the title Set ressoure limits for containers Set resource limits for containers Apr 8, 2021
@felixkrohn
Copy link

felixkrohn commented Apr 12, 2021

Is there anything I can do to help you debug this? (we're not yet running it in production)
2021-04-12 09_09_21-Prometheus Time Series Collection and Processing Server

@JAORMX
Copy link
Contributor

JAORMX commented Apr 12, 2021

@felixkrohn we'll look into it.

@JAORMX JAORMX added kind/bug Categorizes issue or PR as related to a bug. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Apr 12, 2021
@mrogers950
Copy link
Contributor

@felixkrohn would you be able to run with the steps outlined in https://mrogers950.gitlab.io/openshift/2021/04/12/fio-profile/ ?
It will enable pprof for the ds pods, but requires a container build from source. If you can capture the heap data at a few points (like once a few days in, then again the next week), that could be useful for us to take a look at. I've traced the same slow leak myself and it would be good to have a comparison.

@felixkrohn
Copy link

@mrogers950 Thanks to the great how-to 👍 I got it running, and will send you the .gz files next week (don't hesitate to remind me should I forget...)

@felixkrohn
Copy link

Did the traces help in any way?
Would it be OK to add memory limits (something between 500 and 1000M) to the f-i-o deployment, or do you expect this could cause unwanted side effects or even reduce reliability of the results?

@mrogers950
Copy link
Contributor

@felixkrohn yes, thanks for your help! the pprof data shows what I expected, which is that the daemon's actual heap usage is only a small percentage of the total that is reported by the cluster:
Here only about 7mb total:
image

This coincides with what I found about the reserved space used by the go runtime which, I tried to outline briefly here: https://mrogers950.gitlab.io/golang/2021/03/12/wild-crazy-golang-mem/
So I believe the high usage will be addressed by golang/go#44167 , (referenced by golang/go#43699).

But I think that now we can support pod limits properly because the daemon pods are more robust and should be able to handle restart by OOM occasionally. I'll work on a PR for that.

@felixkrohn
Copy link

Great news! thanks for the update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants