-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-46144: azure: use separate /var to avoid growfs timeouts #9310
base: main
Are you sure you want to change the base?
Conversation
/test e2e-azure-ovn |
/assign @patrickdillon |
/test ? |
@patrickdillon: The following commands are available to trigger required jobs:
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Wow. This e2e-azure run shows a dramatic %50+ increase in machine provisioning time:
compared to two other ci runs I spot checked:
[1] [2, took exactly the same amount of time]
Let's get some more samples. /test e2e-azure-ovn |
This issue escaped our notice with Terraform installs, I think, because all resource creation was wrapped in a thirty-minute timeout. In CAPI installs we break up resource creation into two phases with 15 min timeouts, which should be more than sufficient, so it exposed this problem. |
@jlebon can you check this implementation instead? I believe the current #9305 did not yield results, but early testing of this is looking great (see above). In response to #9305 (comment) Yes, Azure has excessively large disks (1TB) because that is the only way to guarantee iops on an os disk in this wonderful cloud. So this is a significant improvement for Azure. On all other clouds, control plane osdisk size is configurable, but we have no telemetry to help inform how common it is to increase. I think this is fine as a workaround--I believe you mentioned that a fix has already landed upstream. Perhaps we need to consider how the installer will remove this in the future--a jira tied to a release? |
@r4f4: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/retitle OCPBUGS-46144: azure: use separate /var to avoid growfs timeouts |
@r4f4: This pull request references Jira Issue OCPBUGS-46144, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this works and should be quicker. The caveat is that it introduces a layout difference across platforms, but at the same time we do document that having a separate /var
partition is required for large disks, so this is just us following our own advice.
Using the workaround of a separate /var partition until the issue is fixed in RHCOS.
c15d459
to
a10520c
Compare
New changes are detected. LGTM label has been removed. |
Update: addressed review comments. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Using the workaround of a separate /var partition until the issue is fixed in RHCOS.