Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-46144: azure: use separate /var to avoid growfs timeouts #9310

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

r4f4
Copy link
Contributor

@r4f4 r4f4 commented Dec 11, 2024

Using the workaround of a separate /var partition until the issue is fixed in RHCOS.

@r4f4
Copy link
Contributor Author

r4f4 commented Dec 11, 2024

/test e2e-azure-ovn

@r4f4
Copy link
Contributor Author

r4f4 commented Dec 11, 2024

/assign @patrickdillon

@patrickdillon
Copy link
Contributor

/test ?

Copy link
Contributor

openshift-ci bot commented Dec 12, 2024

@patrickdillon: The following commands are available to trigger required jobs:

/test altinfra-images
/test aro-unit
/test artifacts-images
/test e2e-agent-compact-ipv4
/test e2e-aws-ovn
/test e2e-aws-ovn-edge-zones-manifest-validation
/test e2e-aws-ovn-upi
/test e2e-azure-ovn
/test e2e-azure-ovn-upi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upi
/test e2e-metal-ipi-ovn-ipv6
/test e2e-openstack-ovn
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi
/test gofmt
/test golint
/test govet
/test images
/test integration-tests
/test integration-tests-nodejoiner
/test openstack-manifests
/test shellcheck
/test terraform-images
/test terraform-verify-vendor
/test tf-lint
/test unit
/test verify-codegen
/test verify-vendor
/test yaml-lint

The following commands are available to trigger optional jobs:

/test altinfra-e2e-aws-custom-security-groups
/test altinfra-e2e-aws-ovn
/test altinfra-e2e-aws-ovn-fips
/test altinfra-e2e-aws-ovn-imdsv2
/test altinfra-e2e-aws-ovn-localzones
/test altinfra-e2e-aws-ovn-proxy
/test altinfra-e2e-aws-ovn-shared-vpc
/test altinfra-e2e-aws-ovn-shared-vpc-local-zones
/test altinfra-e2e-aws-ovn-shared-vpc-wavelength-zones
/test altinfra-e2e-aws-ovn-single-node
/test altinfra-e2e-aws-ovn-wavelengthzones
/test altinfra-e2e-azure-capi-ovn
/test altinfra-e2e-azure-ovn-shared-vpc
/test altinfra-e2e-gcp-capi-ovn
/test altinfra-e2e-gcp-ovn-byo-network-capi
/test altinfra-e2e-gcp-ovn-secureboot-capi
/test altinfra-e2e-gcp-ovn-xpn-capi
/test altinfra-e2e-ibmcloud-capi-ovn
/test altinfra-e2e-nutanix-capi-ovn
/test altinfra-e2e-openstack-capi-ccpmso
/test altinfra-e2e-openstack-capi-ccpmso-zone
/test altinfra-e2e-openstack-capi-dualstack
/test altinfra-e2e-openstack-capi-dualstack-upi
/test altinfra-e2e-openstack-capi-dualstack-v6primary
/test altinfra-e2e-openstack-capi-externallb
/test altinfra-e2e-openstack-capi-nfv-intel
/test altinfra-e2e-openstack-capi-ovn
/test altinfra-e2e-openstack-capi-proxy
/test altinfra-e2e-vsphere-capi-multi-vcenter-ovn
/test altinfra-e2e-vsphere-capi-ovn
/test altinfra-e2e-vsphere-capi-static-ovn
/test altinfra-e2e-vsphere-capi-zones
/test azure-ovn-marketplace-images
/test e2e-agent-4control-ipv4
/test e2e-agent-5control-ipv4
/test e2e-agent-compact-ipv4-appliance-diskimage
/test e2e-agent-compact-ipv4-none-platform
/test e2e-agent-compact-ipv6-minimaliso
/test e2e-agent-ha-dualstack
/test e2e-agent-sno-ipv4-pxe
/test e2e-agent-sno-ipv6
/test e2e-aws-default-config
/test e2e-aws-overlay-mtu-ovn-1200
/test e2e-aws-ovn-custom-iam-profile
/test e2e-aws-ovn-edge-zones
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-heterogeneous
/test e2e-aws-ovn-imdsv2
/test e2e-aws-ovn-proxy
/test e2e-aws-ovn-public-ipv4-pool
/test e2e-aws-ovn-public-ipv4-pool-disabled
/test e2e-aws-ovn-public-subnets
/test e2e-aws-ovn-shared-vpc-custom-security-groups
/test e2e-aws-ovn-shared-vpc-edge-zones
/test e2e-aws-ovn-single-node
/test e2e-aws-ovn-techpreview
/test e2e-aws-ovn-upgrade
/test e2e-aws-ovn-workers-rhel8
/test e2e-aws-upi-proxy
/test e2e-azure-default-config
/test e2e-azure-ovn-resourcegroup
/test e2e-azure-ovn-shared-vpc
/test e2e-azure-ovn-techpreview
/test e2e-azurestack
/test e2e-azurestack-upi
/test e2e-crc
/test e2e-external-aws
/test e2e-external-aws-ccm
/test e2e-gcp-ovn-byo-vpc
/test e2e-gcp-ovn-heterogeneous
/test e2e-gcp-ovn-techpreview
/test e2e-gcp-ovn-xpn
/test e2e-gcp-secureboot
/test e2e-gcp-upgrade
/test e2e-gcp-upi-xpn
/test e2e-gcp-user-provisioned-dns
/test e2e-ibmcloud-ovn
/test e2e-metal-assisted
/test e2e-metal-ipi-ovn
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-swapped-hosts
/test e2e-metal-ipi-ovn-virtualmedia
/test e2e-metal-single-node-live-iso
/test e2e-nutanix-ovn
/test e2e-openstack-ccpmso
/test e2e-openstack-ccpmso-zone
/test e2e-openstack-dualstack
/test e2e-openstack-dualstack-upi
/test e2e-openstack-externallb
/test e2e-openstack-nfv-intel
/test e2e-openstack-proxy
/test e2e-openstack-singlestackv6
/test e2e-powervs-capi-ovn
/test e2e-vsphere-multi-vcenter-ovn
/test e2e-vsphere-ovn-multi-network
/test e2e-vsphere-ovn-techpreview
/test e2e-vsphere-ovn-upi-zones
/test e2e-vsphere-ovn-zones
/test e2e-vsphere-ovn-zones-techpreview
/test e2e-vsphere-static-ovn
/test okd-scos-e2e-aws-ovn
/test okd-scos-images
/test tf-fmt

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-installer-master-altinfra-images
pull-ci-openshift-installer-master-aro-unit
pull-ci-openshift-installer-master-artifacts-images
pull-ci-openshift-installer-master-e2e-aws-ovn
pull-ci-openshift-installer-master-gofmt
pull-ci-openshift-installer-master-golint
pull-ci-openshift-installer-master-govet
pull-ci-openshift-installer-master-images
pull-ci-openshift-installer-master-okd-scos-e2e-aws-ovn
pull-ci-openshift-installer-master-shellcheck
pull-ci-openshift-installer-master-tf-fmt
pull-ci-openshift-installer-master-tf-lint
pull-ci-openshift-installer-master-unit
pull-ci-openshift-installer-master-verify-codegen
pull-ci-openshift-installer-master-verify-vendor
pull-ci-openshift-installer-master-yaml-lint

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@patrickdillon
Copy link
Contributor

Wow. This e2e-azure run shows a dramatic %50+ increase in machine provisioning time:

time="2024-12-12T01:30:21Z" level=debug msg="                  Machine Provisioning: 4m30s"

compared to two other ci runs I spot checked:

time="2024-11-20T23:15:58Z" level=debug msg="                  Machine Provisioning: 10m46s"

[1] [2, took exactly the same amount of time]

time="2024-12-06T02:54:34Z" level=debug msg="                  Machine Provisioning: 10m16s"

[3]

Let's get some more samples.

/test e2e-azure-ovn
/test e2e-azure-default-config
/test e2e-azure-ovn-resourcegroup
/test e2e-azure-ovn-shared-vpc
/test e2e-azure-ovn-techpreview

@patrickdillon
Copy link
Contributor

This issue escaped our notice with Terraform installs, I think, because all resource creation was wrapped in a thirty-minute timeout. In CAPI installs we break up resource creation into two phases with 15 min timeouts, which should be more than sufficient, so it exposed this problem.

@patrickdillon
Copy link
Contributor

@jlebon can you check this implementation instead? I believe the current #9305 did not yield results, but early testing of this is looking great (see above).

In response to #9305 (comment)

Yes, Azure has excessively large disks (1TB) because that is the only way to guarantee iops on an os disk in this wonderful cloud. So this is a significant improvement for Azure. On all other clouds, control plane osdisk size is configurable, but we have no telemetry to help inform how common it is to increase. I think this is fine as a workaround--I believe you mentioned that a fix has already landed upstream.

Perhaps we need to consider how the installer will remove this in the future--a jira tied to a release?

Copy link
Contributor

openshift-ci bot commented Dec 12, 2024

@r4f4: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn c15d459 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@r4f4
Copy link
Contributor Author

r4f4 commented Dec 12, 2024

/retitle OCPBUGS-46144: azure: use separate /var to avoid growfs timeouts

@openshift-ci openshift-ci bot changed the title azure: use separate /var to avoid growfs timeouts OCPBUGS-46144: azure: use separate /var to avoid growfs timeouts Dec 12, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Dec 12, 2024
@openshift-ci-robot
Copy link
Contributor

@r4f4: This pull request references Jira Issue OCPBUGS-46144, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jinyunma

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Using the workaround of a separate /var partition until the issue is fixed in RHCOS.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from jinyunma December 12, 2024 08:53
Copy link
Contributor

@barbacbd barbacbd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 13, 2024
Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this works and should be quicker. The caveat is that it introduces a layout difference across platforms, but at the same time we do document that having a separate /var partition is required for large disks, so this is just us following our own advice.

Docs: https://docs.openshift.com/container-platform/4.17/installing/installing_platform_agnostic/installing-platform-agnostic.html#installation-user-infra-machines-advanced_vardisk_installing-platform-agnostic

pkg/asset/ignition/node.go Outdated Show resolved Hide resolved
pkg/asset/ignition/node.go Show resolved Hide resolved
pkg/asset/ignition/node.go Outdated Show resolved Hide resolved
pkg/asset/ignition/node.go Outdated Show resolved Hide resolved
Using the workaround of a separate /var partition until the issue is
fixed in RHCOS.
@r4f4 r4f4 force-pushed the azure-avoid-growfs-var branch from c15d459 to a10520c Compare December 17, 2024 21:48
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 17, 2024
Copy link
Contributor

openshift-ci bot commented Dec 17, 2024

New changes are detected. LGTM label has been removed.

@r4f4
Copy link
Contributor Author

r4f4 commented Dec 17, 2024

Update: addressed review comments.

Copy link
Contributor

openshift-ci bot commented Dec 17, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from barbacbd. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants