Skip to content

OCPBUGS-60772: Reuse instance groups#86

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
theobarberbany:tb/patch-xpn
Aug 26, 2025
Merged

OCPBUGS-60772: Reuse instance groups#86
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
theobarberbany:tb/patch-xpn

Conversation

@theobarberbany
Copy link

@theobarberbany theobarberbany commented Aug 14, 2025

This PR rewrites 49f5389.

Work around GCP internal load balancer restrictions for multi-subnet clusters.

GCP internal load balancers have specific restrictions that prevent
straightforward load balancing across multiple subnets:

  1. "Don't put a VM in more than one load-balanced instance group"
  2. Instance groups can "only select VMs that are in the same zone, VPC network, and subnet"
  3. "All VMs in an instance group must have their primary network interface in the same VPC network"
  4. Internal LBs can load balance to VMs in same region but different subnets

For clusters with nodes across multiple subnets, the previous implementation
would fail to create internal load balancers. This change implements a
two-pass approach:

  1. Find existing external instance groups (matching externalInstanceGroupsPrefix)
    that contain ONLY cluster nodes and reuse them for the backend service
  2. Create internal instance groups only for remaining nodes not covered by
    external groups

This ensures compliance with GCP restrictions while enabling multi-subnet
load balancing for Kubernetes clusters.

References:

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 14, 2025
@openshift-ci
Copy link

openshift-ci bot commented Aug 14, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@theobarberbany theobarberbany force-pushed the tb/patch-xpn branch 2 times, most recently from 8661974 to 796793d Compare August 14, 2025 17:15
@theobarberbany
Copy link
Author

/test unit

@miyadav
Copy link
Member

miyadav commented Aug 18, 2025

@theobarberbany , this does resolves the internal lb creation issue we were getting for both xpn and non-xpn.
that is tested successfully,but haven't run regression yet.

@miyadav
Copy link
Member

miyadav commented Aug 18, 2025

Regression looks good too .

/test unit

@theobarberbany
Copy link
Author

@miyadav Woop! Amazing news! Now just to clean the code up, and document it thoroughly so the next poor soul who comes across it isn't as confused :)

@theobarberbany theobarberbany marked this pull request as ready for review August 19, 2025 18:15
@theobarberbany theobarberbany changed the title WIP: Reuse instance groups Reuse instance groups Aug 19, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 19, 2025
@openshift-ci openshift-ci bot requested review from RadekManak and nrb August 19, 2025 18:16
@theobarberbany
Copy link
Author

/test regression-clusterinfra-cucushift-rehearse-gcp-ipi

@openshift-ci
Copy link

openshift-ci bot commented Aug 19, 2025

@theobarberbany: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test fmt
/test images
/test okd-scos-images
/test security
/test unit
/test verify-deps

The following commands are available to trigger optional jobs:

/test okd-scos-e2e-aws-ovn
/test regression-clusterinfra-gcp-ipi-ccm
/test verify-commits

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-cloud-provider-gcp-main-e2e-gcp-ovn
pull-ci-openshift-cloud-provider-gcp-main-e2e-gcp-ovn-upgrade
pull-ci-openshift-cloud-provider-gcp-main-fmt
pull-ci-openshift-cloud-provider-gcp-main-images
pull-ci-openshift-cloud-provider-gcp-main-okd-scos-e2e-aws-ovn
pull-ci-openshift-cloud-provider-gcp-main-okd-scos-images
pull-ci-openshift-cloud-provider-gcp-main-security
pull-ci-openshift-cloud-provider-gcp-main-unit
pull-ci-openshift-cloud-provider-gcp-main-verify-commits
pull-ci-openshift-cloud-provider-gcp-main-verify-deps
Details

In response to this:

/test regression-clusterinfra-cucushift-rehearse-gcp-ipi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@theobarberbany
Copy link
Author

/test regression-clusterinfra-gcp-ipi-ccm

@theobarberbany theobarberbany force-pushed the tb/patch-xpn branch 2 times, most recently from 116fed0 to 7e86daa Compare August 19, 2025 23:44
@theobarberbany
Copy link
Author

/test regression-clusterinfra-cucushift-rehearse-gcp-ipi
/test regression-clusterinfra-gcp-ipi-ccm

@miyadav
Copy link
Member

miyadav commented Aug 20, 2025

/test unit

@shellyyang1989
Copy link

/test security

@miyadav
Copy link
Member

miyadav commented Aug 20, 2025

have ignored the security failures in snyk as these were all test files & Other files reported was due to use of _token as username -
https://github.com/openshift/cloud-provider-gcp/blob/main/pkg/gcpcredential/gcpcredential.go#L258
https://github.com/openshift/cloud-provider-gcp/blob/main/providers/gce/gcpcredential/gcpcredential.go#L120
which is also ignorable being used for registries

`{  5d-497d-ac28-2440ff127d8f 
   Path: pkg/gcpcredential/gcpcredential.go, line 258 
   Info: Do not hardcode credentials in code. Found hardcoded credential used in Username.

 ✗ [Low] Use of Hardcoded Credentials
   ID: 43e918b2-7111-4c27-ba18-e5ee72d9a7fe 
   Path: providers/gce/gcpcredential/gcpcredential.go, line 120 
   Info: Do not hardcode credentials in code. Found hardcoded credential used in Username.

 ✗ [Low] Use of Hardcoded Credentials
   ID: 6dee5dec-e6dd-416e-b39d-3b349b8a603b 
   Path: pkg/credentialconfig/config_test.go, line 157 
   Info: Do not hardcode credentials in code. Found hardcoded credential used in Username.

 ✗ [Low] Use of Hardcoded Credentials
   ID: b97ad2a4-ed5b-4a9b-9a13-99cc23182e23 
   Path: providers/gce/gcpcredential/registry_marshal_test.go, line 69 
   Info: Do not hardcode credentials in code. Found hardcoded credential used in Username.

 ✗ [Low] Improper Certificate Validation
   ID: dc77ba76-1d96-4e66-b0b6-38c930dbb05f 
   Path: test/e2e/loadbalancer.go, line 404 
   Info: TrustManager might be too permissive: The client will accept any certificate and any host name in that certificate, making it susceptible to man-in-the-middle attacks.


✔ Test completed

Organization:      openshift-ci-internal
Test type:         Static code analysis
Project path:      /go/src/k8s.io/cloud-provider-gcp

Summary:

  8 Code issues found
  8 [Low] 

Code Report Complete

Your test results are available at:
https://app.snyk.io/org/openshift-ci-internal/project/aa2d60de-88a8-4053-b2b4-4527f58dd54f/history/13865f53-571b-4125-8186-87191b902fe6


Full vulnerabilities report is available at /logs/artifacts/snyk.sarif.json
snyk code scan failed
{"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:84","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2025-08-19T23:52:22Z"}
error: failed to execute wrapped command: exit status 1`

test should pass now.

@miyadav
Copy link
Member

miyadav commented Aug 20, 2025

The unit test failure is fixed by PR when tested locally , but failing still on main ci
@theobarberbany if time permits have a look, if we can add that in here as well.

edit - passed

@miyadav
Copy link
Member

miyadav commented Aug 20, 2025

Other than ccm cases on xpn cluster looks good ( failure ones are due to being spot instances , validated manually looks good )
ccm cases on xpn ( since the cases here in pre-submit runs on non-xpn)
failed case passed on rerun
Putting these here for record as it doesn't have a Jira.

Non-xpn non ccm cases result are good ( failed cases due to spot instances , which passed when tested manually)

Copy link

@nrb nrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I see the behavior from https://github.com/openshift/cloud-provider-gcp/pull/66/files captured here, though the pre-merge tests are passing.

Thanks for writing the test for this, and for cleaning up the logic code!


// If all instances in this external instance group are also in our zone's node list,
// we can reuse this instance group instead of creating our own internal instance group
shouldReuse = gceHostNamesInZone.HasAll(instanceNames.UnsortedList()...)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic here is relevant to https://github.com/openshift/cloud-provider-gcp/pull/66/files; the condition that was added was, in English, "or all instanceNames start with the set prefix."

If I'm reading this correctly, that "or" condition isn't here, or perhaps it's somewhere else in the PR?

Copy link
Author

@theobarberbany theobarberbany Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, I don't understand the motivation for patrick's change 😢

I think, given regression is passing and QE has not found problems we're fine? Maybe?

Understanding in general, motivation for why things are this way / decisions is hard with this.

Given we're installing the e2es / regression with CAPG, shouldn't this fail?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the additional check buy us?

Here, gceHostNamesInZone is every hostname of every node in the zone. instanceNames is... oh, weird. Maybe that should be simplified?

But still, why would "all instanceNames start with the set prefix" be relevant in addition to this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per Patrick's comment, the bootstrap node on install wasn't joining the proper instance groups, resulting in an attempt to use two IGs for one node.

@patrickdillon are you able to give this a look?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, I don't understand the motivation for patrick's change 😢

I think, given regression is passing and QE has not found problems we're fine?

The relevant bug is OCPBUGS-35256 and it only occurs on private clusters, so you would want to make sure to test a private install to be confident.

Here, gceHostNamesInZone is every hostname of every node in the zone. instanceNames is... oh, weird. Maybe that should be simplified?

But still, why would "all instanceNames start with the set prefix" be relevant in addition to this?

Because the bootstrap instance is not a node in the cluster, if we only look at node hostnames, the bootstrap node will not get joined to the correct instance group.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you would want to make sure to test a private install to be confident.

Private clusters without XPN, correct?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. It may also affect private clusters with XPN--I'm not sure.


// If all instances in this external instance group are also in our zone's node list,
// we can reuse this instance group instead of creating our own internal instance group
shouldReuse = gceHostNamesInZone.HasAll(instanceNames.UnsortedList()...)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the additional check buy us?

Here, gceHostNamesInZone is every hostname of every node in the zone. instanceNames is... oh, weird. Maybe that should be simplified?

But still, why would "all instanceNames start with the set prefix" be relevant in addition to this?

@theobarberbany theobarberbany changed the title Reuse instance groups OCPBUGS-60772: Reuse instance groups Aug 21, 2025
@theobarberbany
Copy link
Author

/test regression-clusterinfra-cucushift-rehearse-gcp-ipi
/test regression-clusterinfra-gcp-ipi-ccm

@theobarberbany
Copy link
Author

/test unit

1 similar comment
@miyadav
Copy link
Member

miyadav commented Aug 22, 2025

/test unit

@theobarberbany
Copy link
Author

The units are very flakey, and should be helped by #88.

/override ci/prow/unit

@openshift-ci
Copy link

openshift-ci bot commented Aug 22, 2025

@theobarberbany: Overrode contexts on behalf of theobarberbany: ci/prow/unit

Details

In response to this:

The units are very flakey, and should be helped by #88.

/override ci/prow/unit

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

UseMetadataServer bool
AlphaFeatureGate *AlphaFeatureGate
StackType string
ExternalInstanceGroupsPrefix string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, if you left a blank line between this and the upstream code, wouldn't it prevent the indentation change and make future merges easier?

metricsCollector: newLoadBalancerMetrics(),
projectsBasePath: getProjectsBasePath(service.BasePath),
stackType: StackType(config.StackType),
externalInstanceGroupsPrefix: config.ExternalInstanceGroupsPrefix,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same nit, what happens if you leave a blank line intentionally to minimise indentation changes

@nrb
Copy link

nrb commented Aug 25, 2025

/lgtm
/approve

I think Joel's nits are relevant, but I also don't think we should hold up builds for it right now.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 25, 2025
@openshift-ci
Copy link

openshift-ci bot commented Aug 25, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nrb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 25, 2025
@nrb
Copy link

nrb commented Aug 25, 2025

/label acknowledge-critical-fixes

@openshift-ci
Copy link

openshift-ci bot commented Aug 25, 2025

@nrb: The label(s) /label acknowledge-critical-fixes cannot be applied. These labels are supported: acknowledge-critical-fixes-only, platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, px-approved, docs-approved, qe-approved, ux-approved, no-qe, downstream-change-needed, rebase/manual, cluster-config-api-changed, run-integration-tests, approved, backport-risk-assessed, bugzilla/valid-bug, cherry-pick-approved, jira/valid-bug, ok-to-test, stability-fix-approved, staff-eng-approved. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

Details

In response to this:

/label acknowledge-critical-fixes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@nrb
Copy link

nrb commented Aug 25, 2025

/label acknowledge-critical-fixes-only

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Aug 25, 2025
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD d6b577a and 2 for PR HEAD e4f8177 in total

@shellyyang1989
Copy link

/retest

@theobarberbany
Copy link
Author

/override e2e-gcp-ovn

we failed on deprovision.

@openshift-ci
Copy link

openshift-ci bot commented Aug 26, 2025

@theobarberbany: /override requires failed status contexts, check run or a prowjob name to operate on.
The following unknown contexts/checkruns were given:

  • e2e-gcp-ovn

Only the following failed contexts/checkruns were expected:

  • ci/prow/e2e-gcp-ovn
  • ci/prow/e2e-gcp-ovn-upgrade
  • ci/prow/fmt
  • ci/prow/images
  • ci/prow/okd-scos-e2e-aws-ovn
  • ci/prow/okd-scos-images
  • ci/prow/regression-clusterinfra-gcp-ipi-ccm
  • ci/prow/security
  • ci/prow/unit
  • ci/prow/verify-commits
  • ci/prow/verify-deps
  • pull-ci-openshift-cloud-provider-gcp-main-e2e-gcp-ovn
  • pull-ci-openshift-cloud-provider-gcp-main-e2e-gcp-ovn-upgrade
  • pull-ci-openshift-cloud-provider-gcp-main-fmt
  • pull-ci-openshift-cloud-provider-gcp-main-images
  • pull-ci-openshift-cloud-provider-gcp-main-okd-scos-e2e-aws-ovn
  • pull-ci-openshift-cloud-provider-gcp-main-okd-scos-images
  • pull-ci-openshift-cloud-provider-gcp-main-regression-clusterinfra-gcp-ipi-ccm
  • pull-ci-openshift-cloud-provider-gcp-main-security
  • pull-ci-openshift-cloud-provider-gcp-main-unit
  • pull-ci-openshift-cloud-provider-gcp-main-verify-commits
  • pull-ci-openshift-cloud-provider-gcp-main-verify-deps
  • tide

If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.

Details

In response to this:

/override e2e-gcp-ovn

we failed on deprovision.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@theobarberbany
Copy link
Author

/override ci/prow/e2e-gcp-ovn

@openshift-ci
Copy link

openshift-ci bot commented Aug 26, 2025

@theobarberbany: Overrode contexts on behalf of theobarberbany: ci/prow/e2e-gcp-ovn

Details

In response to this:

/override ci/prow/e2e-gcp-ovn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link

openshift-ci bot commented Aug 26, 2025

@theobarberbany: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn e4f8177 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@theobarberbany
Copy link
Author

/tide refresh

@openshift-merge-bot openshift-merge-bot bot merged commit f940e72 into openshift:main Aug 26, 2025
11 of 12 checks passed
@openshift-ci-robot
Copy link

@theobarberbany: Jira Issue OCPBUGS-60772: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-60772 has been moved to the MODIFIED state.

Details

In response to this:

This PR rewrites 49f5389.

Work around GCP internal load balancer restrictions for multi-subnet clusters.

GCP internal load balancers have specific restrictions that prevent
straightforward load balancing across multiple subnets:

  1. "Don't put a VM in more than one load-balanced instance group"
  2. Instance groups can "only select VMs that are in the same zone, VPC network, and subnet"
  3. "All VMs in an instance group must have their primary network interface in the same VPC network"
  4. Internal LBs can load balance to VMs in same region but different subnets

For clusters with nodes across multiple subnets, the previous implementation
would fail to create internal load balancers. This change implements a
two-pass approach:

  1. Find existing external instance groups (matching externalInstanceGroupsPrefix)
    that contain ONLY cluster nodes and reuse them for the backend service
  2. Create internal instance groups only for remaining nodes not covered by
    external groups

This ensures compliance with GCP restrictions while enabling multi-subnet
load balancing for Kubernetes clusters.

References:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@theobarberbany theobarberbany deleted the tb/patch-xpn branch August 26, 2025 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants

Comments