OCPBUGS-60772: Reuse instance groups by theobarberbany · Pull Request #86 · openshift/cloud-provider-gcp

theobarberbany · 2025-08-14T15:36:37Z

This PR rewrites 49f5389.

Work around GCP internal load balancer restrictions for multi-subnet clusters.

GCP internal load balancers have specific restrictions that prevent
straightforward load balancing across multiple subnets:

"Don't put a VM in more than one load-balanced instance group"
Instance groups can "only select VMs that are in the same zone, VPC network, and subnet"
"All VMs in an instance group must have their primary network interface in the same VPC network"
Internal LBs can load balance to VMs in same region but different subnets

For clusters with nodes across multiple subnets, the previous implementation
would fail to create internal load balancers. This change implements a
two-pass approach:

Find existing external instance groups (matching externalInstanceGroupsPrefix)
that contain ONLY cluster nodes and reuse them for the backend service
Create internal instance groups only for remaining nodes not covered by
external groups

This ensures compliance with GCP restrictions while enabling multi-subnet
load balancing for Kubernetes clusters.

References:

Internal LB docs: https://cloud.google.com/load-balancing/docs/internal
Backend service restrictions: https://cloud.google.com/load-balancing/docs/backend-service#restrictions_and_guidance
Instance group constraints: https://cloud.google.com/compute/docs/instance-groups/creating-groups-of-unmanaged-instances#addinstances
Our own doc on this change: https://docs.google.com/document/d/1S2VvTAESeJgpZ-b9FRG9sLymTZeFZKUSH4vyrFfluIU/edit?tab=t.0

openshift-ci · 2025-08-14T15:36:43Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

theobarberbany · 2025-08-14T17:15:48Z

/test unit

miyadav · 2025-08-18T06:47:40Z

@theobarberbany , this does resolves the internal lb creation issue we were getting for both xpn and non-xpn.
that is tested successfully,but haven't run regression yet.

miyadav · 2025-08-18T11:37:23Z

Regression looks good too .

/test unit

theobarberbany · 2025-08-18T13:30:11Z

@miyadav Woop! Amazing news! Now just to clean the code up, and document it thoroughly so the next poor soul who comes across it isn't as confused :)

theobarberbany · 2025-08-19T23:03:34Z

/test regression-clusterinfra-cucushift-rehearse-gcp-ipi

openshift-ci · 2025-08-19T23:03:38Z

@theobarberbany: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-gcp-ovn

/test e2e-gcp-ovn-upgrade

/test fmt

/test images

/test okd-scos-images

/test security

/test unit

/test verify-deps

The following commands are available to trigger optional jobs:

/test okd-scos-e2e-aws-ovn

/test regression-clusterinfra-gcp-ipi-ccm

/test verify-commits

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-cloud-provider-gcp-main-e2e-gcp-ovn

pull-ci-openshift-cloud-provider-gcp-main-e2e-gcp-ovn-upgrade

pull-ci-openshift-cloud-provider-gcp-main-fmt

pull-ci-openshift-cloud-provider-gcp-main-images

pull-ci-openshift-cloud-provider-gcp-main-okd-scos-e2e-aws-ovn

pull-ci-openshift-cloud-provider-gcp-main-okd-scos-images

pull-ci-openshift-cloud-provider-gcp-main-security

pull-ci-openshift-cloud-provider-gcp-main-unit

pull-ci-openshift-cloud-provider-gcp-main-verify-commits

pull-ci-openshift-cloud-provider-gcp-main-verify-deps

Details

In response to this:

/test regression-clusterinfra-cucushift-rehearse-gcp-ipi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

theobarberbany · 2025-08-19T23:03:39Z

/test regression-clusterinfra-gcp-ipi-ccm

theobarberbany · 2025-08-19T23:44:52Z

/test regression-clusterinfra-cucushift-rehearse-gcp-ipi
/test regression-clusterinfra-gcp-ipi-ccm

miyadav · 2025-08-20T02:47:24Z

/test unit

shellyyang1989 · 2025-08-20T02:52:05Z

/test security

miyadav · 2025-08-20T03:02:11Z

have ignored the security failures in snyk as these were all test files & Other files reported was due to use of _token as username -
https://github.com/openshift/cloud-provider-gcp/blob/main/pkg/gcpcredential/gcpcredential.go#L258
https://github.com/openshift/cloud-provider-gcp/blob/main/providers/gce/gcpcredential/gcpcredential.go#L120
which is also ignorable being used for registries

`{  5d-497d-ac28-2440ff127d8f 
   Path: pkg/gcpcredential/gcpcredential.go, line 258 
   Info: Do not hardcode credentials in code. Found hardcoded credential used in Username.

 ✗ [Low] Use of Hardcoded Credentials
   ID: 43e918b2-7111-4c27-ba18-e5ee72d9a7fe 
   Path: providers/gce/gcpcredential/gcpcredential.go, line 120 
   Info: Do not hardcode credentials in code. Found hardcoded credential used in Username.

 ✗ [Low] Use of Hardcoded Credentials
   ID: 6dee5dec-e6dd-416e-b39d-3b349b8a603b 
   Path: pkg/credentialconfig/config_test.go, line 157 
   Info: Do not hardcode credentials in code. Found hardcoded credential used in Username.

 ✗ [Low] Use of Hardcoded Credentials
   ID: b97ad2a4-ed5b-4a9b-9a13-99cc23182e23 
   Path: providers/gce/gcpcredential/registry_marshal_test.go, line 69 
   Info: Do not hardcode credentials in code. Found hardcoded credential used in Username.

 ✗ [Low] Improper Certificate Validation
   ID: dc77ba76-1d96-4e66-b0b6-38c930dbb05f 
   Path: test/e2e/loadbalancer.go, line 404 
   Info: TrustManager might be too permissive: The client will accept any certificate and any host name in that certificate, making it susceptible to man-in-the-middle attacks.


✔ Test completed

Organization:      openshift-ci-internal
Test type:         Static code analysis
Project path:      /go/src/k8s.io/cloud-provider-gcp

Summary:

  8 Code issues found
  8 [Low] 

Code Report Complete

Your test results are available at:
https://app.snyk.io/org/openshift-ci-internal/project/aa2d60de-88a8-4053-b2b4-4527f58dd54f/history/13865f53-571b-4125-8186-87191b902fe6


Full vulnerabilities report is available at /logs/artifacts/snyk.sarif.json
snyk code scan failed
{"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:84","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2025-08-19T23:52:22Z"}
error: failed to execute wrapped command: exit status 1`

test should pass now.

miyadav · 2025-08-20T06:00:40Z

The unit test failure is fixed by PR when tested locally , but failing still on main ci
@theobarberbany if time permits have a look, if we can add that in here as well.

edit - passed

miyadav · 2025-08-20T14:23:55Z

Other than ccm cases on xpn cluster looks good ( failure ones are due to being spot instances , validated manually looks good )
ccm cases on xpn ( since the cases here in pre-submit runs on non-xpn)
failed case passed on rerun
Putting these here for record as it doesn't have a Jira.

Non-xpn non ccm cases result are good ( failed cases due to spot instances , which passed when tested manually)

nrb

I don't think I see the behavior from https://github.com/openshift/cloud-provider-gcp/pull/66/files captured here, though the pre-merge tests are passing.

Thanks for writing the test for this, and for cleaning up the logic code!

providers/gce/gce.go

providers/gce/gce_loadbalancer_internal.go

nrb · 2025-08-20T16:41:19Z

providers/gce/gce_loadbalancer_internal.go

+
+	// If all instances in this external instance group are also in our zone's node list,
+	// we can reuse this instance group instead of creating our own internal instance group
+	shouldReuse = gceHostNamesInZone.HasAll(instanceNames.UnsortedList()...)


I think the logic here is relevant to https://github.com/openshift/cloud-provider-gcp/pull/66/files; the condition that was added was, in English, "or all instanceNames start with the set prefix."

If I'm reading this correctly, that "or" condition isn't here, or perhaps it's somewhere else in the PR?

Ah yeah, I don't understand the motivation for patrick's change 😢

I think, given regression is passing and QE has not found problems we're fine? Maybe?

Understanding in general, motivation for why things are this way / decisions is hard with this.

Given we're installing the e2es / regression with CAPG, shouldn't this fail?

What would the additional check buy us?

Here, gceHostNamesInZone is every hostname of every node in the zone. instanceNames is... oh, weird. Maybe that should be simplified?

But still, why would "all instanceNames start with the set prefix" be relevant in addition to this?

Per Patrick's comment, the bootstrap node on install wasn't joining the proper instance groups, resulting in an attempt to use two IGs for one node.

@patrickdillon are you able to give this a look?

Ah yeah, I don't understand the motivation for patrick's change 😢

I think, given regression is passing and QE has not found problems we're fine?

The relevant bug is OCPBUGS-35256 and it only occurs on private clusters, so you would want to make sure to test a private install to be confident.

Here, gceHostNamesInZone is every hostname of every node in the zone. instanceNames is... oh, weird. Maybe that should be simplified?

But still, why would "all instanceNames start with the set prefix" be relevant in addition to this?

Because the bootstrap instance is not a node in the cluster, if we only look at node hostnames, the bootstrap node will not get joined to the correct instance group.

you would want to make sure to test a private install to be confident.

Private clusters without XPN, correct?

Correct. It may also affect private clusters with XPN--I'm not sure.

providers/gce/gce_loadbalancer_internal.go

mdbooth · 2025-08-20T17:00:20Z

providers/gce/gce_loadbalancer_internal.go

+
+	// If all instances in this external instance group are also in our zone's node list,
+	// we can reuse this instance group instead of creating our own internal instance group
+	shouldReuse = gceHostNamesInZone.HasAll(instanceNames.UnsortedList()...)


What would the additional check buy us?

Here, gceHostNamesInZone is every hostname of every node in the zone. instanceNames is... oh, weird. Maybe that should be simplified?

But still, why would "all instanceNames start with the set prefix" be relevant in addition to this?

providers/gce/gce_loadbalancer_internal.go

theobarberbany · 2025-08-21T16:54:00Z

/test regression-clusterinfra-cucushift-rehearse-gcp-ipi
/test regression-clusterinfra-gcp-ipi-ccm

theobarberbany · 2025-08-21T19:36:02Z

/test unit

miyadav · 2025-08-22T02:52:36Z

/test unit

theobarberbany · 2025-08-22T08:11:23Z

The units are very flakey, and should be helped by #88.

/override ci/prow/unit

openshift-ci · 2025-08-22T08:11:37Z

@theobarberbany: Overrode contexts on behalf of theobarberbany: ci/prow/unit

Details

In response to this:

The units are very flakey, and should be helped by #88.

/override ci/prow/unit

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

JoelSpeed · 2025-08-22T13:56:11Z

providers/gce/gce.go

+	UseMetadataServer            bool
+	AlphaFeatureGate             *AlphaFeatureGate
+	StackType                    string
+	ExternalInstanceGroupsPrefix string


Nit, if you left a blank line between this and the upstream code, wouldn't it prevent the indentation change and make future merges easier?

JoelSpeed · 2025-08-22T13:56:49Z

providers/gce/gce.go

+		metricsCollector:             newLoadBalancerMetrics(),
+		projectsBasePath:             getProjectsBasePath(service.BasePath),
+		stackType:                    StackType(config.StackType),
+		externalInstanceGroupsPrefix: config.ExternalInstanceGroupsPrefix,


Same nit, what happens if you leave a blank line intentionally to minimise indentation changes

nrb · 2025-08-25T22:23:34Z

/lgtm
/approve

I think Joel's nits are relevant, but I also don't think we should hold up builds for it right now.

openshift-ci · 2025-08-25T22:23:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nrb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [nrb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nrb · 2025-08-25T22:27:42Z

/label acknowledge-critical-fixes

openshift-ci · 2025-08-25T22:27:45Z

@nrb: The label(s) /label acknowledge-critical-fixes cannot be applied. These labels are supported: acknowledge-critical-fixes-only, platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, px-approved, docs-approved, qe-approved, ux-approved, no-qe, downstream-change-needed, rebase/manual, cluster-config-api-changed, run-integration-tests, approved, backport-risk-assessed, bugzilla/valid-bug, cherry-pick-approved, jira/valid-bug, ok-to-test, stability-fix-approved, staff-eng-approved. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

Details

In response to this:

/label acknowledge-critical-fixes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

nrb · 2025-08-25T22:27:53Z

/label acknowledge-critical-fixes-only

openshift-ci-robot · 2025-08-25T23:27:35Z

/retest-required

Remaining retests: 0 against base HEAD d6b577a and 2 for PR HEAD e4f8177 in total

shellyyang1989 · 2025-08-26T02:04:51Z

/retest

theobarberbany · 2025-08-26T08:41:00Z

/override e2e-gcp-ovn

we failed on deprovision.

openshift-ci · 2025-08-26T08:41:17Z

@theobarberbany: /override requires failed status contexts, check run or a prowjob name to operate on.
The following unknown contexts/checkruns were given:

e2e-gcp-ovn

Only the following failed contexts/checkruns were expected:

ci/prow/e2e-gcp-ovn
ci/prow/e2e-gcp-ovn-upgrade
ci/prow/fmt
ci/prow/images
ci/prow/okd-scos-e2e-aws-ovn
ci/prow/okd-scos-images
ci/prow/regression-clusterinfra-gcp-ipi-ccm
ci/prow/security
ci/prow/unit
ci/prow/verify-commits
ci/prow/verify-deps
pull-ci-openshift-cloud-provider-gcp-main-e2e-gcp-ovn
pull-ci-openshift-cloud-provider-gcp-main-e2e-gcp-ovn-upgrade
pull-ci-openshift-cloud-provider-gcp-main-fmt
pull-ci-openshift-cloud-provider-gcp-main-images
pull-ci-openshift-cloud-provider-gcp-main-okd-scos-e2e-aws-ovn
pull-ci-openshift-cloud-provider-gcp-main-okd-scos-images
pull-ci-openshift-cloud-provider-gcp-main-regression-clusterinfra-gcp-ipi-ccm
pull-ci-openshift-cloud-provider-gcp-main-security
pull-ci-openshift-cloud-provider-gcp-main-unit
pull-ci-openshift-cloud-provider-gcp-main-verify-commits
pull-ci-openshift-cloud-provider-gcp-main-verify-deps
tide

If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.

Details

In response to this:

/override e2e-gcp-ovn

we failed on deprovision.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

theobarberbany · 2025-08-26T08:42:04Z

/override ci/prow/e2e-gcp-ovn

openshift-ci · 2025-08-26T08:42:19Z

@theobarberbany: Overrode contexts on behalf of theobarberbany: ci/prow/e2e-gcp-ovn

Details

In response to this:

/override ci/prow/e2e-gcp-ovn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2025-08-26T08:42:21Z

@theobarberbany: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`e4f8177`	link	false	`/test okd-scos-e2e-aws-ovn`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

theobarberbany · 2025-08-26T08:43:08Z

/tide refresh

openshift-ci-robot · 2025-08-26T08:47:40Z

@theobarberbany: Jira Issue OCPBUGS-60772: All pull requests linked via external trackers have merged:

openshift/cloud-provider-gcp#86

Jira Issue OCPBUGS-60772 has been moved to the MODIFIED state.

Details

In response to this:

This PR rewrites 49f5389.

Work around GCP internal load balancer restrictions for multi-subnet clusters.

GCP internal load balancers have specific restrictions that prevent
straightforward load balancing across multiple subnets:

"Don't put a VM in more than one load-balanced instance group"

Instance groups can "only select VMs that are in the same zone, VPC network, and subnet"

"All VMs in an instance group must have their primary network interface in the same VPC network"

Internal LBs can load balance to VMs in same region but different subnets

For clusters with nodes across multiple subnets, the previous implementation
would fail to create internal load balancers. This change implements a
two-pass approach:

Find existing external instance groups (matching externalInstanceGroupsPrefix)
that contain ONLY cluster nodes and reuse them for the backend service

Create internal instance groups only for remaining nodes not covered by
external groups

This ensures compliance with GCP restrictions while enabling multi-subnet
load balancing for Kubernetes clusters.

References:

Internal LB docs: https://cloud.google.com/load-balancing/docs/internal

Backend service restrictions: https://cloud.google.com/load-balancing/docs/backend-service#restrictions_and_guidance

Instance group constraints: https://cloud.google.com/compute/docs/instance-groups/creating-groups-of-unmanaged-instances#addinstances

Our own doc on this change: https://docs.google.com/document/d/1S2VvTAESeJgpZ-b9FRG9sLymTZeFZKUSH4vyrFfluIU/edit?tab=t.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 14, 2025

theobarberbany force-pushed the tb/patch-xpn branch 2 times, most recently from 8661974 to 796793d Compare August 14, 2025 17:15

theobarberbany force-pushed the tb/patch-xpn branch from 5557bc7 to 04ed5c6 Compare August 19, 2025 18:15

theobarberbany marked this pull request as ready for review August 19, 2025 18:15

theobarberbany changed the title ~~WIP: Reuse instance groups~~ Reuse instance groups Aug 19, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 19, 2025

openshift-ci bot requested review from RadekManak and nrb August 19, 2025 18:16

theobarberbany force-pushed the tb/patch-xpn branch 2 times, most recently from 116fed0 to 7e86daa Compare August 19, 2025 23:44

nrb reviewed Aug 20, 2025

View reviewed changes

mdbooth reviewed Aug 20, 2025

View reviewed changes

candita mentioned this pull request Aug 21, 2025

OCPBUGS-60620: e2e: Deflake tests by using ReplicaSet for test workload openshift/cluster-ingress-operator#1262

Merged

mdbooth reviewed Aug 21, 2025

View reviewed changes

providers/gce/gce_loadbalancer_internal.go Outdated Show resolved Hide resolved

theobarberbany changed the title ~~Reuse instance groups~~ OCPBUGS-60772: Reuse instance groups Aug 21, 2025

theobarberbany requested review from mdbooth, nrb and patrickdillon August 21, 2025 17:00

JoelSpeed reviewed Aug 22, 2025

View reviewed changes

openshift-ci bot assigned nrb Aug 25, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 25, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 25, 2025

openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Aug 25, 2025

alebedev87 mentioned this pull request Aug 26, 2025

OCPBUGS-53432: deflake TestIngressControllerCustomEndpoints by proper waiting for CCM to be ready openshift/cluster-ingress-operator#1267

Merged

openshift-merge-bot bot merged commit f940e72 into openshift:main Aug 26, 2025
11 of 12 checks passed

theobarberbany deleted the tb/patch-xpn branch August 26, 2025 08:48

Conversation

theobarberbany commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Aug 14, 2025

Uh oh!

theobarberbany commented Aug 14, 2025

Uh oh!

miyadav commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

miyadav commented Aug 18, 2025

Uh oh!

theobarberbany commented Aug 18, 2025

Uh oh!

theobarberbany commented Aug 19, 2025

Uh oh!

openshift-ci bot commented Aug 19, 2025

Uh oh!

theobarberbany commented Aug 19, 2025

Uh oh!

theobarberbany commented Aug 19, 2025

Uh oh!

miyadav commented Aug 20, 2025

Uh oh!

shellyyang1989 commented Aug 20, 2025

Uh oh!

miyadav commented Aug 20, 2025

Uh oh!

miyadav commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

miyadav commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nrb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

theobarberbany Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

theobarberbany commented Aug 21, 2025

Uh oh!

theobarberbany commented Aug 21, 2025

Uh oh!

miyadav commented Aug 22, 2025

Uh oh!

theobarberbany commented Aug 22, 2025

Uh oh!

openshift-ci bot commented Aug 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nrb commented Aug 25, 2025

Uh oh!

openshift-ci bot commented Aug 25, 2025

Uh oh!

nrb commented Aug 25, 2025

Uh oh!

theobarberbany commented Aug 14, 2025 •

edited

Loading

miyadav commented Aug 18, 2025 •

edited

Loading

miyadav commented Aug 20, 2025 •

edited

Loading

miyadav commented Aug 20, 2025 •

edited

Loading

theobarberbany Aug 20, 2025 •

edited

Loading