Skip to content

Conversation

@kolluria
Copy link
Contributor

@kolluria kolluria commented Apr 30, 2025

What this PR does / why we need it:
As part of #3234, I refactored the BeforeServe method which introduced a regression in Vanilla deployments which is causing the CSI controller to stay stuck in CLBO -

root@k8s-control-157-1745853912:~# kubectl -n vmware-system-csi get pods
NAME                                      READY   STATUS             RESTARTS          AGE
vsphere-csi-controller-7bffffc495-24995   2/7     CrashLoopBackOff   2167 (82s ago)    38h
vsphere-csi-controller-7bffffc495-jg767   2/7     CrashLoopBackOff   178 (3m21s ago)   179m
vsphere-csi-controller-7bffffc495-kvzhm   2/7     CrashLoopBackOff   97 (3m47s ago)    86m
vsphere-csi-node-2h82g                    3/3     Running            10 (3h59m ago)    38h
vsphere-csi-node-9x2c5                    3/3     Running            13 (9h ago)       38h
vsphere-csi-node-b4jbd                    3/3     Running            7 (108m ago)      38h
vsphere-csi-node-d82x7                    3/3     Running            8 (29h ago)       38h
vsphere-csi-node-fmqrh                    3/3     Running            12 (13h ago)      38h
vsphere-csi-node-md486                    3/3     Running            11 (21h ago)      38h
vsphere-csi-node-sbmbt                    3/3     Running            1 (38h ago)       38h
vsphere-csi-node-vld5b                    3/3     Running            1 (38h ago)       38h
vsphere-csi-node-zcf62                    3/3     Running            11 (13h ago)      38h
vsphere-csi-node-zzdd8                    3/3     Running            17 (4h33m ago)    38h

The vsphere-csi-controller containers contain the following error -

2025-04-30T06:40:23.720Z	ERROR	service/driver.go:217	failed to run the driver. Err: +Cluster ID is present in vSphere Config Secret as well as in vsphere-csi-cluster-id ConfigMap. Please remove the cluster ID from vSphere Config Secret.	{"TraceId": "a558d9ac-e873-417c-a836-c15847c7f6a4"}
sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).Run
	/build/pkg/csi/service/driver.go:217
main.main
	/build/cmd/vsphere-csi/main.go:96
runtime.main
	/usr/local/go/src/runtime/proc.go:271

The regression is due to the way cfg.Global.ClusterID is updated (here and here) and subsequently checked here.

This MR reverts the refactoring changes introduced.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes regression introduced in #3234

Testing done:
Verified that the CSI controller pods have been stable after applying the fix.

root@k8s-control-157-1745853912:~# alias kcsi='kubectl -n vmware-system-csi'
root@k8s-control-157-1745853912:~# kcsi get pods
NAME                                      READY   STATUS    RESTARTS         AGE
vsphere-csi-controller-69569bf449-992b9   7/7     Running   1 (11m ago)      77m
vsphere-csi-controller-69569bf449-9bcck   7/7     Running   1 (16m ago)      77m
vsphere-csi-controller-69569bf449-qw6mw   7/7     Running   5 (44m ago)      77m
vsphere-csi-node-2h82g                    3/3     Running   10 (6h3m ago)    40h
vsphere-csi-node-9x2c5                    3/3     Running   13 (11h ago)     40h
vsphere-csi-node-b4jbd                    3/3     Running   10 (119m ago)    40h
vsphere-csi-node-d82x7                    3/3     Running   8 (31h ago)      40h
vsphere-csi-node-fmqrh                    3/3     Running   12 (15h ago)     40h
vsphere-csi-node-md486                    3/3     Running   11 (23h ago)     40h
vsphere-csi-node-sbmbt                    3/3     Running   1 (40h ago)      40h
vsphere-csi-node-vld5b                    3/3     Running   1 (40h ago)      40h
vsphere-csi-node-zcf62                    3/3     Running   11 (15h ago)     40h
vsphere-csi-node-zzdd8                    3/3     Running   17 (6h37m ago)   40h

Special notes for your reviewer:

Release note:

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 30, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 30, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @kolluria. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Apr 30, 2025
@kolluria kolluria force-pushed the fix-refactor-bug branch 2 times, most recently from 9d97c40 to 9b8e448 Compare April 30, 2025 08:44
@kolluria kolluria marked this pull request as ready for review April 30, 2025 08:58
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 30, 2025
@kolluria kolluria requested a review from vdkotkar April 30, 2025 19:22
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 5, 2025
@kolluria kolluria force-pushed the fix-refactor-bug branch from 35bf739 to 09dbd5b Compare May 5, 2025 06:35
@vdkotkar
Copy link
Contributor

vdkotkar commented May 5, 2025

/ok-to-test

@vdkotkar
Copy link
Contributor

vdkotkar commented May 5, 2025

/approve

@k8s-ci-robot
Copy link
Contributor

@vdkotkar: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@akankshapanse
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 5, 2025
@kolluria kolluria force-pushed the fix-refactor-bug branch from 09dbd5b to 9882f39 Compare May 9, 2025 09:27
@kolluria
Copy link
Contributor Author

kolluria commented May 9, 2025

After reverting the changes from #3234 and taking the latest pull from the master, I created an image vsphere-csi:9882f398 from my branch and verified that the controller is now able to come up without any issues. Below are the logs for reference:

Controller logs before my change -

2025-05-09T09:18:08.613Z	ERROR	service/driver.go:197	Cluster ID is present in vSphere Config Secret as well as in vsphere-csi-cluster-id ConfigMap. Please remove the cluster ID from vSphere Config Secret.	{"TraceId": "2b587b75-40bd-459a-bed4-367b36870ca3"}
sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).BeforeServe
	/build/pkg/csi/service/driver.go:197
sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).Run
	/build/pkg/csi/service/driver.go:216
main.main
	/build/cmd/vsphere-csi/main.go:96
runtime.main
	/usr/local/go/src/runtime/proc.go:271
2025-05-09T09:18:08.613Z	INFO	service/driver.go:113	Configured: "csi.vsphere.vmware.com" with clusterFlavor: "VANILLA" and mode: "controller"	{"TraceId": "2b587b75-40bd-459a-bed4-367b36870ca3"}
2025-05-09T09:18:08.613Z	ERROR	service/driver.go:217	failed to run the driver. Err: +Cluster ID is present in vSphere Config Secret as well as in vsphere-csi-cluster-id ConfigMap. Please remove the cluster ID from vSphere Config Secret.	{"TraceId": "2b587b75-40bd-459a-bed4-367b36870ca3"}
sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).Run
	/build/pkg/csi/service/driver.go:217
main.main
	/build/cmd/vsphere-csi/main.go:96
runtime.main
	/usr/local/go/src/runtime/proc.go:271

Controller logs after my change -

2025-05-09T09:21:44.713Z	INFO	vsphere/virtualmachine.go:180	Returning VM VirtualMachine:vm-43 [VirtualCenterHost: 10.70.176.27, UUID: 420b72e6-b60a-fd5a-f8c2-e1774431a3c4, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3 @ /VSAN-DC, VirtualCenterHost: 10.70.176.27]] for UUID 420b72e6-b60a-fd5a-f8c2-e1774431a3c4	{"TraceId": "e97fabba-aa70-4ee4-b35c-baf26674b770"}
2025-05-09T09:21:44.713Z	INFO	node/manager.go:145	Successfully discovered node with nodeUUID 420b72e6-b60a-fd5a-f8c2-e1774431a3c4 in vm VirtualMachine:vm-43 [VirtualCenterHost: 10.70.176.27, UUID: 420b72e6-b60a-fd5a-f8c2-e1774431a3c4, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3 @ /VSAN-DC, VirtualCenterHost: 10.70.176.27]]	{"TraceId": "e97fabba-aa70-4ee4-b35c-baf26674b770"}
2025-05-09T09:21:44.713Z	INFO	node/manager.go:128	Successfully discovered node: "k8s-node-525-1746733815" with nodeUUID "420b72e6-b60a-fd5a-f8c2-e1774431a3c4"	{"TraceId": "e97fabba-aa70-4ee4-b35c-baf26674b770"}
2025-05-09T09:21:44.713Z	INFO	node/manager.go:130	Successfully registered node: "k8s-node-525-1746733815" with nodeUUID "420b72e6-b60a-fd5a-f8c2-e1774431a3c4"	{"TraceId": "e97fabba-aa70-4ee4-b35c-baf26674b770"}
2025-05-09T09:21:44.714Z	INFO	node/manager.go:122	Discovering the node vm using uuid: "420b89cd-92b1-c64d-0651-f4086b65f69a"	{"TraceId": "f5ab3ec3-c2cc-43ff-a6f3-3348edc56da6"}
2025-05-09T09:21:44.714Z	INFO	vsphere/virtualmachine.go:152	Initiating asynchronous datacenter listing with uuid 420b89cd-92b1-c64d-0651-f4086b65f69a	{"TraceId": "f5ab3ec3-c2cc-43ff-a6f3-3348edc56da6"}
2025-05-09T09:21:44.799Z	INFO	vsphere/virtualmachine.go:172	Found VM VirtualMachine:vm-44 [VirtualCenterHost: 10.70.176.27, UUID: 420b89cd-92b1-c64d-0651-f4086b65f69a, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3 @ /VSAN-DC, VirtualCenterHost: 10.70.176.27]] given uuid 420b89cd-92b1-c64d-0651-f4086b65f69a on DC Datacenter [Datacenter: Datacenter:datacenter-3 @ /VSAN-DC, VirtualCenterHost: 10.70.176.27]	{"TraceId": "f5ab3ec3-c2cc-43ff-a6f3-3348edc56da6"}
2025-05-09T09:21:44.827Z	INFO	vsphere/virtualmachine.go:180	Returning VM VirtualMachine:vm-44 [VirtualCenterHost: 10.70.176.27, UUID: 420b89cd-92b1-c64d-0651-f4086b65f69a, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3 @ /VSAN-DC, VirtualCenterHost: 10.70.176.27]] for UUID 420b89cd-92b1-c64d-0651-f4086b65f69a	{"TraceId": "f5ab3ec3-c2cc-43ff-a6f3-3348edc56da6"}
2025-05-09T09:21:44.827Z	INFO	node/manager.go:145	Successfully discovered node with nodeUUID 420b89cd-92b1-c64d-0651-f4086b65f69a in vm VirtualMachine:vm-44 [VirtualCenterHost: 10.70.176.27, UUID: 420b89cd-92b1-c64d-0651-f4086b65f69a, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3 @ /VSAN-DC, VirtualCenterHost: 10.70.176.27]]	{"TraceId": "f5ab3ec3-c2cc-43ff-a6f3-3348edc56da6"}
2025-05-09T09:21:44.827Z	INFO	node/manager.go:128	Successfully discovered node: "k8s-node-553-1746733824" with nodeUUID "420b89cd-92b1-c64d-0651-f4086b65f69a"	{"TraceId": "f5ab3ec3-c2cc-43ff-a6f3-3348edc56da6"}
2025-05-09T09:21:44.827Z	INFO	node/manager.go:130	Successfully registered node: "k8s-node-553-1746733824" with nodeUUID "420b89cd-92b1-c64d-0651-f4086b65f69a"	{"TraceId": "f5ab3ec3-c2cc-43ff-a6f3-3348edc56da6"}
2025-05-09T09:21:44.827Z	INFO	node/manager.go:122	Discovering the node vm using uuid: "422145de-4dde-d1d5-208c-3e10166c66c7"	{"TraceId": "b1ac4dec-a881-4120-825e-f00202e4bda5"}
2025-05-09T09:21:44.827Z	INFO	vsphere/virtualmachine.go:152	Initiating asynchronous datacenter listing with uuid 422145de-4dde-d1d5-208c-3e10166c66c7	{"TraceId": "b1ac4dec-a881-4120-825e-f00202e4bda5"}
2025-05-09T09:21:44.873Z	INFO	vsphere/virtualmachine.go:172	Found VM VirtualMachine:vm-52 [VirtualCenterHost: 10.161.16.240, UUID: 422145de-4dde-d1d5-208c-3e10166c66c7, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3 @ /VSAN-DC, VirtualCenterHost: 10.161.16.240]] given uuid 422145de-4dde-d1d5-208c-3e10166c66c7 on DC Datacenter [Datacenter: Datacenter:datacenter-3 @ /VSAN-DC, VirtualCenterHost: 10.161.16.240]	{"TraceId": "b1ac4dec-a881-4120-825e-f00202e4bda5"}
2025-05-09T09:21:44.944Z	INFO	vsphere/virtualmachine.go:180	Returning VM VirtualMachine:vm-52 [VirtualCenterHost: 10.161.16.240, UUID: 422145de-4dde-d1d5-208c-3e10166c66c7, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3 @ /VSAN-DC, VirtualCenterHost: 10.161.16.240]] for UUID 422145de-4dde-d1d5-208c-3e10166c66c7	{"TraceId": "b1ac4dec-a881-4120-825e-f00202e4bda5"}
2025-05-09T09:21:44.944Z	INFO	node/manager.go:145	Successfully discovered node with nodeUUID 422145de-4dde-d1d5-208c-3e10166c66c7 in vm VirtualMachine:vm-52 [VirtualCenterHost: 10.161.16.240, UUID: 422145de-4dde-d1d5-208c-3e10166c66c7, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3 @ /VSAN-DC, VirtualCenterHost: 10.161.16.240]]	{"TraceId": "b1ac4dec-a881-4120-825e-f00202e4bda5"}
2025-05-09T09:21:44.944Z	INFO	node/manager.go:128	Successfully discovered node: "k8s-node-697-1746733479" with nodeUUID "422145de-4dde-d1d5-208c-3e10166c66c7"	{"TraceId": "b1ac4dec-a881-4120-825e-f00202e4bda5"}
2025-05-09T09:21:44.944Z	INFO	node/manager.go:130	Successfully registered node: "k8s-node-697-1746733479" with nodeUUID "422145de-4dde-d1d5-208c-3e10166c66c7"	{"TraceId": "b1ac4dec-a881-4120-825e-f00202e4bda5"}
2025-05-09T09:21:45.223Z	INFO	common/topology.go:94	Tags associated with category "k8s-zone" are [{ID:urn:vmomi:InventoryServiceTag:8b97c0c4-013d-4ebc-a315-aa18c087fef2:GLOBAL Description:vctag Name:zone-1 CategoryID:urn:vmomi:InventoryServiceCategory:8e773532-0475-402d-b7ee-c8daffac0b6d:GLOBAL UsedBy:[]}]	{"TraceId": "928232db-4676-4d27-838e-d7ffe965a9a2"}
2025-05-09T09:21:45.365Z	INFO	common/topology.go:100	Entities associated with tag "zone-1" are [{Type:Folder Value:group-d1}]	{"TraceId": "928232db-4676-4d27-838e-d7ffe965a9a2"}
2025-05-09T09:21:45.372Z	INFO	vsphere/tagmanager.go:27	New tag manager with useragent 'k8s-csi-useragent-3a99b5fe-b032-453a-81fe-57c91df63114'	{"TraceId": "928232db-4676-4d27-838e-d7ffe965a9a2"}
2025-05-09T09:21:45.643Z	INFO	common/topology.go:94	Tags associated with category "k8s-zone" are [{ID:urn:vmomi:InventoryServiceTag:a305360f-1fcc-4696-81ca-8129a9b7010d:GLOBAL Description:vctag Name:zone-2 CategoryID:urn:vmomi:InventoryServiceCategory:262f03b8-b74d-4cf5-b36d-ce6219adca98:GLOBAL UsedBy:[]}]	{"TraceId": "928232db-4676-4d27-838e-d7ffe965a9a2"}
2025-05-09T09:21:45.727Z	INFO	common/topology.go:100	Entities associated with tag "zone-2" are [{Type:Folder Value:group-d1}]	{"TraceId": "928232db-4676-4d27-838e-d7ffe965a9a2"}
2025-05-09T09:21:45.731Z	INFO	k8sorchestrator/topology.go:251	Topology service initiated successfully	{"TraceId": "928232db-4676-4d27-838e-d7ffe965a9a2"}
2025-05-09T09:21:45.732Z	INFO	service/driver.go:113	Configured: "csi.vsphere.vmware.com" with clusterFlavor: "VANILLA" and mode: "controller"	{"TraceId": "24d04151-83a9-474e-bfe6-e184e19fde9c"}

Also, verified that the pods have been stable -

vsphere-csi-controller-7f7944c788-c4vv4   7/7     Running   0               12m
vsphere-csi-controller-7f7944c788-d8vxb   7/7     Running   0               13m
vsphere-csi-controller-7f7944c788-qfvcn   7/7     Running   0               12m

@kolluria
Copy link
Contributor Author

kolluria commented May 9, 2025

/ok-to-test

@akankshapanse
Copy link
Contributor

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 9, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: akankshapanse, kolluria, vdkotkar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 9, 2025
@k8s-ci-robot k8s-ci-robot merged commit 7673442 into kubernetes-sigs:master May 9, 2025
12 checks passed
@kolluria kolluria deleted the fix-refactor-bug branch May 9, 2025 10:37
nikhilbarge pushed a commit to nikhilbarge/vsphere-csi-driver that referenced this pull request May 14, 2025
k8s-ci-robot pushed a commit that referenced this pull request May 16, 2025
* Reverted the refactoring done as part of #3234 (#3266)

* Fix panic on DeletedFinalStateUnknown (#3262)

Properly handle DeletedFinalStateUnknown when watching Nodes

* fullsync fix: add nil check on claimreference while listing pvs (#3269)

* bump up k8s version to 1.33.0 (#3274)

---------

Co-authored-by: Satyanarayana Kolluri <[email protected]>
Co-authored-by: Jan Šafránek <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants