Skip to content

Conversation

@bai3shuo4
Copy link
Contributor

@bai3shuo4 bai3shuo4 commented Apr 25, 2021

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind bug

What this PR does / why we need it:
bugfix for commit before, if we directly remove topology informer run, the topology worker will not run and do not sync. fix this PR #590
Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fix capacity information updates when topology changes. Only affected central deployment and network attached storage, not deployment on each node. This broke in v2.2.0 as part of a bug fix for capacity informer handling.

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 25, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @bai3shuo4. Thanks for your PR.

I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 25, 2021
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 25, 2021
@bai3shuo4
Copy link
Contributor Author

@pohly sorry, this PR should be fix further. Do you remember this PR: #590. It will cause problem when topology change

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 26, 2021
@bai3shuo4
Copy link
Contributor Author

/assign pohly

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash your commits and reference the commit where this broke.

klog.Info("Starting Capacity Controller")
defer c.queue.ShutDown()

go c.topologyInformer.Run(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the wrong place to put this. To be consistent with how the other informers are handled, this should be in the location where the informer gets created (i.e. in csi-provisioner.go), not where it is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the place it was placed: ff439ee

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this commit, if this line was removed directly, the node topology worker will not run. If you change the node label or csinode, nothing will happen. It will cause problem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that removal of go c.topologyInformer.Run(ctx) in that commit introduced a bug. The question just is where that Run() call should go instead. I propose cmd/csi-provisioner/csi-provisioner.go.

Copy link
Contributor Author

@bai3shuo4 bai3shuo4 Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, c.topologyInformer.Run(ctx) is not an informer. It only starts a topology queue worker. the node informer and csinode informer are started by factory.Start() in cmd/csi-provisioner/csi-provisioner.go. Moreover, only capacity feature uses this worker and it should belong to capacity controller right?

Copy link
Contributor Author

@bai3shuo4 bai3shuo4 Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I can change this method name to avoid ambiguity like changing Run to RunWorker

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, c.topologyInformer it is not a SharedInformer instance from an informer factory. But conceptually it is the same.

We can rename Run -> RunWorker to make that more obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly I have revised to this way. Can you review again? Thanks!

@bai3shuo4 bai3shuo4 force-pushed the bai3shuo4/fix-topology-worker branch from f9b41df to aaa456d Compare April 28, 2021 06:40
@pohly
Copy link
Contributor

pohly commented Apr 28, 2021

We have unit tests for "topology changes", but that doesn't help here because this is an issue about combining the various pieces into the final binary. An E2E test would be needed which:

  • deploys a CSI driver on multiple nodes, with capacity tracking
  • changes the node topology while the test runs

Right now we don't have a driver for testing this.

@bai3shuo4: You are using this with network attached storage, i.e. you have a central external-provisioner instance, right? Distributed provisioning uses a fixed topology, so it shouldn't matter there whether the informers run.

@bai3shuo4
Copy link
Contributor Author

We have unit tests for "topology changes", but that doesn't help here because this is an issue about combining the various pieces into the final binary. An E2E test would be needed which:

  • deploys a CSI driver on multiple nodes, with capacity tracking
  • changes the node topology while the test runs

Right now we don't have a driver for testing this.

@bai3shuo4: You are using this with network attached storage, i.e. you have a central external-provisioner instance, right? Distributed provisioning uses a fixed topology, so it shouldn't matter there whether the informers run.

Yes, I use central one and network storage

@pohly
Copy link
Contributor

pohly commented Apr 28, 2021

Yes, I use central one and network storage

How important is storage capacity tracking in such a setup? Are there some limitations which nodes have access to which storage?

@bai3shuo4
Copy link
Contributor Author

Yes, I use central one and network storage

How important is storage capacity tracking in such a setup? Are there some limitations which nodes have access to which storage?

We have different storage cluster with different capacity for node which is divided by topology. Your CSIStorageCapacities really help a lot

@bai3shuo4
Copy link
Contributor Author

Yes, I use central one and network storage

How important is storage capacity tracking in such a setup? Are there some limitations which nodes have access to which storage?

One more thing, I think CSIStorageCapacity should refresh when we expand volume or something. Now I'm working with csi-resizer to help this happening

@pohly
Copy link
Contributor

pohly commented Apr 28, 2021

One more thing, I think CSIStorageCapacity should refresh when we expand volume or something. Now I'm working with csi-resizer to help this happening

The periodic refresh will cover that, but I agree, something that is actually responding to changes would be better. If you have some idea then that would be great.

The modular design of the CSI sidecars is working against us here. If all functionality was in a single "controller sidecar", it would be easier.

@bai3shuo4 bai3shuo4 force-pushed the bai3shuo4/fix-topology-worker branch from aaa456d to 3a29b8c Compare April 28, 2021 08:37
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 28, 2021
@bai3shuo4 bai3shuo4 requested a review from pohly April 28, 2021 10:08

func (mt *Mock) RunWorker(ctx context.Context) {
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replaces the Run method, doesn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you mean I can remove topology informer Run method directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I thought that was the proposal: clarify that this is not a SharedInformer by naming the "Run" method not "Run". Nothing else calls it, so we are free to choose an arbitrary name.

If that makes the patch very large, then perhaps we should simply stick with Run.

klog.Infof("Started node topology worker")
<-ctx.Done()
klog.Infof("Shutting node topology worker")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have simplified this function like this, one other enhancement becomes possible: we no longer need the separate runWorker function and can save one goroutine:

func (nt *nodeTopology) RunWorker(ctx context.Context) {
	klog.Info("Started node topology informer")
        for nt.processNextWorkItem(ctx) {
	}
	klog.Info("Stopped node topology worker")
}

HasSynced() bool

// RunWorker starts a worker to process queue.
RunWorker(ctx context.Context)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

klog.Info("Starting Capacity Controller")
defer c.queue.ShutDown()

go c.topologyInformer.RunWorker(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this belongs into csi-provisioner.go. We should add a comment to NewCentralCapacityController that "all informers are expected to be started by the caller".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node topology worker only works for Capacity controller. If something goes wrong before CapacityController Run, I think we don't even need to start this worker. Therefore, why not put it into controller run function :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is consistent conceptually ("all informers are expected to be started by the caller"- we can even add "started and stopped").

If something goes wrong in csi-provisioner.go, the binary will exit. It'll make no difference how many goroutines get killed by that and the overhead is also irrelevant.

@bai3shuo4 bai3shuo4 force-pushed the bai3shuo4/fix-topology-worker branch from 3a29b8c to d175b80 Compare April 29, 2021 12:11
@bai3shuo4 bai3shuo4 requested a review from pohly April 29, 2021 12:11
Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting close... now we need to describe the change better in the release note.

Can you change that into:

Fix capacity information updates when topology changes. Only affected central deployment and network attached storage, not deployment on each node. This broke in v2.2.0 as part of a bug fix for capacity informer handling.

go nt.runWorker(ctx)

klog.Info("Started node topology informer")
klog.Infof("Started node topology worker")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please stick with Info, the string is not a format string.

klog.Infof("Started node topology worker")
<-ctx.Done()
klog.Info("Shutting node topology informer")
klog.Infof("Shutting node topology worker")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

klog.Infof("producing CSIStorageCapacity objects with fixed topology segment %s", segment)
topologyInformer = topology.NewFixedNodeTopology(&segment)
}
go topologyInformer.RunWorker(context.TODO())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use context.Background() here. context.TODO() implies that further changes are needed, which is not the case here.

@pohly
Copy link
Contributor

pohly commented Apr 30, 2021

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 30, 2021
@pohly
Copy link
Contributor

pohly commented Apr 30, 2021

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 30, 2021
@bai3shuo4 bai3shuo4 force-pushed the bai3shuo4/fix-topology-worker branch from d175b80 to 38ce9c3 Compare May 1, 2021 03:36
Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, one more idea...

klog.Infof("Started node topology worker")
<-ctx.Done()
klog.Infof("Shutting node topology worker")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have simplified this function like this, one other enhancement becomes possible: we no longer need the separate runWorker function and can save one goroutine:

func (nt *nodeTopology) RunWorker(ctx context.Context) {
	klog.Info("Started node topology informer")
        for nt.processNextWorkItem(ctx) {
	}
	klog.Info("Stopped node topology worker")
}

@pohly
Copy link
Contributor

pohly commented May 5, 2021

Sorry, one more idea...

That can be added later. Let me merge it as-is.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 5, 2021
@pohly pohly mentioned this pull request May 5, 2021
@pohly
Copy link
Contributor

pohly commented May 5, 2021

/approve

1 similar comment
@msau42
Copy link
Collaborator

msau42 commented May 5, 2021

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bai3shuo4, msau42, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 5, 2021
@k8s-ci-robot k8s-ci-robot merged commit f71827b into kubernetes-csi:master May 5, 2021
@bai3shuo4
Copy link
Contributor Author

Sorry, one more idea...

@pohly Sorry, I just check my github at this moment....yeap, I think we can simplify this func, maybe I can add this commit in further bugfix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants