-
Notifications
You must be signed in to change notification settings - Fork 373
node topology worker should run #617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node topology worker should run #617
Conversation
|
Hi @bai3shuo4. Thanks for your PR. I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/assign pohly |
pohly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please squash your commits and reference the commit where this broke.
pkg/capacity/capacity.go
Outdated
| klog.Info("Starting Capacity Controller") | ||
| defer c.queue.ShutDown() | ||
|
|
||
| go c.topologyInformer.Run(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like the wrong place to put this. To be consistent with how the other informers are handled, this should be in the location where the informer gets created (i.e. in csi-provisioner.go), not where it is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the place it was placed: ff439ee
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to this commit, if this line was removed directly, the node topology worker will not run. If you change the node label or csinode, nothing will happen. It will cause problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that removal of go c.topologyInformer.Run(ctx) in that commit introduced a bug. The question just is where that Run() call should go instead. I propose cmd/csi-provisioner/csi-provisioner.go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, c.topologyInformer.Run(ctx) is not an informer. It only starts a topology queue worker. the node informer and csinode informer are started by factory.Start() in cmd/csi-provisioner/csi-provisioner.go. Moreover, only capacity feature uses this worker and it should belong to capacity controller right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I can change this method name to avoid ambiguity like changing Run to RunWorker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, c.topologyInformer it is not a SharedInformer instance from an informer factory. But conceptually it is the same.
We can rename Run -> RunWorker to make that more obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pohly I have revised to this way. Can you review again? Thanks!
f9b41df to
aaa456d
Compare
|
We have unit tests for "topology changes", but that doesn't help here because this is an issue about combining the various pieces into the final binary. An E2E test would be needed which:
Right now we don't have a driver for testing this. @bai3shuo4: You are using this with network attached storage, i.e. you have a central external-provisioner instance, right? Distributed provisioning uses a fixed topology, so it shouldn't matter there whether the informers run. |
Yes, I use central one and network storage |
How important is storage capacity tracking in such a setup? Are there some limitations which nodes have access to which storage? |
We have different storage cluster with different capacity for node which is divided by topology. Your CSIStorageCapacities really help a lot |
One more thing, I think CSIStorageCapacity should refresh when we expand volume or something. Now I'm working with csi-resizer to help this happening |
The periodic refresh will cover that, but I agree, something that is actually responding to changes would be better. If you have some idea then that would be great. The modular design of the CSI sidecars is working against us here. If all functionality was in a single "controller sidecar", it would be easier. |
aaa456d to
3a29b8c
Compare
|
|
||
| func (mt *Mock) RunWorker(ctx context.Context) { | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This replaces the Run method, doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you mean I can remove topology informer Run method directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I thought that was the proposal: clarify that this is not a SharedInformer by naming the "Run" method not "Run". Nothing else calls it, so we are free to choose an arbitrary name.
If that makes the patch very large, then perhaps we should simply stick with Run.
| klog.Infof("Started node topology worker") | ||
| <-ctx.Done() | ||
| klog.Infof("Shutting node topology worker") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have simplified this function like this, one other enhancement becomes possible: we no longer need the separate runWorker function and can save one goroutine:
func (nt *nodeTopology) RunWorker(ctx context.Context) {
klog.Info("Started node topology informer")
for nt.processNextWorkItem(ctx) {
}
klog.Info("Stopped node topology worker")
}
| HasSynced() bool | ||
|
|
||
| // RunWorker starts a worker to process queue. | ||
| RunWorker(ctx context.Context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here.
pkg/capacity/capacity.go
Outdated
| klog.Info("Starting Capacity Controller") | ||
| defer c.queue.ShutDown() | ||
|
|
||
| go c.topologyInformer.RunWorker(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this belongs into csi-provisioner.go. We should add a comment to NewCentralCapacityController that "all informers are expected to be started by the caller".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Node topology worker only works for Capacity controller. If something goes wrong before CapacityController Run, I think we don't even need to start this worker. Therefore, why not put it into controller run function :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it is consistent conceptually ("all informers are expected to be started by the caller"- we can even add "started and stopped").
If something goes wrong in csi-provisioner.go, the binary will exit. It'll make no difference how many goroutines get killed by that and the overhead is also irrelevant.
3a29b8c to
d175b80
Compare
pohly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting close... now we need to describe the change better in the release note.
Can you change that into:
Fix capacity information updates when topology changes. Only affected central deployment and network attached storage, not deployment on each node. This broke in v2.2.0 as part of a bug fix for capacity informer handling.
pkg/capacity/topology/nodes.go
Outdated
| go nt.runWorker(ctx) | ||
|
|
||
| klog.Info("Started node topology informer") | ||
| klog.Infof("Started node topology worker") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please stick with Info, the string is not a format string.
pkg/capacity/topology/nodes.go
Outdated
| klog.Infof("Started node topology worker") | ||
| <-ctx.Done() | ||
| klog.Info("Shutting node topology informer") | ||
| klog.Infof("Shutting node topology worker") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
| klog.Infof("producing CSIStorageCapacity objects with fixed topology segment %s", segment) | ||
| topologyInformer = topology.NewFixedNodeTopology(&segment) | ||
| } | ||
| go topologyInformer.RunWorker(context.TODO()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use context.Background() here. context.TODO() implies that further changes are needed, which is not the case here.
|
/kind bug |
|
/ok-to-test |
d175b80 to
38ce9c3
Compare
pohly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, one more idea...
| klog.Infof("Started node topology worker") | ||
| <-ctx.Done() | ||
| klog.Infof("Shutting node topology worker") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have simplified this function like this, one other enhancement becomes possible: we no longer need the separate runWorker function and can save one goroutine:
func (nt *nodeTopology) RunWorker(ctx context.Context) {
klog.Info("Started node topology informer")
for nt.processNextWorkItem(ctx) {
}
klog.Info("Stopped node topology worker")
}
That can be added later. Let me merge it as-is. /lgtm |
|
/approve |
1 similar comment
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bai3shuo4, msau42, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@pohly Sorry, I just check my github at this moment....yeap, I think we can simplify this func, maybe I can add this commit in further bugfix |
What type of PR is this?
/kind bug
What this PR does / why we need it:
bugfix for commit before, if we directly remove topology informer run, the topology worker will not run and do not sync. fix this PR #590
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: