-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1550266 - Fix clearInitialNodeNetworkUnavailableCondition() in sdn master #18758
Bug 1550266 - Fix clearInitialNodeNetworkUnavailableCondition() in sdn master #18758
Conversation
6163713
to
2f10bff
Compare
2f10bff
to
0b33729
Compare
This change fixes these 2 issues: - Currently, clearing NodeNetworkUnavailable node condition only works if we are successful in updating the node status during the first iteration. Subsequent retries will not work because: 1. knode != node 2. node.Status is updated in memory 3. UpdateNodeStatus(knode) (3) will have no effect as in step (2) node.Status is updated but not knode.Status - Node object passed to this method is pointer to an item in the informer cache and it should not be modified directly.
…date - We know that kubelet sets NodeNetworkUnavailable condition when the node is created/registered with api server. - So we only need to call clearInitialNodeNetworkUnavailableCondition() for the first time and not during subsequent node status update events.
0b33729
to
7d5f2ac
Compare
I don't have access to GCP, so simulated the issue on dind env and tested the fix. |
What is this in reference to? Can you link to the relevant github issue or rhbz bug? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
cleared = true | ||
for i := range knode.Status.Conditions { | ||
if knode.Status.Conditions[i].Type == kapi.NodeNetworkUnavailable { | ||
condition := &knode.Status.Conditions[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for _, condition := range knode.Status.Conditions {
if condition.Type == kapi.NodeNetworkUnavailable {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for _, condition := range knode.Status.Conditions {
With this implementation, 'condition' is a copy from knode.Status.Conditions[i] and modifying 'condition' will not reflect any change in knode.Status.Conditions.
In this case, we do want to modify condition.{Status, Reason, ...} fields.
|
||
if oldNodeIP, ok := master.hostSubnetNodeIPs[node.UID]; ok && (nodeIP == oldNodeIP) { | ||
return | ||
} | ||
// Node status is frequently updated by kubelet, so log only if the above condition is not met | ||
glog.V(5).Infof("Watch %s event for Node %q", eventType, node.Name) | ||
|
||
master.clearInitialNodeNetworkUnavailableCondition(node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we do this, we should make sure that we don't run these controllers at all:
kubernetes/pkg/controller/cloud/node_controller.go
kubernetes/pkg/controller/route/router_controller.go
since they both will set NodeNetworkUnavailable on the node in addition to kubelet. I don't think we run the route controller, but I'm not sure about the node controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... oh, yeah, for some reason I was thinking we'd still run clearInitialNodeNetworkUnavailableCondition on any "real" Node change, just not on the "Node status is frequently updated by kubelet" changes. But I guess this makes it so we only run clearInitialNodeNetworkUnavailableCondition when the IP change, which is riskier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not run CloudNodeController (kubernetes/pkg/controller/cloud/node_controller.go), RouteController (kubernetes/pkg/controller/route/router_controller.go) and other kubernetes/pkg/controller/node/ipam/{sync, adapter, cloud_cidr_allocator} in OpenShift where NodeNetworkUnavailabe condition is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created bug: https://bugzilla.redhat.com/show_bug.cgi?id=1550266 to ensure there are no issues/regressions on GCP with this change.
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Thanks Ravi
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: knobunc, pravisankar The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
@pravisankar: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Automatic merge from submit-queue. |
#This change fixes these 2 issues:
Currently, clearing NodeNetworkUnavailable node condition only works
if we are successful in updating the node status during the first iteration.
Subsequent retries will not work because:
(3) will have no effect as in step (2) node.Status is updated but not knode.Status
Node object passed to this method is pointer to an item in the informer
cache and it should not be modified directly.
Avoid NodeNetworkUnavailable condition check for every node status update
created/registered with api server.
for the first time and not during subsequent node status update events.