Bug 1550266 - Fix clearInitialNodeNetworkUnavailableCondition() in sdn master #18758

pravisankar · 2018-02-27T00:56:51Z

#This change fixes these 2 issues:

Currently, clearing NodeNetworkUnavailable node condition only works
if we are successful in updating the node status during the first iteration.
Subsequent retries will not work because:
1. knode != node
2. node.Status is updated in memory
3. UpdateNodeStatus(knode)
  (3) will have no effect as in step (2) node.Status is updated but not knode.Status
Node object passed to this method is pointer to an item in the informer
cache and it should not be modified directly.

Avoid NodeNetworkUnavailable condition check for every node status update

We know that kubelet sets NodeNetworkUnavailable condition when the node is
created/registered with api server.
So we only need to call clearInitialNodeNetworkUnavailableCondition()
for the first time and not during subsequent node status update events.

This change fixes these 2 issues: - Currently, clearing NodeNetworkUnavailable node condition only works if we are successful in updating the node status during the first iteration. Subsequent retries will not work because: 1. knode != node 2. node.Status is updated in memory 3. UpdateNodeStatus(knode) (3) will have no effect as in step (2) node.Status is updated but not knode.Status - Node object passed to this method is pointer to an item in the informer cache and it should not be modified directly.

…date - We know that kubelet sets NodeNetworkUnavailable condition when the node is created/registered with api server. - So we only need to call clearInitialNodeNetworkUnavailableCondition() for the first time and not during subsequent node status update events.

pravisankar · 2018-02-27T03:19:24Z

I don't have access to GCP, so simulated the issue on dind env and tested the fix.
@openshift/sig-networking @dcbw PTAL

danwinship · 2018-02-27T15:42:57Z

I don't have access to GCP

What is this in reference to? Can you link to the relevant github issue or rhbz bug?

danwinship

lgtm

danwinship · 2018-02-27T15:50:11Z

pkg/network/master/subnets.go

-				cleared = true
+		for i := range knode.Status.Conditions {
+			if knode.Status.Conditions[i].Type == kapi.NodeNetworkUnavailable {
+				condition := &knode.Status.Conditions[i]


for _, condition := range knode.Status.Conditions { if condition.Type == kapi.NodeNetworkUnavailable {

for _, condition := range knode.Status.Conditions {

With this implementation, 'condition' is a copy from knode.Status.Conditions[i] and modifying 'condition' will not reflect any change in knode.Status.Conditions.
In this case, we do want to modify condition.{Status, Reason, ...} fields.

dcbw · 2018-02-27T22:11:43Z

pkg/network/master/subnets.go


 	if oldNodeIP, ok := master.hostSubnetNodeIPs[node.UID]; ok && (nodeIP == oldNodeIP) {
 		return
 	}
 	// Node status is frequently updated by kubelet, so log only if the above condition is not met
 	glog.V(5).Infof("Watch %s event for Node %q", eventType, node.Name)

+	master.clearInitialNodeNetworkUnavailableCondition(node)


Before we do this, we should make sure that we don't run these controllers at all:

kubernetes/pkg/controller/cloud/node_controller.go
kubernetes/pkg/controller/route/router_controller.go

since they both will set NodeNetworkUnavailable on the node in addition to kubelet. I don't think we run the route controller, but I'm not sure about the node controller.

Hm... oh, yeah, for some reason I was thinking we'd still run clearInitialNodeNetworkUnavailableCondition on any "real" Node change, just not on the "Node status is frequently updated by kubelet" changes. But I guess this makes it so we only run clearInitialNodeNetworkUnavailableCondition when the IP change, which is riskier

We do not run CloudNodeController (kubernetes/pkg/controller/cloud/node_controller.go), RouteController (kubernetes/pkg/controller/route/router_controller.go) and other kubernetes/pkg/controller/node/ipam/{sync, adapter, cloud_cidr_allocator} in OpenShift where NodeNetworkUnavailabe condition is used.

Created bug: https://bugzilla.redhat.com/show_bug.cgi?id=1550266 to ensure there are no issues/regressions on GCP with this change.

pravisankar · 2018-02-28T22:08:26Z

/retest

knobunc

/lgtm

Thanks Ravi

openshift-ci-robot · 2018-03-16T13:23:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knobunc, pravisankar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/network/OWNERS~~ [knobunc,pravisankar]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-merge-robot · 2018-03-16T13:48:08Z

/test all [submit-queue is verifying that this PR is safe to merge]

openshift-ci-robot · 2018-03-16T15:13:26Z

@pravisankar: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/openshift-jenkins/gcp	`7d5f2ac`	link	`/test gcp`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-merge-robot · 2018-03-16T15:24:20Z

Automatic merge from submit-queue.

openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 27, 2018

openshift-ci-robot requested review from knobunc and rajatchopra February 27, 2018 00:57

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 27, 2018

pravisankar force-pushed the fix-clear-nodenetwork branch from 6163713 to 2f10bff Compare February 27, 2018 00:57

pravisankar added the component/kubernetes label Feb 27, 2018

pravisankar force-pushed the fix-clear-nodenetwork branch from 2f10bff to 0b33729 Compare February 27, 2018 03:04

Ravi Sankar Penta added 2 commits February 26, 2018 19:12

pravisankar force-pushed the fix-clear-nodenetwork branch from 0b33729 to 7d5f2ac Compare February 27, 2018 03:13

pravisankar added the kind/bug Categorizes issue or PR as related to a bug. label Feb 27, 2018

openshift-ci-robot added the sig/networking label Feb 27, 2018

pravisankar requested review from danwinship and dcbw February 27, 2018 03:19

pravisankar added component/networking and removed component/kubernetes labels Feb 27, 2018

danwinship reviewed Feb 27, 2018

View reviewed changes

dcbw reviewed Feb 27, 2018

View reviewed changes

pravisankar changed the title ~~Fix clearInitialNodeNetworkUnavailableCondition() in sdn master~~ Bug 1550266 - Fix clearInitialNodeNetworkUnavailableCondition() in sdn master Feb 28, 2018

knobunc self-assigned this Mar 16, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 16, 2018

knobunc approved these changes Mar 16, 2018

View reviewed changes

openshift-merge-robot merged commit 308bb2e into openshift:master Mar 16, 2018

pravisankar mentioned this pull request Mar 19, 2018

remove sdn's GetNodeCondition #15283

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1550266 - Fix clearInitialNodeNetworkUnavailableCondition() in sdn master #18758

Bug 1550266 - Fix clearInitialNodeNetworkUnavailableCondition() in sdn master #18758

pravisankar commented Feb 27, 2018 •

edited

Loading

pravisankar commented Feb 27, 2018

danwinship commented Feb 27, 2018

danwinship left a comment

danwinship Feb 27, 2018

pravisankar Feb 28, 2018

dcbw Feb 27, 2018

danwinship Feb 28, 2018

pravisankar Feb 28, 2018 •

edited

Loading

pravisankar Feb 28, 2018 •

edited

Loading

pravisankar commented Feb 28, 2018

knobunc left a comment

openshift-ci-robot commented Mar 16, 2018

openshift-merge-robot commented Mar 16, 2018

openshift-ci-robot commented Mar 16, 2018 •

edited

Loading

openshift-merge-robot commented Mar 16, 2018

Bug 1550266 - Fix clearInitialNodeNetworkUnavailableCondition() in sdn master #18758

Bug 1550266 - Fix clearInitialNodeNetworkUnavailableCondition() in sdn master #18758

Conversation

pravisankar commented Feb 27, 2018 • edited Loading

pravisankar commented Feb 27, 2018

danwinship commented Feb 27, 2018

danwinship left a comment

Choose a reason for hiding this comment

danwinship Feb 27, 2018

Choose a reason for hiding this comment

pravisankar Feb 28, 2018

Choose a reason for hiding this comment

dcbw Feb 27, 2018

Choose a reason for hiding this comment

danwinship Feb 28, 2018

Choose a reason for hiding this comment

pravisankar Feb 28, 2018 • edited Loading

Choose a reason for hiding this comment

pravisankar Feb 28, 2018 • edited Loading

Choose a reason for hiding this comment

pravisankar commented Feb 28, 2018

knobunc left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Mar 16, 2018

openshift-merge-robot commented Mar 16, 2018

openshift-ci-robot commented Mar 16, 2018 • edited Loading

openshift-merge-robot commented Mar 16, 2018

pravisankar commented Feb 27, 2018 •

edited

Loading

pravisankar Feb 28, 2018 •

edited

Loading

pravisankar Feb 28, 2018 •

edited

Loading

openshift-ci-robot commented Mar 16, 2018 •

edited

Loading