Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1550266 - Fix clearInitialNodeNetworkUnavailableCondition() in sdn master #18758

Merged

Conversation

pravisankar
Copy link

@pravisankar pravisankar commented Feb 27, 2018

#This change fixes these 2 issues:

  • Currently, clearing NodeNetworkUnavailable node condition only works
    if we are successful in updating the node status during the first iteration.
    Subsequent retries will not work because:

    1. knode != node
    2. node.Status is updated in memory
    3. UpdateNodeStatus(knode)
      (3) will have no effect as in step (2) node.Status is updated but not knode.Status
  • Node object passed to this method is pointer to an item in the informer
    cache and it should not be modified directly.

Avoid NodeNetworkUnavailable condition check for every node status update

  • We know that kubelet sets NodeNetworkUnavailable condition when the node is
    created/registered with api server.
  • So we only need to call clearInitialNodeNetworkUnavailableCondition()
    for the first time and not during subsequent node status update events.

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 27, 2018
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 27, 2018
@pravisankar pravisankar force-pushed the fix-clear-nodenetwork branch from 6163713 to 2f10bff Compare February 27, 2018 00:57
@pravisankar pravisankar force-pushed the fix-clear-nodenetwork branch from 2f10bff to 0b33729 Compare February 27, 2018 03:04
Ravi Sankar Penta added 2 commits February 26, 2018 19:12
This change fixes these 2 issues:
- Currently, clearing NodeNetworkUnavailable node condition only works
if we are successful in updating the node status during the first iteration.
Subsequent retries will not work because:
  1. knode != node
  2. node.Status is updated in memory
  3. UpdateNodeStatus(knode)
(3) will have no effect as in step (2) node.Status is updated but not knode.Status

- Node object passed to this method is pointer to an item in the informer
cache and it should not be modified directly.
…date

- We know that kubelet sets NodeNetworkUnavailable condition when the node is
created/registered with api server.
- So we only need to call clearInitialNodeNetworkUnavailableCondition()
for the first time and not during subsequent node status update events.
@pravisankar pravisankar force-pushed the fix-clear-nodenetwork branch from 0b33729 to 7d5f2ac Compare February 27, 2018 03:13
@pravisankar pravisankar added the kind/bug Categorizes issue or PR as related to a bug. label Feb 27, 2018
@pravisankar
Copy link
Author

I don't have access to GCP, so simulated the issue on dind env and tested the fix.
@openshift/sig-networking @dcbw PTAL

@danwinship
Copy link
Contributor

I don't have access to GCP

What is this in reference to? Can you link to the relevant github issue or rhbz bug?

Copy link
Contributor

@danwinship danwinship left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

cleared = true
for i := range knode.Status.Conditions {
if knode.Status.Conditions[i].Type == kapi.NodeNetworkUnavailable {
condition := &knode.Status.Conditions[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for _, condition := range knode.Status.Conditions {
        if condition.Type == kapi.NodeNetworkUnavailable {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for _, condition := range knode.Status.Conditions {

With this implementation, 'condition' is a copy from knode.Status.Conditions[i] and modifying 'condition' will not reflect any change in knode.Status.Conditions.
In this case, we do want to modify condition.{Status, Reason, ...} fields.


if oldNodeIP, ok := master.hostSubnetNodeIPs[node.UID]; ok && (nodeIP == oldNodeIP) {
return
}
// Node status is frequently updated by kubelet, so log only if the above condition is not met
glog.V(5).Infof("Watch %s event for Node %q", eventType, node.Name)

master.clearInitialNodeNetworkUnavailableCondition(node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we do this, we should make sure that we don't run these controllers at all:

kubernetes/pkg/controller/cloud/node_controller.go
kubernetes/pkg/controller/route/router_controller.go

since they both will set NodeNetworkUnavailable on the node in addition to kubelet. I don't think we run the route controller, but I'm not sure about the node controller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... oh, yeah, for some reason I was thinking we'd still run clearInitialNodeNetworkUnavailableCondition on any "real" Node change, just not on the "Node status is frequently updated by kubelet" changes. But I guess this makes it so we only run clearInitialNodeNetworkUnavailableCondition when the IP change, which is riskier

Copy link
Author

@pravisankar pravisankar Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not run CloudNodeController (kubernetes/pkg/controller/cloud/node_controller.go), RouteController (kubernetes/pkg/controller/route/router_controller.go) and other kubernetes/pkg/controller/node/ipam/{sync, adapter, cloud_cidr_allocator} in OpenShift where NodeNetworkUnavailabe condition is used.

Copy link
Author

@pravisankar pravisankar Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created bug: https://bugzilla.redhat.com/show_bug.cgi?id=1550266 to ensure there are no issues/regressions on GCP with this change.

@pravisankar pravisankar changed the title Fix clearInitialNodeNetworkUnavailableCondition() in sdn master Bug 1550266 - Fix clearInitialNodeNetworkUnavailableCondition() in sdn master Feb 28, 2018
@pravisankar
Copy link
Author

/retest

@knobunc knobunc self-assigned this Mar 16, 2018
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 16, 2018
Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks Ravi

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knobunc, pravisankar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot
Copy link
Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 16, 2018

@pravisankar: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/openshift-jenkins/gcp 7d5f2ac link /test gcp

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue.

@openshift-merge-robot openshift-merge-robot merged commit 308bb2e into openshift:master Mar 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. component/networking kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. sig/networking size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants