Skip to content

Conversation

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Jun 29, 2017

Fixes bug 1465361
Fixes #14964

[test]

@smarterclayton
Copy link
Contributor Author

@deads2k it's too easy for a reflector to fail to watch and go into LIST polling without noticing if reflector's default interval is 1s. We should probably be backing off the reflector or otherwise rate limiting it.

@smarterclayton smarterclayton added this to the 3.6.0 milestone Jun 29, 2017
// sdn-controller
addControllerRole(rbac.ClusterRole{
ObjectMeta: metav1.ObjectMeta{Name: saRolePrefix + InfraSDNControllerServiceAccountName},
Rules: []rbac.PolicyRule{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies to the SDN master but not the nodes, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

ObjectMeta: metav1.ObjectMeta{Name: saRolePrefix + InfraSDNControllerServiceAccountName},
Rules: []rbac.PolicyRule{
rbac.NewRule("get", "create", "update").Groups(networkGroup, legacyNetworkGroup).Resources("clusternetworks").RuleOrDie(),
rbac.NewRule("get", "list").Groups(networkGroup, legacyNetworkGroup).Resources("egressnetworkpolicies").RuleOrDie(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The master does not need to get or list egressnetworkpolicies.

rbac.NewRule("get", "list").Groups(networkGroup, legacyNetworkGroup).Resources("egressnetworkpolicies").RuleOrDie(),
rbac.NewRule("get", "list", "create", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("hostsubnets").RuleOrDie(),
rbac.NewRule("get", "list", "create", "update", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("netnamespaces").RuleOrDie(),
rbac.NewRule("get", "list", "watch", "create", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("hostsubnets").RuleOrDie(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(as mentioned on the other bug, this fix is correct but only applies in an obscure case, for F5 integration)

rbac.NewRule("get", "list", "create", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("hostsubnets").RuleOrDie(),
rbac.NewRule("get", "list", "create", "update", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("netnamespaces").RuleOrDie(),
rbac.NewRule("get", "list", "watch", "create", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("hostsubnets").RuleOrDie(),
rbac.NewRule("get", "list", "watch", "create", "update", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("netnamespaces").RuleOrDie(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fix is correct... and the lack of "watch" here should have caused test/cmd/sdn.sh to fail... weird

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How so? You're running as cluster-admin in that test (unless you explicitly switch to another user). And the controllers should only be delayed by 1s max because of the relisting behavior.

rbac.NewRule("get", "list", "create", "update", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("netnamespaces").RuleOrDie(),
rbac.NewRule("get", "list", "watch", "create", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("hostsubnets").RuleOrDie(),
rbac.NewRule("get", "list", "watch", "create", "update", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("netnamespaces").RuleOrDie(),
rbac.NewRule("get", "list").Groups(kapiGroup).Resources("pods").RuleOrDie(),
Copy link
Contributor

@danwinship danwinship Jun 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW it doesn't currently need "get", though it does use "list" (at startup, to make sure that all running pods are inside the current clusterNetworkCIDR, if clusterNetworkCIDR has changed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, can probably leave it. "get"/"list"/"watch" are so closely coupled that we probably should have just done a placeholder

rbac.NewRule("get", "list").Groups(kapiGroup).Resources("pods").RuleOrDie(),
rbac.NewRule("list").Groups(kapiGroup).Resources("services").RuleOrDie(),
rbac.NewRule("list").Groups(kapiGroup).Resources("namespaces").RuleOrDie(),
rbac.NewRule("get").Groups(kapiGroup).Resources("nodes").RuleOrDie(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's no good. The master needs "list" and "watch" on nodes so that it can create/delete hostsubnets as nodes are created/deleted. (Um... and I have no explanation for how networking-minimal could pass with this being wrong... Is the SDN possibly picking up permissions from somewhere else? Oh, we use a SharedInformer here now, is that it?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The master is using a shared informer, I'll add the permissions explicitly though.

rbac.NewRule("get").Groups(kapiGroup).Resources("nodes").RuleOrDie(),
rbac.NewRule("update").Groups(kapiGroup).Resources("nodes/status").RuleOrDie(),
rbac.NewRule("list").Groups(extensionsGroup).Resources("networkPolicies").RuleOrDie(),
rbac.NewRule("list", "watch").Groups(extensionsGroup).Resources("networkPolicies").RuleOrDie(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the master should not need any access to networkpolicies

@smarterclayton
Copy link
Contributor Author

Updated with comments addressed

@danwinship
Copy link
Contributor

LGTM
[testextended][extended:networking]

Copy link

@pravisankar pravisankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-bot
Copy link
Contributor

Evaluated for origin testextended up to b9822a1

@dcbw
Copy link
Contributor

dcbw commented Jun 29, 2017

LGTM

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to b9822a1

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/testextended SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended/797/) (Base Commit: dc5a05f) (PR Branch Commit: b9822a1) (Extended Tests: networking)

@smarterclayton smarterclayton merged commit b80c2fb into openshift:master Jun 30, 2017
@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/2843/) (Base Commit: dc5a05f) (PR Branch Commit: b9822a1)

@deads2k
Copy link
Contributor

deads2k commented Jun 30, 2017

@deads2k it's too easy for a reflector to fail to watch and go into LIST polling without noticing if reflector's default interval is 1s. We should probably be backing off the reflector or otherwise rate limiting it.

It's hard to know what is "good". I guess its a function of whether we want the failures to mostly work or mostly fail. I'm actually inclined to have two modes that are toggled by env var. If we panic during our tests we're unlikely to ship a mistake, but if we do ship a mistake limping along on a relist seems better than straight up failing.

The difference between 1 second and 10 seconds on a long list doesn't matter. You're talking about switching to O(minutes)?

@smarterclayton
Copy link
Contributor Author

I'm thinking backoff from 1s up to 20-30s. I am extremely concerned about anything that is a fixed rate failing forever, we already know exactly what that looks like in very large clusters and we're going to stop doing it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants