SDN controller requires access to watch resources #14968

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

smarterclayton merged 1 commit into openshift:master from smarterclayton:wrong_policy_network

Jun 30, 2017

Contributor

smarterclayton commented Jun 29, 2017 •

edited

Loading

Fixes bug 1465361
Fixes #14964

[test]

smarterclayton mentioned this pull request

Refactor controller initialization (final round?) #14633

Merged

Contributor Author

smarterclayton commented Jun 29, 2017

@deads2k it's too easy for a reflector to fail to watch and go into LIST polling without noticing if reflector's default interval is 1s. We should probably be backing off the reflector or otherwise rate limiting it.

smarterclayton added the priority/P1 label

smarterclayton added this to the 3.6.0 milestone

danwinship reviewed

View reviewed changes

pkg/cmd/server/bootstrappolicy/controller_policy.go

    
              	// sdn-controller

              	addControllerRole(rbac.ClusterRole{

              		ObjectMeta: metav1.ObjectMeta{Name: saRolePrefix + InfraSDNControllerServiceAccountName},

              		Rules: []rbac.PolicyRule{

Contributor

danwinship Jun 29, 2017

This applies to the SDN master but not the nodes, correct?

Contributor Author

smarterclayton Jun 29, 2017

Correct

pkg/cmd/server/bootstrappolicy/controller_policy.go Outdated

    
              		ObjectMeta: metav1.ObjectMeta{Name: saRolePrefix + InfraSDNControllerServiceAccountName},

              		Rules: []rbac.PolicyRule{

              			rbac.NewRule("get", "create", "update").Groups(networkGroup, legacyNetworkGroup).Resources("clusternetworks").RuleOrDie(),

              			rbac.NewRule("get", "list").Groups(networkGroup, legacyNetworkGroup).Resources("egressnetworkpolicies").RuleOrDie(),

Contributor

danwinship Jun 29, 2017

The master does not need to get or list egressnetworkpolicies.

pkg/cmd/server/bootstrappolicy/controller_policy.go

    
              			rbac.NewRule("get", "list").Groups(networkGroup, legacyNetworkGroup).Resources("egressnetworkpolicies").RuleOrDie(),

              			rbac.NewRule("get", "list", "create", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("hostsubnets").RuleOrDie(),

              			rbac.NewRule("get", "list", "create", "update", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("netnamespaces").RuleOrDie(),

              			rbac.NewRule("get", "list", "watch", "create", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("hostsubnets").RuleOrDie(),

Contributor

danwinship Jun 29, 2017

(as mentioned on the other bug, this fix is correct but only applies in an obscure case, for F5 integration)

pkg/cmd/server/bootstrappolicy/controller_policy.go

    
              			rbac.NewRule("get", "list", "create", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("hostsubnets").RuleOrDie(),

              			rbac.NewRule("get", "list", "create", "update", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("netnamespaces").RuleOrDie(),

              			rbac.NewRule("get", "list", "watch", "create", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("hostsubnets").RuleOrDie(),

              			rbac.NewRule("get", "list", "watch", "create", "update", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("netnamespaces").RuleOrDie(),

Contributor

danwinship Jun 29, 2017

this fix is correct... and the lack of "watch" here should have caused test/cmd/sdn.sh to fail... weird

Contributor Author

smarterclayton Jun 29, 2017

How so? You're running as cluster-admin in that test (unless you explicitly switch to another user). And the controllers should only be delayed by 1s max because of the relisting behavior.

pkg/cmd/server/bootstrappolicy/controller_policy.go

    
              			rbac.NewRule("get", "list", "create", "update", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("netnamespaces").RuleOrDie(),

              			rbac.NewRule("get", "list", "watch", "create", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("hostsubnets").RuleOrDie(),

              			rbac.NewRule("get", "list", "watch", "create", "update", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("netnamespaces").RuleOrDie(),

              			rbac.NewRule("get", "list").Groups(kapiGroup).Resources("pods").RuleOrDie(),

Contributor

danwinship Jun 29, 2017 •

edited

Loading

FWIW it doesn't currently need "get", though it does use "list" (at startup, to make sure that all running pods are inside the current clusterNetworkCIDR, if clusterNetworkCIDR has changed).

Contributor Author

smarterclayton Jun 29, 2017

Hrm, can probably leave it. "get"/"list"/"watch" are so closely coupled that we probably should have just done a placeholder

pkg/cmd/server/bootstrappolicy/controller_policy.go Outdated

    
              			rbac.NewRule("get", "list").Groups(kapiGroup).Resources("pods").RuleOrDie(),

              			rbac.NewRule("list").Groups(kapiGroup).Resources("services").RuleOrDie(),

              			rbac.NewRule("list").Groups(kapiGroup).Resources("namespaces").RuleOrDie(),

              			rbac.NewRule("get").Groups(kapiGroup).Resources("nodes").RuleOrDie(),

Contributor

danwinship Jun 29, 2017

That's no good. The master needs "list" and "watch" on nodes so that it can create/delete hostsubnets as nodes are created/deleted. (Um... and I have no explanation for how networking-minimal could pass with this being wrong... Is the SDN possibly picking up permissions from somewhere else? Oh, we use a SharedInformer here now, is that it?)

Contributor Author

smarterclayton Jun 29, 2017

The master is using a shared informer, I'll add the permissions explicitly though.

pkg/cmd/server/bootstrappolicy/controller_policy.go Outdated

    
              			rbac.NewRule("get").Groups(kapiGroup).Resources("nodes").RuleOrDie(),

              			rbac.NewRule("update").Groups(kapiGroup).Resources("nodes/status").RuleOrDie(),

              			rbac.NewRule("list").Groups(extensionsGroup).Resources("networkPolicies").RuleOrDie(),

              			rbac.NewRule("list", "watch").Groups(extensionsGroup).Resources("networkPolicies").RuleOrDie(),

Contributor

danwinship Jun 29, 2017

the master should not need any access to networkpolicies


          SDN controller requires access to watch resources

b9822a1

smarterclayton force-pushed the wrong_policy_network branch from 06aefcd to b9822a1 Compare

June 29, 2017 21:07

Contributor Author

smarterclayton commented Jun 29, 2017

Updated with comments addressed

Contributor

danwinship commented Jun 29, 2017

LGTM
[testextended][extended:networking]

pravisankar approved these changes

View reviewed changes

pravisankar left a comment

LGTM

Contributor

openshift-bot commented Jun 29, 2017

Evaluated for origin testextended up to b9822a1

Contributor

dcbw commented Jun 29, 2017

LGTM

Contributor

openshift-bot commented Jun 29, 2017

Evaluated for origin test up to b9822a1

Contributor

openshift-bot commented Jun 29, 2017

continuous-integration/openshift-jenkins/testextended SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended/797/) (Base Commit: dc5a05f) (PR Branch Commit: b9822a1) (Extended Tests: networking)

smarterclayton merged commit b80c2fb into openshift:master

Contributor

openshift-bot commented Jun 30, 2017

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/2843/) (Base Commit: dc5a05f) (PR Branch Commit: b9822a1)

Contributor

deads2k commented Jun 30, 2017

@deads2k it's too easy for a reflector to fail to watch and go into LIST polling without noticing if reflector's default interval is 1s. We should probably be backing off the reflector or otherwise rate limiting it.

It's hard to know what is "good". I guess its a function of whether we want the failures to mostly work or mostly fail. I'm actually inclined to have two modes that are toggled by env var. If we panic during our tests we're unlikely to ship a mistake, but if we do ship a mistake limping along on a relist seems better than straight up failing.

The difference between 1 second and 10 seconds on a long list doesn't matter. You're talking about switching to O(minutes)?

danwinship mentioned this pull request

One more fix to SDN controller perms #14979

Merged

Contributor Author

smarterclayton commented Jun 30, 2017

I'm thinking backoff from 1s up to 20-30s. I am extremely concerned about anything that is a fixed rate failing forever, we already know exactly what that looks like in very large clusters and we're going to stop doing it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels