-
Notifications
You must be signed in to change notification settings - Fork 4.8k
SDN controller requires access to watch resources #14968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SDN controller requires access to watch resources #14968
Conversation
|
@deads2k it's too easy for a reflector to fail to watch and go into LIST polling without noticing if reflector's default interval is 1s. We should probably be backing off the reflector or otherwise rate limiting it. |
| // sdn-controller | ||
| addControllerRole(rbac.ClusterRole{ | ||
| ObjectMeta: metav1.ObjectMeta{Name: saRolePrefix + InfraSDNControllerServiceAccountName}, | ||
| Rules: []rbac.PolicyRule{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This applies to the SDN master but not the nodes, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
| ObjectMeta: metav1.ObjectMeta{Name: saRolePrefix + InfraSDNControllerServiceAccountName}, | ||
| Rules: []rbac.PolicyRule{ | ||
| rbac.NewRule("get", "create", "update").Groups(networkGroup, legacyNetworkGroup).Resources("clusternetworks").RuleOrDie(), | ||
| rbac.NewRule("get", "list").Groups(networkGroup, legacyNetworkGroup).Resources("egressnetworkpolicies").RuleOrDie(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The master does not need to get or list egressnetworkpolicies.
| rbac.NewRule("get", "list").Groups(networkGroup, legacyNetworkGroup).Resources("egressnetworkpolicies").RuleOrDie(), | ||
| rbac.NewRule("get", "list", "create", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("hostsubnets").RuleOrDie(), | ||
| rbac.NewRule("get", "list", "create", "update", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("netnamespaces").RuleOrDie(), | ||
| rbac.NewRule("get", "list", "watch", "create", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("hostsubnets").RuleOrDie(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(as mentioned on the other bug, this fix is correct but only applies in an obscure case, for F5 integration)
| rbac.NewRule("get", "list", "create", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("hostsubnets").RuleOrDie(), | ||
| rbac.NewRule("get", "list", "create", "update", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("netnamespaces").RuleOrDie(), | ||
| rbac.NewRule("get", "list", "watch", "create", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("hostsubnets").RuleOrDie(), | ||
| rbac.NewRule("get", "list", "watch", "create", "update", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("netnamespaces").RuleOrDie(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this fix is correct... and the lack of "watch" here should have caused test/cmd/sdn.sh to fail... weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How so? You're running as cluster-admin in that test (unless you explicitly switch to another user). And the controllers should only be delayed by 1s max because of the relisting behavior.
| rbac.NewRule("get", "list", "create", "update", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("netnamespaces").RuleOrDie(), | ||
| rbac.NewRule("get", "list", "watch", "create", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("hostsubnets").RuleOrDie(), | ||
| rbac.NewRule("get", "list", "watch", "create", "update", "delete").Groups(networkGroup, legacyNetworkGroup).Resources("netnamespaces").RuleOrDie(), | ||
| rbac.NewRule("get", "list").Groups(kapiGroup).Resources("pods").RuleOrDie(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW it doesn't currently need "get", though it does use "list" (at startup, to make sure that all running pods are inside the current clusterNetworkCIDR, if clusterNetworkCIDR has changed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, can probably leave it. "get"/"list"/"watch" are so closely coupled that we probably should have just done a placeholder
| rbac.NewRule("get", "list").Groups(kapiGroup).Resources("pods").RuleOrDie(), | ||
| rbac.NewRule("list").Groups(kapiGroup).Resources("services").RuleOrDie(), | ||
| rbac.NewRule("list").Groups(kapiGroup).Resources("namespaces").RuleOrDie(), | ||
| rbac.NewRule("get").Groups(kapiGroup).Resources("nodes").RuleOrDie(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's no good. The master needs "list" and "watch" on nodes so that it can create/delete hostsubnets as nodes are created/deleted. (Um... and I have no explanation for how networking-minimal could pass with this being wrong... Is the SDN possibly picking up permissions from somewhere else? Oh, we use a SharedInformer here now, is that it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The master is using a shared informer, I'll add the permissions explicitly though.
| rbac.NewRule("get").Groups(kapiGroup).Resources("nodes").RuleOrDie(), | ||
| rbac.NewRule("update").Groups(kapiGroup).Resources("nodes/status").RuleOrDie(), | ||
| rbac.NewRule("list").Groups(extensionsGroup).Resources("networkPolicies").RuleOrDie(), | ||
| rbac.NewRule("list", "watch").Groups(extensionsGroup).Resources("networkPolicies").RuleOrDie(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the master should not need any access to networkpolicies
06aefcd to
b9822a1
Compare
|
Updated with comments addressed |
|
LGTM |
pravisankar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Evaluated for origin testextended up to b9822a1 |
|
LGTM |
|
Evaluated for origin test up to b9822a1 |
|
continuous-integration/openshift-jenkins/testextended SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended/797/) (Base Commit: dc5a05f) (PR Branch Commit: b9822a1) (Extended Tests: networking) |
|
continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/2843/) (Base Commit: dc5a05f) (PR Branch Commit: b9822a1) |
It's hard to know what is "good". I guess its a function of whether we want the failures to mostly work or mostly fail. I'm actually inclined to have two modes that are toggled by env var. If we panic during our tests we're unlikely to ship a mistake, but if we do ship a mistake limping along on a relist seems better than straight up failing. The difference between 1 second and 10 seconds on a long list doesn't matter. You're talking about switching to O(minutes)? |
|
I'm thinking backoff from 1s up to 20-30s. I am extremely concerned about anything that is a fixed rate failing forever, we already know exactly what that looks like in very large clusters and we're going to stop doing it :) |
Fixes bug 1465361
Fixes #14964
[test]