Skip to content

Conversation

@yuvalk
Copy link
Contributor

@yuvalk yuvalk commented Jul 2, 2021

- What I did

- How to verify it

- Description for the changelog

@kikisdeliveryservice
Copy link
Contributor

Can you please update the description & commit to include something explaining why this is necessary?

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 2, 2021
@kikisdeliveryservice
Copy link
Contributor

Closing in favor of #2627 which only contains appropriate on-prem run-level changes. We don't need the change this PR has to the MCO itself which has been in effect basically forever.

@yuqi-zhang yuqi-zhang changed the title remove run-level info from operators namespaces Bug 1978581: remove run-level info from operators namespaces Jul 3, 2021
@yuqi-zhang
Copy link
Contributor

I believe this is associated with https://bugzilla.redhat.com/show_bug.cgi?id=1978581, which on top of https://bugzilla.redhat.com/show_bug.cgi?id=1805488 has a bit more context. Reopening for now and associating.

Please add some more description to the commit/PR message, or mark WIP if this is just for testing, thanks!

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 3, 2021

@yuvalk: This pull request references Bugzilla bug 1978581, which is invalid:

  • expected the bug to target the "4.9.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1978581: remove run-level info from operators namespaces

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Jul 3, 2021
@yuqi-zhang yuqi-zhang reopened this Jul 3, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 3, 2021

@yuvalk: This pull request references Bugzilla bug 1978581, which is invalid:

  • expected the bug to target the "4.9.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1978581: remove run-level info from operators namespaces

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mcoops
Copy link

mcoops commented Aug 6, 2021

With this change the pods for the machine-config-operator will fail admission, for example:

LAST SEEN TYPE REASON OBJECT MESSAGE
12s Warning FailedCreate daemonset/machine-config-daemon Error creating: pods "machine-config-daemon-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted: .spec.securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": ...

We need to set the correct privileges for each ServiceAccount. Opened a PR on this merging branch: yuvalk#1

That then allows the pods to run without errors and behave in the expected way:

# oc get pods
NAME                                         READY   STATUS    RESTARTS   AGE
machine-config-controller-5d5d9f9657-dbxsl   1/1     Running   0          11h
machine-config-daemon-94gfw                  2/2     Running   0          10h
machine-config-daemon-fkzvf                  2/2     Running   0          11h
machine-config-daemon-hvmd7                  2/2     Running   0          11h
machine-config-daemon-m495h                  2/2     Running   0          11h
machine-config-operator-84756558d7-g84bd     1/1     Running   0          11h
machine-config-server-2cccr                  1/1     Running   0          11h
machine-config-server-pzfcn                  1/1     Running   0          11h
machine-config-server-slg5z                  1/1     Running   0          11h

@yuvalk
Copy link
Contributor Author

yuvalk commented Aug 18, 2021

/retest

1 similar comment
@yuvalk
Copy link
Contributor Author

yuvalk commented Aug 22, 2021

/retest

@yuvalk
Copy link
Contributor Author

yuvalk commented Aug 22, 2021

/retest

@yuvalk
Copy link
Contributor Author

yuvalk commented Aug 22, 2021

/retest-required

@yuvalk
Copy link
Contributor Author

yuvalk commented Aug 22, 2021

/retest

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good opportunity to revisit this. Some comments below. Also please rebase on master when you get a chance. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions on the clusterrole changes:

  1. is it required as part of the runlevel removal? or is it parallel nice-to-have? Is that tracked somewhere via e.g. another bug?
  2. I see that in the commits, you added them for controller, but then removed it from a later commit. What is the reasoning behind that?
  3. Could you squash the commits and provide more context via the commit message?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It is, otherwise the MCO will fail to start, well specifically the daemon will fail to start due to a lack of permissions. For example:
2s Warning FailedCreate daemonset/machine-config-daemon Error creating: pods "machine-config-daemon-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted: .spec.securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed, spec.containers[0].securityContext.hostNe...

From this requirement:

plus the fact it needs
hostNetwork: true
hostPID: true

I gave the config-server hostnetwork to allow that to run as well, from memory due to


2. The controller doesn't appear to need any special permissions. I originally added the permissions, but then realized the controller didn't need anything special. Looking at the deployment.yaml for it, there's nothing there that would require any permissions so can run restricted.

With the daemon running privileged, server with hostnetwork and controller as restricted all pods seem to start ok:

# oc get pods
NAME READY STATUS RESTARTS AGE
machine-config-controller-b559cbf8-2wp7v 1/1 Running 0 80m
machine-config-daemon-4s9tb 2/2 Running 0 80m
machine-config-daemon-r82bm 2/2 Running 0 51m
machine-config-daemon-z5xfb 2/2 Running 0 80m
machine-config-daemon-zzc9d 2/2 Running 0 80m
machine-config-operator-68cdd77788-kxd92 1/1 Running 1 102m
machine-config-server-gqrsn 1/1 Running 0 79m
machine-config-server-kmz9b 1/1 Running 0 79m
machine-config-server-ttp5t 1/1 Running 0 79m

An no errors related to admission:

# oc get events | grep -i failed
81m Warning FailedMount pod/machine-config-daemon-4s9tb MountVolume.SetUp failed for volume "proxy-tls" : secret "proxy-tls" not found
81m Warning FailedMount pod/machine-config-daemon-z5xfb MountVolume.SetUp failed for volume "proxy-tls" : secret "proxy-tls" not found
81m Warning FailedMount pod/machine-config-daemon-zzc9d MountVolume.SetUp failed for volume "proxy-tls" : secret "proxy-tls" not found
102m Warning FailedScheduling pod/machine-config-operator-68cdd77788-kxd92 no nodes available to schedule pods
102m Warning FailedScheduling pod/machine-config-operator-68cdd77788-kxd92 no nodes available to schedule pods
84m Warning FailedScheduling pod/machine-config-operator-68cdd77788-kxd92 0/2 nodes are available: 2 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
83m Warning FailedScheduling pod/machine-config-operator-68cdd77788-kxd92 0/3 nodes are available: 3 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
81m Warning FailedScheduling pod/machine-config-operator-68cdd77788-kxd92 0/3 nodes are available: 3 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
# oc get events | grep -i error
# oc get events | grep -i scc
104m        Normal    CreatedSCCRanges    namespace/openshift-machine-config-operator     created SCC ranges

Of course @yuqi-zhang if you know anything in the controller that might need something more than the restricted SCC then we can def alter that. But from testing the MCO seems to run ok, doesn't fail to admit.

  1. @yuvalk happy to squish and add to the commit message?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed comments! One follow up, since I don't know runlevels that well, are there any other risks to revoking runlevel1 for the MC* pods, other than hostnetwork and privileged being required?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main risk is just permissions. As when using runlevels it pretty much bypasses SCC entirely, the pods basically run with no restrictions, hence why we're pretty keen to minimize what uses it.

Given it passes admission - means that it has the required permissions - I think it's fairly unlikely we'll see if fail later on. As far as I'm aware we've had no issues with components like kni removing it either: #2627

It used to be a requirement back in the early days of OpenShift (4.0) due to a slow startup delay, but since about 4.6 there's no requirement for it now it seems. We'll be looking at pushing for the CVO to do the same too: openshift/cluster-version-operator#623, now that we've confirmed that it should be fine to remove there as well.

@sinnykumari
Copy link
Contributor

/cc @cgwalters will be nice to know your thoughts on this PR

@openshift-ci openshift-ci bot requested a review from cgwalters November 4, 2021 09:11
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 4, 2021

@sinnykumari: GitHub didn't allow me to request PR reviews from the following users: to, your, will, be, on, this, PR, nice, know, thoughts.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @cgwalters will be nice to know your thoughts on this PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member

cgwalters commented Nov 9, 2021

Honestly I don't have any more expertise in this subject than anyone else here. I read through the PR and some of the linked bugzilla comments, and based on that I am fine to say
/approve

AFAICS the runlevel dates to the primordial epoch of bf6ac87 - it may be it was never necessary.

@mcoops
Copy link

mcoops commented Nov 10, 2021

Dang it. Similar to CVO openshift/cluster-version-operator#623 we're going have to account for upgrades too.

curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2655/pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade/1456000980163235840/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/namespaces.json | jq '.items[].metadata | select(.name == "openshift-machine-config-operator").labels'
{
  "kubernetes.io/metadata.name": "openshift-machine-config-operator",
  "name": "openshift-machine-config-operator",
  "olm.operatorgroup.uid/59920ef9-792f-461b-b4cd-364910998083": "",
  "openshift.io/cluster-monitoring": "true",
  "openshift.io/run-level": "1",
  "pod-security.kubernetes.io/audit": "privileged",
  "pod-security.kubernetes.io/enforce": "privileged",
  "pod-security.kubernetes.io/warn": "privileged"
}

@mcoops
Copy link

mcoops commented Nov 10, 2021

/hold

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

14 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@yuvalk
Copy link
Contributor Author

yuvalk commented Nov 26, 2021

/retest

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 26, 2021

@yuvalk: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-e2e-aws 8c4dda1 link false /test okd-e2e-aws
ci/prow/e2e-aws-workers-rhel7 8c4dda1 link false /test e2e-aws-workers-rhel7
ci/prow/e2e-aws-upgrade-single-node 8c4dda1 link false /test e2e-aws-upgrade-single-node
ci/prow/e2e-aws-disruptive 8c4dda1 link false /test e2e-aws-disruptive

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

7 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 2d8c153 into openshift:master Nov 26, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 26, 2021

@yuvalk: All pull requests linked via external trackers have merged:

Bugzilla bug 1978581 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1978581: remove run-level info from operators namespaces

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants