-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --privileged-datamover-pods option to installer and node agent #8243
Conversation
cc4d1ce
to
9a65e29
Compare
We should probably find out which perm exactly was needed that caused permission denied errors. |
…rver Signed-off-by: Scott Seago <[email protected]>
9a65e29
to
f943eaa
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8243 +/- ##
=======================================
Coverage ? 59.16%
=======================================
Files ? 367
Lines ? 30870
Branches ? 0
=======================================
Hits ? 18263
Misses ? 11143
Partials ? 1464 ☔ View full report in Codecov by Sentry. |
It's probably worth eventually figuring out why we need the permissions, but we still need the configurability. This is an action that in Velero 1.14 and earlier would have always been done under privileged pod conditions with the Node Agent (OpenShift always required a privileged node agent pod), so requiring it for datamover pods is not a regression. If in the future we can eliminate the need for OpenShift to have these pods privileged, that would be nice, but we can't guarantee that some other kubernetes environment won't still need this, so I think figuring that out is outside of the scope of this bugfix. |
As discussed, let's find the root cause before making these changes, as the data mover micro service design, requiring on privileged mode is not expected. Therefore, we should make sure that it is really required according to the root cause and also document the reasons and scenarios. |
@sseago One more question, when you see the permission denied error, does it happen to the pod, so data mover pod cannot start or to the execution of the backup, so the data mover backup starts but fails? |
@Lyndon-Li Looks like the pod runs. Here's the full log:
|
@Lyndon-Li so the permission denied error is happening during the actual kopia Upload call. |
@Lyndon-Li This is a mysql PV, here's the full dir listing, but I'm not seeing any unusual file permissions here:
|
Peremission denied seems to be coming from this func. |
@Lyndon-Li with debug logging enabled, here's the actual perm denied call. Directory access:
|
Narrowing it down. It looks like the mounted cloned volume cannot be accessed by the (root user but non-privileged container) environment where kopia runs. Here's what I got by keeping the pod open and connecting directly via rsh:
|
I suspect that in an OpenShift environment, we may either need some additional permissons on the Velero SA, or modify the OpenShift SecurityContextConstraints, or modify the pod or container security context for the datamover backup pod. The basic issue here seems to be that the cloned pod is owned/created by a different user and even as root, in OpenShift, the backup pod doesn't have read access. Restore works fine because we're working with a newly-provisioned volume that the restore pod owns, and then the ownership change operation at the end is also fine, since the pod runs as root already. |
@Lyndon-Li I think we isolated the problem. The way the pod is currently created, both the volumeMounts and the volumes entries have "readOnly: true" -- if we remove that from the volumes entry, but leave it for volumeMounts, backup works just fine on OpenShift. Here's my diff:
In my pod definition, this results in readOnly for volumeMounts but not volumes -- but effectively the volume is read-only in the pod:
Read-only is enforced:
|
@Lyndon-Li PR for the proposed fix: #8248 |
I have two concerns with the fix #8248:
Therefore, let's further troubleshoot it, simply removing
|
@Lyndon-Li
We created a test pod with 2 volumes. both have It seems that the only thing OpenShift-specific here is that OpenShift enables selinux by default. I suspect non-OpenShift clusters would see the same problem with the 1.15 datamover if selinux were enabled. I think there are several possible resolutions, although some of these may not be feasible:
I share your concern about the ceph shallow copy use case -- removing @shubham-pampattiwar @msfrucht @weshayutin @shawn-hurley @kaovilai |
@Lyndon-Li I found something that works without setting "privileged=true" or changing the pod mounts. Since the problem is selinux-specific, we found a way to make this work via the
As @msfrucht pointed out to me today, this also has a performance advantage. For volumes with a large number of files, the selinux relabeling on mount is a slow process. Setting So this should be completely ignored in non-selinux environments, as long as it's not a windows pod -- but since it's the velero container image, it's never a windows pod. It might be good to test this out in your non-OpenShift env just to make sure it doesn't break or change anything, but it shouldn't -- it should only alter the selinux context it's running in if selinux is enabled. |
@sseago @msfrucht Besides, I have two more questions, though I think the current fix is good enough for now, they might be something we need to follow up in future:
|
@Lyndon-Li I'm not 100% sure on 1) -- regarding 2) I think that's at least reasonably close to the answer. I don't know about the mechanics of why relabel fails, but from reading docs (and talking to some people who know this stuff better than I do), it's pretty clear that relabeling won't happen with a readonly access mode (and, also when the volumes entry is marked readonly). It may be that for the first concern there will be users who don't want to open up permissions like this (although it's still an improvement over making the pod privileged which disables selinux enforcement and more), so for those users we may in the future want to provide an option so they can either use spc_t or remove the readonly attribute. But for now lets just get this working. We'll be testing the PR today. |
Closing in favor of #8250 |
Thank you for contributing to Velero!
Please add a summary of your change
For some kubernetes variants (for example, OpenShift), the 1.15 podified datamover pods fail with permission denied errors unless the pod is privileged like the node agent pod is.
This PR adds a new installer flag
--privileged-datamover-pods
and a correspondingvelero node-agent server
flag.Does your change fix a particular issue?
Fixes #(issue)
Please indicate you've done the following:
make new-changelog
) or comment/kind changelog-not-required
on this PR.site/content/docs/main
.