-
Notifications
You must be signed in to change notification settings - Fork 519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve support for container storage mounts #2240
improve support for container storage mounts #2240
Conversation
# Mask `/local/mnt` to avoid confusion, since it will have most of the contents | ||
# of `/mnt` but not any of the mounts. | ||
ExecStart=/usr/bin/mount --bind --options nosuid,nodev,noexec,private /srv /local/mnt | ||
ExecStop=/usr/bin/umount /local/mnt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a oneshot
, do you still need the ExecStop
directive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This guards against the case where /mnt
is unmounted.
It should never happen in practice, but if it did, then if /mnt
were mounted again via the mnt.mount
unit, it might bind-mount the /srv
mount over /mnt
, instead of bind-mounting the underlying directory /local/mnt
.
@@ -0,0 +1,5 @@ | |||
[Service] | |||
# Create the directories for symlinks in /etc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both in the commit message and in the systemd services: the directories are created in /var
not in /etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this help clarify? The symlinks are still created in /etc
, and these are the directories they point to.
# Create the directories for symlinks in /etc | |
# Create the backing directories for symlinks in /etc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think the suggest comment is clearer 👍
storageDirs := []string{ | ||
"/var/lib/containerd", | ||
"/var/lib/docker", | ||
"/var/lib/kubelet", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a constant, declared close to where it used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give me an example of what you mean?
It feels pretty close to where it's used already but I'm guessing from your comment that it might be idiomatic to declare it as part of the for
statement somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const storageDirs := []string{
"/var/lib/containerd",
"/var/lib/docker",
"/var/lib/kubelet",
}
func withStorageMounts() oci.SpecOpts {
var mounts []runtimespec.Mount
for _, storageDir := range storageDirs {
// ...
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go doesn't support const
arrays or slices, so I don't think this alternative approach can work.
} | ||
} | ||
|
||
return oci.WithMounts(mounts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consistency with the other with*Mounts
functions in the file:
return oci.WithMounts(mounts) | |
return withMounts(mounts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It felt unnecessary to call that function since all the mounts are known to have propagation options specified.
I could add a comment instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a comment will work 👍
|
||
# Mask `/local/mnt` to avoid confusion, since it will have most of the contents | ||
# of `/mnt` but not any of the mounts. | ||
ExecStart=/usr/bin/mount --bind --options nosuid,nodev,noexec,private /srv /local/mnt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL /srv
, somewhat unexpected way of using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely odd. I figured relying on an empty /srv
as an implementation detail was a less grievous sin against FHS than inventing a top-level /empty
directory to serve the same purpose.
I'm open to other suggestions if this is too weird.
a660a07
to
11c4d9e
Compare
The default dependencies are already specified explicitly, and this makes the units consistent with other mounts. Signed-off-by: Ben Cressey <[email protected]>
Rather than setting up `/mnt`, `/opt`, and `/var` in the same unit, switch to dedicated units that prepare the corresponding mount. This lets the system do more in parallel, and makes it trivial to reason about when `/local` is fully ready and done - which it now is as soon as it's mounted. Signed-off-by: Ben Cressey <[email protected]>
Previously, `/local` was mounted with the "shared" option by default, and any mounts done under the bind mounts (`/mnt', `/opt`, `/var`) would also propagate to `/local`. This meant that container-related mounts under `/var`, and storage- related mounts created by bootstrap containers, would be listed more than once in `/proc/self/mountinfo`. Each of these duplicate mount entries has to be processed by systemd and by components like kubelet, which is inefficient at best and a potential scaling concern at high pod densities. Signed-off-by: Ben Cressey <[email protected]>
Since mounts are not propagated back to `/local`, subdirectories like `/local/var` will no longer be identical and interchangeable with `/var`. This is confusing and could lead to situations where files written to some subdirectories under `/local/var` are not visible to processes reading from `/var`, which might look like data loss or application bugs. Mounting an empty, read-only directory over each directory prevents the confusion, at the cost of potentially disrupting containers that try to mount these directories using the `/local` prefixed paths. That is a relatively obvious failure mode - the container will fail to start, or encounter errors when writing to the mount - which is preferable to subtle misbehavior with no obvious errors. Signed-off-by: Ben Cressey <[email protected]>
Create the directories for container storage early in boot, so that they can be mounted into bootstrap containers. This is split up by container runtime so that e.g. `/var/lib/docker` is not created on a host that does not include Docker. Existing symlinks in these locations are removed. These may have been created by bootstrap containers, and if so may point to directories in `/mnt` that are not yet available. This would prevent bootstrap containers from starting and mounting the intended storage device. Signed-off-by: Ben Cressey <[email protected]>
To allow kubelet storage to be swapped out by bootstrap containers, create the backing directories for symlinks in `/etc` when kubelet starts, rather than when `systemd-tmpfiles` runs. Signed-off-by: Ben Cressey <[email protected]>
Previously, the implementation anticipated that bootstrap containers would use symlinks back to `/mnt` to redirect container storage to a different mounted filesystem. This turned out to be quite brittle in practice. Redirecting kubelet volume storage in that way was found to be fundamentally incompatible with CSI driver expections, which require the "real" mount point to be `/var/lib/kubelet` in order to work correctly. To accommodate this, container storage directories are now treated as a special case. If they exist in the host, they are mounted into the bootstrap container with the "rbind,rshared" options so that changes will propagate back to the host. Signed-off-by: Ben Cressey <[email protected]>
11c4d9e
to
8f1cc01
Compare
Force push with no changes other than a fix for the merge conflict in |
Issue number:
#2218
Description of changes:
The first commit is a clean-up commit to end up with consistent dependencies for all mount units.
The second set of commits is all related to setting the "private" option for
/local
and dealing with the fallout./var
and/opt
join/mnt
in getting the "rshared" option; this is what ultimately allows for/var
to have new mounts for container storage to propagate toward it.Previously all mounts under
/var
were also listed under/local
, which could lead to a lot of duplicate mount entries for Kubernetes and Docker volumes. When allowing container storage mounts,/proc/self/mountinfo
could potentially end up with four entries for each mount, if a/mnt
directory was bound to a/var
directory:/mnt/blah
,/var/blah
,/local/mnt/blah
,/local/var/blah
.Hence
/local
is made private so it is not part of the peer group. Since/local
is private, the contents of/local/var/lib/containerd
might not match/var/lib/containerd
, if a separate container storage filesystem was mounted. To avoid this inconsistency, each of the/local
directories are masked with an empty, read-only directory.This is potentially a breaking change for pods that expect
/local/var
to match/var
. However, I could not find any public examples of pod specs using/local/var
in GitHub search, and it seems unusual enough that there's a decent chance that nobody is depending on this behavior./local
setup is split into three parts - setup for/var
,/opt
, and/mnt
- so that it's easier to reason about when setup is "done" for each directory, and when it's OK to mask the contents by mounting an empty directory over it.The final set of commits is what actually enables bootstrap containers to mount custom volumes over the container storage directories. A directory is created for each container runtime, after first removing any symlink that may have been created by a bootstrap container, since otherwise it might point to a missing directory. Then
host-ctr
mounts any of these directories it finds into bootstrap containers with the "rbind,rshared" flags so mounts will propagate back to the host.Testing done:
Verified that I could use a bootstrap container to create custom storage mounts and bind mount them over
/var/lib/containerd
,/var/lib/docker
, and/var/lib/kubelet
. In thekubelet
case, all CSI drivers I tested continued to work: EFS, EBS, and Secrets Store.Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.