Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve support for container storage mounts #2240

Merged
merged 7 commits into from
Jul 7, 2022

Conversation

bcressey
Copy link
Contributor

@bcressey bcressey commented Jun 24, 2022

Issue number:
#2218

Description of changes:

release: set DefaultDependencies=no

The first commit is a clean-up commit to end up with consistent dependencies for all mount units.

release: adjust mount propagation for /local
release: split up /local preparation
release: mask storage on /local

The second set of commits is all related to setting the "private" option for /local and dealing with the fallout. /var and /opt join /mnt in getting the "rshared" option; this is what ultimately allows for /var to have new mounts for container storage to propagate toward it.

Previously all mounts under /var were also listed under /local, which could lead to a lot of duplicate mount entries for Kubernetes and Docker volumes. When allowing container storage mounts, /proc/self/mountinfo could potentially end up with four entries for each mount, if a /mnt directory was bound to a /var directory: /mnt/blah, /var/blah, /local/mnt/blah, /local/var/blah.

Hence /local is made private so it is not part of the peer group. Since /local is private, the contents of /local/var/lib/containerd might not match /var/lib/containerd, if a separate container storage filesystem was mounted. To avoid this inconsistency, each of the /local directories are masked with an empty, read-only directory.

This is potentially a breaking change for pods that expect /local/var to match /var. However, I could not find any public examples of pod specs using /local/var in GitHub search, and it seems unusual enough that there's a decent chance that nobody is depending on this behavior.

/local setup is split into three parts - setup for /var, /opt, and /mnt - so that it's easier to reason about when setup is "done" for each directory, and when it's OK to mask the contents by mounting an empty directory over it.

create mount points for container storage
kubelet: defer creation of subdirectories
host-ctr: propagate container storage mounts

The final set of commits is what actually enables bootstrap containers to mount custom volumes over the container storage directories. A directory is created for each container runtime, after first removing any symlink that may have been created by a bootstrap container, since otherwise it might point to a missing directory. Then host-ctr mounts any of these directories it finds into bootstrap containers with the "rbind,rshared" flags so mounts will propagate back to the host.

Testing done:
Verified that I could use a bootstrap container to create custom storage mounts and bind mount them over /var/lib/containerd, /var/lib/docker, and /var/lib/kubelet. In the kubelet case, all CSI drivers I tested continued to work: EFS, EBS, and Secrets Store.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

# Mask `/local/mnt` to avoid confusion, since it will have most of the contents
# of `/mnt` but not any of the mounts.
ExecStart=/usr/bin/mount --bind --options nosuid,nodev,noexec,private /srv /local/mnt
ExecStop=/usr/bin/umount /local/mnt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a oneshot, do you still need the ExecStop directive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guards against the case where /mnt is unmounted.

It should never happen in practice, but if it did, then if /mnt were mounted again via the mnt.mount unit, it might bind-mount the /srv mount over /mnt, instead of bind-mounting the underlying directory /local/mnt.

@@ -0,0 +1,5 @@
[Service]
# Create the directories for symlinks in /etc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both in the commit message and in the systemd services: the directories are created in /var not in /etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this help clarify? The symlinks are still created in /etc, and these are the directories they point to.

Suggested change
# Create the directories for symlinks in /etc
# Create the backing directories for symlinks in /etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the suggest comment is clearer 👍

Comment on lines +770 to +776
storageDirs := []string{
"/var/lib/containerd",
"/var/lib/docker",
"/var/lib/kubelet",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a constant, declared close to where it used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give me an example of what you mean?

It feels pretty close to where it's used already but I'm guessing from your comment that it might be idiomatic to declare it as part of the for statement somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const storageDirs := []string{
                "/var/lib/containerd",
		"/var/lib/docker",
		"/var/lib/kubelet",
}

func withStorageMounts() oci.SpecOpts {
    var mounts []runtimespec.Mount

     for _, storageDir := range storageDirs {
          // ...
     }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go doesn't support const arrays or slices, so I don't think this alternative approach can work.

}
}

return oci.WithMounts(mounts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consistency with the other with*Mounts functions in the file:

Suggested change
return oci.WithMounts(mounts)
return withMounts(mounts)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It felt unnecessary to call that function since all the mounts are known to have propagation options specified.

I could add a comment instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a comment will work 👍


# Mask `/local/mnt` to avoid confusion, since it will have most of the contents
# of `/mnt` but not any of the mounts.
ExecStart=/usr/bin/mount --bind --options nosuid,nodev,noexec,private /srv /local/mnt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL /srv, somewhat unexpected way of using it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely odd. I figured relying on an empty /srv as an implementation detail was a less grievous sin against FHS than inventing a top-level /empty directory to serve the same purpose.

I'm open to other suggestions if this is too weird.

The default dependencies are already specified explicitly, and this
makes the units consistent with other mounts.

Signed-off-by: Ben Cressey <[email protected]>
Rather than setting up `/mnt`, `/opt`, and `/var` in the same unit,
switch to dedicated units that prepare the corresponding mount. This
lets the system do more in parallel, and makes it trivial to reason
about when `/local` is fully ready and done - which it now is as soon
as it's mounted.

Signed-off-by: Ben Cressey <[email protected]>
Previously, `/local` was mounted with the "shared" option by default,
and any mounts done under the bind mounts (`/mnt', `/opt`, `/var`)
would also propagate to `/local`.

This meant that container-related mounts under `/var`, and storage-
related mounts created by bootstrap containers, would be listed more
than once in `/proc/self/mountinfo`.

Each of these duplicate mount entries has to be processed by systemd
and by components like kubelet, which is inefficient at best and a
potential scaling concern at high pod densities.

Signed-off-by: Ben Cressey <[email protected]>
Since mounts are not propagated back to `/local`, subdirectories like
`/local/var` will no longer be identical and interchangeable with
`/var`.

This is confusing and could lead to situations where files written to
some subdirectories under `/local/var` are not visible to processes
reading from `/var`, which might look like data loss or application
bugs.

Mounting an empty, read-only directory over each directory prevents
the confusion, at the cost of potentially disrupting containers that
try to mount these directories using the `/local` prefixed paths.

That is a relatively obvious failure mode - the container will fail
to start, or encounter errors when writing to the mount - which is
preferable to subtle misbehavior with no obvious errors.

Signed-off-by: Ben Cressey <[email protected]>
Create the directories for container storage early in boot, so that
they can be mounted into bootstrap containers. This is split up by
container runtime so that e.g. `/var/lib/docker` is not created on a
host that does not include Docker.

Existing symlinks in these locations are removed. These may have been
created by bootstrap containers, and if so may point to directories
in `/mnt` that are not yet available. This would prevent bootstrap
containers from starting and mounting the intended storage device.

Signed-off-by: Ben Cressey <[email protected]>
To allow kubelet storage to be swapped out by bootstrap containers,
create the backing directories for symlinks in `/etc` when kubelet
starts, rather than when `systemd-tmpfiles` runs.

Signed-off-by: Ben Cressey <[email protected]>
Previously, the implementation anticipated that bootstrap containers
would use symlinks back to `/mnt` to redirect container storage to a
different mounted filesystem.

This turned out to be quite brittle in practice. Redirecting kubelet
volume storage in that way was found to be fundamentally incompatible
with CSI driver expections, which require the "real" mount point to
be `/var/lib/kubelet` in order to work correctly.

To accommodate this, container storage directories are now treated as
a special case. If they exist in the host, they are mounted into the
bootstrap container with the "rbind,rshared" options so that changes
will propagate back to the host.

Signed-off-by: Ben Cressey <[email protected]>
@bcressey
Copy link
Contributor Author

bcressey commented Jul 6, 2022

Force push with no changes other than a fix for the merge conflict in local.mount.

@bcressey bcressey merged commit e86aa00 into bottlerocket-os:develop Jul 7, 2022
@bcressey bcressey deleted the container-storage-mounts branch July 7, 2022 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants