Remove seccomp and landlock #47

cd-work · 2023-09-28T22:06:26Z

See individual commits for details.

This patch completely removes landlock from the Linux sandbox. While landlock provides solid filesystem isolation even without requiring any intricate Linux sandboxing knowledge, it is currently still too limited to build a "bulletproof" filesystem sandbox. As such, it is better suited for best-effort isolation of "assumed safe" applications, rather than sandboxing of "potentially hazardous" software. For Birdcage, these are the biggest limitations: - Locking down sockets/pipes - High kernel requirement for all basic features (FS_TRUNCATE is 6.2)

This patch completely removes seccomp from the Linux sandbox. Currently the only usage of seccomp was to block system calls for network filtering. However since user namespaces already isolate networking, seccomp isn't necessary anymore. Seccomp could be useful in the future to limit some system calls that could cause undesired system changes, but these types of system calls usually require elevated permissions already.

cd-work · 2023-09-28T22:13:00Z

Pinging @l0kod since I know you've documented our landlock usage in some places, so I just wanted to make sure you're aware we're not using it anymore.

I think in our past discussions you've already pointed out that landlock is better suited as a "best-effort sandboxing" mechanism for "normal" applications. So this might not come as much of a surprise to you.

I definitely do feel like it's a pleasure to use and I'm sure once development has progressed a little further its application in areas other than "best-effort sandboxing" will become more prevalent too.

kylewillmon

As far as I can tell, the kernel requirement now should be Linux 3.8 (for user namespaces). That's awesome.

cd-work · 2023-09-29T19:30:21Z

As far as I can tell, this should make the kernel requirement now should be Linux 3.8 (for user namespaces). That's awesome.

Yeah I came to the same conclusion and considered adding it, but I wasn't 100% sure and that is so low that most setups should just work anyway, so no need to make guarantees you can't keep.

l0kod · 2023-10-02T13:59:20Z

Pinging @l0kod since I know you've documented our landlock usage in some places, so I just wanted to make sure you're aware we're not using it anymore.

Thanks for the heads up.

I think in our past discussions you've already pointed out that landlock is better suited as a "best-effort sandboxing" mechanism for "normal" applications. So this might not come as much of a surprise to you.

To clarify a bit, Landlock is designed with backward and forward compatibility in mind, and we encourage everyone to sandbox as much as possible. Given the nature of the Linux kernel and distros, most application developers cannot have guarantees on which kernel version (and then which security features) will be available at run time. To protect users as much as possible, we then encourage to follow a best-effort security approach, which means to not disable the whole sandboxing even if only some security features are available on users' kernels. This is easier said than done, hence the ongoing effort on the rust-landlock library to make developers' live easier.

I definitely do feel like it's a pleasure to use and I'm sure once development has progressed a little further its application in areas other than "best-effort sandboxing" will become more prevalent too.

Good to know, thanks!

Commenting on the commit messages:

While landlock provides solid filesystem isolation even without
requiring any intricate Linux sandboxing knowledge, it is currently
still too limited to build a "bulletproof" filesystem sandbox.

This is correct, Landlock is not feature-complete yet (e.g. compared to SELinux), but as explained above still very useful. The reason is because developing new kernel features requires time, and it is much better to follow an incremental development and adoption.

About the "bulletproof filesystem sandbox", only SELinux could be described like that.

As such, it is better suited for best-effort isolation of "assumed safe"
applications, rather than sandboxing of "potentially hazardous"
software.

With the current Linux kernel, it is correct to say that Landlock alone is not enough to protect against any malicious code that would like to harm the system, but as always, it depends on the threat model. Landlock can already protect against illegitimate access to (sensitive) data stored on the filesystem, but it cannot protect against e.g., network data transfers yet.

For Birdcage, these are the biggest limitations:

Locking down sockets/pipes

FYI, restricting sockets according to their type is planned, and control of other IPC mechanisms should follow.

I'm wondering about the threat model specific to pipes though.

High kernel requirement for all basic features (FS_TRUNCATE is 6.2)

We indeed need a 6.2 kernel (e.g. Ubuntu 22.04 LTS) to be able to restrict truncate operations . It should be noted that it will become more and more common, especially thanks to the LTS kernel lifetime shrinking (and then bringing new features more quickly).

Currently the only usage of seccomp was to block system calls for
network filtering. However since user namespaces already isolate
networking, seccomp isn't necessary anymore.

The initial goal of seccomp is to limit kernel's attack surface. Replacing its use with user namespaces would mainly increase the kernel attack surface.

Seccomp could be useful in the future to limit some system calls that
could cause undesired system changes, but these types of system calls
usually require elevated permissions already.

Not only "privileged" syscalls can harm the kernel. This is why most container runtimes use seccomp. Linux namespaces are not intended to be used by untrusted processes.

For any sandboxer, I strongly suggest to use seccomp (to limit kernel attack surface and avoid sandbox escape) and Landlock (for an ephemeral and run-time-configurable access control system), and to disable user namespaces (and drop capabilities) inside the sandbox.

cd-work · 2023-10-02T16:53:23Z

FYI, restricting sockets according to their type is planned, and control of other IPC mechanisms should follow.

Looking forward to test this out.

I'm wondering about the threat model specific to pipes though.

There's no threat model with pipes, I just copied that from the landlock kernel docs. The issue is just with sockets specifically because of Wayland/X11 allowing nefarious stuff done through socket communication.

The initial goal of seccomp is to limit kernel's attack surface. Replacing its use with user namespaces would mainly increase the kernel attack surface.

That's fair, though the previous seccomp sandbox was extremely limited in its restrictions anyway, focusing only on networking. I definitely want to bring it back, but I'd like to focus more on a general "sensible syscall" list rather than just locking down networking.

Not only "privileged" syscalls can harm the kernel.

That's fair and our previous seccomp filter was heavily inspired by docker's filter already. It just needs to be reworked in its focus.

Linux namespaces are not intended to be used by untrusted processes

Namespace usage can definitely be locked down with seccomp inside the sandbox in the future and specifically the comment about complexity (Moreover, their complexity can lead to security issues, especially when untrusted processes can manipulate them) is why I initially went with Landlock instead.

However for now, they still seem like a more reliable tool to lock down a sandbox than landlock, which is probably why most sanboxes I've looked into use namespaces extensively.

For any sandboxer, I strongly suggest to use seccomp (to limit kernel attack surface and avoid sandbox escape) and Landlock (for an ephemeral and run-time-configurable access control system), and to disable user namespaces (and drop capabilities) inside the sandbox.

While this might be true in 2-3 years, I'm not sure I can see this working today. As you've said socket restrictions are upcoming, but not supported yet. So it'll be years before the common LTS releases have this functionality. Layering of landlock/mount namespaces isn't possible, so the only options are using landlock without locking down sockets, or using mount namespaces which supports locking down everything.

So considering the options available today, it seems to me like user mount namespaces are the only viable choice, despite their numerous issues. Especially for an application that doesn't want to provide an suid sandbox binary.

Also I really appreciate your feedback, so thanks for volunteering your time.

kylewillmon · 2023-10-02T18:33:19Z

To protect users as much as possible, we then encourage to follow a best-effort security approach, which means to not disable the whole sandboxing even if only some security features are available on users' kernels.

Unfortunately, this has been our exact dilemma with the first Landlock ABI because of FS_REFER being denied by default... We would have to disable Landlock on these kernel versions for many tools.

But, of course, the newer Landlock ABIs have resolved this issue.

FYI, restricting sockets according to their type is planned, and control of other IPC mechanisms should follow.

Looking forward to this! Once kernels with this feature are generally available, I hope to see Birdcage make the switch back to Landlock.

l0kod · 2023-10-03T09:29:12Z

There's no threat model with pipes, I just copied that from the landlock kernel docs. The issue is just with sockets specifically because of Wayland/X11 allowing nefarious stuff done through socket communication.

Indeed. I was curious because I think restricting pipes/fifos doesn't make sense for a sandbox, but we'll see if there is a need.

The initial goal of seccomp is to limit kernel's attack surface. Replacing its use with user namespaces would mainly increase the kernel attack surface.

That's fair, though the previous seccomp sandbox was extremely limited in its restrictions anyway, focusing only on networking. I definitely want to bring it back, but I'd like to focus more on a general "sensible syscall" list rather than just locking down networking.

Not only "privileged" syscalls can harm the kernel.

That's fair and our previous seccomp filter was heavily inspired by docker's filter already. It just needs to be reworked in its focus.

Sounds good.

Linux namespaces are not intended to be used by untrusted processes

Namespace usage can definitely be locked down with seccomp inside the sandbox in the future and specifically the comment about complexity (Moreover, their complexity can lead to security issues, especially when untrusted processes can manipulate them) is why I initially went with Landlock instead.

However for now, they still seem like a more reliable tool to lock down a sandbox than landlock, which is probably why most sanboxes I've looked into use namespaces extensively.

The main reason most sandboxes use namespaces is because Landlock is pretty new and it wasn't available a few years ago: not everyone is aware of Landlock, not every users have an up-to-date kernel, and namespaces may seem good-enough for sandboxers that invested in them. And as you pointed out, some filesystem access types cannot be restricted with Landlock yet.

For any sandboxer, I strongly suggest to use seccomp (to limit kernel attack surface and avoid sandbox escape) and Landlock (for an ephemeral and run-time-configurable access control system), and to disable user namespaces (and drop capabilities) inside the sandbox.

While this might be true in 2-3 years, I'm not sure I can see this working today. As you've said socket restrictions are upcoming, but not supported yet. So it'll be years before the common LTS releases have this functionality. Layering of landlock/mount namespaces isn't possible, so the only options are using landlock without locking down sockets, or using mount namespaces which supports locking down everything.

So considering the options available today, it seems to me like user mount namespaces are the only viable choice, despite their numerous issues. Especially for an application that doesn't want to provide an suid sandbox binary.

Not requiring a SUID sandbox binary is definitely a good thing. Using namespaces (when available) makes sense too (e.g. to create a limited view of the filesystem). I understand that seccomp is not the priority, which is OK. If "layering of landlock/mount namespace isn't possible", layering of mount/Landlock definitely works. I'm suggesting to use namespaces, then seccomp (locking down the availability of user namespaces and uncommon syscalls), then Landlock (partial restrictions but could still help with fine-grained access control, if supported by the running system).

I'm not sure about the availability of user namespaces to unprivileged users though. I know some (most?) distros disable it. I encourage you to check that. For instance, see Unprivileged user namespace restrictions via AppArmor in Ubuntu.

Also I really appreciate your feedback, so thanks for volunteering your time.

I also appreciate this serene discussion and your feedback on Landlock. It's of course up to you to choose what suits you best, taking into account different constraints. As a Landlock developer, I might be biased, but these are genuine security suggestions. One reason Landlock exists is because of the lack of real sandboxing features for Linux: namespaces are very useful, but we'd like the right tool for the right job.

l0kod · 2023-10-03T09:38:32Z

To protect users as much as possible, we then encourage to follow a best-effort security approach, which means to not disable the whole sandboxing even if only some security features are available on users' kernels.

Unfortunately, this has been our exact dilemma with the first Landlock ABI because of FS_REFER being denied by default... We would have to disable Landlock on these kernel versions for many tools.

If the FS_REFER is required to work properly, then you should indeed disable Landlock for kernels not supporting this feature. FS_REFER is denied with the first Landlock version because it was too complex to properly implement this feature at first, which requires to be very careful about the side effects (that could have been a way to bypass a sandbox). We also need to take into account backward and forward compatibility, which implies limitations.

l0kod · 2023-10-03T09:47:04Z

About the FS_REFER access right, it should be noted that a sandbox that doesn't support it will not deny a rename or link operation, but receive an EXDEV error instead, and then the processes (like cp) will usually copy the file (instead of linking it). This also happen when linking or renaming files between different mount points (even bind mounts), which can then also be a limitation of namespace/mount according to their use.

cd-work · 2023-10-04T10:38:41Z

I'm suggesting to use namespaces, then seccomp (locking down the availability of user namespaces and uncommon syscalls), then Landlock (partial restrictions but could still help with fine-grained access control, if supported by the running system).

Using landlock for more granular control on top of mount namespaces might be worthwhile. My main concern with landlock on top of mount namespaces to restrict hierarchy access was that the filesystem tree is basically, fake, so you're not really enforcing anything that you haven't enforced with mount namespaces already anyway.

But one feature regression I did notice with mount namespaces is that you cannot create a write-only bind mount. So that's where landlock could come in again. That said, I don't expect write-only access to be a major requirement of any application we sandbox.

l0kod · 2023-10-04T14:49:58Z

Using landlock for more granular control on top of mount namespaces might be worthwhile. My main concern with landlock on top of mount namespaces to restrict hierarchy access was that the filesystem tree is basically, fake, so you're not really enforcing anything that you haven't enforced with mount namespaces already anyway.

This might be close but not the same. For instance, with Landlock you can allow access to a file or directory, but not its parent (e.g. /). You can allow the creation of some file types but not others. You can allow rename/link between directories with different access (if it doesn't bypass the policy)... Landlock will get more features over time. In a nutshell, namespace/mount's goal is not to be an access control.

cd-work · 2023-10-04T15:49:40Z

This might be close but not the same. For instance, with Landlock you can allow access to a file or directory, but not its parent (e.g. /). You can allow the creation of some file types but not others. You can allow rename/link between directories with different access (if it doesn't bypass the policy)... Landlock will get more features over time. In a nutshell, namespace/mount's goal is not to be an access control.

This is fair, but very hard to abstract on top of with a generic sandbox that should work on macOS too. And for most scenarios this is probably too granular anyway, at least with our sandbox I don't actually know which operations exactly are going to be performed.

We're already working on sandboxing by [blocking all unneeded syscalls](#882). However, due to the wide scope it will take a while longer. This PR starts with a much smaller scope, only blocking network-related syscalls until the above is ready. For security we block the following with `seccomp`: - creation of new sockets - these are unneeded in PVF jobs, and we can safely block them without affecting consensus. - `io_uring` - as discussed [here](paritytech/polkadot#7334 (comment)), io_uring allows for networking and needs to be blocked. See below for a discussion on the safety of doing this. - `connect`ing to sockets - the above two points are enough for networking and is what birdcage does (or [used to do](phylum-dev/birdcage#47)) to restrict networking. However, it is possible to [connect to abstract unix sockets](https://lore.kernel.org/landlock/[email protected]/T/#u) to do some kinds of sandbox escapes, so we also block the `connect` syscall. (Intentionally left out of implementer's guide because it felt like too much detail.) `io_uring` is just a way of issuing system calls in an async manner, and there is nothing stopping wasmtime from legitimately using it. Fortunately, at the moment it does not. Generally, not many applications use `io_uring` in production yet, because of the numerous kernel CVEs discovered. It's still under a lot of development. Android outright banned `io_uring` for these reasons. Considering `io_uring`'s status, and that it very likely would get detected either by our [recently-added static analysis](#1663) or by testing, I think it is fairly safe to block it. If execution hits an edge case code path unique to a given machine, it's already taken a non-deterministic branch anyway. After all, we just care that the majority of validators reach the same result and preserve consensus. So worst-case scenario, there's a dispute, and we can always admit fault and refund the wrong validator. On the other hand, if all validators take the code path that results in a seccomp violation, then they would all vote against the current candidate, which is also fine. The violation would get logged (in big scary letters) and hopefully some validator reports it to us. Actually, a worst-worse-case scenario is that 50% of validators vote against, so that there is no consensus. But so many things would have to go wrong for that to happen: 1. An update to `wasmtime` is introduced that uses io_uring (unlikely as io_uring is mainly for IO-heavy applications) 2. The new syscall is not detected by our static analysis 3. It is never triggered in any of our tests 4. It then gets triggered on some super edge case in production on 50% of validators causing a stall (bad but very unlikely) 5. Or, it triggers on only a few validators causing a dispute (more likely but not as bad?) Considering how many things would have to go wrong here, we believe it's safe to block `io_uring`. Closes #619 Original PR in Polkadot repo: paritytech/polkadot#7334

cd-work requested a review from kylewillmon September 28, 2023 22:06

cd-work added 2 commits September 29, 2023 00:07

cd-work force-pushed the remove_seccomp_landlock branch from 723ae65 to aab3123 Compare September 28, 2023 22:07

cd-work mentioned this pull request Sep 28, 2023

Allow controlling Landlock ABI versions #41

Closed

kylewillmon approved these changes Sep 29, 2023

View reviewed changes

cd-work mentioned this pull request Oct 4, 2023

Add extension API for Linux sandbox control phylum-dev/cli#1234

Closed

cd-work merged commit 77db8aa into main Oct 5, 2023
11 checks passed

cd-work deleted the remove_seccomp_landlock branch October 5, 2023 21:22

kylewillmon assigned cd-work Oct 6, 2023

mrcnski mentioned this pull request Oct 24, 2023

PVF worker: Add seccomp restrictions (restrict networking) paritytech/polkadot-sdk#2009

Merged

mrcnski mentioned this pull request Nov 29, 2023

PVF: Add Secure Validator Mode paritytech/polkadot-sdk#2486

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove seccomp and landlock #47

Remove seccomp and landlock #47

cd-work commented Sep 28, 2023

cd-work commented Sep 28, 2023

kylewillmon left a comment •

edited

Loading

cd-work commented Sep 29, 2023

l0kod commented Oct 2, 2023

cd-work commented Oct 2, 2023

kylewillmon commented Oct 2, 2023

l0kod commented Oct 3, 2023

l0kod commented Oct 3, 2023

l0kod commented Oct 3, 2023

cd-work commented Oct 4, 2023 •

edited

Loading

l0kod commented Oct 4, 2023

cd-work commented Oct 4, 2023

Remove seccomp and landlock #47

Remove seccomp and landlock #47

Conversation

cd-work commented Sep 28, 2023

cd-work commented Sep 28, 2023

kylewillmon left a comment • edited Loading

Choose a reason for hiding this comment

cd-work commented Sep 29, 2023

l0kod commented Oct 2, 2023

cd-work commented Oct 2, 2023

kylewillmon commented Oct 2, 2023

l0kod commented Oct 3, 2023

l0kod commented Oct 3, 2023

l0kod commented Oct 3, 2023

cd-work commented Oct 4, 2023 • edited Loading

l0kod commented Oct 4, 2023

cd-work commented Oct 4, 2023

kylewillmon left a comment •

edited

Loading

cd-work commented Oct 4, 2023 •

edited

Loading