-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PVF host: sandbox/harden worker process #882
Comments
Should be done before parathreads! |
Unsurprisingly, all of the syscall options are Linux-only, and we again run into portability concerns. Looks like there is potentially some alternative for MacOS, though it's marked deprecated and would also increase our complexity/testing burden. However, I'm wondering if running workers on Docker would be an acceptable solution. It would solve the portability issues we are repeatedly running into. The performance cost should (theoretically) be negligible, and we would get Server: Containers: 5 Running: 0 Paused: 0 Stopped: 5 Images: 1 Server Version: 20.10.21 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 2 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux Default Runtime: runc Init Binary: docker-init containerd version: 1c90a442489720eec95342e1789ee8a5e1b9536f runc version: v1.1.4-0-g5fd4c4d init version: de40ad0 Security Options: seccomp Profile: default cgroupns Kernel Version: 5.15.49-linuxkit Operating System: Docker Desktop OSType: linux Architecture: aarch64 CPUs: 5 Total Memory: 7.667GiB Name: docker-desktop ID: SCCV:JEJX:743F:W33L:I3VH:NTKL:GKQF:EMIT:E5BR:Q5LG:HMOW:NNYS Docker Root Dir: /var/lib/docker Debug Mode: false HTTP Proxy: http.docker.internal:3128 HTTPS Proxy: http.docker.internal:3128 No Proxy: hubproxy.docker.internal Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: hubproxy.docker.internal:5000 127.0.0.0/8 Live Restore Enabled: false |
cc @koute in case you're not subscribed. |
I think sandboxing is pretty much non-negotiable so we have to go through with it regardless of any portability concerns. (And if we want security we are pretty much forced to use OS-specific APIs to do it.) For macOS (and any other OS) for now I'd suggest we just don't sandbox in there and refuse to start a validator unless a mandatory e.g. |
Is there any data on OS distribution for validators? It's something I've been curious about for a while. cc @wpank. It would definitely be nice to drop our portability concerns as it simplifies things. I agree that sandboxing is non-negotiable! |
What portability concerns? |
I just mean things we've been wanting to do that do not have good cross-platform support/alternatives. As we need this now, I'll begin work on it. |
I think the "solution" @koute proposed above sounds like a good enough solution for now? We don't need to support Mac or Linux kernels from 2016 or whatever. |
I’ve been researching And maybe I’m overcomplicating this, but I also realized the difficulty of testing every possible path through execution. E.g. if there is a panic during execution, it may trigger some unexpected syscalls. Or if a signal happens and some signal handler runs some unexpected code (and we have to account for the |
Yes.
Yes, different versions of libc could use different syscalls, and that is a problem. We could just compile the PVF worker to use musl to make sure it's completely independent of the host's libc. A little annoying, but should be doable. Another option would be to bundle a minimum known set of libraries that are necessary to run the PVF worker (our own mini docker, I suppose). We want to namespace/cgroup the process anyway, so that wouldn't be too big of a stretch. There's also the issue of the libc picking different syscalls depending on the kernel that the host's running. But we could probably easily simulate that by just mocking the kernel version to an older one and then running the thing and see if it works. Another (crazier alternative) would be to maybe compile the PVF worker as a unikernel (e.g. with something like this) and run it under KVM. This would also give us easy portability to other OSes (we could relatively easily make it also work on macOS)
That was originally the plan, wasn't it? For now only amd64's going to be supported. I general I think the set of syscalls that the worker actually needs should be quite limited, which is why I think this form of sandboxing should be actually practical for us to implement. |
Why not use existing proven technology (e.g. Docker) at that point? Either way, this would be very nice for determinism, too.
Oh, I didn't know that. Has this been discussed before?
Yes, I might be overthinking it. If something unexpected happens (e.g. a signal handler runs some unaccounted syscall), killing the process is probably what we want to do anyway. |
I would be for gradually improving security here. E.g. start with launching the worker threads with a different user in a chroot environment. Assuming that different user does not have access to keys and such, then we already have 3 security barriers in place, an attacker has to break through, to get to keys or otherwise manipulate the machine:
It would also be good to identify some threat model. What are we actually worried about? Things that come to mind:
In addition, limiting the attack surface is generally a good idea obviously. Any syscall that is exposed could have some bug leading to privilege escalation. At the same time, if we are too restrictive we risk DoSing parachains and causing consensus issues on upgrades. Therefore ramping the security up there slowly might make sense. E.g. in addition to the above start by only blocking syscalls that are clearly not needed (e.g. access to network), then once we have more tooling/tests or whatever in place to be able to block more syscalls with confidence, we do so. |
I agree that gradually amping up the security is a good idea. I started with
In my research I found that Also, is it safe to assume that validators are running as root?
One concern I have is that an attacker can target certain validators and make them vote invalid and get them slashed. Or, if they can get some source of randomness they can do an untargeted attack, and this wouldn't require much of an escalation at all. I.e. a baddie can do significant economic damage by voting against with 1/3 chance, without even stealing keys or completely replacing the binary. |
by e.g. killing its own worker? |
Yeah, or also by causing an error (right now all execution errors lead to invalid). If it always did this, then all validators would vote the same way, but if it could cause an invalid vote only sometimes, either target or untargeted way, then it would trigger some chaos. So I agree that we can iteratively bump up security (splitting up the changes is good anyway), but eventually we need it to be pretty airtight. Luckily preparation/execution should require very little capabilities: opening/reading a file, memory operations. (May not be the full list, so far I ran into a bug and enabled seccomp on the wrong thread.) (Would be nice to obviate the first one and read the file ahead of time, passing in the contents to the worker. Not sure if that is supported, but it would also make this concern moot.) |
BTW I just found this issue about specializing on Linux -- linking it here since I didn't know about it. Let's share more thoughts on the topic there. |
@koute Can you elaborate on this? |
Calling some of Nevertheless, there are two possible issues here:
This I think can be mostly solved by:
So actually we probably don't even need to test on different kernel versions; maybe except just run on the oldest supported kernel version to make sure all of the used syscalls are available there and refuse to run if an even older kernel is encountered. |
This is something that could also be done by the parent process and then sending/receiving the artifact as we do the other communication. |
Thanks for the proposal @koute!
I don't think that (1) is possible with
Without (1), does your proposal still work?
Looks like musl does this too, but it also has a lot of
I would really want to test this on different kernel versions as any oversights here can lead to disputes/slashing. Also, can we assume an oldest supported kernel version, or that amd64 is the only arch? There are some values in the Validators guide under "Reference Hardware," but it explicitly states that this is "not a hard requirement". And while I see now that we only provide an amd64 release binary, I don't see a warning anywhere for validators not to build the source and run on e.g. arm, unless I missed it. That said, I think we do need to restrict "secure mode" (seccomp mode) to amd64 and arm64 because Counter-proposal
This would ensure that in legitimate execution we never call an unexpected syscall unless we upgrade musl. (2) requires coordination with the build team. Not clear how big of an effort it is, but non-trivial for sure. For (5), I believe that for unrecognized syscalls we should kill the process as opposed to returning |
AFAIK it is possible. You use
Yes, this must be amd64-only for now. Each architecture has slightly different syscalls (e.g. amd64 uses the As I've previously said, should not be a big deal since essentially everyone who's running a validator runs on amd64. Later on we can expand it to aarch64 once this works properly on amd64.
Yes. It'd be good to have e.g. some debugging flag or something for This does have some logistical issues though. Do we build everything under musl and just ship one My vote would probably be to build a separate binary, e.g. do something similar to what we do for WASM runtimes and just build the binary in
Most likely the call would return an error, and would be passed along to the worker. (so e.g. if there's an |
Thanks! Indeed, looks like that would work if we hooked up ptrace, here's a amd64-specific way to do it. That would be a nice solution. We would have a small surface area of syscalls, making it more secure and also more predictable and deterministic (because we know that only one of e.g. My only reservation is that I don't feel confident about relying on My approach (casting a wider net of syscalls and killing the process when seccomp is triggered) still seems safer to me. The main issue with it is just that we would need to revisit the whitelist whenever we upgrade musl, so we would to have some process around that, but I expect it to be done rarely. This way we also don't need the "hack" with uname and ptrace. Let me know what you think.
I agree that a separate binary for the workers makes sense. It would indeed be better for security and executing with a known libc would be a win for determinism. Writing the binary to disk on execution would avoid the UX problems that are being discussed on the forum.
Definitely, thanks for the tip! |
FWIW, there is this nice crate that we could use to build the filter BPF programs: https://github.com/firecracker-microvm/firecracker/tree/main/src/seccompiler |
Unfortunately, building a separate crate without tokio's extra features still brings in a union of all tokio features from all transitive deps. I think it's possible to remove the immediate dep on tokio, but removing other deps that bring in tokio seems like a bigger project. Maybe cargo-guppy could help with it, but not sure it's worth going down the rabbithole. And building with (Though it did reduce the binary size by about 25% (20mb -> 15mb), which is nice if we want to embed it in Polkadot. But it took almost 5x longer to compile (2m -> 9m). For now I suggest that we block the following I/O syscalls, even though they are present in the binary: 0 (read) 1 (write) 2 (open) 3 (close) 4 (stat) 5 (fstat) 7 (poll) 8 (lseek) ... 16 (ioctl) 19 (readv) 20 (writev) ... 41 (socket) 42 (connect) 45 (recvfrom) 46 (sendmsg) 53 (socketpair) 55 (getsockopt) ... 82 (rename) 83 (mkdir) 87 (unlink) 89 (readlink) ... 213 (epoll_create) 232 (epoll_wait) 233 (epoll_ctl) 281 (epoll_pwait) 291 (epoll_create1) ... 257 (openat) 262 (newfstatat) 262 (newfstatat) 263 (unlinkat) ... 284 (eventfd) 290 (eventfd2) ... 318 (getrandom) It's a lot, but I don't think there's a legitimate reason for the work threads to ever call these. Letting them through the sandbox would defeat the point. Next up, I'll work on embedding the musl binaries and extracting them when |
As this is should only be done when doing a production build, there should be no problem with compiling 9m. |
About getrandom, PVFs must be deterministic, but if folks want singing code to run in off-chain workers, then we should expose system randomness there somehow. |
We already do this. Offchain worker has a randomness function. |
We're already working on sandboxing by [blocking all unneeded syscalls](#882). However, due to the wide scope it will take a while longer. This PR starts with a much smaller scope, only blocking network-related syscalls until the above is ready. For security we block the following with `seccomp`: - creation of new sockets - these are unneeded in PVF jobs, and we can safely block them without affecting consensus. - `io_uring` - as discussed [here](paritytech/polkadot#7334 (comment)), io_uring allows for networking and needs to be blocked. See below for a discussion on the safety of doing this. - `connect`ing to sockets - the above two points are enough for networking and is what birdcage does (or [used to do](phylum-dev/birdcage#47)) to restrict networking. However, it is possible to [connect to abstract unix sockets](https://lore.kernel.org/landlock/[email protected]/T/#u) to do some kinds of sandbox escapes, so we also block the `connect` syscall. (Intentionally left out of implementer's guide because it felt like too much detail.) `io_uring` is just a way of issuing system calls in an async manner, and there is nothing stopping wasmtime from legitimately using it. Fortunately, at the moment it does not. Generally, not many applications use `io_uring` in production yet, because of the numerous kernel CVEs discovered. It's still under a lot of development. Android outright banned `io_uring` for these reasons. Considering `io_uring`'s status, and that it very likely would get detected either by our [recently-added static analysis](#1663) or by testing, I think it is fairly safe to block it. If execution hits an edge case code path unique to a given machine, it's already taken a non-deterministic branch anyway. After all, we just care that the majority of validators reach the same result and preserve consensus. So worst-case scenario, there's a dispute, and we can always admit fault and refund the wrong validator. On the other hand, if all validators take the code path that results in a seccomp violation, then they would all vote against the current candidate, which is also fine. The violation would get logged (in big scary letters) and hopefully some validator reports it to us. Actually, a worst-worse-case scenario is that 50% of validators vote against, so that there is no consensus. But so many things would have to go wrong for that to happen: 1. An update to `wasmtime` is introduced that uses io_uring (unlikely as io_uring is mainly for IO-heavy applications) 2. The new syscall is not detected by our static analysis 3. It is never triggered in any of our tests 4. It then gets triggered on some super edge case in production on 50% of validators causing a stall (bad but very unlikely) 5. Or, it triggers on only a few validators causing a dispute (more likely but not as bad?) Considering how many things would have to go wrong here, we believe it's safe to block `io_uring`. Closes #619 Original PR in Polkadot repo: paritytech/polkadot#7334
We have landlock, networking restrictions with seccomp, unshare/pivot_root, separate processes, and are in the process of implementing clone with sandbox args. We've decided that this is enough security until the PolkaVM migration. |
Co-authored-by: David Dunn <[email protected]> Co-authored-by: Clara van Staden <[email protected]> Co-authored-by: Alistair Singh <[email protected]> Co-authored-by: Ron <[email protected]> Co-authored-by: claravanstaden <Cats 4 life!>
Overview
As future work for the PVF validation host we envisioned that the workers should be hardened.
Ideally, if a worker process was compromised then the attacker won't get the rest of the system on a silver plate. To do that, we may want to consider:
Summary / Roadmap
https://hackmd.io/@gIXf9c2_SLijWKSGjgDJrg/BkNikRcf2
Potentially Useful Resources
https://www.redhat.com/sysadmin/container-security-seccomp
The text was updated successfully, but these errors were encountered: