Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF host: sandbox/harden worker process #882

Closed
pepyakin opened this issue Jan 14, 2022 · 62 comments
Closed

PVF host: sandbox/harden worker process #882

pepyakin opened this issue Jan 14, 2022 · 62 comments
Assignees
Labels
I1-security The node fails to follow expected, security-sensitive, behaviour.

Comments

@pepyakin
Copy link
Contributor

pepyakin commented Jan 14, 2022

Overview

As future work for the PVF validation host we envisioned that the workers should be hardened.

Ideally, if a worker process was compromised then the attacker won't get the rest of the system on a silver plate. To do that, we may want to consider:

  1. running workers under a different user
  2. jailing/unshare or clone(2) with fresh new namespace
  3. seccomp

Summary / Roadmap

https://hackmd.io/@gIXf9c2_SLijWKSGjgDJrg/BkNikRcf2

Potentially Useful Resources

https://www.redhat.com/sysadmin/container-security-seccomp

@pepyakin pepyakin changed the title PVF validation host: security hardening PVF host: security hardening Jan 14, 2022
@eskimor
Copy link
Member

eskimor commented Mar 24, 2023

Should be done before parathreads!

@mrcnski mrcnski self-assigned this Mar 26, 2023
@mrcnski
Copy link
Contributor

mrcnski commented Mar 26, 2023

Unsurprisingly, all of the syscall options are Linux-only, and we again run into portability concerns. Looks like there is potentially some alternative for MacOS, though it's marked deprecated and would also increase our complexity/testing burden.

However, I'm wondering if running workers on Docker would be an acceptable solution. It would solve the portability issues we are repeatedly running into. The performance cost should (theoretically) be negligible, and we would get seccomp (and cgroups for #767). These syscalls should be supported on Docker on Macs, at least according to docker info on my M1 Mac.

Server:
 Containers: 5
  Running: 0
  Paused: 0
  Stopped: 5
 Images: 1
 Server Version: 20.10.21
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 1c90a442489720eec95342e1789ee8a5e1b9536f
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.15.49-linuxkit
 Operating System: Docker Desktop
 OSType: linux
 Architecture: aarch64
 CPUs: 5
 Total Memory: 7.667GiB
 Name: docker-desktop
 ID: SCCV:JEJX:743F:W33L:I3VH:NTKL:GKQF:EMIT:E5BR:Q5LG:HMOW:NNYS
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy: http.docker.internal:3128
 HTTPS Proxy: http.docker.internal:3128
 No Proxy: hubproxy.docker.internal
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  hubproxy.docker.internal:5000
  127.0.0.0/8
 Live Restore Enabled: false

@mrcnski
Copy link
Contributor

mrcnski commented Mar 26, 2023

cc @koute in case you're not subscribed.

@koute
Copy link
Contributor

koute commented Mar 26, 2023

I think sandboxing is pretty much non-negotiable so we have to go through with it regardless of any portability concerns. (And if we want security we are pretty much forced to use OS-specific APIs to do it.)

For macOS (and any other OS) for now I'd suggest we just don't sandbox in there and refuse to start a validator unless a mandatory e.g. --running-validator-on-this-os-is-unsecure (or something along those lines) command line argument is given. Virtually no one is currently running validators on macOS anyway, so our priority should be to make this work well on Linux which the majority uses and then we can expand the support to other OSes if we want. (Or as you'd say - just have a blessed Docker setup for macOS.)

@mrcnski
Copy link
Contributor

mrcnski commented Mar 26, 2023

Is there any data on OS distribution for validators? It's something I've been curious about for a while. cc @wpank. It would definitely be nice to drop our portability concerns as it simplifies things.

I agree that sandboxing is non-negotiable!

@bkchr
Copy link
Member

bkchr commented Mar 26, 2023

What portability concerns?

@mrcnski
Copy link
Contributor

mrcnski commented Mar 27, 2023

What portability concerns?

I just mean things we've been wanting to do that do not have good cross-platform support/alternatives. cgroups as mentioned above is one example. So far no such features have been an absolute "must-have", until sandboxing -- so we haven't yet had to make a clear decision about how to handle non-Linux systems.

As we need this now, I'll begin work on it.

@bkchr
Copy link
Member

bkchr commented Mar 27, 2023

so we haven't yet had to make a clear decision about how to handle non-Linux systems.

I think the "solution" @koute proposed above sounds like a good enough solution for now? We don't need to support Mac or Linux kernels from 2016 or whatever.

@mrcnski
Copy link
Contributor

mrcnski commented Apr 6, 2023

I’ve been researching seccomp and playing around with a PoC, and I have some concerns. The most sensible strategy is blocking by default, and explicitly specifying allowed syscalls. But, there are many different versions of syscalls and new ones are added all the time. This is a problem because when workers start running syscalls we didn’t account for, we will get disputes. At a minimum, we would need to restrict the arch and c std lib validators can run on, to a handful that we have approved and tested...

And maybe I’m overcomplicating this, but I also realized the difficulty of testing every possible path through execution. E.g. if there is a panic during execution, it may trigger some unexpected syscalls. Or if a signal happens and some signal handler runs some unexpected code (and we have to account for the sigreturn syscall). Or something else I haven't thought of. I guess in these exceptional cases we would kill the process due to an unexpected syscall and it would be fine, because we retry once before disputing. But all these assumptions feel so fragile...

@koute
Copy link
Contributor

koute commented Apr 7, 2023

The most sensible strategy is blocking by default, and explicitly specifying allowed syscalls.

Yes.

But, there are many different versions of syscalls and new ones are added all the time. This is a problem because when workers start running syscalls we didn’t account for, we will get disputes. At a minimum, we would need to restrict the arch and c std lib validators can run on, to a handful that we have approved and tested...

Yes, different versions of libc could use different syscalls, and that is a problem.

We could just compile the PVF worker to use musl to make sure it's completely independent of the host's libc. A little annoying, but should be doable.

Another option would be to bundle a minimum known set of libraries that are necessary to run the PVF worker (our own mini docker, I suppose). We want to namespace/cgroup the process anyway, so that wouldn't be too big of a stretch.

There's also the issue of the libc picking different syscalls depending on the kernel that the host's running. But we could probably easily simulate that by just mocking the kernel version to an older one and then running the thing and see if it works.

Another (crazier alternative) would be to maybe compile the PVF worker as a unikernel (e.g. with something like this) and run it under KVM. This would also give us easy portability to other OSes (we could relatively easily make it also work on macOS)

At a minimum, we would need to restrict the arch

That was originally the plan, wasn't it? For now only amd64's going to be supported.


I general I think the set of syscalls that the worker actually needs should be quite limited, which is why I think this form of sandboxing should be actually practical for us to implement.

@mrcnski
Copy link
Contributor

mrcnski commented Apr 7, 2023

We could just compile the PVF worker to use musl to make sure it's completely independent of the host's libc. A little annoying, but should be doable.

Another option would be to bundle a minimum known set of libraries that are necessary to run the PVF worker (our own mini docker, I suppose). We want to namespace/cgroup the process anyway, so that wouldn't be too big of a stretch.

Why not use existing proven technology (e.g. Docker) at that point? Either way, this would be very nice for determinism, too.

That was originally the plan, wasn't it? For now only amd64's going to be supported.

Oh, I didn't know that. Has this been discussed before?

I general I think the set of syscalls that the worker actually needs should be quite limited, which is why I think this form of sandboxing should be actually practical for us to implement.

Yes, I might be overthinking it. If something unexpected happens (e.g. a signal handler runs some unaccounted syscall), killing the process is probably what we want to do anyway.

@eskimor
Copy link
Member

eskimor commented Apr 7, 2023

I would be for gradually improving security here. E.g. start with launching the worker threads with a different user in a chroot environment. Assuming that different user does not have access to keys and such, then we already have 3 security barriers in place, an attacker has to break through, to get to keys or otherwise manipulate the machine:

  1. Break through the wasm sandboxing.
  2. Break through chroot.
  3. Privilege escalation.

It would also be good to identify some threat model. What are we actually worried about? Things that come to mind:

  1. Stealing keys - that would be pretty bad. Assuming memory protection is in place, this is already very hard to do given the above three. We could make it even harder, by not allowing network access via seccomp for example.
  2. Taking control over the validator. E.g. replacing the polkadot binary with a polkadot-evil binary. Should again not be possible with the above.
  3. Intercepting and manipulating packages - effect very similar to 2, hard to do without also being able to do 1 or 2.
  4. ?

In addition, limiting the attack surface is generally a good idea obviously. Any syscall that is exposed could have some bug leading to privilege escalation. At the same time, if we are too restrictive we risk DoSing parachains and causing consensus issues on upgrades. Therefore ramping the security up there slowly might make sense. E.g. in addition to the above start by only blocking syscalls that are clearly not needed (e.g. access to network), then once we have more tooling/tests or whatever in place to be able to block more syscalls with confidence, we do so.

@mrcnski
Copy link
Contributor

mrcnski commented Apr 7, 2023

I agree that gradually amping up the security is a good idea. I started with seccomp per @koute's suggestion in #828, but seems like that still requires more research/effort to not break determinism, i.e. we have to control the c lib and arch. I'm not concerned about dependency upgrades (wasmtime) because new syscalls would be caught in testing, it would just slow down our ability to upgrade wasmtime in case of critical CVEs.

E.g. start with launching the worker threads with a different user in a chroot environment.

In my research I found that chroot is not an effective security mechanism, and easy to break out of. I do think that running unshare with CLONE_NEWUSER is a good start, though that requires the calling process to not be threaded, so we would have to spawn an intermediate process first, as seen here.

Also, is it safe to assume that validators are running as root?

  1. ?

One concern I have is that an attacker can target certain validators and make them vote invalid and get them slashed.

Or, if they can get some source of randomness they can do an untargeted attack, and this wouldn't require much of an escalation at all. I.e. a baddie can do significant economic damage by voting against with 1/3 chance, without even stealing keys or completely replacing the binary.

@eskimor
Copy link
Member

eskimor commented Apr 7, 2023

One concern I have is that an attacker can target certain validators and make them vote invalid and get them slashed.

by e.g. killing its own worker?

@mrcnski
Copy link
Contributor

mrcnski commented Apr 7, 2023

by e.g. killing its own worker?

Yeah, or also by causing an error (right now all execution errors lead to invalid). If it always did this, then all validators would vote the same way, but if it could cause an invalid vote only sometimes, either target or untargeted way, then it would trigger some chaos.

So I agree that we can iteratively bump up security (splitting up the changes is good anyway), but eventually we need it to be pretty airtight. Luckily preparation/execution should require very little capabilities: opening/reading a file, memory operations. (May not be the full list, so far I ran into a bug and enabled seccomp on the wrong thread.) (Would be nice to obviate the first one and read the file ahead of time, passing in the contents to the worker. Not sure if that is supported, but it would also make this concern moot.)

@mrcnski
Copy link
Contributor

mrcnski commented Apr 7, 2023

BTW I just found this issue about specializing on Linux -- linking it here since I didn't know about it. Let's share more thoughts on the topic there.

@mrcnski
Copy link
Contributor

mrcnski commented Apr 7, 2023

There's also the issue of the libc picking different syscalls depending on the kernel that the host's running. But we could probably easily simulate that by just mocking the kernel version to an older one and then running the thing and see if it works.

@koute Can you elaborate on this?

@koute
Copy link
Contributor

koute commented Apr 7, 2023

There's also the issue of the libc picking different syscalls depending on the kernel that the host's running. But we could probably easily simulate that by just mocking the kernel version to an older one and then running the thing and see if it works.

@koute Can you elaborate on this?

Calling some of libc functions may end up calling different syscalls depending on the kernel the host's running. For example, there is the clone syscall and Linux 5.3+ only clone3, so glibc will try to use clone3 and if that fails will fall back to clone. Although now that I think about it, IIRC for clone3/clone at least glibc looks at whether clone3 returns ENOSYS and then tries clone instead of explicitly checking the kernel version.

Nevertheless, there are two possible issues here:

  1. Different libc versions can call different "versions" of a particular syscall for the same libc function.
  2. Different libc versions can call a syscall or can instead emulate in userspace the functionality of a particular libc function.

This I think can be mostly solved by:

  1. hooking into the uname syscall through seccomp and get it to always return the same values for the kernel version (so the "calls different syscall based on the kernel version" problem is solved),
  2. by having it return ENOSYS for blocked syscalls and not abort the process when an unknown syscall is called (so the "opportunistically tries to call newer versions of syscalls when available" problem is solved, as it'll fall back to the older one)

So actually we probably don't even need to test on different kernel versions; maybe except just run on the oldest supported kernel version to make sure all of the used syscalls are available there and refuse to run if an even older kernel is encountered.

@bkchr
Copy link
Member

bkchr commented Apr 9, 2023

opening/reading a file

This is something that could also be done by the parent process and then sending/receiving the artifact as we do the other communication.

@mrcnski
Copy link
Contributor

mrcnski commented Apr 11, 2023

Thanks for the proposal @koute!

This I think can be mostly solved by:

  1. hooking into the uname syscall through seccomp and get it to always return the same values for the kernel version (so the "calls different syscall based on the kernel version" problem is solved),
  2. by having it return ENOSYS for blocked syscalls and not abort the process when an unknown syscall is called (so the "opportunistically tries to call newer versions of syscalls when available" problem is solved, as it'll fall back to the older one)

I don't think that (1) is possible with seccomp, though please correct me if I'm wrong. The closest thing is SECCOMP_RET_ERRNO which doesn't quite work (we could use it for (2) though):

This value results in the SECCOMP_RET_DATA portion of the
filter's return value being passed to user space as the
errno value without executing the system call.

Without (1), does your proposal still work?

Although now that I think about it, IIRC for clone3/clone at least glibc looks at whether clone3 returns ENOSYS and then tries clone instead of explicitly checking the kernel version.

Looks like musl does this too, but it also has a lot of ifdefs e.g. #ifdef SYS_select. So I am guessing it will also try different syscalls based on the architecture it's compiled on, which complicates things unless we restrict the arch.

So actually we probably don't even need to test on different kernel versions; maybe except just run on the oldest supported kernel version to make sure all of the used syscalls are available there and refuse to run if an even older kernel is encountered.

I would really want to test this on different kernel versions as any oversights here can lead to disputes/slashing.

Also, can we assume an oldest supported kernel version, or that amd64 is the only arch? There are some values in the Validators guide under "Reference Hardware," but it explicitly states that this is "not a hard requirement". And while I see now that we only provide an amd64 release binary, I don't see a warning anywhere for validators not to build the source and run on e.g. arm, unless I missed it. That said, I think we do need to restrict "secure mode" (seccomp mode) to amd64 and arm64 because seccompiler only supports those

Counter-proposal

  1. Trace and determine what libc calls are made by the PVF jobs. (by executing real PVFs)
  2. We build against musl as @koute proposed above.
  3. Determine every syscall that could be made by each libc call we observed (just by looking through the musl source for each function).
  4. Whitelist these syscalls.
  5. Set the seccomp filter to kill the process on all unrecognized syscalls.
  6. Implement "secure mode" flag and restrict it to amd64 and arm64.

This would ensure that in legitimate execution we never call an unexpected syscall unless we upgrade musl.

(2) requires coordination with the build team. Not clear how big of an effort it is, but non-trivial for sure.

For (5), I believe that for unrecognized syscalls we should kill the process as opposed to returning ENOSYS. With the latter strategy, if an unknown syscall is called (e.g. by an attacker) the thread would keep running and eventually (probably?) return an error, but the error could really be anything. This is bad because we would like to treat certain errors like file-not-found as internal (do not vote against the candidate). I'm not sure what would happen in reality if we tried to read a file and ENOSYS was returned, but that's the point -- we need a deterministic failure mode.

@koute
Copy link
Contributor

koute commented Apr 12, 2023

I don't think that (1) is possible with seccomp, though please correct me if I'm wrong.

AFAIK it is possible. You use SECCOMP_RET_TRACE and then have the parent process modify the syscall's return value with ptrace.

So I am guessing it will also try different syscalls based on the architecture it's compiled on, which complicates things unless we restrict the arch.

Yes, this must be amd64-only for now. Each architecture has slightly different syscalls (e.g. amd64 uses the mmap syscall which takes the offset in bytes and arm uses the mmap2 syscall which takes the offset in pages) and the same syscalls can have different syscall numbers (but this can be easily mitigated by just grabbing the syscall numbers from the libc crate).

As I've previously said, should not be a big deal since essentially everyone who's running a validator runs on amd64. Later on we can expand it to aarch64 once this works properly on amd64.

1. Trace and determine what libc calls are made by the PVF jobs. (by executing real PVFs)

Yes. It'd be good to have e.g. some debugging flag or something for polkadot which would run the PVF worker under strace and dump those logs somewhere.

2. We build against musl as @koute proposed above.

This does have some logistical issues though. Do we build everything under musl and just ship one polkadot binary? We'd have to disallow running a validator with non-musl binaries, which from the users' POV might be a pain (some people compile their own binaries). Alternatively we could build a dedicated PVF worker executable and embed that into polkadot, and then when we want to use it we write it to disk and execute it.

My vote would probably be to build a separate binary, e.g. do something similar to what we do for WASM runtimes and just build the binary in build.rs. This will work automatically for the users manually compiling polkadot (as long as they have the musl target installed, but we can just print an error like we do for WASM if it's missing), doesn't need much extra work from the CI people (just needs the target to be rustup-installed in the CI docker image), and in theory is also better for security as a smaller binary will give less opportunities for attackers to chain together ROP gadgets if they get remote code execution somehow. (Although with good enough sandboxing it probably won't matter, but still.)

I'm not sure what would happen in reality if we tried to read a file and ENOSYS was returned, but that's the point -- we need a deterministic failure mode.

Most likely the call would return an error, and would be passed along to the worker. (so e.g. if there's an unwrap there then it'd panic) So this shouldn't really be any less deterministic (as long as the libc used is the same).

@mrcnski
Copy link
Contributor

mrcnski commented Apr 12, 2023

Thanks! Indeed, looks like that would work if we hooked up ptrace, here's a amd64-specific way to do it. That would be a nice solution. We would have a small surface area of syscalls, making it more secure and also more predictable and deterministic (because we know that only one of e.g. clone and clone3 would ever be used). I'm also good with restricting the arch and keeping the initial scope low, since other validators can just run in insecure mode for now.

My only reservation is that I don't feel confident about relying on ENOSYS from unsupported syscalls to trigger a panic. I'm not convinced it would panic in the file-not-found case, and there also may be other error states in the future that we would want to treat as internal i.e. not vote against. (Also, we currently treat panics as internal errors, though I have an issue open to change that.)

My approach (casting a wider net of syscalls and killing the process when seccomp is triggered) still seems safer to me. The main issue with it is just that we would need to revisit the whitelist whenever we upgrade musl, so we would to have some process around that, but I expect it to be done rarely. This way we also don't need the "hack" with uname and ptrace. Let me know what you think.

My vote would probably be to build a separate binary, e.g. do something similar to what we do for WASM runtimes and just build the binary in build.rs.

I agree that a separate binary for the workers makes sense. It would indeed be better for security and executing with a known libc would be a win for determinism. Writing the binary to disk on execution would avoid the UX problems that are being discussed on the forum.

Yes. It'd be good to have e.g. some debugging flag or something for polkadot which would run the PVF worker under strace and dump those logs somewhere.

Definitely, thanks for the tip!

@sandreim
Copy link
Contributor

FWIW, there is this nice crate that we could use to build the filter BPF programs: https://github.com/firecracker-microvm/firecracker/tree/main/src/seccompiler

@mrcnski
Copy link
Contributor

mrcnski commented Apr 21, 2023

Unfortunately, building a separate crate without tokio's extra features still brings in a union of all tokio features from all transitive deps. I think it's possible to remove the immediate dep on tokio, but removing other deps that bring in tokio seems like a bigger project. Maybe cargo-guppy could help with it, but not sure it's worth going down the rabbithole.

And building with lto = "fat" did not succeed in reducing the number of syscalls in the final binary.

(Though it did reduce the binary size by about 25% (20mb -> 15mb), which is nice if we want to embed it in Polkadot. But it took almost 5x longer to compile (2m -> 9m). "thin" took 7m for hardly any size reduction. My overall sense is LTO is not worth it.)

For now I suggest that we block the following I/O syscalls, even though they are present in the binary:

    0 (read)
    1 (write)
    2 (open)
    3 (close)
    4 (stat)
    5 (fstat)
    7 (poll)
    8 (lseek)
    ...
    16 (ioctl)
    19 (readv)
    20 (writev)
    ...
    41 (socket)
    42 (connect)
    45 (recvfrom)
    46 (sendmsg)
    53 (socketpair)
    55 (getsockopt)
    ...
    82 (rename)
    83 (mkdir)
    87 (unlink)
    89 (readlink)
    ...
    213 (epoll_create)
    232 (epoll_wait)
    233 (epoll_ctl)
    281 (epoll_pwait)
    291 (epoll_create1)
    ...
    257 (openat)
    262 (newfstatat)
    262 (newfstatat)
    263 (unlinkat)
    ...
    284 (eventfd)
    290 (eventfd2)
    ...
    318 (getrandom)

It's a lot, but I don't think there's a legitimate reason for the work threads to ever call these. Letting them through the sandbox would defeat the point.

Next up, I'll work on embedding the musl binaries and extracting them when polkadot is run.

@bkchr
Copy link
Member

bkchr commented Apr 21, 2023

(Though it did reduce the binary size by about 25% (20mb -> 15mb), which is nice if we want to embed it in Polkadot. But it took almost 5x longer to compile (2m -> 9m). "thin" took 7m for hardly any size reduction. My overall sense is LTO is not worth it.)

As this is should only be done when doing a production build, there should be no problem with compiling 9m.

@burdges
Copy link

burdges commented Apr 22, 2023

About getrandom, PVFs must be deterministic, but if folks want singing code to run in off-chain workers, then we should expose system randomness there somehow.

@bkchr
Copy link
Member

bkchr commented Apr 22, 2023

About getrandom, PVFs must be deterministic, but if folks want singing code to run in off-chain workers, then we should expose system randomness there somehow.

We already do this. Offchain worker has a randomness function.

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
@the-right-joyce the-right-joyce added I1-security The node fails to follow expected, security-sensitive, behaviour. and removed D9-needsaudit 👮 labels Aug 25, 2023
@mrcnski mrcnski mentioned this issue Sep 7, 2023
8 tasks
@the-right-joyce the-right-joyce moved this to In Progress in parachains team board Oct 18, 2023
mrcnski added a commit that referenced this issue Oct 24, 2023
We're already working on sandboxing by [blocking all unneeded syscalls](#882). However, due to the wide scope it will take a while longer. This PR starts with a much smaller scope, only blocking network-related syscalls until the above is ready.

For security we block the following with `seccomp`:

- creation of new sockets - these are unneeded in PVF jobs, and we can safely block them without affecting consensus.

- `io_uring` - as discussed [here](paritytech/polkadot#7334 (comment)), io_uring allows for networking and needs to be blocked. See below for a discussion on the safety of doing this.

- `connect`ing to sockets - the above two points are enough for networking and is what birdcage does (or [used to do](phylum-dev/birdcage#47)) to restrict networking. However, it is possible to [connect to abstract unix sockets](https://lore.kernel.org/landlock/[email protected]/T/#u) to do some kinds of sandbox escapes, so we also block the `connect` syscall.

(Intentionally left out of implementer's guide because it felt like too much detail.)

`io_uring` is just a way of issuing system calls in an async manner, and there is nothing stopping wasmtime from legitimately using it. Fortunately, at the moment it does not. Generally, not many applications use `io_uring` in production yet, because of the numerous kernel CVEs discovered. It's still under a lot of development. Android outright banned `io_uring` for these reasons.

Considering `io_uring`'s status, and that it very likely would get detected either by our [recently-added static analysis](#1663) or by testing, I think it is fairly safe to block it.

If execution hits an edge case code path unique to a given machine, it's already taken a non-deterministic branch anyway. After all, we just care that the majority of validators reach the same result and preserve consensus. So worst-case scenario, there's a dispute, and we can always admit fault and refund the wrong validator. On the other hand, if all validators take the code path that results in a seccomp violation, then they would all vote against the current candidate, which is also fine. The violation would get logged (in big scary letters) and hopefully some validator reports it to us.

Actually, a worst-worse-case scenario is that 50% of validators vote against, so that there is no consensus. But so many things would have to go wrong for that to happen:

1. An update to `wasmtime` is introduced that uses io_uring (unlikely as io_uring is mainly for IO-heavy applications)
2. The new syscall is not detected by our static analysis
3. It is never triggered in any of our tests
4. It then gets triggered on some super edge case in production on 50% of validators causing a stall (bad but very unlikely)
5. Or, it triggers on only a few validators causing a dispute (more likely but not as bad?)

Considering how many things would have to go wrong here, we believe it's safe to block `io_uring`.

Closes #619
Original PR in Polkadot repo: paritytech/polkadot#7334
@mrcnski
Copy link
Contributor

mrcnski commented Nov 26, 2023

We have landlock, networking restrictions with seccomp, unshare/pivot_root, separate processes, and are in the process of implementing clone with sandbox args. We've decided that this is enough security until the PolkaVM migration.

@mrcnski mrcnski closed this as completed Nov 26, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in PVF Security Hardening Nov 26, 2023
claravanstaden added a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023
Co-authored-by: David Dunn <[email protected]>
Co-authored-by: Clara van Staden <[email protected]>
Co-authored-by: Alistair Singh <[email protected]>
Co-authored-by: Ron <[email protected]>
Co-authored-by: claravanstaden <Cats 4 life!>
@eskimor eskimor moved this from In Progress to Completed in parachains team board Jan 30, 2024
@sandreim sandreim assigned s0me0ne-unkn0wn and unassigned mrcnski Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I1-security The node fails to follow expected, security-sensitive, behaviour.
Projects
Status: Backlog
Status: Completed
Status: In Progress
Development

No branches or pull requests

9 participants