Communication with devices (shared memory, ioctl) #353

dimakuv · 2022-01-20T14:24:56Z

This describes the current state of my "communication with devices" project.

This project is in the context of the Linux-SGX PAL, though it should be applicable to other TEEs as well. This project is trivially implementable in the Linux PAL.

Legend:

✔️ Done (merged to master)
🚧 In progress (usually has a PR open)
⭐ Next (usually will be unlocked by current "in progress")

Rationale

Intel SGX is a CPU-only technology. This project proposes a generic way to enable communication with host devices from within the SGX (Gramine) enclave. There are two key interfaces in CPU–device communication: (1) the device-backed mmap system call and (2) the ioctl system call.

Device-backed mmap system call allows to create memory regions that are shared between the SGX enclave and an arbitrary host device. The ioctl system call provides a generic interface to send an arbitrary request to an arbitrary device, passing arbitrarily complex objects to the device’s memory and back to the CPU memory (RAM). Adding support for these two system calls – together with the standard and already-supported open, read, write, close system calls on devices – is sufficient to enable interactions between the trusted SGX enclave and the host device.

The goal of this project is the enablement of communication between SGX enclaves and host devices. Protecting this communication from eavesdropping and other attacks is a non-goal of this project. Adding integrity checks, encryption, side-channel mitigations, etc. on top of the enabled insecure communication is the responsibility of the application.

This project was developed locally for about a year, with a particular focus on Direct Rendering Manager (DRM) workloads and interactions. Several interesting workloads were enabled and can run adequately. Now it's time to upstream this work (except for device-specific hacks). The branch with the latest snapshot is here: b12f340 (danger! some code there is terrible hacks).

~~TODO: Add a link to the whitepaper that contains explanations, diagrams and pseudo-code.~~ UPDATE: The whitepaper: https://arxiv.org/abs/2203.01813

Sub-projects

✔️ Standard I/O operations via read()/write() (already implemented in Gramine)
🚧 Shared untrusted memory via mmap() -- done in [LibOS,PAL/Linux-SGX] Add shared untrusted memory support #827
✔️ Device-specific I/O operations via ioctl()
⛔ : Example with CUSE device that uses all of the above: read/write/mmap/ioctl

Below is the approximate split of these sub-projects into series of PRs.

Example with CUSE device

Add a new example CI-Examples/cuse-device
- The template for the CUSE device: https://github.com/libfuse/libfuse/blob/master/example/cuse.c
- The template for the CUSE client: https://github.com/libfuse/libfuse/blob/master/example/cuse_client.c
- Initial version should implement only standard I/O operations: open, read, write, close
- Simple Makefile to build and run the two executables
- Add to CI; figure out how to run the CUSE device with root privileges

Shared untrusted memory via `mmap()`

Introduce two non-overlapping address ranges: private range and shared range
- For LibOS: private range is used for mmap(MAP_PRIVATE), shared range is used for mmap(MAP_SHARED)
- For common PAL: existing g_pal_public_state.user_address becomes private_range, plus a new shared_range
- For Linux PAL: private range is e.g. 0x0 - 1TB and shared range is e.g. 1TB - 2TB
- For Linux-SGX PAL: private range is enclave memory and shared range is e.g. 1TB - 2TB of untrusted memory (actually, a 1TB-sized region allocated with host's mmap() to reserve this memory)
- For Linux-SGX PAL: only allow shared range if sgx.insecure__enable_shared_range = "true"
- Main changes in: shim_vma.c, shim_mmap.c, disallowing addr = NULL in PAL memory-alloc functions
Add device-backed mmap: mmap(..., <device-fd>, ...)
- For LibOS: add device-backed support for mmap(), mprotect(), munmap()
- For Linux PAL: add .mmap callback to the "device" PAL handle
- For Linux-SGX PAL: add .mmap callback to the "device" PAL handle
Augment CUSE device example with mmap() tests
- Test for mmap(MAP_SHARED, <cuse-device-fd>)
- Test for mmap(MAP_PRIVATE, <cuse-device-fd>)
- Negative test for opening with O_DIRECT and then mmapping -- this is not supported by current in-Linux-kernel FUSE/CUSE

Device-specific I/O operations via `ioctl()`

Add new PAL API: DkDeviceIoControl(handle, cmd, arg)
- For LibOS: trivially call in shim_do_ioctl()
- For Linux PAL: trivially call host-level ioctl()
- For Linux-SGX PAL: add trivial (flat arg, no-data-nesting) ocall_ioctl()
- Augment CUSE device with CUSE_UNRESTRICTED_IOCTL and ioctl(<dummy-cmd>, <flat-arg>)
Add first (simple) version of IOCTL struct definition to Linux-SGX PAL
- Add sgx.ioctl_structs and sgx.allowed_ioctls
- No caching
- No boolean expressions (no onlyif)
- Augment CUSE device with ioctl(<dummy-cmd2>, <complex-arg>)
Add final version of IOCTL struct definition to Linux-SGX PAL
- No caching
- Add boolean expressions (onlyif)
- Augment CUSE device with ioctl(<dummy-cmd3>, <onlyif-arg>)
Add caching of recently-used IOCTL structs to the final version of IOCTL struct definition to Linux-SGX PAL
- [ Performance optimization: add some loop in CUSE to manually check performance? ]
Add documentation for sgx.ioctl_structs and sgx.allowed_ioctls and the IOCTL struct definitions in the manifest

The text was updated successfully, but these errors were encountered:

dimakuv · 2022-02-01T13:54:48Z

Example with CUSE device

Some more details on this sub-project.

How CUSE works

The Linux kernel exposes the /dev/cuse driver. The user-mode driver (UMD) is a normal C application that must be started with root privileges. This UMD calls cuse_lowlevel_main() function to connect to the /dev/cuse, and the Linux kernel's CUSE module creates a new device with the name requested by UMD, e.g. /dev/mydevice. After this, the UMD may drop privileges. The UMD will now listen on /dev/mydevice and get client requests via callbacks struct cuse_lowlevel_ops = {.open, .read, .write, .ioctl}.

Now another (client) application can open the /dev/mydevice file (the client does not use /dev/cuse at all). The client can now perform normal read(), write(), ioctl() using this device fd.

The UMD program removes the /dev/mydevice file when it terminates.

Unfortunately, CUSE doesn't have the mmap() callback. It was added in 2011 but then was reverted in 2012.

CI Issues

As can be seen above, the CUSE UMD app must create a new device under /dev/. Creating new devices is unsupported in Docker when it runs in unprivileged mode (without --privileged). Even when Docker runs in privileged mode, the new device may be created but it will not be visible to application inside this Docker container (see moby/moby#27886).

This means that there is no possibility to start the CUSE UMD app inside a Docker container (i.e., use our current CI infrastructure). This app must be started with root privileges (to create /dev/mydevice) but can drop privileges immediately after this.

There are two workarounds:

Start the CUSE UMD app on the host and propagate the created device via docker run --device=/dev/mydevice.
- Pros: simple, no changes to the CI (except a one-time effort of installing this CUSE app on each CI machine).
- Cons: each CUSE app modification needs to be installed manually on each CI machine.
Spawn a minimal Linux VM which will start the CUSE UMD app and then run the Gramine CUSE test.
- Pros: CUSE UMD app can be easily modified.
- Cons: complex:
  - Must enable Docker container to spawn new VMs (probably --device=/dev/kvm is enough, but maybe some privileges?).
  - Must get a Linux image (bzImage); can copy from the host I guess.
  - Must create an initramfs: put some minimal libraries (e.g. Glibc), copy install of Gramine, copy CUSE test.
  - Must create an init script that will start the CUSE UMD app and run Gramine.
  - Install QEMU that is SGX-enabled, and spawn an SGX-enabled VM via QEMU command line (with KVM support).
  - Must propagate and parse the results of the VM execution.
  - A starting point could be the "Exercise 1" from here: http://www.lockett.ca/linuxboot/linuxboot.html

mmap issues

A large part of the proposed communication with devices is shared memory (allocated via device-backed mmap()). But CUSE doesn't support mmap, so we won't be able to test shared memory. Recall that most ioctl() commands use pointers to such shared memory regions.

Interestingly, classic FUSE supports mmap() but doesn't support unrestricted IOCTLs. So we cannot emulate device communication with FUSE as well (this would be weird anyway, to use a custom FS as a device approximation).

This probably means that testing with a CUSE device is not an option for this use case.

UPDATE: On the other hand, I couldn't find any other framework or existing device that would be simple to use and would support mmap(). So maybe we use a home-grown CUSE device for ioctl() verification, and e.g. /dev/zero for mmap() verification. I also looked at existing dummy-GPU drivers like Xvfb and Xdummy, but they are way too complicated.

Misc references

About FUSE and CUSE:

About creating VMs:

boryspoplawski · 2022-02-01T19:31:41Z

Even when Docker runs in privileged mode, the new device may be created but it will not be visible to application inside this Docker container (see moby/moby#27886).

Wouldn't it be enough to do mknod inside container?

Start the CUSE UMD app on the host

That will be a nightmare to maintain. Realistically we have to expect some bugs and submitting a new app version manually each time? Ugh.

Spawn a minimal Linux VM which will start the CUSE UMD
complex

I disagree it's complex, I've done it many times, it's super easy. The only hard thing here is SGX enabled QEMU, I don't know what's the state of that.

This probably means that testing with a CUSE device is not an option for this use case.

Yeah, isn't mmap like half of the complexity we want to test (other being ioctl)? I would say CUSE is a no-go for us then...

What about creating a normal kernel module and using a VM? There are 2 "hard" parts about it:

qemu with sgx support - no idea what's the state of it
writing kernel module - that wouldn't be much harder than using CUSE, would it?

I can help you with this.

dimakuv · 2022-02-02T07:26:02Z

Yeah, isn't mmap like half of the complexity we want to test (other being ioctl)? I would say CUSE is a no-go for us then...
What about creating a normal kernel module and using a VM?

Yes, I agree. With all the limititations of other approaches, it seems we should go for an obvious solution: a kernel module. Which leads to the usage of a VM to run it in.

llly · 2022-05-10T03:04:58Z

Can ioctl also support and passthrough TYPE_SOCK fd?
When Java code wants to retrieve IP addr or MAC, it uses syscall ioctl(sockfd, SIOCGIFCONF, &ifc) and ioctl(sockfd, SIOCGIFHWADDR, &ifreq). Using Current Gramine, I get java.net.SocketException: Function not implemented (ioctl(SIOCGIFCONF) failed) and java.net.SocketException: No such device (ioctl(SIOCGIFHWADDR) failed)

dimakuv · 2022-05-10T06:52:31Z

Can ioctl also support and passthrough TYPE_SOCK fd?

What is TYPE_SOCK fd? If you mean normal TCP/IP sockets, then no, my IOCTL support is tailored for devices (like the ones under /dev/), not for generic objects like TCP/IP sockets.

I took a quick look at SIOCGIFCONF and SIOCGIFHWADDR ioctls. These are quite tricky to emulate inside Gramine. We definitely don't want to have a passthrough, because this goes against the "virtualize all hardware" philosophy of Gramine.

I think these ioctls must be emulated by the following:

On startup, Gramine fetches all host's L2/L3 network addresses.
Gramine chooses one of these network addresses (maybe somehow specify the priority which addresses to choose in the manifest file?).
Gramine memorizes this pair of L2+L3 network addresses, sanitizes it, and keeps in enclave memory.
During runtime, when LibOS emulates SIOCGIFCONF and SIOCGIFHWADDR ioctls, the corresponding L2/L3 address is taken from enclave memory and filled for the application.

@mkow @boryspoplawski Does this sound right?

boryspoplawski · 2022-05-10T10:58:45Z

@dimakuv We definitely don't want one address only, there are cases when you want to use multiple (not even mentioning multiple addresses on one interface).

dimakuv · 2023-03-09T14:05:18Z

I added P0 priority label simply because we're planning to add this feature to the next release of Gramine.

dimakuv self-assigned this Jan 20, 2022

llly mentioned this issue Jul 14, 2022

RFC: Add POSIX shared untrusted memory support #757

Closed

kailun-qin mentioned this issue Mar 6, 2023

GPU support #1214

Open

dimakuv added feature request P: 0 labels Mar 9, 2023

dimakuv mentioned this issue Oct 31, 2023

[LibOS,PAL/Linux-SGX] Add shared untrusted memory support #827

Merged

dimakuv closed this as completed in #827 Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Communication with devices (shared memory, ioctl) #353

Communication with devices (shared memory, ioctl) #353

dimakuv commented Jan 20, 2022 •

edited

Loading

dimakuv commented Feb 1, 2022 •

edited

Loading

boryspoplawski commented Feb 1, 2022 •

edited

Loading

dimakuv commented Feb 2, 2022

llly commented May 10, 2022

dimakuv commented May 10, 2022

boryspoplawski commented May 10, 2022

dimakuv commented Mar 9, 2023

Communication with devices (shared memory, ioctl) #353

Communication with devices (shared memory, ioctl) #353

Comments

dimakuv commented Jan 20, 2022 • edited Loading

Rationale

Sub-projects

Example with CUSE device

Shared untrusted memory via mmap()

Device-specific I/O operations via ioctl()

dimakuv commented Feb 1, 2022 • edited Loading

Example with CUSE device

How CUSE works

CI Issues

mmap issues

Misc references

boryspoplawski commented Feb 1, 2022 • edited Loading

dimakuv commented Feb 2, 2022

llly commented May 10, 2022

dimakuv commented May 10, 2022

boryspoplawski commented May 10, 2022

dimakuv commented Mar 9, 2023

dimakuv commented Jan 20, 2022 •

edited

Loading

Shared untrusted memory via `mmap()`

Device-specific I/O operations via `ioctl()`

dimakuv commented Feb 1, 2022 •

edited

Loading

boryspoplawski commented Feb 1, 2022 •

edited

Loading