Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Communication with devices (shared memory, ioctl) #353

Closed
dimakuv opened this issue Jan 20, 2022 · 7 comments · Fixed by #827
Closed

Communication with devices (shared memory, ioctl) #353

dimakuv opened this issue Jan 20, 2022 · 7 comments · Fixed by #827

Comments

@dimakuv
Copy link

dimakuv commented Jan 20, 2022

This describes the current state of my "communication with devices" project.

This project is in the context of the Linux-SGX PAL, though it should be applicable to other TEEs as well. This project is trivially implementable in the Linux PAL.

Legend:

✔️ Done (merged to master)
🚧 In progress (usually has a PR open)
⭐ Next (usually will be unlocked by current "in progress")

Rationale

Intel SGX is a CPU-only technology. This project proposes a generic way to enable communication with host devices from within the SGX (Gramine) enclave. There are two key interfaces in CPU–device communication: (1) the device-backed mmap system call and (2) the ioctl system call.

Device-backed mmap system call allows to create memory regions that are shared between the SGX enclave and an arbitrary host device. The ioctl system call provides a generic interface to send an arbitrary request to an arbitrary device, passing arbitrarily complex objects to the device’s memory and back to the CPU memory (RAM). Adding support for these two system calls – together with the standard and already-supported open, read, write, close system calls on devices – is sufficient to enable interactions between the trusted SGX enclave and the host device.

The goal of this project is the enablement of communication between SGX enclaves and host devices. Protecting this communication from eavesdropping and other attacks is a non-goal of this project. Adding integrity checks, encryption, side-channel mitigations, etc. on top of the enabled insecure communication is the responsibility of the application.

This project was developed locally for about a year, with a particular focus on Direct Rendering Manager (DRM) workloads and interactions. Several interesting workloads were enabled and can run adequately. Now it's time to upstream this work (except for device-specific hacks). The branch with the latest snapshot is here: b12f340 (danger! some code there is terrible hacks).

TODO: Add a link to the whitepaper that contains explanations, diagrams and pseudo-code. UPDATE: The whitepaper: https://arxiv.org/abs/2203.01813

Sub-projects

  • ✔️ Standard I/O operations via read()/write() (already implemented in Gramine)
  • 🚧 Shared untrusted memory via mmap() -- done in [LibOS,PAL/Linux-SGX] Add shared untrusted memory support #827
  • ✔️ Device-specific I/O operations via ioctl()
  • ⛔ : Example with CUSE device that uses all of the above: read/write/mmap/ioctl

Below is the approximate split of these sub-projects into series of PRs.

Example with CUSE device

Shared untrusted memory via mmap()

  • Introduce two non-overlapping address ranges: private range and shared range

    • For LibOS: private range is used for mmap(MAP_PRIVATE), shared range is used for mmap(MAP_SHARED)
    • For common PAL: existing g_pal_public_state.user_address becomes private_range, plus a new shared_range
    • For Linux PAL: private range is e.g. 0x0 - 1TB and shared range is e.g. 1TB - 2TB
    • For Linux-SGX PAL: private range is enclave memory and shared range is e.g. 1TB - 2TB of untrusted memory (actually, a 1TB-sized region allocated with host's mmap() to reserve this memory)
    • For Linux-SGX PAL: only allow shared range if sgx.insecure__enable_shared_range = "true"
    • Main changes in: shim_vma.c, shim_mmap.c, disallowing addr = NULL in PAL memory-alloc functions
  • Add device-backed mmap: mmap(..., <device-fd>, ...)

    • For LibOS: add device-backed support for mmap(), mprotect(), munmap()
    • For Linux PAL: add .mmap callback to the "device" PAL handle
    • For Linux-SGX PAL: add .mmap callback to the "device" PAL handle
  • Augment CUSE device example with mmap() tests

    • Test for mmap(MAP_SHARED, <cuse-device-fd>)
    • Test for mmap(MAP_PRIVATE, <cuse-device-fd>)
    • Negative test for opening with O_DIRECT and then mmapping -- this is not supported by current in-Linux-kernel FUSE/CUSE

Device-specific I/O operations via ioctl()

  • Add new PAL API: DkDeviceIoControl(handle, cmd, arg)

    • For LibOS: trivially call in shim_do_ioctl()
    • For Linux PAL: trivially call host-level ioctl()
    • For Linux-SGX PAL: add trivial (flat arg, no-data-nesting) ocall_ioctl()
    • Augment CUSE device with CUSE_UNRESTRICTED_IOCTL and ioctl(<dummy-cmd>, <flat-arg>)
  • Add first (simple) version of IOCTL struct definition to Linux-SGX PAL

    • Add sgx.ioctl_structs and sgx.allowed_ioctls
    • No caching
    • No boolean expressions (no onlyif)
    • Augment CUSE device with ioctl(<dummy-cmd2>, <complex-arg>)
  • Add final version of IOCTL struct definition to Linux-SGX PAL

    • No caching
    • Add boolean expressions (onlyif)
    • Augment CUSE device with ioctl(<dummy-cmd3>, <onlyif-arg>)
  • Add caching of recently-used IOCTL structs to the final version of IOCTL struct definition to Linux-SGX PAL

    • [ Performance optimization: add some loop in CUSE to manually check performance? ]
  • Add documentation for sgx.ioctl_structs and sgx.allowed_ioctls and the IOCTL struct definitions in the manifest

@dimakuv dimakuv self-assigned this Jan 20, 2022
@dimakuv
Copy link
Author

dimakuv commented Feb 1, 2022

Example with CUSE device

Some more details on this sub-project.

How CUSE works

The Linux kernel exposes the /dev/cuse driver. The user-mode driver (UMD) is a normal C application that must be started with root privileges. This UMD calls cuse_lowlevel_main() function to connect to the /dev/cuse, and the Linux kernel's CUSE module creates a new device with the name requested by UMD, e.g. /dev/mydevice. After this, the UMD may drop privileges. The UMD will now listen on /dev/mydevice and get client requests via callbacks struct cuse_lowlevel_ops = {.open, .read, .write, .ioctl}.

Now another (client) application can open the /dev/mydevice file (the client does not use /dev/cuse at all). The client can now perform normal read(), write(), ioctl() using this device fd.

The UMD program removes the /dev/mydevice file when it terminates.

Unfortunately, CUSE doesn't have the mmap() callback. It was added in 2011 but then was reverted in 2012.

CI Issues

As can be seen above, the CUSE UMD app must create a new device under /dev/. Creating new devices is unsupported in Docker when it runs in unprivileged mode (without --privileged). Even when Docker runs in privileged mode, the new device may be created but it will not be visible to application inside this Docker container (see moby/moby#27886).

This means that there is no possibility to start the CUSE UMD app inside a Docker container (i.e., use our current CI infrastructure). This app must be started with root privileges (to create /dev/mydevice) but can drop privileges immediately after this.

There are two workarounds:

  • Start the CUSE UMD app on the host and propagate the created device via docker run --device=/dev/mydevice.

    • Pros: simple, no changes to the CI (except a one-time effort of installing this CUSE app on each CI machine).
    • Cons: each CUSE app modification needs to be installed manually on each CI machine.
  • Spawn a minimal Linux VM which will start the CUSE UMD app and then run the Gramine CUSE test.

    • Pros: CUSE UMD app can be easily modified.
    • Cons: complex:
      • Must enable Docker container to spawn new VMs (probably --device=/dev/kvm is enough, but maybe some privileges?).
      • Must get a Linux image (bzImage); can copy from the host I guess.
      • Must create an initramfs: put some minimal libraries (e.g. Glibc), copy install of Gramine, copy CUSE test.
      • Must create an init script that will start the CUSE UMD app and run Gramine.
      • Install QEMU that is SGX-enabled, and spawn an SGX-enabled VM via QEMU command line (with KVM support).
      • Must propagate and parse the results of the VM execution.
      • A starting point could be the "Exercise 1" from here: http://www.lockett.ca/linuxboot/linuxboot.html

mmap issues

A large part of the proposed communication with devices is shared memory (allocated via device-backed mmap()). But CUSE doesn't support mmap, so we won't be able to test shared memory. Recall that most ioctl() commands use pointers to such shared memory regions.

Interestingly, classic FUSE supports mmap() but doesn't support unrestricted IOCTLs. So we cannot emulate device communication with FUSE as well (this would be weird anyway, to use a custom FS as a device approximation).

This probably means that testing with a CUSE device is not an option for this use case.

UPDATE: On the other hand, I couldn't find any other framework or existing device that would be simple to use and would support mmap(). So maybe we use a home-grown CUSE device for ioctl() verification, and e.g. /dev/zero for mmap() verification. I also looked at existing dummy-GPU drivers like Xvfb and Xdummy, but they are way too complicated.

Misc references

About FUSE and CUSE:

About creating VMs:

@boryspoplawski
Copy link
Contributor

boryspoplawski commented Feb 1, 2022

Even when Docker runs in privileged mode, the new device may be created but it will not be visible to application inside this Docker container (see moby/moby#27886).

Wouldn't it be enough to do mknod inside container?

Start the CUSE UMD app on the host

That will be a nightmare to maintain. Realistically we have to expect some bugs and submitting a new app version manually each time? Ugh.

Spawn a minimal Linux VM which will start the CUSE UMD
complex

I disagree it's complex, I've done it many times, it's super easy. The only hard thing here is SGX enabled QEMU, I don't know what's the state of that.

This probably means that testing with a CUSE device is not an option for this use case.

Yeah, isn't mmap like half of the complexity we want to test (other being ioctl)? I would say CUSE is a no-go for us then...

What about creating a normal kernel module and using a VM? There are 2 "hard" parts about it:

  • qemu with sgx support - no idea what's the state of it
  • writing kernel module - that wouldn't be much harder than using CUSE, would it?

I can help you with this.

@dimakuv
Copy link
Author

dimakuv commented Feb 2, 2022

Yeah, isn't mmap like half of the complexity we want to test (other being ioctl)? I would say CUSE is a no-go for us then...
What about creating a normal kernel module and using a VM?

Yes, I agree. With all the limititations of other approaches, it seems we should go for an obvious solution: a kernel module. Which leads to the usage of a VM to run it in.

@llly
Copy link
Contributor

llly commented May 10, 2022

Can ioctl also support and passthrough TYPE_SOCK fd?
When Java code wants to retrieve IP addr or MAC, it uses syscall ioctl(sockfd, SIOCGIFCONF, &ifc) and ioctl(sockfd, SIOCGIFHWADDR, &ifreq). Using Current Gramine, I get java.net.SocketException: Function not implemented (ioctl(SIOCGIFCONF) failed) and java.net.SocketException: No such device (ioctl(SIOCGIFHWADDR) failed)

@dimakuv
Copy link
Author

dimakuv commented May 10, 2022

Can ioctl also support and passthrough TYPE_SOCK fd?

What is TYPE_SOCK fd? If you mean normal TCP/IP sockets, then no, my IOCTL support is tailored for devices (like the ones under /dev/), not for generic objects like TCP/IP sockets.

I took a quick look at SIOCGIFCONF and SIOCGIFHWADDR ioctls. These are quite tricky to emulate inside Gramine. We definitely don't want to have a passthrough, because this goes against the "virtualize all hardware" philosophy of Gramine.

I think these ioctls must be emulated by the following:

  1. On startup, Gramine fetches all host's L2/L3 network addresses.
  2. Gramine chooses one of these network addresses (maybe somehow specify the priority which addresses to choose in the manifest file?).
  3. Gramine memorizes this pair of L2+L3 network addresses, sanitizes it, and keeps in enclave memory.
  4. During runtime, when LibOS emulates SIOCGIFCONF and SIOCGIFHWADDR ioctls, the corresponding L2/L3 address is taken from enclave memory and filled for the application.

@mkow @boryspoplawski Does this sound right?

@boryspoplawski
Copy link
Contributor

@dimakuv We definitely don't want one address only, there are cases when you want to use multiple (not even mentioning multiple addresses on one interface).

@dimakuv
Copy link
Author

dimakuv commented Mar 9, 2023

I added P0 priority label simply because we're planning to add this feature to the next release of Gramine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants