Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LibOS] RFC: support for System-V semaphores #1248

Open
dimakuv opened this issue Mar 24, 2023 · 2 comments
Open

[LibOS] RFC: support for System-V semaphores #1248

dimakuv opened this issue Mar 24, 2023 · 2 comments

Comments

@dimakuv
Copy link

dimakuv commented Mar 24, 2023

Description of the feature

System V semaphores (aka Sys-V semaphores) are a primitive for inter-process synchronization. They are not to be confused with newer POSIX semaphores.

Sys-V semaphores use four system calls:

[ Just for the record, POSIX semaphores are implemented in user-space. ]
  • Glibc implements POSIX semaphores' API: sem_open(), sem_wait(), sem_post(), etc. See the man page for details.
  • POSIX semaphores open a shared-memory file under /dev/shm -- that's why POSIX semaphores are hard/impossible to implement in Gramine-SGX.
  • POSIX semaphores use futexes on shared-memory locations, that's why they are typically faster than Sys-V semaphores.
  • Also note that Sys-V semaphores are system-wide (any process can connect to any semaphore, if it has sufficient permissions), whereas POSIX semaphores are specific to a group of processes. That's why Sys-V semaphores has a problem of system-wide semaphore leaks: if some group of processes terminated without removing its semaphores, the semaphores will be available until next Linux reboot (or manual cleanup of these orphaned semaphores).

I will not describe how Sys-V semaphores work in this issue. Here are the links I found useful:

Here are some links to the Linux source code on Sys-V semaphores:

Gramine needs to implement

  • 4 syscalls: semget(), semop(), semctl(), semtimedop()
  • 2 pseudo-files:
    • /proc/sys/kernel/sem -- read-only, with hard-coded limits for semaphores:
    • /proc/sysvipc/sem -- read-only, shows all semaphores in the system:
      • we should not implement this file for now, as it is probably not used by real-world apps,
      • but implementation should be simple: ask the master process about all semaphores, the master process sends back the results, print these results in the format similar to Linux.

Proposed simplification 1

Ignore permissions and their checks in all 4 syscalls; we should set sem_perm.cuid, sem_perm.uid, sem_perm.cgid, sem_perm.gid, sem_perm.mode, but for simplicity we can ignore their verifications during syscalls.

Well, it may be trivial to implement such checks. If it is, let's implement them immediately. But if it would require some changes in other parts of Gramine, I would leave it as future work.

Proposed simplification 2

The man page for semop() says this:

The calling thread catches a signal: the value of semzcnt is decremented and semop() fails, with errno set to EINTR.

In other words, if there is an interrupt/signal during a blocking semop(), then the syscall must mark this process as "not waiting anymore" and fail with EINTR. This first part is problematic: in the distrubuted logic of Gramine, this would require sending a special "not waiting anymore" message from this process to the leader process, then the parent process must decrement sem::semzcnt, and send the acknowledgement message back... This problem is non-trivial, and it is similar to this issue: #12

So, we should not implement this logic for now. This essentially renders all semaphore operations non-interruptible.

Also, this limitation will most probably affect semtimedop() syscall -- the timeout will probably be useless, because we won't be able to "undo" the operations if timeout is triggered.

Proposed simplification 3

The semantics of SEM_UNDO flag are complicated, especially because SEM_UNDO metadata (semaphore adjustments) are per-process and are kept on execve() and clone(CLONE_SYSVSEM). This would necessiate a separate LibOS handle for each semaphore, and corresponding checkpoint-restore code in Gramine.

Even though the adjustments logic itself is pretty simple, I think we can silently ignore SEM_UNDO for now.

By the way, the Linux Programming Interface book also mentions that SEM_UNDO is not as useful as it may seem, and applications shouldn't rely on this flag really (see Limitations of SEM_UNDO section).

UPDATE 27. March 2023: Apache APR uses SEM_UNDO. Looks like we need to implement SEM_UNDO, but probably we can get away with just silently ignoring it.

Random notes

  • References to Sys-V semaphores are shared, because the semaphores don't belong to any process. So they are preserved across fork and execve -- in the sense that semid can be accessed by any process. IIUC, Sys-V semaphores do not have a state in the process (other than SEM_UNDO adjustments which we don't plan to support yet), so there is nothing to checkpoint/restore in Gramine. Also note that semaphore IDs (semid integers) are not file descriptors, so they can't be used in e.g. poll().

  • The standards say: "Where multiple processes are trying to decrease a semaphore by the same amount, it is indeterminate which process will actually be permitted to perform the operation first." Gramine can rely on this to simplify iterating over waiting processes when some semaphore-set becomes available. It is the responsibility of the app to prevent starvation scenarios, not Gramine's.

  • Sys-V semaphores API is clumsy and overcomplicated. For example, almost all applications use simpler binary semaphores: the semaphore set is reduced to the size of 1, and the semaphore values can be only 0 and 1. I don't think we should do any simplifications based on this, but this is an interesting remark. (Also note that semaphore sets were designed like this to have atomic guarantees -- all semaphores in the set are either processed, or not at all.)

  • Long time ago, Gramine had an implementation of Sys-V semaphores that was very buggy and limited. See this commit: 356ae6e

Notes on implementation

The implementation must closely follow the one for POSIX file locks: gramineproject/graphene#2481

In particular, the following files should be modified/created:

  • libos/include/libos_sysv_sem.h -- new file with structs, enums, constants for Sys-V semaphores
  • libos/include/libos_ipc.h -- add new functions for Sys-V operations via IPC messages
  • libos/src/ipc/libos_ipc_worker.c -- add Sys-V specific callbacks
  • libos/src/ipc/libos_ipc_sysv_sem.c -- new file with glue code to transform from high-level ipc_sysv_sem_xxx() operations to Gramine IPC functionality like ipc_send_msg_and_get_response()
  • libos/src/sys/libos_sysv_sem.c -- new file with syscall implementations
  • libos/src/bookkeep/libos_sysv_sem.c -- new file with logic of the leader process (how it iterates through semaphores in the set and decides whether to allow or block requests)

Also tests should be added:

  • libos/test/ltp -- enable as many Sys-V semaphore tests as possible
  • libos/test/regression -- add one or two Sys-V semaphore tests: single-process and multi-process, testing:
    • IPC_PRIVATE
    • IPC_CREAT
    • IPC_EXCL
    • sem_otime (because of the synchronization trick between semget and semctl, we must implement this)
    • IPC_SET (at least for the trick with sem_otime above)
    • IPC_RMID
    • IPC_INFO
    • GET.../SET... operations in semctl()
    • IPC_NOWAIT
    • EIDRM error code
    • SEM_UNDO (if we implement it)

Random note: maybe the Sync Engine could be useful for Sys-V semaphores (but I doubt): 0e75cad#diff-53c705e096c216c76af82a3affd8a17766e8539c58d1f8b1243ae44010cc74da

Why Gramine should implement it?

The main user is Apache web-server and its derivatives like Apache proxy.

  • Apache web-server and its plugins use Sys-V semaphores (e.g. google for APR_USE_SYSVSEM_SERIALIZE). In particular:
    • Apache httpd example (that we had for Gramine) uses them
    • Apache Beam with Flink uses them

On the other hand, Python's multiprocessing package unfortunately does not use Sys-V semaphores but instead uses POSIX semaphores. That's unfortunate, because implementing POSIX semaphores in Gramine/SGX would require allowing untrusted shared memory (/dev/shm), which will probably never happen...

@dimakuv
Copy link
Author

dimakuv commented Apr 20, 2023

This was de-prioritized.

@dimakuv
Copy link
Author

dimakuv commented Apr 17, 2024

Just a note:

On the other hand, Python's multiprocessing package unfortunately does not use Sys-V semaphores but instead uses POSIX semaphores. That's unfortunate, because implementing POSIX semaphores in Gramine/SGX would require allowing untrusted shared memory (/dev/shm), which will probably never happen...

Even though Gramine supports POSIX shared memory (/dev/shm) in an insecure way (see https://gramine.readthedocs.io/en/stable/manifest-syntax.html#untrusted-shared-memory-1), that support is restricted only to devices that use mmap. See e.g. this code:

struct libos_fs_ops shm_fs_ops = {
.mount = shm_mount,
/* .read and .write are intentionally not supported according to POSIX shared memory API. */
.mmap = shm_mmap,
.hstat = generic_inode_hstat,
.truncate = generic_truncate,
};

On the other side, POSIX semaphores want to use the write() and read() syscalls on shared memory. Example from PYthon's multiprocessing:

trace: ---- newfstatat(AT_FDCWD, "/dev/shm/sem.1ZQAk7", 0x1db576d230, 256) = -2
trace: ---- openat(AT_FDCWD, "/dev/shm/sem.1ZQAk7", O_RDWR|O_CREAT|O_EXCL|0x80000, 0600) = 0x4
trace: ---- write(4, 0x1db576d3e0, 0x20) ...
trace: ---- return from write(...) = -9
trace: ---- unlink("/dev/shm/sem.1ZQAk7") = 0x0

So, even the latest Gramine version with support for POSIX shared memory does not support POSIX semaphores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backburner
Development

No branches or pull requests

1 participant