[LibOS] RFC: support for System-V semaphores #1248

dimakuv · 2023-03-24T13:39:02Z

Description of the feature

System V semaphores (aka Sys-V semaphores) are a primitive for inter-process synchronization. They are not to be confused with newer POSIX semaphores.

Sys-V semaphores use four system calls:

semget(), currently unimplemented in Gramine
semop(), currently unimplemented in Gramine
semctl(), currently unimplemented in Gramine
semtimedop(), currently unimplemented in Gramine
These can be implemented on top of Gramine's IPC mechanism.

[ Just for the record, POSIX semaphores are implemented in user-space. ]

Glibc implements POSIX semaphores' API: sem_open(), sem_wait(), sem_post(), etc. See the man page for details.
POSIX semaphores open a shared-memory file under /dev/shm -- that's why POSIX semaphores are hard/impossible to implement in Gramine-SGX.
POSIX semaphores use futexes on shared-memory locations, that's why they are typically faster than Sys-V semaphores.
Also note that Sys-V semaphores are system-wide (any process can connect to any semaphore, if it has sufficient permissions), whereas POSIX semaphores are specific to a group of processes. That's why Sys-V semaphores has a problem of system-wide semaphore leaks: if some group of processes terminated without removing its semaphores, the semaphores will be available until next Linux reboot (or manual cleanup of these orphaned semaphores).

I will not describe how Sys-V semaphores work in this issue. Here are the links I found useful:

https://man7.org/linux/man-pages/man7/sysvipc.7.html
https://man7.org/linux/man-pages/man2/semget.2.html
https://man7.org/linux/man-pages/man2/semctl.2.html
https://man7.org/linux/man-pages/man2/semop.2.html
https://docs.oracle.com/cd/E19455-01/806-4750/svipc-65382/index.html -- okish overview
https://www.softprayog.in/programming/system-v-semaphores -- contains good example
https://tldp.org/LDP/lpg/node47.html#SECTION00743100000000000000 -- old but ok, see other pages as well
https://www.oreilly.com/library/view/the-linux-programming/9781593272203/xhtml/ch47.xhtml -- great thorough description, but requires subscription

Here are some links to the Linux source code on Sys-V semaphores:

Gramine needs to implement

4 syscalls: semget(), semop(), semctl(), semtimedop()
2 pseudo-files:
- /proc/sys/kernel/sem -- read-only, with hard-coded limits for semaphores:
  - see the /proc man page for details on these limits;
  - hard-code the same default values as in Linux sources.
- /proc/sysvipc/sem -- read-only, shows all semaphores in the system:
  - we should not implement this file for now, as it is probably not used by real-world apps,
  - but implementation should be simple: ask the master process about all semaphores, the master process sends back the results, print these results in the format similar to Linux.

Proposed simplification 1

Ignore permissions and their checks in all 4 syscalls; we should set sem_perm.cuid, sem_perm.uid, sem_perm.cgid, sem_perm.gid, sem_perm.mode, but for simplicity we can ignore their verifications during syscalls.

Well, it may be trivial to implement such checks. If it is, let's implement them immediately. But if it would require some changes in other parts of Gramine, I would leave it as future work.

Proposed simplification 2

The man page for semop() says this:

The calling thread catches a signal: the value of semzcnt is decremented and semop() fails, with errno set to EINTR.

In other words, if there is an interrupt/signal during a blocking semop(), then the syscall must mark this process as "not waiting anymore" and fail with EINTR. This first part is problematic: in the distrubuted logic of Gramine, this would require sending a special "not waiting anymore" message from this process to the leader process, then the parent process must decrement sem::semzcnt, and send the acknowledgement message back... This problem is non-trivial, and it is similar to this issue: #12

So, we should not implement this logic for now. This essentially renders all semaphore operations non-interruptible.

Also, this limitation will most probably affect semtimedop() syscall -- the timeout will probably be useless, because we won't be able to "undo" the operations if timeout is triggered.

Proposed simplification 3

The semantics of SEM_UNDO flag are complicated, especially because SEM_UNDO metadata (semaphore adjustments) are per-process and are kept on execve() and clone(CLONE_SYSVSEM). This would necessiate a separate LibOS handle for each semaphore, and corresponding checkpoint-restore code in Gramine.

Even though the adjustments logic itself is pretty simple, I think we can silently ignore SEM_UNDO for now.

By the way, the Linux Programming Interface book also mentions that SEM_UNDO is not as useful as it may seem, and applications shouldn't rely on this flag really (see Limitations of SEM_UNDO section).

UPDATE 27. March 2023: Apache APR uses SEM_UNDO. Looks like we need to implement SEM_UNDO, but probably we can get away with just silently ignoring it.

Random notes

References to Sys-V semaphores are shared, because the semaphores don't belong to any process. So they are preserved across fork and execve -- in the sense that semid can be accessed by any process. IIUC, Sys-V semaphores do not have a state in the process (other than SEM_UNDO adjustments which we don't plan to support yet), so there is nothing to checkpoint/restore in Gramine. Also note that semaphore IDs (semid integers) are not file descriptors, so they can't be used in e.g. poll().
The standards say: "Where multiple processes are trying to decrease a semaphore by the same amount, it is indeterminate which process will actually be permitted to perform the operation first." Gramine can rely on this to simplify iterating over waiting processes when some semaphore-set becomes available. It is the responsibility of the app to prevent starvation scenarios, not Gramine's.
Sys-V semaphores API is clumsy and overcomplicated. For example, almost all applications use simpler binary semaphores: the semaphore set is reduced to the size of 1, and the semaphore values can be only 0 and 1. I don't think we should do any simplifications based on this, but this is an interesting remark. (Also note that semaphore sets were designed like this to have atomic guarantees -- all semaphores in the set are either processed, or not at all.)
Long time ago, Gramine had an implementation of Sys-V semaphores that was very buggy and limited. See this commit: 356ae6e

Notes on implementation

The implementation must closely follow the one for POSIX file locks: gramineproject/graphene#2481

In particular, the following files should be modified/created:

libos/include/libos_sysv_sem.h -- new file with structs, enums, constants for Sys-V semaphores
libos/include/libos_ipc.h -- add new functions for Sys-V operations via IPC messages
libos/src/ipc/libos_ipc_worker.c -- add Sys-V specific callbacks
libos/src/ipc/libos_ipc_sysv_sem.c -- new file with glue code to transform from high-level ipc_sysv_sem_xxx() operations to Gramine IPC functionality like ipc_send_msg_and_get_response()
libos/src/sys/libos_sysv_sem.c -- new file with syscall implementations
libos/src/bookkeep/libos_sysv_sem.c -- new file with logic of the leader process (how it iterates through semaphores in the set and decides whether to allow or block requests)

Also tests should be added:

libos/test/ltp -- enable as many Sys-V semaphore tests as possible
libos/test/regression -- add one or two Sys-V semaphore tests: single-process and multi-process, testing:
- IPC_PRIVATE
- IPC_CREAT
- IPC_EXCL
- sem_otime (because of the synchronization trick between semget and semctl, we must implement this)
- IPC_SET (at least for the trick with sem_otime above)
- IPC_RMID
- IPC_INFO
- GET.../SET... operations in semctl()
- IPC_NOWAIT
- EIDRM error code
- SEM_UNDO (if we implement it)

Random note: maybe the Sync Engine could be useful for Sys-V semaphores (but I doubt): 0e75cad#diff-53c705e096c216c76af82a3affd8a17766e8539c58d1f8b1243ae44010cc74da

Why Gramine should implement it?

The main user is Apache web-server and its derivatives like Apache proxy.

Apache web-server and its plugins use Sys-V semaphores (e.g. google for APR_USE_SYSVSEM_SERIALIZE). In particular:
- Apache httpd example (that we had for Gramine) uses them
- Apache Beam with Flink uses them

On the other hand, Python's multiprocessing package unfortunately does not use Sys-V semaphores but instead uses POSIX semaphores. That's unfortunate, because implementing POSIX semaphores in Gramine/SGX would require allowing untrusted shared memory (/dev/shm), which will probably never happen...

See these references:
- multiprocessing: use SysV semaphores on FreeBSD python/cpython#54557
- https://github.com/python/cpython/blob/a87c46eab3c306b1c5b8a072b7b30ac2c50651c0/Modules/_multiprocessing/semaphore.c#L210-L219
This means that Sys-V semaphores will not fix the following issues:
- [Error:38]Function not implemented. multiprocessing in graphene graphene#2689
- error： _multiprocessing.SemLock( FileNotFoundError: [Errno 2] No such file or directory examples#33

The text was updated successfully, but these errors were encountered:

dimakuv · 2023-04-20T07:27:52Z

This was de-prioritized.

dimakuv · 2024-04-17T11:50:49Z

Just a note:

On the other hand, Python's multiprocessing package unfortunately does not use Sys-V semaphores but instead uses POSIX semaphores. That's unfortunate, because implementing POSIX semaphores in Gramine/SGX would require allowing untrusted shared memory (/dev/shm), which will probably never happen...

Even though Gramine supports POSIX shared memory (/dev/shm) in an insecure way (see https://gramine.readthedocs.io/en/stable/manifest-syntax.html#untrusted-shared-memory-1), that support is restricted only to devices that use mmap. See e.g. this code:

gramine/libos/src/fs/shm/fs.c

Lines 171 to 177 in c59c041

    
           struct libos_fs_ops shm_fs_ops = { 
        
               .mount      = shm_mount, 
        
               /* .read and .write are intentionally not supported according to POSIX shared memory API. */ 
        
               .mmap       = shm_mmap, 
        
               .hstat      = generic_inode_hstat, 
        
               .truncate   = generic_truncate, 
        
           };

On the other side, POSIX semaphores want to use the write() and read() syscalls on shared memory. Example from PYthon's multiprocessing:

trace: ---- newfstatat(AT_FDCWD, "/dev/shm/sem.1ZQAk7", 0x1db576d230, 256) = -2
trace: ---- openat(AT_FDCWD, "/dev/shm/sem.1ZQAk7", O_RDWR|O_CREAT|O_EXCL|0x80000, 0600) = 0x4
trace: ---- write(4, 0x1db576d3e0, 0x20) ...
trace: ---- return from write(...) = -9
trace: ---- unlink("/dev/shm/sem.1ZQAk7") = 0x0

So, even the latest Gramine version with support for POSIX shared memory does not support POSIX semaphores.

dimakuv moved this to Backburner in Gramine Roadmap Apr 5, 2023

dimakuv added this to Gramine Roadmap Apr 5, 2023

dimakuv added feature request P: 2 labels Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LibOS] RFC: support for System-V semaphores #1248

[LibOS] RFC: support for System-V semaphores #1248

dimakuv commented Mar 24, 2023 •

edited

Loading

dimakuv commented Apr 20, 2023

dimakuv commented Apr 17, 2024

[LibOS] RFC: support for System-V semaphores #1248

[LibOS] RFC: support for System-V semaphores #1248

Comments

dimakuv commented Mar 24, 2023 • edited Loading

Description of the feature

Gramine needs to implement

Proposed simplification 1

Proposed simplification 2

Proposed simplification 3

Random notes

Notes on implementation

Why Gramine should implement it?

dimakuv commented Apr 20, 2023

dimakuv commented Apr 17, 2024

dimakuv commented Mar 24, 2023 •

edited

Loading