diff --git a/text/0009-lifetime-event_loop.md b/text/0009-lifetime-event_loop.md new file mode 100644 index 0000000..1e542dd --- /dev/null +++ b/text/0009-lifetime-event_loop.md @@ -0,0 +1,389 @@ +- Feature Name: `lifetime-event_loop` +- Start Date: 2024-11-11 +- RFC PR: + [crystal-lang/rfcs#0009](https://github.com/crystal-lang/rfcs/pull/9) +- Issue: + [crystal-lang/crystal#14996](https://github.com/crystal-lang/crystal/pull/14996) + +# Summary + +Integrate the Crystal event loop directly with system selectors, +[`epoll`](https://linux.die.net/man/7/epoll) (Linux, Android) and +[`kqueue`](https://man.freebsd.org/cgi/man.cgi?kqueue) (BSDs, macOS) instead of +going through [`libevent`](https://libevent.org/). + +# Motivation + +Direct integration with the system selectors enables a more performant +implementation thanks to a design change. Going from ad-hoc event subscriptions +to lifetime subscriptions per file descriptor (`fd`) reduces overhead per +operation. + +Dropping `libevent` also removes an external runtime dependency from Crystal +programs. + +This also prepares the event loop foundation for implementing execution contexts +from [RFC #0002](https://github.com/crystal-lang/rfcs/pull/2). + +# Guide-level explanation + + + +This new event loop driver builds on top of the event loop refactor from [RFC +#0007](./0007-event_loop-refactor.md). It plugs right into the runtime and does +not require any changes in user code, even for direct consumers of the +`Crystal::EventLoop` API. + +The new event loop driver is enabled automatically on supported targets (see +[*Availability*](#availability)). The `libevent` driver serves as a fallback for +unsupported targets and can be opted-in with the compile-time flag +`-Deventloop=libevent`. + +### Design + +The logic of the event loop doesn't change much from the one based on +`libevent`: + +* We try to execute an operation (e.g. `read`, `write`, `connect`, ...) on + nonblocking `fd`s +* If the operation would block (`EAGAIN`) we create an event that references the + operation along with the current fiber +* We eventually rely on the polling system (`epoll` or `kqueue`) to report when + an `fd` is ready, which will dequeue a pending event and resume its associated + fiber (one at a time). + +Unlike `libevent` which adds and removes the `fd` to and from the polling system +on every blocking operation, the lifetime event loop driver adds every `fd` +_once_ and only removes it when closing the `fd`. + +The argument is that being notified for readiness (which happens only once +thanks to edge-triggering) is less expensive than always modifying the polling +system. + +The mechanism only requires fine grained locks (read, write) which are usually +uncontended, unless you share a `fd` for the same read or write operation in +multiple fibers. + +> [!NOTE] +> +> On a purely IO-bound benchmark with long running sockets, we noticed up to 20% +> performance improvement. Real applications would see less improvements, +> though. + +### Multi-Threading + +> [!CAUTION] +> +> Moving file descriptors with pending operations between event loop instances – +> i.e. between threads with `-Dpreview_mt` – is an error. If you need this +> behaviour, it's best to stay with `libevent` via `-Devloop=libevent`. + +It supports the multi-threading preview (`-Dpreview_mt`) with one event loop +instance per thread (scheduler requirement). + +Execution contexts ([RFC #0002](https://github.com/crystal-lang/rfcs/pull/2)) +will have one event loop instance per context. + +With multiple event loop instances, it's necessary to define what happens when +the same `fd` is used from multiple instances. + +When an operation starts on a `fd` that is owned by a different event loop +instance, we transfer it. The new event loop becomes the sole "owner" of the +`fd`. This transfer happens implicitly. + +This leads to a caveat: we can't have multiple fibers waiting for the same `fd` +in different event loop instances (aka threads/contexts). Trying to transfer the +`fd` raises if there is a waiting fiber in the old event loop. This is because +an IO read/write can have a timeout which is registered in the current event +loop timers, and timers aren't transferred. This also allows for future +enhancements (e.g. enqueues are _always_ local). + +This can be an issue for `preview_mt`, for example when multiple fibers are +waiting for connections on a server socket. This shall be mitigated with +execution contexts where an event loop instance is shared per context — just +don't share a `fd` between contexts. + +### Availability + +The lifetime event loop driver is supported on: + +- Android +- FreeBSD +- Linux +- macOS + +On these operating systems, it's enabled automatically by default. Unsupported +systems keep using `libevent`. + +Compile time flags allow choosing a different event loop driver. + +- `-Devloop=libevent`: Use `libevent` +- `-Devloop=epoll`: Use `epoll` (e.g. Solaris); +- `-Devloop=kqueue`: Use `kqueue` (on *BSD); + +The event loop drivers on Windows and WebAssembly (WASI) are not affected by +this change. The event loop driver for Windows already integrates directly with +the system selector (`IOCP`). + +## Terminology + +- **Event loop**: an abstraction to wait on specific events, such as timers + (e.g. wait until a certain amount of time has passed) or IO (e.g. wait for an + socket to be readable or writable). + +- **Crystal event loop:** Component of the Crystal runtime that subscribes to + and dispatches events from the OS and other sources to enqueue a waiting fiber + when a desired event (such as IO availability) has occured. It’s an interface + between Crystal’s scheduler and the event loop driver. + +- **Event loop driver:** Connects the event loop backend (`libevent`, `IOCP`, + `epoll`, `kqueue`, `io_uring`) with the Crystal runtime. + +- **System selector:** The system implementation of an event loop which the + runtime component depends on to receive events from the operating system. + Example implementations: `epoll`, `kqueue`, `IOCP`, `io_uring`. `libevent` is + a cross-platform wrapper for system selectors. + +- **Scheduler:** manages fibers’ executions inside the program (controlled by + Crystal), unlike threads that are scheduled by the operating system (outside + of the accessibility of the program). + +# Reference-level explanation + + + +The run loop first waits on `epoll`/`kqueue`, canceling IO timeouts as it +resumes the ready fibers, then proceeds to process timers. + +The `epoll_wait` and `kevent` syscalls can wait until a timeout is reached. +The event loop implementations don't use the timeout and instead wait forever +(blocking) or don't wait at all (nonblocking). There are a couple reasons: + +1. Epoll also has a minimum timeout precision of 1ms, but what if the next + expiring timer is in 10us? + +3. In a single threaded context we could set the timeout to the next expiring + timer; that would work for no MT as well as `preview_mt`, but if we share + the event loop instance between threads —which will be the case with + execution contexts— then we have a race condition: another thread can set a + sooner expiring timer just before we'd wait, and a mutex wouldn't help to + solve (can't keep it locked until the syscall returns). + +Instead, we use an extra `timerfd` (Linux) and `EVFILT_TIMER` (macOS, *BSD) +that have sub-millisecond precision (except for DragonFlyBSD), and that we can +update in parallel when the next expiring timer has changed, using a mutex to +prevent parallel races. + +## Components + +Each syscall is abstracted in its own little struct: `Crystal::System::Epoll`, +`Crystal::System::TimerFD`, etc. + +- `Crystal::Evented` namespace (`src/crystal/system/unix/evented`) contains the + base implementation that the system specific drivers + `Crystal::Epoll::EventLoop` (`src/crystal/system/unix/epoll`) and + `Crystal::Kqueue::EventLoop` (`src/crystal/system/unix/kqueue`) are built on. +- `Crystal::Evented::Timers` is a basic data structure to keep a list of timers + (one instance per event loop). +- `Crystal::Evented::Event` holds the event, be it IO or sleep or select timeout + or IO with timeout, while `FiberEvent` wraps an `Event` for sleeps and select + timeouts. +- `Crystal::Evented::PollDescriptor` is allocated in a Generational Arena and + keeps the list of readers and writers (events/fibers waiting on IO). It takes + advantage that the OS kernel guarantees a unique `fd` number and always reuses + numbers of closed `fd`, only adding more numbers when needed. + +## Poll Descriptors + +To avoid keeping pointers to the IO object that could prevent the GC from +collecting lost IO objects, this proposal introduces *Poll Descriptor* objects +(the name comes from Go's netpoll) that keep the list of readers and writers and +don't point back to the IO object. The GC collecting an IO object is fine: the +finalizer will close the `fd` and tell the event loop to cleanup the associated *Poll Descriptor* (so we can safely reuse the `fd`). + +To avoid pushing raw pointers into the kernel data structures, and to quickly +retrieve the *Poll Descriptor* from a mere `fd`, but also to avoid programming +errors that would segfault the program, this propsal introduces a *Generational +Arena* to store the *Poll Descriptors* (the name is inherited from Go's netpoll) +so we only store an index into the polling system. Another benefit is that we +can reuse the existing allocation when a `fd` is reused. If we try to retrieve +an outdated index (the allocation was freed or reallocated) the arena will raise +an explicit exception. + +> [!NOTE] +> +> The goals of the arena are: +> - avoid repeated allocations; +> - avoid polluting the IO object with the PollDescriptor (doesn't exist in +> other event loops); +> - avoid saving raw pointers into kernel data structures; +> - safely detect allocation issues instead of segfaults because of raw +> pointers. + +The *Poll Descriptors* associate a `fd` to an event loop instance, so we can +still have multiple event loops per processes, yet make sure that an `fd` is +only ever in one event loop. When a `fd` will block on another event loop +instance, the `fd` will be transferred automatically (i.e. removed from the old +one & added to the new one). The benefits are numerous: this avoids having +multiple event loops being notified at the same time; this avoids having to +close/remove the `fd` from each event loop instances; this avoids cross event +loop enqueues that are much slower than local enqueues in execution contexts. + +A limitation is that trying to move a `fd` from one event loop to another while +there are pending waiters will raise an exception. We could move timeout events +along with the `fd` from one event loop instance to another one, but that would +also break the "always local enqueues" benefit. + +Most application shouldn't notice any impact because of this design choice, +since a `fd` is usually not shared across fibers for concurrency issues. An +exception may be a server socket with multiple accepting fibers. In that case +you'll need to make sure the fibers are on the same thread (`preview_mt`) or +execution context. + +If this ever proves to be an issue, we can think of solutions. For example migrating timeouts along, triaging external fibers from local ones, and more, of course at the expense of some performance. + +# Drawbacks + + + +* There's more code to maintain: Instead of only the `libevent` driver we now + have additional ones to care about. + +# Rationale and alternatives + + + +# Prior art + + + +Golang's [netpoll](https://go.dev/src/runtime/netpoll.go) uses a similar design, with the difference that Go only has a single event loop instance per process. + +# Unresolved questions + + + +## Timers and timeouts + +We're missing a proper data structure to store timers and timeouts. It must be +thread safe and efficient. Ideas are a minheap (4-heap) or a skiplist. Timers +and timeouts may need to be handled separately. + +This issue is blocking. An inefficient data structure wrecks performance for +timers and timeouts. + +## Performance issues on BSDs + +The `kqueue` driver is disabled on DragonFly BSD, OpenBSD and NetBSD due to +performance regressions. + +- DragonFly BSD: Running `std_spec` is eons slower than libevent. It regularly + hangs on `event_loop.run` until the stack pool collector timeout kicks in + (i.e. loops on 5s pauses). + +- OpenBSD: Running `std_spec` is noticeably slower (4:15 minutes) compared to + libevent (1:16 minutes). It appears that the main fiber keeps re-enqueueing + itself from the event loop run (10us on each run). + +- NetBSD: The event loop doesn't work with `kevent` returning `ENOENT` for the + signal loop `fd` and `EINVAL` when trying to set an `EVFILT_TIMER`. + +The `kqueue` driver works fine on FreeBSD and Darwin. + +These issues are non-blocking. We can keep using `libevent` on these operating +systems until resolved. + +# Future possibilities + + + +* `aio` for async read/write over regular disk files with `kqueue` +* Integrate more evented operations such as `signalfd`/`EVFILT_SIGNAL` and + `pidfd`/`EVFILT_PROC`.