crystal-lang · straight-shoota · Nov 19, 2024 · Nov 12, 2024 · Nov 12, 2024 · Nov 12, 2024
diff --git a/text/0009-lifetime-event_loop.md b/text/0009-lifetime-event_loop.md
@@ -0,0 +1,389 @@
+- Feature Name: `lifetime-event_loop`
+- Start Date: 2024-11-11
+- RFC PR:
+  [crystal-lang/rfcs#0009](https://github.com/crystal-lang/rfcs/pull/9)
+- Issue:
+  [crystal-lang/crystal#14996](https://github.com/crystal-lang/crystal/pull/14996)
+
+# Summary
+
+Integrate the Crystal event loop directly with system selectors,
+[`epoll`](https://linux.die.net/man/7/epoll) (Linux, Android) and
+[`kqueue`](https://man.freebsd.org/cgi/man.cgi?kqueue) (BSDs, macOS) instead of
+going through [`libevent`](https://libevent.org/).
+
+# Motivation
+
+Direct integration with the system selectors enables a more performant
+implementation thanks to a design change. Going from ad-hoc event subscriptions
+to lifetime subscriptions per file descriptor (`fd`) reduces overhead per
+operation.
+
+Dropping `libevent` also removes an external runtime dependency from Crystal
+programs.
+
+This also prepares the event loop foundation for implementing execution contexts
+from [RFC #0002](https://github.com/crystal-lang/rfcs/pull/2).
+
+# Guide-level explanation
+
+<!--
+Explain the proposal as if it was already included in the language and you were
+teaching it to another Crystal programmer. That generally means:
+
+- Introducing new named concepts.
+- Explaining the feature largely in terms of examples.
+- Explaining how Crystal programmers should *think* about the feature, and how
+  it should impact the way they use Crystal. It should explain the impact as
+  concretely as possible.
+- If applicable, provide sample error messages, deprecation warnings, or
+  migration guidance.
+- If applicable, describe the differences between teaching this to existing
+  Crystal programmers and new Crystal programmers.
+- If applicable, discuss how this impacts the ability to read, understand, and
+  maintain Crystal code. Code is read and modified far more often than written;
+  will the proposed feature make code easier to maintain?
+
+For implementation-oriented RFCs (e.g. for compiler internals), this section
+should focus on how compiler contributors should think about the change, and
+give examples of its concrete impact. For policy RFCs, this section should
+provide an example-driven introduction to the policy, and explain its impact in
+concrete terms.
+-->
+
+This new event loop driver builds on top of the event loop refactor from [RFC
+#0007](./0007-event_loop-refactor.md). It plugs right into the runtime and does
+not require any changes in user code, even for direct consumers of the
+`Crystal::EventLoop` API.
+
+The new event loop driver is enabled automatically on supported targets (see
+[*Availability*](#availability)). The `libevent` driver serves as a fallback for
+unsupported targets and can be opted-in with the compile-time flag
+`-Deventloop=libevent`.
+
+### Design
+
+The logic of the event loop doesn't change much from the one based on
+`libevent`:
+
+* We try to execute an operation (e.g. `read`, `write`, `connect`, ...) on
+  nonblocking `fd`s
+* If the operation would block (`EAGAIN`) we create an event that references the
+  operation along with the current fiber
+* We eventually rely on the polling system (`epoll` or `kqueue`) to report when
+  an `fd` is ready, which will dequeue a pending event and resume its associated
+  fiber (one at a time).
+
+Unlike `libevent` which adds and removes the `fd` to and from the polling system
+on every blocking operation, the lifetime event loop driver adds every `fd`
+_once_ and only removes it when closing the `fd`.
+
+The argument is that being notified for readiness (which happens only once
+thanks to edge-triggering) is less expensive than always modifying the polling
+system.
+
+The mechanism only requires fine grained locks (read, write) which are usually
+uncontended, unless you share a `fd` for the same read or write operation in
+multiple fibers.
+
+> [!NOTE]
+>
+> On a purely IO-bound benchmark with long running sockets, we noticed up to 20%
+> performance improvement. Real applications would see less improvements,
+> though.
+
+### Multi-Threading
+
+> [!CAUTION]
+>
+> Moving file descriptors with pending operations between event loop instances –
+> i.e. between threads with `-Dpreview_mt` – is an error. If you need this
+> behaviour, it's best to stay with `libevent` via `-Devloop=libevent`.
+
+It supports the multi-threading preview (`-Dpreview_mt`) with one event loop
+instance per thread (scheduler requirement).
+
+Execution contexts ([RFC #0002](https://github.com/crystal-lang/rfcs/pull/2))
+will have one event loop instance per context.
+
+With multiple event loop instances, it's necessary to define what happens when
+the same `fd` is used from multiple instances.
+
+When an operation starts on a `fd` that is owned by a different event loop
+instance, we transfer it. The new event loop becomes the sole "owner" of the
+`fd`. This transfer happens implicitly.
+
+This leads to a caveat: we can't have multiple fibers waiting for the same `fd`
+in different event loop instances (aka threads/contexts). Trying to transfer the
+`fd` raises if there is a waiting fiber in the old event loop. This is because
+an IO read/write can have a timeout which is registered in the current event
+loop timers, and timers aren't transferred. This also allows for future
+enhancements (e.g. enqueues are _always_ local).
+
+This can be an issue for `preview_mt`, for example when multiple fibers are
+waiting for connections on a server socket. This shall be mitigated with
+execution contexts where an event loop instance is shared per context — just
+don't share a `fd` between contexts.
+
+### Availability
+
+The lifetime event loop driver is supported on:
+
+- Android
+- FreeBSD
+- Linux
+- macOS
+
+On these operating systems, it's enabled automatically by default. Unsupported
+systems keep using `libevent`.
+
+Compile time flags allow choosing a different event loop driver.
+
+- `-Devloop=libevent`: Use `libevent`
+- `-Devloop=epoll`: Use `epoll` (e.g. Solaris);
+- `-Devloop=kqueue`: Use `kqueue` (on *BSD);
+
+The event loop drivers on Windows and WebAssembly (WASI) are not affected by
+this change. The event loop driver for Windows already integrates directly with
+the system selector (`IOCP`).
+
+## Terminology
+
+- **Event loop**: an abstraction to wait on specific events, such as timers
+  (e.g. wait until a certain amount of time has passed) or IO (e.g. wait for an
+  socket to be readable or writable).
+
+- **Crystal event loop:** Component of the Crystal runtime that subscribes to
+  and dispatches events from the OS and other sources to enqueue a waiting fiber
+  when a desired event (such as IO availability) has occured. It’s an interface
+  between Crystal’s scheduler and the event loop driver.
+
+- **Event loop driver:** Connects the event loop backend (`libevent`, `IOCP`,
+  `epoll`, `kqueue`, `io_uring`) with the Crystal runtime.
+
+- **System selector:** The system implementation of an event loop which the
+  runtime component depends on to receive events from the operating system.
+  Example implementations: `epoll`, `kqueue`, `IOCP`, `io_uring`. `libevent` is
+  a cross-platform wrapper for system selectors.
+
+- **Scheduler:** manages fibers’ executions inside the program (controlled by
+  Crystal), unlike threads that are scheduled by the operating system (outside
+  of the accessibility of the program).
+
+# Reference-level explanation
+
+<!--
+This is the technical portion of the RFC. Explain the design in sufficient
+detail that:
+
+- Its interaction with other features is clear.
+- It is reasonably clear how the feature would be implemented.
+- Corner cases are dissected by example.
+
+The section should return to the examples given in the previous section, and
+explain more fully how the detailed proposal makes those examples work.
+-->
+
+The run loop first waits on `epoll`/`kqueue`, canceling IO timeouts as it
+resumes the ready fibers, then proceeds to process timers.
+
+The `epoll_wait` and `kevent` syscalls can wait until a timeout is reached.
+The event loop implementations don't use the timeout and instead wait forever
+(blocking) or don't wait at all (nonblocking). There are a couple reasons:
+
+1. Epoll also has a minimum timeout precision of 1ms, but what if the next
+   expiring timer is in 10us?
+
+3. In a single threaded context we could set the timeout to the next expiring
+   timer; that would work for no MT as well as `preview_mt`, but if we share
+   the event loop instance between threads —which will be the case with
+   execution contexts— then we have a race condition: another thread can set a
+   sooner expiring timer just before we'd wait, and a mutex wouldn't help to
+   solve (can't keep it locked until the syscall returns).
+
+Instead, we use an extra `timerfd` (Linux) and `EVFILT_TIMER` (macOS, *BSD)
+that have sub-millisecond precision (except for DragonFlyBSD), and that we can
+update in parallel when the next expiring timer has changed, using a mutex to
+prevent parallel races.
+
+## Components
+
+Each syscall is abstracted in its own little struct: `Crystal::System::Epoll`,
+`Crystal::System::TimerFD`, etc.
+
+- `Crystal::Evented` namespace (`src/crystal/system/unix/evented`) contains the
+  base implementation that the system specific drivers
+  `Crystal::Epoll::EventLoop` (`src/crystal/system/unix/epoll`) and
+  `Crystal::Kqueue::EventLoop` (`src/crystal/system/unix/kqueue`) are built on.
+- `Crystal::Evented::Timers` is a basic data structure to keep a list of timers
+  (one instance per event loop).
+- `Crystal::Evented::Event` holds the event, be it IO or sleep or select timeout
+  or IO with timeout, while `FiberEvent` wraps an `Event` for sleeps and select
+  timeouts.
+- `Crystal::Evented::PollDescriptor` is allocated in a Generational Arena and
+  keeps the list of readers and writers (events/fibers waiting on IO). It takes
+  advantage that the OS kernel guarantees a unique `fd` number and always reuses
+  numbers of closed `fd`, only adding more numbers when needed.
+
+## Poll Descriptors
+
+To avoid keeping pointers to the IO object that could prevent the GC from
+collecting lost IO objects, this proposal introduces *Poll Descriptor* objects
+(the name comes from Go's netpoll) that keep the list of readers and writers and
+don't point back to the IO object. The GC collecting an IO object is fine: the
+finalizer will close the `fd` and tell the event loop to cleanup the associated *Poll Descriptor* (so we can safely reuse the `fd`).
+
+To avoid pushing raw pointers into the kernel data structures, and to quickly
+retrieve the *Poll Descriptor* from a mere `fd`, but also to avoid programming
+errors that would segfault the program, this propsal introduces a *Generational
+Arena* to store the *Poll Descriptors* (the name is inherited from Go's netpoll)
+so we only store an index into the polling system. Another benefit is that we
+can reuse the existing allocation when a `fd` is reused. If we try to retrieve
+an outdated index (the allocation was freed or reallocated) the arena will raise
+an explicit exception.
+
+> [!NOTE]
+>
+> The goals of the arena are:
+> - avoid repeated allocations;
+> - avoid polluting the IO object with the PollDescriptor (doesn't exist in
+> other event loops);
+> - avoid saving raw pointers into kernel data structures;
+> - safely detect allocation issues instead of segfaults because of raw
+>   pointers.
+
+The *Poll Descriptors* associate a `fd` to an event loop instance, so we can
+still have multiple event loops per processes, yet make sure that an `fd` is
+only ever in one event loop. When a `fd` will block on another event loop
+instance, the `fd` will be transferred automatically (i.e. removed from the old
+one & added to the new one). The benefits are numerous: this avoids having
+multiple event loops being notified at the same time; this avoids having to
+close/remove the `fd` from each event loop instances; this avoids cross event
+loop enqueues that are much slower than local enqueues in execution contexts.
+
+A limitation is that trying to move a `fd` from one event loop to another while
+there are pending waiters will raise an exception. We could move timeout events
+along with the `fd` from one event loop instance to another one, but that would
+also break the "always local enqueues" benefit.
+
+Most application shouldn't notice any impact because of this design choice,
+since a `fd` is usually not shared across fibers for concurrency issues. An
+exception may be a server socket with multiple accepting fibers. In that case
+you'll need to make sure the fibers are on the same thread (`preview_mt`) or
+execution context.
+
+If this ever proves to be an issue, we can think of solutions. For example migrating timeouts along, triaging external fibers from local ones, and more, of course at the expense of some performance.
+
+# Drawbacks
+
+<!--
+Why should we *not* do this?
+-->
+
+* There's more code to maintain: Instead of only the `libevent` driver we now
+  have additional ones to care about.
+
+# Rationale and alternatives
+
+<!--
+- Why is this design the best in the space of possible designs?
+- What other designs have been considered and what is the rationale for not
+  choosing them?
+- What is the impact of not doing this?
+- If this is a language proposal, could this be done in a library instead? Does
+  the proposed change make Crystal code easier or harder to read, understand,
+  and maintain?
+-->
+
+# Prior art
+
+<!--
+Discuss prior art, both the good and the bad, in relation to this proposal. A
+few examples of what this can include are:
+
+- For language, library, tools, and compiler proposals: Does this feature exist
+  in other programming languages and what experience have their community had?
+- For community proposals: Is this done by some other community and what were
+  their experiences with it?
+- Papers: Are there any published papers or great posts that discuss this? If
+  you have some relevant papers to refer to, this can serve as a more detailed
+  theoretical background.
+
+This section is intended to encourage you as an author to think about the
+lessons from other languages, provide readers of your RFC with a fuller picture.
+If there is no prior art, that is fine - your ideas are interesting to us
+whether they are brand new or if it is an adaptation from other languages.
+
+Note that while precedent set by other languages is some motivation, it does not
+on its own motivate an RFC. Please also take into consideration that Crystal
+sometimes intentionally diverges from common language features.
+-->
+
+Golang's [netpoll](https://go.dev/src/runtime/netpoll.go) uses a similar design, with the difference that Go only has a single event loop instance per process.
+
+# Unresolved questions
+
+<!--
+- What parts of the design do you expect to resolve through the RFC process
+  before this gets merged?*
+- What parts of the design do you expect to resolve through the implementation
+  of this feature before stabilization?
+- What related issues do you consider out of scope for this RFC that could be
+  addressed in the future independently of the solution that comes out of this
+  RFC?
+-->
+
+## Timers and timeouts
+
+We're missing a proper data structure to store timers and timeouts. It must be
+thread safe and efficient. Ideas are a minheap (4-heap) or a skiplist. Timers
+and timeouts may need to be handled separately.
+
+This issue is blocking. An inefficient data structure wrecks performance for
+timers and timeouts.
+
+## Performance issues on BSDs
+
+The `kqueue` driver is disabled on DragonFly BSD, OpenBSD and NetBSD due to
+performance regressions.
+
+- DragonFly BSD: Running `std_spec` is eons slower than libevent. It regularly
+  hangs on `event_loop.run` until the stack pool collector timeout kicks in
+  (i.e. loops on 5s pauses).
+
+- OpenBSD: Running `std_spec` is noticeably slower (4:15 minutes) compared to
+  libevent (1:16 minutes). It appears that the main fiber keeps re-enqueueing
+  itself from the event loop run (10us on each run).
+
+- NetBSD: The event loop doesn't work with `kevent` returning `ENOENT` for the
+  signal loop `fd` and `EINVAL` when trying to set an `EVFILT_TIMER`.
+
+The `kqueue` driver works fine on FreeBSD and Darwin.
+
+These issues are non-blocking. We can keep using `libevent` on these operating
+systems until resolved.
+
+# Future possibilities
+
+<!--
+Think about what the natural extension and evolution of your proposal would be
+and how it would affect the language and project as a whole in a holistic way.
+Try to use this section as a tool to more fully consider all possible
+interactions with the project and language in your proposal. Also consider how
+this all fits into the roadmap for the project and of the relevant sub-team.
+
+This is also a good place to "dump ideas", if they are out of scope for the RFC
+you are writing but otherwise related.
+
+If you have tried and cannot think of any future possibilities, you may simply
+state that you cannot think of anything.
+
+Note that having something written down in the future-possibilities section is
+not a reason to accept the current or a future RFC; such notes should be in the
+section on motivation or rationale in this or subsequent RFCs. The section
+merely provides additional information.
+-->
+
+* `aio` for async read/write over regular disk files with `kqueue`
+* Integrate more evented operations such as `signalfd`/`EVFILT_SIGNAL` and
+  `pidfd`/`EVFILT_PROC`.