From bca292a6a0a3638577dd0bbac52af9a7b65a8582 Mon Sep 17 00:00:00 2001 From: Julien Portalier Date: Fri, 22 Nov 2024 19:53:30 +0100 Subject: [PATCH 1/9] RFC: timers --- text/0000-timers.md | 310 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 310 insertions(+) create mode 100644 text/0000-timers.md diff --git a/text/0000-timers.md b/text/0000-timers.md new file mode 100644 index 0000000..7e4f65e --- /dev/null +++ b/text/0000-timers.md @@ -0,0 +1,310 @@ +- Feature Name: `timers` +- Start Date: 2024-11-22 +- RFC PR: [crystal-lang/rfcs#0000](https://github.com/crystal-lang/rfcs/pull/0000) +- Issue: ... + +# Summary + +Determine a general interface and internal data structure to handle and store +timers in the Crystal runtime. + +# Motivation + +With the Event Loop overhaul made possible by [RFC 7] and achieved in [RFC 9] +where we remove the libevent dependency, that we already didn't use on Windows, +we need to handle the correct execution of timers ourselves. + +We must handle timers, we must store them into efficient data structure(s), and +we must suppor the following operations: + +- create a timer; +- cancel a timer; +- execute expired timers; +- determine the next timer to expire, so we can decide for how long a process or + thread can be suspended (usually when there is nothing to do). + +The IOCP event loop currently uses an unordered `Deque`, and thus needs a simple +O(1) operation to insert a time, but needs a linear scan to delete the timer and +a full scan to decide the next expiring timer or to dequeue the expired timers. + +The Polling event-loop (wraps `epoll` and `kqueue`) uses an ordered `Deque` and +needs a linear scan for insert and delete, but getting the next expiring timer +and dequeueing the expired timers is O(1). + +This is far from efficient. We can do better. + +# Guide-level explanation + +First we emphasize that Crystal cannot be a realtime language (at least without +dropping the whole stdlib) because it relies on a GC that can stop the world at +any time and for a long time; the fiber schedulers also only reach to the event +loop when there is nothing left to do. These necessarily **introduce latencies +to the execution of expired timers**. + +We can categorize timers into two categories, that I shamelessly took from the +[Hrtimers and Beyond: Transforming the Linux Time +Subsystems](https://www.kernel.org/doc/ols/2006/ols2006v1-pages-333-346.pdf) +paper about the introduction of high resolution timers in the Linux kernel: + +1. **Timeouts**: Timeouts are used primarily to detect when an event (I/O + completion, for example) does not occur as expected. They have low resolution + requirements, and they are almost always removed before they actually expire. + + In Crystal such a `timeout` may be created before every blocking read or + write IO operation (when configured on the IO object) or to handle the + timeout action of a `select` statement. They're usually cancelled once the IO + operation or a channel operation becomes ready; they may expire, that is + raise an `IO::Timeout` exception or execute the timeout branch of the + `select` action. + + The low resolution is because timeouts are mostly about bookkeeping, to + eventually close a connection after some time has passed for example, so a + 10s timeout running after 11s won't create issues. + +2. **Timers**: Timers are used to schedule ongoing events. They can have high + resolution requirements, and usually expire. + + In crystal such a `timer` is created when we call `sleep(Time::Span)` or + `Fiber.yield` that behaves as a `sleep(0.seconds)`. There are no public API + to cancel a sleep, and they always expire. + + The high resolution is because timers are expected to run at the scheduled + time. As explained above this might be hard, but we can still try to avoid + too much latency. + +Both categories share common traits: + +- fast `insert` operation (lower impact on performance, especially with + parallelism); +- fast `get-min` operation (same as `insert` but less frequently called); +- reasonably fast `delete-min` operation (only needed when processing expired + timers); + +However they differ in these other traits: + +1. Timeouts: + + - low precision (some milliseconds is acceptable); + - fast `delete` operation (likely to be cancelled); + - must accomodate many timeouts at any given time (e.g. c10k problem). + +2. Timers (sleeps): + + - high precision (sub-millisecond and below is desireable); + - no need for `delete` (never cancelled); + - more reasonable number of timers (**BOLD CLAIM TO BE VERIFIED**) + +These requirements can help us to shape which data structure(s) to choose. + +## Relative clock + +The relative clock to compare the time against. For example `libevent` uses the +monotonic clock, and the other event loop implementations followed suits (AFAIK). + +This hasn't been an issue for the current usages in Crystal that always consider +an interval from now (be it a timeout or a sleep). + +# Reference-level explanation + +> [!CAUTION] +> This is a rough draft, asking more questions than providing answers! +> +> The technical definition will come and evolve as we experiment and refactor +> the different event loops. +> +> For example the technical details of abstracting the interface to be usable +> from different event loops lead to technical issues, notably around how to +> define the individual `Timer` interface, its relationship with the event loop +> `Event` actual object (e.g. struct pointer in the polling evloop), ... + +**TBD**: the general internal interface, for example (loosely shaped from the +polling event loop, with different wording): + +```crystal +# The type `T` must implement `#wake_at : Time::Span` and return the absolute +# time at which a timer expires (monotonic clock). + +class Crystal::Timers(T) + # Schedules a timer. Returns true if it is the next timer to expire. + abstract def schedule(timer : T) : Bool + + # Cancels a previously scheduled timer. Returns a tuple(deleted, + # was_next_timer_to_expire). + abstract def cancel(timer : T) : {Bool, Bool} + + # Yields and dequeues expired timers to be executed (cancel timeout, resume + # fiber, ...). + abstract def dequeue_expired(& : T ->) : Nil + + # Returns the absolute time at which the next expiring timer is scheduled at. + # Returns nil if there are no timers. + abstract def next_expiring_at? : Time::Span? +end +``` + +## Data structure: min pairing heap + +A min-heap is a simple, fast and efficient tree data structure, that keeps the +smaller value as the HEAD of the tree (the rest isn't ordered). This is enough +for timers in general as we only really need to know about the next expiring +timer, we don't need the list to be fully ordered. + +From the [wikipedia page](https://en.wikipedia.org/wiki/Pairing_heap): in +practice a D-ary heap is always faster unless the `decrease-key` operation is +needed, in which case the Pairing HEAP often becomes faster (even to supposedly +more efficient algorithms, like the Fibonacci HEAP). + +An initial implementation (twopass algorithm, no auxiliary insert, intrusive +nodes) led to to slighly faster `insert` time than a D-ary Heap (that needs more +swaps) especially when timers come out of order, but a noticeably slower +`delete-min` since it must rebalance the tree. The `delete` operation however +quickly outperforms the 4-heap, even at low occupancy (a hundred timers) and +never balloons. + +Despite the drawback on the `delete-min` operation, a benchmark using mixed +operations (insert + delete-min, insert + delete) led the pairing heap to have +the best overall performance. See the [benchmark +results](https://gist.github.com/ysbaddaden/a5d98c88105ea58ba85f4db1ed814d70à) +for more details. + +Since it performs well for timers (add / delete-min) and timeouts (add / delete +and sometimes delete-min) as well I propose to use it to store both categories +in a single data structure. + +Reference: + +- [Pairing Heaps: Experiments and Analysis](https://dl.acm.org/doi/pdf/10.1145/214748.214759) + +# Drawbacks + +TBD. + +# Rationale + +This is an initial proposal for a long term work to internally handle timers in +the Crystal runtime. It aims to forge the path forward as we refactor the +different event loops (`IOCP`, `Polling`), introduce new ones (`io_uring`), and +as we evolve the public API interface. + +# Alternatives + +## Deque + +We could treat `Fiber.yield` and `sleep(0.seconds)` and by extension any already +expired timer specifically with a push to a mere `Deque`: no need to keep these +in an ordered data structure. + +## 4-heap (D-ary HEAP) + +A [D-ary HEAP] can be implemented as a flat array to take full advantage of CPU +caches, and be binary or higher. Even at large occupancy (million timers) the +overall performance is excellent... except for the `delete` operation that +cannot benefit from the tree structure, and requires a linear scan. Performance +quickly plummets at low to moderate occupancy (thousand timers) and becomes +unbearable at higher occupancies. + +Aside from timeouts, timers (sleeps) could take advantage of this data structure +since we can't cancel a sleep (so far). + +## Skip list + +An alternative to heaps is the [skip list](https://en.wikipedia.org/wiki/Skip_list) +data structure. It's a simple doubly linked list but with multiple levels. The +lowest level is the whole list, while the higher levels skip over more and more +entries, leading to quick arbitrary lookups (from highest down to the lowest). + +While the `delete-min` has excellent performance, the increased cost of keeping +the whole list ordered on every add/remove and creating and deleting multiple +links reduces the overall performance compared to the pairing heap. + +## Non-cascading timer wheel + +> [!NOTE] +> The concept is a total rip-off from the Linux kernel! +> - [documentation](https://www.kernel.org/doc/html/latest/timers/highres.html) +> - [LWN article](https://lwn.net/Articles/646950/) that explains the core idea; +> - [implementation](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) (warning: GPL license!) + +The idea derives from the "hierarchical timing wheels" design. This is a ring +(circular array) of N slots sub-divided into M slots where each individual slot +represents a jiffy (or moment) with a specific precision (1ms, 4ms or 10ms for +example). Each slot is a doubly linked list of events scheduled for the +specified jiffy. Each M slots represent a wheel, with less precision the higher +we climb up the wheels. When we process timers, we process the expired timers +from the "last" processed slot up to the "current" slot. + +The usual disadvantage of hierarchical timer wheels is that whenever we loop on +the initial wheel we must cascade down the timers from the upper wheel into the +lower wheel. This can lead to multiple cascades in a row. + +The trick is to skip the cascade altogether. This means losing precision (the +farther in the future the larger the delta), which is unacceptable for timers, +but for timeouts? They're usually cancelled and we don't need to run precisely +at the scheduled time, we just need them to run. + +Example table from the current linux kernel (jiffies at 10ms precision, aka +100HZ). The ring has 512 slots in total and can accomodate timers up to 15 days +from now: + + Level Offset Granularity Range + 0 0 10 ms 0 ms - 630 ms + 1 64 80 ms 640 ms - 5110 ms (640ms - ~5s) + 2 128 640 ms 5120 ms - 40950 ms (~5s - ~40s) + 3 192 5120 ms (~5s) 40960 ms - 327670 ms (~40s - ~5m) + 4 256 40960 ms (~40s) 327680 ms - 2621430 ms (~5m - ~43m) + 5 320 327680 ms (~5m) 2621440 ms - 20971510 ms (~43m - ~5h) + 6 384 2621440 ms (~43m) 20971520 ms - 167772150 ms (~5h - ~1d) + 7 448 20971520 ms (~5h) 167772160 ms - 1342177270 ms (~1d - ~15d) + +The technical operations are: + +- `insert`: determine the slot (relative to the current slot), append (or + prepend) to the linked list; +- `delete`: remove the timer from any linked list it may be in (no need to + lookup the timer); +- `get-min`: the delta between the current and the first non empty slot (can be + sped up with a bitmap of (not)empty slots); +- `delete-min`: process the linked list(s) as we advance the slot(s). + +Aside from deciding the slot, all these operations involve mere doubly linked +list operations. + +**NOTE** I didn't test this solution, it currently sounds overkill; yet the +overall simplicity makes it a good contender to the pairing heap for storing +timeouts. In that case maybe a dual D-ary Heap for timers and a Timing Wheel for +timeouts would be a better choice than the single Pairing Heap? + +# Prior art + +- `libevent` stores events with a timer into a min-heap, but it also keeps a + list of "common timeouts"... I didn't investigate what they mean by it + exactly. + +- Go stores all timers into a min-heap (4-ary) but allocates timers in the GC + HEAP and merely marks cancelled timers on delete. I didn't investigate how + it deals with the tombstones. + +- The Linux kernel keeps timeouts in a non cascading timing wheel, and timers in + a red-black tree. See the [hrtimers] page. + +# Unresolved questions + +TBD. + +# Future possibilities + +The monotonic relative clock can be an issue for timers that need to execute at +a specific realtime, that is relative to the realtime clock. We might want to +introduce an explicit `Timer` type that could expire once or at a defined +interval, using different clocks (realtime, monotonic, boottime), as well as be +absolute or relative to the current time. + +These would fall into the *timers* category, and change the requirements for +them from "never cancelled" to "sometimes cancelled", though in practice it +should probably be implemented using system timers, for example `timerfd` on +Linux, `EVFILT_TIMER` on BSD, something else on Windows. + +[RFC 7]: https://github.com/crystal-lang/rfcs/pull/7 +[RFC 9]: https://github.com/crystal-lang/rfcs/pull/9 +[D-ary HEAP]: https://en.wikipedia.org/wiki/D-ary_heap +[hrtimers]: https://www.kernel.org/doc/html/latest/timers/hrtimers.html From fcaad286b47fda12ff7cb786683e8ce29b1b7f84 Mon Sep 17 00:00:00 2001 From: Julien Portalier Date: Fri, 22 Nov 2024 19:56:50 +0100 Subject: [PATCH 2/9] RFC 0000 -> RFC 0012 --- text/{0000-timers.md => 0012-timers.md} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename text/{0000-timers.md => 0012-timers.md} (99%) diff --git a/text/0000-timers.md b/text/0012-timers.md similarity index 99% rename from text/0000-timers.md rename to text/0012-timers.md index 7e4f65e..79057c2 100644 --- a/text/0000-timers.md +++ b/text/0012-timers.md @@ -1,6 +1,6 @@ - Feature Name: `timers` - Start Date: 2024-11-22 -- RFC PR: [crystal-lang/rfcs#0000](https://github.com/crystal-lang/rfcs/pull/0000) +- RFC PR: [crystal-lang/rfcs#12](https://github.com/crystal-lang/rfcs/pull/12) - Issue: ... # Summary From 240d789c9d0656d433617211f6bf07751aa02f4d Mon Sep 17 00:00:00 2001 From: Julien Portalier Date: Sat, 23 Nov 2024 20:09:09 +0100 Subject: [PATCH 3/9] Update text/0012-timers.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Johannes Müller --- text/0012-timers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0012-timers.md b/text/0012-timers.md index 79057c2..bdb8d53 100644 --- a/text/0012-timers.md +++ b/text/0012-timers.md @@ -86,7 +86,7 @@ However they differ in these other traits: - low precision (some milliseconds is acceptable); - fast `delete` operation (likely to be cancelled); - - must accomodate many timeouts at any given time (e.g. c10k problem). + - must accomodate many timeouts at any given time (e.g. [c10k problem](https://en.wikipedia.org/wiki/C10k_problem)). 2. Timers (sleeps): From 576618f31956bd8a49e9bcc5ec6e3b4faf037c81 Mon Sep 17 00:00:00 2001 From: Julien Portalier Date: Sun, 1 Dec 2024 15:26:51 +0100 Subject: [PATCH 4/9] Fix: bad link to benchmarks... Co-authored-by: Vlad Zarakovsky --- text/0012-timers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0012-timers.md b/text/0012-timers.md index bdb8d53..d3c3c37 100644 --- a/text/0012-timers.md +++ b/text/0012-timers.md @@ -164,7 +164,7 @@ never balloons. Despite the drawback on the `delete-min` operation, a benchmark using mixed operations (insert + delete-min, insert + delete) led the pairing heap to have the best overall performance. See the [benchmark -results](https://gist.github.com/ysbaddaden/a5d98c88105ea58ba85f4db1ed814d70à) +results](https://gist.github.com/ysbaddaden/a5d98c88105ea58ba85f4db1ed814d70) for more details. Since it performs well for timers (add / delete-min) and timeouts (add / delete From 7ee1b15184555191e9d49f66c6177843f49fb8fe Mon Sep 17 00:00:00 2001 From: Julien Portalier Date: Fri, 21 Feb 2025 18:39:11 +0100 Subject: [PATCH 5/9] Update text/0012-timers.md Co-authored-by: Vlad Zarakovsky --- text/0012-timers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0012-timers.md b/text/0012-timers.md index d3c3c37..e216328 100644 --- a/text/0012-timers.md +++ b/text/0012-timers.md @@ -15,7 +15,7 @@ where we remove the libevent dependency, that we already didn't use on Windows, we need to handle the correct execution of timers ourselves. We must handle timers, we must store them into efficient data structure(s), and -we must suppor the following operations: +we must support the following operations: - create a timer; - cancel a timer; From 00a91171f99328800d7a7e6daef828c95e6f7407 Mon Sep 17 00:00:00 2001 From: Julien Portalier Date: Fri, 21 Feb 2025 18:43:09 +0100 Subject: [PATCH 6/9] Update text/0012-timers.md --- text/0012-timers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0012-timers.md b/text/0012-timers.md index e216328..b873067 100644 --- a/text/0012-timers.md +++ b/text/0012-timers.md @@ -145,7 +145,7 @@ end ## Data structure: min pairing heap A min-heap is a simple, fast and efficient tree data structure, that keeps the -smaller value as the HEAD of the tree (the rest isn't ordered). This is enough +smallest value as the HEAD of the tree (the rest isn't ordered). This is enough for timers in general as we only really need to know about the next expiring timer, we don't need the list to be fully ordered. From 8ee766c611142210ac2eb96c55f0c20fd9c3896b Mon Sep 17 00:00:00 2001 From: Julien Portalier Date: Fri, 21 Feb 2025 18:45:08 +0100 Subject: [PATCH 7/9] Update text/0012-timers.md --- text/0012-timers.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/text/0012-timers.md b/text/0012-timers.md index b873067..c766110 100644 --- a/text/0012-timers.md +++ b/text/0012-timers.md @@ -64,9 +64,8 @@ paper about the introduction of high resolution timers in the Linux kernel: 2. **Timers**: Timers are used to schedule ongoing events. They can have high resolution requirements, and usually expire. - In crystal such a `timer` is created when we call `sleep(Time::Span)` or - `Fiber.yield` that behaves as a `sleep(0.seconds)`. There are no public API - to cancel a sleep, and they always expire. + In crystal such a `timer` is created when we call `sleep(Time::Span)`. + There is no public API to cancel a sleep, and they always expire. The high resolution is because timers are expected to run at the scheduled time. As explained above this might be hard, but we can still try to avoid From 1ec077d24ede752f80574296fc85238091978bdd Mon Sep 17 00:00:00 2001 From: Julien Portalier Date: Fri, 21 Feb 2025 18:46:09 +0100 Subject: [PATCH 8/9] Update text/0012-timers.md --- text/0012-timers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0012-timers.md b/text/0012-timers.md index c766110..6cfb950 100644 --- a/text/0012-timers.md +++ b/text/0012-timers.md @@ -91,7 +91,7 @@ However they differ in these other traits: - high precision (sub-millisecond and below is desireable); - no need for `delete` (never cancelled); - - more reasonable number of timers (**BOLD CLAIM TO BE VERIFIED**) + - more reasonable number of timers compared to timeouts (**BOLD CLAIM TO BE VERIFIED**) These requirements can help us to shape which data structure(s) to choose. From 487463890b6fe20040b3d6bef0701064d753ba81 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Johannes=20M=C3=BCller?= Date: Wed, 26 Feb 2025 00:07:27 +0100 Subject: [PATCH 9/9] Convert metadata into YAML frontmatter (see #3) --- text/0012-timers.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/text/0012-timers.md b/text/0012-timers.md index 6cfb950..4fca075 100644 --- a/text/0012-timers.md +++ b/text/0012-timers.md @@ -1,7 +1,9 @@ -- Feature Name: `timers` -- Start Date: 2024-11-22 -- RFC PR: [crystal-lang/rfcs#12](https://github.com/crystal-lang/rfcs/pull/12) -- Issue: ... +--- +Feature Name: timers +Start Date: 2024-11-22 +RFC PR: https://github.com/crystal-lang/rfcs/pull/12 +Issue: +--- # Summary