Add io_uring event loop by ysbaddaden · Pull Request #15634 · crystal-lang/crystal

ysbaddaden · 2025-04-04T10:37:10Z

Initial attempt at writing an EventLoop backed by io_uring for Linux targets.

It requires different features that have been implemented in different versions of the kernel (I'm not sure exactly which), one of the most important being "don't drop any CQE as long as there is memory" 😅

It supports SQPOLL (implicit submissions) to further reduce the number of syscalls.

Unlike the polling EventLoop (libevent, epoll, kqueue) it's fully async, meaning that any attempt to read or write will yield the current fiber for example, whatever if there could have been something to read. Expect lots of fiber yields!

WARNING: THREAD UNSAFE! MT will segfault. Only try it with single thread 💣 💣 💣

PREREQUISITES:

Add explicit Crystal::EventLoop#reopened(FileDescriptor) hook #15640
Let Crystal::EventLoop#close do the actual close (not just cleanup) #15641
conditional O_NONBLOCK (see Add io_uring event loop #15634 (comment))
rework blocking arg on File (?)

TODO:

LibC bindings for all linux targets;
determine minimal linux kernel version => Linux 5.6+;
epoll fallback (old kernel, not compiled, disabled, ...);
GC finalizers: cancel operations on .remove_impl(FileDescriptor|Socket) so we don't leak fibers:
- Can the IO object actually be collected? if there's a pending op, then there should be at least a reference to the IO object on the stack (we pass the IO object around).
- At worst we can add a reference to the IO object to the Event object;
drop @timespecs from System::IoUring and put each LibC::Timespec on the Event object to have a stable pointer for SQPOLL submissions.

MT:

Challenge: each thread/scheduler needs its own SQ + CQ rings;
EC::Parallel shall create a ring for the first scheduler, then the other schedulers share its resources (IORING_SETUP_ATTACH_WQ);
Add [e]poll and wait on it instead of io_uring_enter (only when blocking?);
Register an eventfd per ring + add to [e]poll;
Add timerfd to handle select action timeouts + add to [e]poll;
close(IO::FileDescriptor) must cancel any read and any write operations to the IO::FileDescriptor on any ring —Socket doesn't, shutdown will do the job.

BONUSES (over Epoll):

only make syscalls when strictly necessary (full submission queue, empty completion queue) otherwise they're offloaded to the kernel threads;
disk file open, read and write are async;
close file/socket is async;
an always ready IO now yields the fiber.

DRAWBACK-ish:

unbuffered IO read/write will yield the fiber every time (e.g. socket << 'a' << 'b' will yield twice by default)... instead of doing a syscall, so 🤔

FOLLOW UP:

moving open to the eventloop could make opening a fifo or character device asynchronous and fix the issue where the thread is blocked until another process also opens the fifo (regardless of the blocking arg);
push MORE things to the event-loop (~~open~~, fsync, fstat, mkdir, link, listen, bind, ...) 🤑

NOTES:

I didn't use liburing to avoid bringing an external dependency, plus the io_uring_prep_* functions are inlined in the io_uring.h header, and would have had to be rewritten anyway. The Crystal::System::IoUring struct does most of the whole job, then Crystal::EventLoop::IoUring directly fills the SQE.

Adding support for MT and EC involves a few hurdles. See #10740 (comment)

Abstraction of `io_uring` syscalls to create a ring, map the kernel buffers into userspace, submit operations and iterate completions. Also provides optional support for SQPOLL with proper wakeup of the SQ thread when thread.

…+ fallback

Since read can also fail with EINTR we may always have to retry...

We must submit operation chains in a single shot, that is update the SQ tail shared with the kernel (sq_ktail) in a single STORE after populating all the SQE to chain together. This led to an overhaul refactor of the System::IoUring abstraction and the EventLoop::IoUring async helpers. Fixes the issue where CLOSE happens after ASYNC_CANCEL when closing a file descriptor. Makes sure that LINK_TIMEOUT will always be correctly registered to the previous READ, WRITE or POLL.

ysbaddaden · 2025-04-04T10:54:02Z

src/crystal/event_loop/io_uring.cr

+      # one thread closing a fd won't interrupt reads or writes happening in
+      # other threads, for example a blocked read on a fifo will keep blocking,
+      # while close would have finished and closed the fd; we thus explicitly
+      # cancel any pending operations on the fd before we try to close


The close(2) manpage explicitly states that some systems interrupt any blocking read or write but the Linux behavior is to 🙈

src/crystal/event_loop/io_uring.cr

ysbaddaden · 2025-04-04T11:07:32Z

src/crystal/system/unix/io_uring.cr

+    # TODO: we could check if tail changed and iterate more, until we reach the
+    # maximum iterations count
+  end
+


The following enums are only used to enhance Crystal.trace.

ysbaddaden · 2025-04-04T11:10:05Z

src/crystal/event_loop/io_uring.cr

+
+  def interrupt : Nil
+    # the atomic makes sure we only write once (no need to write multiple times)
+    @eventfd.write(1) if @interrupted.test_and_set


This is broken: there is no @eventfd.

yxhuvud · 2025-04-04T17:48:52Z

plus the io_uring_prep_* functions are inlined in the io_uring.h header, and would have had to be rewritten anyway.

Well, someone (cough) has already done that job, though I can definitely understand not wanting the extra dependency.. That said there is a flag that can be submitted when building liburing that don't inline anything (thanks rust people!), but assuming that build to be available is optimistic, unless we build it ourselves.

(will look at actual code later, but a take on it that might perhaps inspire may be https://github.com/yxhuvud/nested_scheduler_io_uring_context/blob/main/src/nested_scheduler/io_uring_context.cr , which is a plugin to nested_scheduler to use io_uring. it is definitely broken in some aspects, not even counting the general shift of the codebase that has happened since the nested_scheduler set of monkeypatches worked. FWIW the best part of nested scheduler was how much it possible to clean up the specs).

EDIT: OH, and there is a nice io_uring discord available if you want to bounce ideas with people. some of your musings, like the close fd parts, may have good ideas or at least answers there.. https://discord.gg/T9WqsqPZ

yxhuvud

Neat.

Regarding TODO:

push MORE things to the event-loop (mkdir, listen, bind, ...) 🤑

And more importantly, the nonsocket file write, read, fsync, fstat etc that lives in FileDescriptor.

yxhuvud · 2025-04-05T10:00:11Z

src/crystal/system/unix/io_uring.cr

+  end
+
+  def finalize
+    close


drain, could perhaps be necessary. But perhaps it is like exit and flushing writes 🤷

Let's err on the safe side and say not. We'll need the ability to drain a ring if we want to shutdown an execution context or thread anyway.

yxhuvud · 2025-04-05T10:23:07Z

src/lib_c/x86_64-linux-gnu/c/linux/io_uring.cr

+  IORING_FEAT_LINKED_FILE     = 1_u32 << 12
+  IORING_FEAT_REG_REG_RING    = 1_u32 << 13
+
+  IORING_OP_NOP              =  0_u32


I find this weird, as the op field in the struct is a u8 and not a u32, but the weirdness is also present in liburing, so I guess it doesn't matter. The same confusion exist in SQE_FLAGS.

The age old question of if to mirror the C files structure, or to use properly sized enums, I guess. Compare

enum Op : UInt8 NOP READV .. end

Which may be a bit less prone to copy-pasta errors as long as the order is kept correct.

yxhuvud · 2025-04-05T10:31:44Z

src/crystal/event_loop/io_uring.cr

+  def delete_timer(event : Event*) : Nil
+    sqe = @ring.next_sqe
+    sqe.value.opcode = LibC::IORING_OP_TIMEOUT_REMOVE
+    sqe.value.flags = LibC::IOSQE_CQE_SKIP_SUCCESS


I have a hard time seeing this is enough requests to matter either way (though I am open to be shown to be wrong).

Isn't it easier to just put a user_data on it that trigger a nop? I used 0 for this. Or check the CQE result for the canceled-result, if that is what you are trying to avoid?

It's really just a "don't even bother pushing a CQE I don't care about".

src/crystal/event_loop/io_uring.cr

ysbaddaden · 2025-04-05T17:00:09Z

And more importantly, the nonsocket file write, read, fsync, fstat etc that lives in FileDescriptor.

At least read and write are already, but fsync and fstat aren't.

yxhuvud · 2025-04-05T17:09:57Z

At least read and write are already

Ah, I got confused by the *_fully methods.

Said differently: we don't need the sys/uio header just to bring the iovec struct because sys/socket shall define it.

yxhuvud · 2025-08-26T15:49:38Z

src/crystal/system/unix/io_uring.cr

+  # Call `io_uring_enter` syscall. Panics on EBADR (can't recover from lost
+  # CQE), returns -EINTR or -EBUSY, and raises on other errnos, otherwise
+  # returns the int returned by the syscall.
+  def enter(to_submit : UInt32 = 0, min_complete : UInt32 = 0, flags : UInt32 = 0) : Int32


By the way, one thing that has changed with regards to external libraries is that liburing has changed so that in addition to the old .so that had a million functions missing due to them being declared as static inline and therefore requiring a file with c shims to be used, they now also provide a liburing-ffi.so, which has dropped those static inline. This due to rust people wanting to use the library, and they have mostly the same issues with static inline as we do even if they are a bit further along. So that is a lot less of a hassle than it used to be.

So if we wanted we could exchange a whole lot of the complexity in this file in exchange for linking against liburing-ffi. This seems complex enough that it may be worth the trouble of extra dependencies.

Sadly that would require a specific version that won't be available for the years to come in distributions, and it's not installed by default (unlike libc)... unless we build and distribute our own copy (nope).

Each prep method also calls a bunch of other prep methods, so we'd lose the benefit of inlining all the SQE setup.

It's also not that complex (setup is textbook boilerplate for example) though I admit I shall check the memory ordering of atomics again 😅

ysbaddaden · 2025-09-09T17:09:58Z

src/crystal/event_loop/io_uring.cr

+      # FIXME: with threads and multiple rings, we'll need to know which rings
+      # have pending operations for the fd (which op/event for each ring) and
+      # tell the rings to cancel said ops (can't just say to cancel all ops for
+      # fd so we can close in parallel)


MT: we can use IORING_OP_MSG_RING on the local ring (one for each ring) to generate a CQE for any ring and can pass some data, then the CQE can be processed on that thread to submit a SQE to cancel operations on the fd.

We "just" have to remember which rings have a pending operation, but with #16127 we'd have at most a single reader and a single writer (i.e. a couple ring fds) and not a dynamic list of ring fds 👍

The issue is that the CQE must be handled by the thread that owns the ring (it must submit a SQE), and we can't have any thread process all the CQEs any more for the whole EC (this is the current EC/EV design).

…-uring

ysbaddaden · 2025-09-11T17:15:07Z

Merged with master, with all the non blocking changes: we no longer set O_NONBLOCK on the fd / sockfd.

I determined that the minimum required linux kernel is 5.6+ which helped simplify a bit (we can expect the IORING_OP_OPENAT and IORING_OP_SENDMSG). We could also assume IORING_FEAT_SINGLE_MMAP).

I also just realized that I could remove the timespecs ring by having a timespec on the Event object directly (maybe replacing the Time::Span), and point the SQE to it. I'll do that next.

src/crystal/event_loop/io_uring/event.cr

ysbaddaden · 2025-10-24T13:29:59Z

Closing. I have a better implementation coming.

ysbaddaden added 19 commits March 25, 2025 11:07

WIP: Add Crystal::System::IoUring

4aa2dca

Abstraction of `io_uring` syscalls to create a ring, map the kernel buffers into userspace, submit operations and iterate completions. Also provides optional support for SQPOLL with proper wakeup of the SQ thread when thread.

Set io_uring_params.sq_thread_idle + support IORING_SETUP_NO_SQARRAY …

f953a12

…+ fallback

Fix: bindings

2b62f79

Initial Crystal::EventLoop::IoUring

fdd5290

Fix: panic if io_uring_enter reports dropped CQEs

58d37bd

Support close, sleep, and basic (check cqe & wait)

32bd599

Fix: sleep, select timeout, ...

a22d724

Fix: compilation + issues => file/socket specs are green

bc5bfee

Fixes

a8c7fcf

Fix: write can fail with EINTR (we must retry)

cda8141

Since read can also fail with EINTR we may always have to retry...

Add Crystal::EventLoop::IoUring#sleep(duration)

da034d3

Drop class_getter?

8da2198

Fix: read can fail with EINTR (we must retry)

7c9864b

Fix: add EventLoop#reopened(FileDescriptor)

3b2362e

Fix: always shutdown read before closing socket

65cbc0e

Add some checks to verify if an IO is opened

db5d179

Fix: don't link timeout op to sleep op...

395471d

Fix: select timeout

822f9fb

ysbaddaden added kind:feature status:draft topic:stdlib:runtime labels Apr 4, 2025

ysbaddaden self-assigned this Apr 4, 2025

ysbaddaden commented Apr 4, 2025

View reviewed changes

src/crystal/event_loop/io_uring.cr Show resolved Hide resolved

ysbaddaden commented Apr 4, 2025

View reviewed changes

straight-shoota mentioned this pull request Apr 4, 2025

Linux's IO_Uring interface (2x IO performance!) #10740

Open

ysbaddaden linked an issue Apr 4, 2025 that may be closed by this pull request

Linux's IO_Uring interface (2x IO performance!) #10740

Open

yxhuvud reviewed Apr 5, 2025

View reviewed changes

ysbaddaden added 2 commits April 6, 2025 12:39

Use IORING_OP_SENDMSG / IORING_OP_RECVFROM

4b0d049

Fix: sys/socket shall define struct iovec as per sys/uio

91f3b72

Said differently: we don't need the sys/uio header just to bring the iovec struct because sys/socket shall define it.

This was referenced Apr 7, 2025

Add explicit Crystal::EventLoop#reopened(FileDescriptor) hook #15640

Merged

Let Crystal::EventLoop#close do the actual close (not just cleanup) #15641

Merged

The event loops should handle the non-blocking behavior of files, fildes and sockets #15652

Open

ysbaddaden mentioned this pull request Apr 18, 2025

PoC: let the event loop decide blocking or non blocking #15685

Closed

ysbaddaden mentioned this pull request May 6, 2025

Add Crystal::EventLoop::FileDescriptor#open #15750

Merged

yxhuvud reviewed Aug 26, 2025

View reviewed changes

ysbaddaden commented Sep 9, 2025

View reviewed changes

ysbaddaden added 9 commits September 11, 2025 14:20

Merge remote-tracking branch 'upstream/master' into feature/io-uring

8dabe11

Implement #open

ff46a4b

Implement #pipe

b7b72ee

fixup! Implement #open

ae6a0e5

Use close_volatile_fd? + SHUT_RDWR

fc13379

Implement #socket and #socketpair (blocking)

71f6c96

fixup! Merge remote-tracking branch 'upstream/master' into feature/io…

d77cfda

…-uring

Assume Linux 5.6+ which helps simplifying + fixes

1772f0c

fix: ameba warning + typos

b89e451

ysbaddaden added 2 commits September 12, 2025 16:13

IORING_OP_SHUTDOWN is available since Linux 5.11

73d75e3

Put timespec struct on the Event object

ce34ce8

straight-shoota reviewed Sep 13, 2025

View reviewed changes

src/crystal/event_loop/io_uring/event.cr Show resolved Hide resolved

ysbaddaden mentioned this pull request Oct 14, 2025

Ensure single reader and writer to system fd on Unix #16209

Merged

ysbaddaden closed this Oct 24, 2025

ysbaddaden deleted the feature/io-uring branch October 24, 2025 13:29

ysbaddaden mentioned this pull request Oct 24, 2025

Add io_uring event loop (linux) #16264

Open

5 tasks

Uh oh!

Conversation

ysbaddaden commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yxhuvud commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yxhuvud left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ysbaddaden Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ysbaddaden commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yxhuvud commented Apr 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ysbaddaden Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ysbaddaden commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ysbaddaden commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ysbaddaden commented Apr 4, 2025 •

edited

Loading

yxhuvud commented Apr 4, 2025 •

edited

Loading

ysbaddaden Apr 5, 2025 •

edited

Loading

ysbaddaden commented Apr 5, 2025 •

edited

Loading

ysbaddaden Sep 9, 2025 •

edited

Loading

ysbaddaden commented Sep 11, 2025 •

edited

Loading