Skip to content

Add io_uring event loop (linux)#16264

Open
ysbaddaden wants to merge 19 commits intocrystal-lang:masterfrom
ysbaddaden:feature/io-uring
Open

Add io_uring event loop (linux)#16264
ysbaddaden wants to merge 19 commits intocrystal-lang:masterfrom
ysbaddaden:feature/io-uring

Conversation

@ysbaddaden
Copy link
Collaborator

@ysbaddaden ysbaddaden commented Oct 24, 2025

Implements an event loop that leverages io_uring on Linux targets.

Requirements

The event loop requires different features that have been added in different versions of the kernel. At a minimum Linux 5.19 is required, while the recent Linux 6.13 is recommended. It is thus compatible with Linux 6.1 SLTS but not previous (S)LTS kernels.

The io_uring event loop is disabled by default. It must be enabled manually at compile time with the -Devloop=io_uring flag.

The SQPOLL feature is support but disabled by default. It allows to avoid syscalls on submissions & completions which is very cool... but it uses lots of CPU 🔥. It can be enabled at compile time with the IORING_SQ_THREAD_IDLE environment variable (in milliseconds) that sets the idle time for the SQPOLL thread.

For example:

export IORING_SQ_THREAD_IDLE=200
crystal build app.cr -Devloop=io_uring

Implementation details

The basic implementation was straightforward. It's basically an async framework: submit an operation, suspend the fiber, and resume it when the operation has completed.

This is also the second event loop that uses blocking IO after IOCP on Windows, and the first one on UNIX.

The main issue is a Linux limitation where close doesn't interrupt pending operations in the kernel, so we must shutdown sockets and cancel pending ops on files for example.

Threads Support & Safety

The MT safe implementation (preview_mt, execution_context) was much more complex. Unlike the other event loops, we can't have a single ring as it would require to lock on every submit, and with multiple threads it would create a contention and would likely require syscalls (that would defeat the point), so we need a ring per thread (sharing the same kernel resources).

There's thus a new API to register execution context schedulers to the event loop, so we can create/close rings as needed. Since a scheduler can shutdown (e.g. after a resize down), the execution context must also drain its ring before the scheduler can stop: all the pending operations must have completed and all the pending fibers be enqueued.

We need cross rings communication for a couple scenarios: to interrupt a thread waiting on the event loop, and for cancelling pending read/write file operations (the serial R/W of #16209 is required). At worst, this communication needs a lock on submit (which is avoided on Linux 6.13+). Unlike the single ring, the lock should usually not be contented in practice (unless you open lots of files, read/write from many fibers to the same file and close from whatever fiber).

Unlike the other event loops, there isn't a single system instance for the whole event loop (e.g. one epoll, kqueue or IOCP), and each scheduler is responsible for its own completion queue... which means that we're back into the "a busy thread can block runnable fibers" in its completion queue while there might be starving threads. A busy thread can be a CPU bound fiber, or a pair of fibers that keep re-enqueue each other.

To avoid this situation, once in a while + every time a scheduler would wait on the event loop (starving), the event loop will instead iterate the completion rings and try to steal runnable fibers from other threads. That requires a lock on the completion queue, that should also usually not be contended (only once in a while).

TODO

  • segfault with musl-libc when initializing ~STDERR:const_read (doesn't happen with glibc) (it's actually raise that tries to initialize STDERR that depends on evloop that's not available).
  • replace ENV["IORING_SQ_THREAD_IDLE"] with -Dio_uring_sq_thread_idle=200 —let's use comptime flag values!
  • Add CI job to run std specs for *-linux-* with -Devloop=io_uring — that's a bunch more targets but std specs are quick (and compiler specs are irrelevant).

MAYBE

  • consider a runtime IORING_SQ_THREAD_CPU ENV variable (sadly, it can't be changed after the ring is created).
  • consider a runtime IORING_SQ_THREAD_IDLE ENV variable (sadly, it can't be changed after the ring is created).

Obsoletes #15634
Depends on #16209
Closes #10740

# immediately if the CQ lock couldn't be acquired.
def cq_trylock?(&)
{% if flag?(:execution_context) %}
if @cq_lock.try_lock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meta: Would a try_synchronize method make sense?

@ysbaddaden
Copy link
Collaborator Author

Fixed a bug with execution contexts, and rebased from master. Ready for review!

@zw963
Copy link
Contributor

zw963 commented Jan 12, 2026

Wow, yasbaddaden is such an amazing developer!

@ysbaddaden
Copy link
Collaborator Author

  • Rebased from master to bring the latest changes
  • No need to specify CLOCK_BOOTTIME since we use relative time
  • Fixed compatibility with Time::Instant
  • Fixed evloop interrupt from scheduler without an evloop (e.g. raw thread or isolated context that lazily creates an evloop)

if exc = done.receive
raise exc
end
raise exception if exception
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helper seems to be dependent on fiber context switches. Weirdly it only fails on the OAuth2::Client specs.

With the patch EC + io_uring passes but all other cases hang.
Without the patch EC + io_uring fails but all other cases pass.

@ysbaddaden ysbaddaden force-pushed the feature/io-uring branch 3 times, most recently from 653017e to 24d5e4a Compare January 27, 2026 10:42
Copy link
Member

@straight-shoota straight-shoota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must admit that I'm not super familiar with the details of io_uring and did not scrutinize all the implementations in depth. So there's a good chance I would miss a logic bug if there is one.
But the code looks good overall and it has been more or less working for months now. So I'm pretty confident about merging it. Exposing it more easily for people to try out with their code should help find any issues that might still be there somewhere.

This is amazing work 🚀

- c/errno: EBADR
- c/poll: POLLIN, POLLERR, ...
- c/sys/mman: MAP_POPULATE, MADV_DONTFORK
- c/uio: iovec struct
- linux/io_uring: structs, constants and syscalls
Mostly boilerplate around creating the ring, detecting available
features, mapping the buffers from kernel to user land, handling the SQ
and CQ rings, ...
@ysbaddaden
Copy link
Collaborator Author

Rebased from master to bring the refactored Linux CI workflow, remove the custom changes, and merely add the io_uring jobs to the test stdlib matrix (4 in total).

# which will block the current thread until the ring has been fully drained (all
# the SQE have completed), at which point it will be unregistered from the event
# loop that will nillify the entry in the rings array.
class Crystal::EventLoop::IoUring < Crystal::EventLoop
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class Crystal::EventLoop::IoUring < Crystal::EventLoop
@[Experimental]
class Crystal::EventLoop::IoUring < Crystal::EventLoop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Linux's IO_Uring interface (2x IO performance!)

5 participants