Skip to content

Add io_uring event loop#15634

Closed
ysbaddaden wants to merge 34 commits intocrystal-lang:masterfrom
ysbaddaden:feature/io-uring
Closed

Add io_uring event loop#15634
ysbaddaden wants to merge 34 commits intocrystal-lang:masterfrom
ysbaddaden:feature/io-uring

Conversation

@ysbaddaden
Copy link
Collaborator

@ysbaddaden ysbaddaden commented Apr 4, 2025

Initial attempt at writing an EventLoop backed by io_uring for Linux targets.

It requires different features that have been implemented in different versions of the kernel (I'm not sure exactly which), one of the most important being "don't drop any CQE as long as there is memory" 😅

It supports SQPOLL (implicit submissions) to further reduce the number of syscalls.

Unlike the polling EventLoop (libevent, epoll, kqueue) it's fully async, meaning that any attempt to read or write will yield the current fiber for example, whatever if there could have been something to read. Expect lots of fiber yields!

WARNING: THREAD UNSAFE! MT will segfault. Only try it with single thread 💣 💣 💣

PREREQUISITES:

TODO:

  • LibC bindings for all linux targets;
  • determine minimal linux kernel version => Linux 5.6+;
  • epoll fallback (old kernel, not compiled, disabled, ...);
  • GC finalizers: cancel operations on .remove_impl(FileDescriptor|Socket) so we don't leak fibers:
    • Can the IO object actually be collected? if there's a pending op, then there should be at least a reference to the IO object on the stack (we pass the IO object around).
    • At worst we can add a reference to the IO object to the Event object;
  • drop @timespecs from System::IoUring and put each LibC::Timespec on the Event object to have a stable pointer for SQPOLL submissions.

MT:

  • Challenge: each thread/scheduler needs its own SQ + CQ rings;
  • EC::Parallel shall create a ring for the first scheduler, then the other schedulers share its resources (IORING_SETUP_ATTACH_WQ);
  • Add [e]poll and wait on it instead of io_uring_enter (only when blocking?);
  • Register an eventfd per ring + add to [e]poll;
  • Add timerfd to handle select action timeouts + add to [e]poll;
  • close(IO::FileDescriptor) must cancel any read and any write operations to the IO::FileDescriptor on any ring —Socket doesn't, shutdown will do the job.

BONUSES (over Epoll):

  • only make syscalls when strictly necessary (full submission queue, empty completion queue) otherwise they're offloaded to the kernel threads;
  • disk file open, read and write are async;
  • close file/socket is async;
  • an always ready IO now yields the fiber.

DRAWBACK-ish:

  • unbuffered IO read/write will yield the fiber every time (e.g. socket << 'a' << 'b' will yield twice by default)... instead of doing a syscall, so 🤔

FOLLOW UP:

  • moving open to the eventloop could make opening a fifo or character device asynchronous and fix the issue where the thread is blocked until another process also opens the fifo (regardless of the blocking arg);
  • push MORE things to the event-loop (open, fsync, fstat, mkdir, link, listen, bind, ...) 🤑

NOTES:

I didn't use liburing to avoid bringing an external dependency, plus the io_uring_prep_* functions are inlined in the io_uring.h header, and would have had to be rewritten anyway. The Crystal::System::IoUring struct does most of the whole job, then Crystal::EventLoop::IoUring directly fills the SQE.

Adding support for MT and EC involves a few hurdles. See #10740 (comment)

Abstraction of `io_uring` syscalls to create a ring, map the kernel
buffers into userspace, submit operations and iterate completions.

Also provides optional support for SQPOLL with proper wakeup of the SQ
thread when thread.
Since read can also fail with EINTR we may always have to retry...
We must submit operation chains in a single shot, that is update the SQ
tail shared with the kernel (sq_ktail) in a single STORE after
populating all the SQE to chain together. This led to an overhaul
refactor of the System::IoUring abstraction and the EventLoop::IoUring
async helpers.

Fixes the issue where CLOSE happens after ASYNC_CANCEL when closing a
file descriptor. Makes sure that LINK_TIMEOUT will always be correctly
registered to the previous READ, WRITE or POLL.
# one thread closing a fd won't interrupt reads or writes happening in
# other threads, for example a blocked read on a fifo will keep blocking,
# while close would have finished and closed the fd; we thus explicitly
# cancel any pending operations on the fd before we try to close
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The close(2) manpage explicitly states that some systems interrupt any blocking read or write but the Linux behavior is to 🙈

# TODO: we could check if tail changed and iterate more, until we reach the
# maximum iterations count
end

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following enums are only used to enhance Crystal.trace.


def interrupt : Nil
# the atomic makes sure we only write once (no need to write multiple times)
@eventfd.write(1) if @interrupted.test_and_set
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is broken: there is no @eventfd.

@yxhuvud
Copy link
Contributor

yxhuvud commented Apr 4, 2025

plus the io_uring_prep_* functions are inlined in the io_uring.h header, and would have had to be rewritten anyway.

Well, someone (cough) has already done that job, though I can definitely understand not wanting the extra dependency.. That said there is a flag that can be submitted when building liburing that don't inline anything (thanks rust people!), but assuming that build to be available is optimistic, unless we build it ourselves.

(will look at actual code later, but a take on it that might perhaps inspire may be https://github.com/yxhuvud/nested_scheduler_io_uring_context/blob/main/src/nested_scheduler/io_uring_context.cr , which is a plugin to nested_scheduler to use io_uring. it is definitely broken in some aspects, not even counting the general shift of the codebase that has happened since the nested_scheduler set of monkeypatches worked. FWIW the best part of nested scheduler was how much it possible to clean up the specs).

EDIT: OH, and there is a nice io_uring discord available if you want to bounce ideas with people. some of your musings, like the close fd parts, may have good ideas or at least answers there.. https://discord.gg/T9WqsqPZ

Copy link
Contributor

@yxhuvud yxhuvud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat.

Regarding TODO:

push MORE things to the event-loop (mkdir, listen, bind, ...) 🤑

And more importantly, the nonsocket file write, read, fsync, fstat etc that lives in FileDescriptor.

end

def finalize
close
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drain, could perhaps be necessary. But perhaps it is like exit and flushing writes 🤷

Copy link
Collaborator Author

@ysbaddaden ysbaddaden Apr 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's err on the safe side and say not. We'll need the ability to drain a ring if we want to shutdown an execution context or thread anyway.

IORING_FEAT_LINKED_FILE = 1_u32 << 12
IORING_FEAT_REG_REG_RING = 1_u32 << 13

IORING_OP_NOP = 0_u32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this weird, as the op field in the struct is a u8 and not a u32, but the weirdness is also present in liburing, so I guess it doesn't matter. The same confusion exist in SQE_FLAGS.

The age old question of if to mirror the C files structure, or to use properly sized enums, I guess. Compare

enum Op : UInt8
  NOP
  READV
  ..
end

Which may be a bit less prone to copy-pasta errors as long as the order is kept correct.

def delete_timer(event : Event*) : Nil
sqe = @ring.next_sqe
sqe.value.opcode = LibC::IORING_OP_TIMEOUT_REMOVE
sqe.value.flags = LibC::IOSQE_CQE_SKIP_SUCCESS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a hard time seeing this is enough requests to matter either way (though I am open to be shown to be wrong).

Isn't it easier to just put a user_data on it that trigger a nop? I used 0 for this. Or check the CQE result for the canceled-result, if that is what you are trying to avoid?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really just a "don't even bother pushing a CQE I don't care about".

@ysbaddaden
Copy link
Collaborator Author

ysbaddaden commented Apr 5, 2025

And more importantly, the nonsocket file write, read, fsync, fstat etc that lives in FileDescriptor.

At least read and write are already, but fsync and fstat aren't.

@yxhuvud
Copy link
Contributor

yxhuvud commented Apr 5, 2025

At least read and write are already

Ah, I got confused by the *_fully methods.

Said differently: we don't need the sys/uio header just to bring the
iovec struct because sys/socket shall define it.
# Call `io_uring_enter` syscall. Panics on EBADR (can't recover from lost
# CQE), returns -EINTR or -EBUSY, and raises on other errnos, otherwise
# returns the int returned by the syscall.
def enter(to_submit : UInt32 = 0, min_complete : UInt32 = 0, flags : UInt32 = 0) : Int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, one thing that has changed with regards to external libraries is that liburing has changed so that in addition to the old .so that had a million functions missing due to them being declared as static inline and therefore requiring a file with c shims to be used, they now also provide a liburing-ffi.so, which has dropped those static inline. This due to rust people wanting to use the library, and they have mostly the same issues with static inline as we do even if they are a bit further along. So that is a lot less of a hassle than it used to be.

So if we wanted we could exchange a whole lot of the complexity in this file in exchange for linking against liburing-ffi. This seems complex enough that it may be worth the trouble of extra dependencies.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly that would require a specific version that won't be available for the years to come in distributions, and it's not installed by default (unlike libc)... unless we build and distribute our own copy (nope).

Each prep method also calls a bunch of other prep methods, so we'd lose the benefit of inlining all the SQE setup.

It's also not that complex (setup is textbook boilerplate for example) though I admit I shall check the memory ordering of atomics again 😅

# FIXME: with threads and multiple rings, we'll need to know which rings
# have pending operations for the fd (which op/event for each ring) and
# tell the rings to cancel said ops (can't just say to cancel all ops for
# fd so we can close in parallel)
Copy link
Collaborator Author

@ysbaddaden ysbaddaden Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MT: we can use IORING_OP_MSG_RING on the local ring (one for each ring) to generate a CQE for any ring and can pass some data, then the CQE can be processed on that thread to submit a SQE to cancel operations on the fd.

We "just" have to remember which rings have a pending operation, but with #16127 we'd have at most a single reader and a single writer (i.e. a couple ring fds) and not a dynamic list of ring fds 👍

The issue is that the CQE must be handled by the thread that owns the ring (it must submit a SQE), and we can't have any thread process all the CQEs any more for the whole EC (this is the current EC/EV design).

@ysbaddaden
Copy link
Collaborator Author

ysbaddaden commented Sep 11, 2025

Merged with master, with all the non blocking changes: we no longer set O_NONBLOCK on the fd / sockfd.

I determined that the minimum required linux kernel is 5.6+ which helped simplify a bit (we can expect the IORING_OP_OPENAT and IORING_OP_SENDMSG). We could also assume IORING_FEAT_SINGLE_MMAP).

I also just realized that I could remove the timespecs ring by having a timespec on the Event object directly (maybe replacing the Time::Span), and point the SQE to it. I'll do that next.

@ysbaddaden
Copy link
Collaborator Author

Closing. I have a better implementation coming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Linux's IO_Uring interface (2x IO performance!)

3 participants