Conversation
Abstraction of `io_uring` syscalls to create a ring, map the kernel buffers into userspace, submit operations and iterate completions. Also provides optional support for SQPOLL with proper wakeup of the SQ thread when thread.
Since read can also fail with EINTR we may always have to retry...
We must submit operation chains in a single shot, that is update the SQ tail shared with the kernel (sq_ktail) in a single STORE after populating all the SQE to chain together. This led to an overhaul refactor of the System::IoUring abstraction and the EventLoop::IoUring async helpers. Fixes the issue where CLOSE happens after ASYNC_CANCEL when closing a file descriptor. Makes sure that LINK_TIMEOUT will always be correctly registered to the previous READ, WRITE or POLL.
| # one thread closing a fd won't interrupt reads or writes happening in | ||
| # other threads, for example a blocked read on a fifo will keep blocking, | ||
| # while close would have finished and closed the fd; we thus explicitly | ||
| # cancel any pending operations on the fd before we try to close |
There was a problem hiding this comment.
The close(2) manpage explicitly states that some systems interrupt any blocking read or write but the Linux behavior is to 🙈
| # TODO: we could check if tail changed and iterate more, until we reach the | ||
| # maximum iterations count | ||
| end | ||
|
|
There was a problem hiding this comment.
The following enums are only used to enhance Crystal.trace.
|
|
||
| def interrupt : Nil | ||
| # the atomic makes sure we only write once (no need to write multiple times) | ||
| @eventfd.write(1) if @interrupted.test_and_set |
There was a problem hiding this comment.
This is broken: there is no @eventfd.
Well, someone (cough) has already done that job, though I can definitely understand not wanting the extra dependency.. That said there is a flag that can be submitted when building liburing that don't inline anything (thanks rust people!), but assuming that build to be available is optimistic, unless we build it ourselves. (will look at actual code later, but a take on it that might perhaps inspire may be https://github.com/yxhuvud/nested_scheduler_io_uring_context/blob/main/src/nested_scheduler/io_uring_context.cr , which is a plugin to nested_scheduler to use io_uring. it is definitely broken in some aspects, not even counting the general shift of the codebase that has happened since the nested_scheduler set of monkeypatches worked. FWIW the best part of nested scheduler was how much it possible to clean up the specs). EDIT: OH, and there is a nice io_uring discord available if you want to bounce ideas with people. some of your musings, like the close fd parts, may have good ideas or at least answers there.. https://discord.gg/T9WqsqPZ |
yxhuvud
left a comment
There was a problem hiding this comment.
Neat.
Regarding TODO:
push MORE things to the event-loop (mkdir, listen, bind, ...) 🤑
And more importantly, the nonsocket file write, read, fsync, fstat etc that lives in FileDescriptor.
| end | ||
|
|
||
| def finalize | ||
| close |
There was a problem hiding this comment.
drain, could perhaps be necessary. But perhaps it is like exit and flushing writes 🤷
There was a problem hiding this comment.
Let's err on the safe side and say not. We'll need the ability to drain a ring if we want to shutdown an execution context or thread anyway.
| IORING_FEAT_LINKED_FILE = 1_u32 << 12 | ||
| IORING_FEAT_REG_REG_RING = 1_u32 << 13 | ||
|
|
||
| IORING_OP_NOP = 0_u32 |
There was a problem hiding this comment.
I find this weird, as the op field in the struct is a u8 and not a u32, but the weirdness is also present in liburing, so I guess it doesn't matter. The same confusion exist in SQE_FLAGS.
The age old question of if to mirror the C files structure, or to use properly sized enums, I guess. Compare
enum Op : UInt8
NOP
READV
..
end
Which may be a bit less prone to copy-pasta errors as long as the order is kept correct.
| def delete_timer(event : Event*) : Nil | ||
| sqe = @ring.next_sqe | ||
| sqe.value.opcode = LibC::IORING_OP_TIMEOUT_REMOVE | ||
| sqe.value.flags = LibC::IOSQE_CQE_SKIP_SUCCESS |
There was a problem hiding this comment.
I have a hard time seeing this is enough requests to matter either way (though I am open to be shown to be wrong).
Isn't it easier to just put a user_data on it that trigger a nop? I used 0 for this. Or check the CQE result for the canceled-result, if that is what you are trying to avoid?
There was a problem hiding this comment.
It's really just a "don't even bother pushing a CQE I don't care about".
At least |
Ah, I got confused by the *_fully methods. |
Said differently: we don't need the sys/uio header just to bring the iovec struct because sys/socket shall define it.
| # Call `io_uring_enter` syscall. Panics on EBADR (can't recover from lost | ||
| # CQE), returns -EINTR or -EBUSY, and raises on other errnos, otherwise | ||
| # returns the int returned by the syscall. | ||
| def enter(to_submit : UInt32 = 0, min_complete : UInt32 = 0, flags : UInt32 = 0) : Int32 |
There was a problem hiding this comment.
By the way, one thing that has changed with regards to external libraries is that liburing has changed so that in addition to the old .so that had a million functions missing due to them being declared as static inline and therefore requiring a file with c shims to be used, they now also provide a liburing-ffi.so, which has dropped those static inline. This due to rust people wanting to use the library, and they have mostly the same issues with static inline as we do even if they are a bit further along. So that is a lot less of a hassle than it used to be.
So if we wanted we could exchange a whole lot of the complexity in this file in exchange for linking against liburing-ffi. This seems complex enough that it may be worth the trouble of extra dependencies.
There was a problem hiding this comment.
Sadly that would require a specific version that won't be available for the years to come in distributions, and it's not installed by default (unlike libc)... unless we build and distribute our own copy (nope).
Each prep method also calls a bunch of other prep methods, so we'd lose the benefit of inlining all the SQE setup.
It's also not that complex (setup is textbook boilerplate for example) though I admit I shall check the memory ordering of atomics again 😅
| # FIXME: with threads and multiple rings, we'll need to know which rings | ||
| # have pending operations for the fd (which op/event for each ring) and | ||
| # tell the rings to cancel said ops (can't just say to cancel all ops for | ||
| # fd so we can close in parallel) |
There was a problem hiding this comment.
MT: we can use IORING_OP_MSG_RING on the local ring (one for each ring) to generate a CQE for any ring and can pass some data, then the CQE can be processed on that thread to submit a SQE to cancel operations on the fd.
We "just" have to remember which rings have a pending operation, but with #16127 we'd have at most a single reader and a single writer (i.e. a couple ring fds) and not a dynamic list of ring fds 👍
The issue is that the CQE must be handled by the thread that owns the ring (it must submit a SQE), and we can't have any thread process all the CQEs any more for the whole EC (this is the current EC/EV design).
|
Merged with master, with all the non blocking changes: we no longer set O_NONBLOCK on the fd / sockfd. I determined that the minimum required linux kernel is 5.6+ which helped simplify a bit (we can expect the I also just realized that I could remove the timespecs ring by having a timespec on the |
|
Closing. I have a better implementation coming. |
Initial attempt at writing an EventLoop backed by io_uring for Linux targets.
It requires different features that have been implemented in different versions of the kernel (I'm not sure exactly which), one of the most important being "don't drop any CQE as long as there is memory" 😅
It supports SQPOLL (implicit submissions) to further reduce the number of syscalls.
Unlike the polling EventLoop (libevent, epoll, kqueue) it's fully async, meaning that any attempt to read or write will yield the current fiber for example, whatever if there could have been something to read. Expect lots of fiber yields!
WARNING: THREAD UNSAFE! MT will segfault. Only try it with single thread 💣 💣 💣
PREREQUISITES:
Crystal::EventLoop#reopened(FileDescriptor)hook #15640Crystal::EventLoop#closedo the actual close (not just cleanup) #15641O_NONBLOCK(see Add io_uring event loop #15634 (comment))blockingarg onFile(?)TODO:
.remove_impl(FileDescriptor|Socket)so we don't leak fibers:Eventobject;@timespecsfromSystem::IoUringand put eachLibC::Timespecon theEventobject to have a stable pointer for SQPOLL submissions.MT:
EC::Parallelshall create a ring for the first scheduler, then the other schedulers share its resources (IORING_SETUP_ATTACH_WQ);io_uring_enter(only when blocking?);close(IO::FileDescriptor)must cancel any read and any write operations to theIO::FileDescriptoron any ring —Socket doesn't,shutdownwill do the job.BONUSES (over Epoll):
DRAWBACK-ish:
socket << 'a' << 'b'will yield twice by default)... instead of doing a syscall, so 🤔FOLLOW UP:
blockingarg);,openfsync,fstat,mkdir,link,listen,bind, ...) 🤑NOTES:
I didn't use liburing to avoid bringing an external dependency, plus the
io_uring_prep_*functions are inlined in theio_uring.hheader, and would have had to be rewritten anyway. TheCrystal::System::IoUringstruct does most of the whole job, thenCrystal::EventLoop::IoUringdirectly fills the SQE.Adding support for MT and EC involves a few hurdles. See #10740 (comment)