Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eio_linux: make it safe to share FDs across domains #440

Merged
merged 1 commit into from
Feb 13, 2023

Conversation

talex5
Copy link
Collaborator

@talex5 talex5 commented Feb 9, 2023

It was previously not safe to share file descriptors between domains. The fd_sharing.md test demonstrates the problem: one domain enqueues an operation using an FD and another domain closes it first.

Eio_unix.Private.Rcfd provides a ref-counted wrapper for file descriptors to fix this problem. It's lock-free and designed to be fast in the common case where an FD is only used from one domain, but safe in the case where it's used from several.

The bench_fd benchmark is, if anything, slightly faster with this change, though I don't know why:

fd

There is a dscheck test that tests Rcfd with all combinations of two users and two closers.

It includes an (unsafe) peek operation, as the public API already provided this. We might want to remove that at some point and provide a safe Eio_unix.FD.use operation instead.

There is support here for safely having two domains try to close an FD at once, although eio_linux itself still doesn't support that (you must close from the domain that opened it).

It could also allow us to support a try_close operation at some point.

Note that closing is now always done with Unix.close, not uring. Using uring previously avoided races in the single-domain case, but that's not necessary now that we have a general solution. I did have a remove_or_close function that would use uring in the common case, but Unix.close if an operation was in progress, but that seemed over-complicated.

I'm intending to use this for the generic eio_posix backend too.

It was previously not safe to share file descriptors between domains.
The `fd_sharing.md` test demonstrates the problem: one domain enqueues
an operation using an FD and another domain closes it first.

`Eio_unix.Private.Rcfd` provides a ref-counted wrapper for file
descriptors to fix this problem. It's lock-free and designed to be fast
in the common case where an FD is only used from one domain, but safe in
the case where it's used from several.

The `bench_fd` benchmark is, if anything, slightly faster with this
change, though I don't know why.

There is a dscheck test that tests Rcfd with all combinations of two
users and two closers.

It includes an (unsafe) `peek` operation, as the public API already
provided this. We might want to remove that at some point and provide a
safe `Eio_unix.FD.use` operation instead.

There is support here for safely having two domains try to close an FD
at once, although eio_linux itself still doesn't support that (you must
close from the domain that opened it).

It could also allow us to support a `try_close` operation at some point.

Note that closing is now always done with `Unix.close`, not uring.
Using uring previously avoided races in the single-domain case, but
that's not necessary now that we have a general solution. I did have a
`remove_or_close` function that would use uring in the common case,
but `Unix.close` if an operation was in progress, but that seemed
unnecessary.
@talex5 talex5 requested a review from haesbaert February 9, 2023 14:42
@talex5
Copy link
Collaborator Author

talex5 commented Feb 9, 2023

Notes:

  • It's easiest to review the eio_linux.ml changes with whitespace off. It's mostly just moving the extraction of the FD out of the scheduler context and into the fiber context (where it's easier to deal with exceptions too).
  • The is_tty check is no longer lazy, since lazy isn't domain-safe.
  • The pool of never-freed eventfds has gone :-)

@haesbaert
Copy link
Contributor

On a preliminary review it seems like a sane approach, but I keep wondering, would it not make it easier to condensce both states (the reference counting and the Open/Close) into one Atomic.t ? This way you don't need to do get/put dances, something like:

type state =
  | Open of (int * Unix.file_desc)
  | Closing of (int * (unit -> unit))

On use you would Atomic.compare_and_set (Open (x, fd)) (Open (succ x, fd), and closing would be the inverse dance. Open -> Close would only happen on actual close and would carry the count to the new state.

@talex5
Copy link
Collaborator Author

talex5 commented Feb 13, 2023

I didn't benchmark that, but doing two allocations on every use seemed bad. The current code is only updating an integer and reading a shared location, so no allocations.

Also, I think compare_and_set is generally slower than Atomic.incr, though I don't have figures.

@haesbaert
Copy link
Contributor

I didn't benchmark that, but doing two allocations on every use seemed bad. The current code is only updating an integer and reading a shared location, so no allocations.

Also, I think compare_and_set is generally slower than Atomic.incr, though I don't have figures.

Another idea to avoid it is to use the most significant bit of the reference counting as the Closing state, so you don't need any allocations. I'm mentioning since it's way easier to reason about a protocol with only one atomic word instead of two.

Copy link
Contributor

@haesbaert haesbaert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've found a race, see if I'm making any sense

- closing/closed: No need to do anything, but notifying is harmless.
*)
if Atomic.get t.ops > 0 then () (* Someone else will deal with it. *)
else if Atomic.compare_and_set t.fd prev fully_closed then (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume Domain A is here, it did the 1->0 transition, did not see t.ops be incremented and proceeded to call no_users.

)

let get t =
Atomic.incr t.ops;
Copy link
Contributor

@haesbaert haesbaert Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Domain B gets here, meanwhile Domain A already decided to close with no_users.
Domain B sees Closing since the atomics are ordered, it aborts by doing put on line 97.
B sees the same sequence of events of A in put, and ends up calling no_users as well, so you end up with a double close. The assumption is that the no_close of A would be carried over to B, which I think doesn't hold here, both end up calling it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put always does a CAS to Closing ignore before calling no_users, so the real function can only get called once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahA! indeed, I've missed the ignore in fully_closed

@talex5
Copy link
Collaborator Author

talex5 commented Feb 13, 2023

Another idea to avoid it is to use the most significant bit of the reference counting as the Closing state, so you don't need any allocations.

That might work... let me give it a try...

@talex5
Copy link
Collaborator Author

talex5 commented Feb 13, 2023

After more thought, I don't think using one bit of ops to indicate "closing" helps much.

t.fd still needs to be atomic because we still need to store the callback function (at least for remove). We can't have two closers overwriting each other's callbacks, and we can't have two put calls both calling the same callback.

So, it's still two atomic locations, but working with t.ops becomes harder as it's not just an int any more.

I didn't rewrite it all to check, but suspect it won't be much/any simpler (though you're welcome to have a go - it might turn out OK).

@haesbaert
Copy link
Contributor

Ah indeed, the callback complicates it.
I wanna have a final review later today but I think it's fine as it is.

I understand this solves the re-use race which is dangerous, do you also have something in mind to "cancel the operation on other domains ?"

The way it is after this change. If multiple domains block on accept on the same fd, if one issues a close, the actual close operation will pend until every domain receives one connection.

@talex5
Copy link
Collaborator Author

talex5 commented Feb 13, 2023

I understand this solves the re-use race which is dangerous, do you also have something in mind to "cancel the operation on other domains ?" The way it is after this change. If multiple domains block on accept on the same fd, if one issues a close, the actual close operation will pend until every domain receives one connection.

The idea is you use Eio's cancellation system to cancel, not close. This is similar to Unix's behaviour too (calling close(2) in one thread will not abort an accept(2) performed on the same FD by another thread), just one level up.

Normally, if you're using things in the obvious way, the structured concurrency will have cancelled all operations by the time the switch finishes and closes the FD.

@haesbaert haesbaert merged commit f9b2924 into ocaml-multicore:main Feb 13, 2023
@talex5 talex5 deleted the safe-fds branch April 5, 2023 12:52
talex5 added a commit to talex5/opam-repository that referenced this pull request Apr 11, 2023
CHANGES:

New features:

- Add eio_posix backend (@talex5 @haesbaert ocaml-multicore/eio#448 ocaml-multicore/eio#477, reviewed by @avsm @patricoferris @polytypic).
  This replaces eio_luv on all platforms except Windows (which will later switch to its own backend). It is a lot faster, provides access to more modern features (such as `openat`), and can safely share OS resources between domains.

- Add subprocess support (@patricoferris @talex5 ocaml-multicore/eio#461 ocaml-multicore/eio#464 ocaml-multicore/eio#472, reviewed by @haesbaert @avsm).
  This is the low-level API support for eio_linux and eio_posix. A high-level cross-platform API will be added in the next release.

- Add `Fiber.fork_seq` (@talex5 ocaml-multicore/eio#460, reviewed by @avsm).
  This is a light-weight alternative to using a single-producer, single-consumer, 0-capacity stream, similar to a Python generator function.

Bug fixes:

- eio_linux: make it safe to share FDs across domains (@talex5 ocaml-multicore/eio#440, reviewed by @haesbaert).
  It was previously not safe to share file descriptors between domains because if one domain used an FD just as another was closing it, and the FD got reused, then the original operation could act on the wrong file.

- eio_linux: release uring if Linux is too old (@talex5 ocaml-multicore/eio#476).
  Avoids a small resource leak.

- eio_linux: improve error handling creating pipes and sockets (@talex5 ocaml-multicore/eio#474, spotted by @avsm).
  If we get an error (e.g. too many FDs) then report it to the calling fiber, instead of exiting the event loop.

- eio_linux: wait for uring to finish before exiting (@talex5 ocaml-multicore/eio#470, reviewed by @avsm).
  If the main fiber raised an exception then it was possible to exit while a cancellation operation was still in progress.

- eio_main: make `EIO_BACKEND` handling more uniform (@talex5 ocaml-multicore/eio#447).
  Previously this environment variable was only used on Linux. Now all platforms check it.

- Tell dune about `EIO_BACKEND` (@talex5 ocaml-multicore/eio#442).
  If this changes, dune needs to re-run the tests.

- eio_linux: add some missing close-on-execs (@talex5 ocaml-multicore/eio#441).

- eio_linux: `read_exactly` fails to update file offset (@talex5 ocaml-multicore/eio#438).

- Work around dune `enabled_if` bug on non-Linux systems (@polytypic ocaml-multicore/eio#475, reviewed by @talex5).

- Use raw system call of `getrandom` for glibc versions before 2.25 (@zenfey ocaml-multicore/eio#482).

Documentation:

- Add `HACKING.md` with hints for working on Eio (@talex5 ocaml-multicore/eio#443, reviewed by @avsm @polytypic).

- Improve worker pool example (@talex5 ocaml-multicore/eio#454).

- Add more Conditions documentation (@talex5 ocaml-multicore/eio#436, reviewed by @haesbaert).
  This adds a discussion of conditions to the README and provides examples using them to handle signals.

- Condition: fix the example in the docstring (@avsm ocaml-multicore/eio#468).

Performance:

- Add a network benchmark using an HTTP-like protocol (@talex5 ocaml-multicore/eio#478, reviewed by @avsm @patricoferris).

- Add a benchmark for reading from `/dev/zero` (@talex5 ocaml-multicore/eio#439).

Other changes:

- Add CI for macOS (@talex5 ocaml-multicore/eio#452).

- Add tests for `pread`, `pwrite` and `readdir` (@talex5 ocaml-multicore/eio#451).

- eio_linux: split into multiple files (@talex5 ocaml-multicore/eio#465 ocaml-multicore/eio#466, reviewed by @avsm).

- Update Dockerfile (@talex5 ocaml-multicore/eio#471).

- Use dune.3.7.0 (@patricoferris ocaml-multicore/eio#457).

- Mint exclusive IDs across domains (@TheLortex ocaml-multicore/eio#480, reported by @haesbaert, reviewed by @talex5).
  The tracing currently only works with a single domain anyway, but this will change when OCaml 5.1 is released.
talex5 added a commit to talex5/opam-repository that referenced this pull request Jun 2, 2023
CHANGES:

New features:

- Add eio_posix backend (@talex5 @haesbaert ocaml-multicore/eio#448 ocaml-multicore/eio#477, reviewed by @avsm @patricoferris @polytypic).
  This replaces eio_luv on all platforms except Windows (which will later switch to its own backend). It is a lot faster, provides access to more modern features (such as `openat`), and can safely share OS resources between domains.

- Add subprocess support (@patricoferris @talex5 ocaml-multicore/eio#461 ocaml-multicore/eio#464 ocaml-multicore/eio#472, reviewed by @haesbaert @avsm).
  This is the low-level API support for eio_linux and eio_posix. A high-level cross-platform API will be added in the next release.

- Add `Fiber.fork_seq` (@talex5 ocaml-multicore/eio#460, reviewed by @avsm).
  This is a light-weight alternative to using a single-producer, single-consumer, 0-capacity stream, similar to a Python generator function.

Bug fixes:

- eio_linux: make it safe to share FDs across domains (@talex5 ocaml-multicore/eio#440, reviewed by @haesbaert).
  It was previously not safe to share file descriptors between domains because if one domain used an FD just as another was closing it, and the FD got reused, then the original operation could act on the wrong file.

- eio_linux: release uring if Linux is too old (@talex5 ocaml-multicore/eio#476).
  Avoids a small resource leak.

- eio_linux: improve error handling creating pipes and sockets (@talex5 ocaml-multicore/eio#474, spotted by @avsm).
  If we get an error (e.g. too many FDs) then report it to the calling fiber, instead of exiting the event loop.

- eio_linux: wait for uring to finish before exiting (@talex5 ocaml-multicore/eio#470, reviewed by @avsm).
  If the main fiber raised an exception then it was possible to exit while a cancellation operation was still in progress.

- eio_main: make `EIO_BACKEND` handling more uniform (@talex5 ocaml-multicore/eio#447).
  Previously this environment variable was only used on Linux. Now all platforms check it.

- Tell dune about `EIO_BACKEND` (@talex5 ocaml-multicore/eio#442).
  If this changes, dune needs to re-run the tests.

- eio_linux: add some missing close-on-execs (@talex5 ocaml-multicore/eio#441).

- eio_linux: `read_exactly` fails to update file offset (@talex5 ocaml-multicore/eio#438).

- Work around dune `enabled_if` bug on non-Linux systems (@polytypic ocaml-multicore/eio#475, reviewed by @talex5).

- Use raw system call of `getrandom` for glibc versions before 2.25 (@zenfey ocaml-multicore/eio#482).

Documentation:

- Add `HACKING.md` with hints for working on Eio (@talex5 ocaml-multicore/eio#443, reviewed by @avsm @polytypic).

- Improve worker pool example (@talex5 ocaml-multicore/eio#454).

- Add more Conditions documentation (@talex5 ocaml-multicore/eio#436, reviewed by @haesbaert).
  This adds a discussion of conditions to the README and provides examples using them to handle signals.

- Condition: fix the example in the docstring (@avsm ocaml-multicore/eio#468).

Performance:

- Add a network benchmark using an HTTP-like protocol (@talex5 ocaml-multicore/eio#478, reviewed by @avsm @patricoferris).

- Add a benchmark for reading from `/dev/zero` (@talex5 ocaml-multicore/eio#439).

Other changes:

- Add CI for macOS (@talex5 ocaml-multicore/eio#452).

- Add tests for `pread`, `pwrite` and `readdir` (@talex5 ocaml-multicore/eio#451).

- eio_linux: split into multiple files (@talex5 ocaml-multicore/eio#465 ocaml-multicore/eio#466, reviewed by @avsm).

- Update Dockerfile (@talex5 ocaml-multicore/eio#471).

- Use dune.3.7.0 (@patricoferris ocaml-multicore/eio#457).

- Mint exclusive IDs across domains (@TheLortex ocaml-multicore/eio#480, reported by @haesbaert, reviewed by @talex5).
  The tracing currently only works with a single domain anyway, but this will change when OCaml 5.1 is released.
talex5 added a commit to talex5/opam-repository that referenced this pull request Jun 2, 2023
CHANGES:

New features:

- Add eio_posix backend (@talex5 @haesbaert ocaml-multicore/eio#448 ocaml-multicore/eio#477, reviewed by @avsm @patricoferris @polytypic).
  This replaces eio_luv on all platforms except Windows (which will later switch to its own backend). It is a lot faster, provides access to more modern features (such as `openat`), and can safely share OS resources between domains.

- Add subprocess support (@patricoferris @talex5 ocaml-multicore/eio#461 ocaml-multicore/eio#464 ocaml-multicore/eio#472, reviewed by @haesbaert @avsm).
  This is the low-level API support for eio_linux and eio_posix. A high-level cross-platform API will be added in the next release.

- Add `Fiber.fork_seq` (@talex5 ocaml-multicore/eio#460, reviewed by @avsm).
  This is a light-weight alternative to using a single-producer, single-consumer, 0-capacity stream, similar to a Python generator function.

Bug fixes:

- eio_linux: make it safe to share FDs across domains (@talex5 ocaml-multicore/eio#440, reviewed by @haesbaert).
  It was previously not safe to share file descriptors between domains because if one domain used an FD just as another was closing it, and the FD got reused, then the original operation could act on the wrong file.

- eio_linux: release uring if Linux is too old (@talex5 ocaml-multicore/eio#476).
  Avoids a small resource leak.

- eio_linux: improve error handling creating pipes and sockets (@talex5 ocaml-multicore/eio#474, spotted by @avsm).
  If we get an error (e.g. too many FDs) then report it to the calling fiber, instead of exiting the event loop.

- eio_linux: wait for uring to finish before exiting (@talex5 ocaml-multicore/eio#470, reviewed by @avsm).
  If the main fiber raised an exception then it was possible to exit while a cancellation operation was still in progress.

- eio_main: make `EIO_BACKEND` handling more uniform (@talex5 ocaml-multicore/eio#447).
  Previously this environment variable was only used on Linux. Now all platforms check it.

- Tell dune about `EIO_BACKEND` (@talex5 ocaml-multicore/eio#442).
  If this changes, dune needs to re-run the tests.

- eio_linux: add some missing close-on-execs (@talex5 ocaml-multicore/eio#441).

- eio_linux: `read_exactly` fails to update file offset (@talex5 ocaml-multicore/eio#438).

- Work around dune `enabled_if` bug on non-Linux systems (@polytypic ocaml-multicore/eio#475, reviewed by @talex5).

- Use raw system call of `getrandom` for glibc versions before 2.25 (@zenfey ocaml-multicore/eio#482).

Documentation:

- Add `HACKING.md` with hints for working on Eio (@talex5 ocaml-multicore/eio#443, reviewed by @avsm @polytypic).

- Improve worker pool example (@talex5 ocaml-multicore/eio#454).

- Add more Conditions documentation (@talex5 ocaml-multicore/eio#436, reviewed by @haesbaert).
  This adds a discussion of conditions to the README and provides examples using them to handle signals.

- Condition: fix the example in the docstring (@avsm ocaml-multicore/eio#468).

Performance:

- Add a network benchmark using an HTTP-like protocol (@talex5 ocaml-multicore/eio#478, reviewed by @avsm @patricoferris).

- Add a benchmark for reading from `/dev/zero` (@talex5 ocaml-multicore/eio#439).

Other changes:

- Add CI for macOS (@talex5 ocaml-multicore/eio#452).

- Add tests for `pread`, `pwrite` and `readdir` (@talex5 ocaml-multicore/eio#451).

- eio_linux: split into multiple files (@talex5 ocaml-multicore/eio#465 ocaml-multicore/eio#466, reviewed by @avsm).

- Update Dockerfile (@talex5 ocaml-multicore/eio#471).

- Use dune.3.7.0 (@patricoferris ocaml-multicore/eio#457).

- Mint exclusive IDs across domains (@TheLortex ocaml-multicore/eio#480, reported by @haesbaert, reviewed by @talex5).
  The tracing currently only works with a single domain anyway, but this will change when OCaml 5.1 is released.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants