Skip to content

Conversation

@lalinsky
Copy link

When tick() is called with wait=0 (no_wait mode), it would exit immediately after processing events without submitting pending EV_DELETE changes to kqueue. This caused a use-after-free bug:

  1. Events complete and callbacks return .disarm
  2. DELETEs are queued in local events[] array
  3. Completions are marked .dead and recycled
  4. tick() exits with wait==0 before submitting DELETEs
  5. kqueue still has stale registrations with dead udata pointers
  6. Later events (like EOF) arrive with corrupted pointers

Fix: Only exit early if both wait==0 AND changes==0. This ensures we loop back to submit pending DELETEs via kevent_syscall with zero timeout before exiting, properly cleaning up kqueue state.

When tick() is called with wait=0 (no_wait mode), it would exit
immediately after processing events without submitting pending
EV_DELETE changes to kqueue. This caused a use-after-free bug:

1. Events complete and callbacks return .disarm
2. DELETEs are queued in local events[] array
3. Completions are marked .dead and recycled
4. tick() exits with wait==0 before submitting DELETEs
5. kqueue still has stale registrations with dead udata pointers
6. Later events (like EOF) arrive with corrupted pointers

Fix: Only exit early if both wait==0 AND changes==0. This ensures
we loop back to submit pending DELETEs via kevent_syscall with
zero timeout before exiting, properly cleaning up kqueue state.
@mitchellh
Copy link
Owner

Thanks, for any core loop changes like this I'd love to see this paired with a unit test that demonstrates the problem (i.e. removing the changed code would cause the test to fail).

@lalinsky
Copy link
Author

lalinsky commented Oct 16, 2025

It's a bit tricky to cause test failure because it's memory corruption with random behavior. I'm not sure how to test it.

The test should involve:
some tcp server

tcp connect
tcp read (single-shot)

when the server calls shutdown, we receive EOF, even though the single-shot completion is no longer allocated

it can be executed with a loop run(.no_wait) run(.once), or just no_wait in a busy loop

@lalinsky
Copy link
Author

I was debugging it via unit test of a library that is using libxev, but even there it was just does it crash or not, and sometimes even not crashing didn't mean it's working correctly.

https://github.com/lalinsky/zio/pull/50/files

@mitchellh
Copy link
Owner

Thanks, understood. I'll take a closer look soon!

@lalinsky
Copy link
Author

lalinsky commented Oct 16, 2025

This is the log sequence that helped me understand it:

submitted READ for fd 6:

[KQUEUE] submit: submitting 2 changes to kevent:
  udata=0x1048e2c10 ident=6 filter=-1 flags=0x5

then we got results for the READ:

[KQUEUE] tick: got event udata=0x1048e2c10 ident=6 filter=-1 flags=0x5
[KQUEUE] tick: processing completion backend.kqueue.Completion@1048e2c10 state=3 op=8
[KQUEUE] tick: calling perform on backend.kqueue.Completion@1048e2c10

then we submitted WRITE for fd 6:

[KQUEUE] submit: submitting 1 changes to kevent:
  udata=0x1048e2e38 ident=6 filter=-2 flags=0x5

got results for the WRITE:

[KQUEUE] submit: got event udata=0x1048e2e38 ident=6 filter=-2 flags=0x5
[KQUEUE] submit: processing completion backend.kqueue.Completion@1048e2e38 state=3 op=7
[KQUEUE] submit: calling perform on backend.kqueue.Completion@1048e2e38

then we tried to delete the WRITE for fd 6: (because here it was running in .once, not .no_wait)

[KQUEUE] tick: submitting 1 changes to kevent:
  udata=0x0 ident=6 filter=-2 flags=0x2

and got EOF, with the READ of fd 6, which is no longer in use:

[KQUEUE] tick: got event udata=0x1048e2c10 ident=6 filter=-1 flags=0x8005
[KQUEUE] tick: processing completion backend.kqueue.Completion@1048e2c10 state=0 op=8
[KQUEUE] tick: calling perform on backend.kqueue.Completion@1048e2c10

(this was one of the cases where it did not crash, because was memory was not claimed yet)

see https://github.com/lalinsky/zio/pull/49/files for location of the prints

@lalinsky lalinsky closed this Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants