Skip to content

test fix: keep connection alive#8458

Merged
bw-solana merged 1 commit intoanza-xyz:masterfrom
bw-solana:quic_test_fix_conn_alive
Oct 14, 2025
Merged

test fix: keep connection alive#8458
bw-solana merged 1 commit intoanza-xyz:masterfrom
bw-solana:quic_test_fix_conn_alive

Conversation

@bw-solana
Copy link
Copy Markdown

@bw-solana bw-solana commented Oct 13, 2025

Problem

test_quic_server_multiple_connections_on_single_client_endpoint is flaky in CI. For example, see https://buildkite.com/anza/agave/builds/31839#0199de40-d9cd-4a5d-b814-1a48e19f8d23

We rely on this receiver being alive to keep the server alive, but we don't use the receiver so we drop it immediately.

There seems to be some timing/coalesce variance that will occasionally cause the following sequence to happen:

  1. server tries to send
  2. recognizes channel is disconnected because receiver dropped
  3. calls cancel on the cancellation token
  4. connection gets dropped

This can then lead to multiple different failure modes in this test (and maybe others) depending on exact timing:

  1. We might jump up to 2 removed connections and spin forever in this loop because we're explicitly waiting for 1 removed connection
  2. We might try and write to the connection, fail, try to unwrap the error, and panic

I'm guessing #8025 is what caused this issue to start popping up

Summary of Changes

Explicitly drop the receiver at the end of the test to prevent dropping it earlier

@bw-solana bw-solana marked this pull request as ready for review October 13, 2025 20:24
@bw-solana bw-solana requested a review from KirillLykov October 13, 2025 20:24
@bw-solana
Copy link
Copy Markdown
Author

@KirillLykov It seems like there are other tests that might need a similar fix

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.2%. Comparing base (d3247fa) to head (e738a15).

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #8458     +/-   ##
=========================================
- Coverage    83.2%    83.2%   -0.1%     
=========================================
  Files         841      841             
  Lines      368425   368430      +5     
=========================================
- Hits       306635   306615     -20     
- Misses      61790    61815     +25     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@KirillLykov
Copy link
Copy Markdown

interesting, before it was exit.store(true, Ordering::Relaxed); instead of cancel.cancel(); when receiver dropped in run_packet_batch_sender. But before we didn't kill the tasks handling connections when cancellation happens, maybe that's why.

Copy link
Copy Markdown

@KirillLykov KirillLykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will check out if we have other tests like that.

@bw-solana bw-solana added this pull request to the merge queue Oct 14, 2025
Merged via the queue into anza-xyz:master with commit 769645f Oct 14, 2025
43 checks passed
@bw-solana bw-solana deleted the quic_test_fix_conn_alive branch October 14, 2025 17:49
let SpawnTestServerResult {
join_handle,
receiver: _,
receiver,
Copy link
Copy Markdown

@steviez steviez Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I feel like I might have known this could happen at some point but didn't remember this time around. Had to dig a little to find this in the book:

Note that there is a subtle difference between using only _ and using a name that starts with an underscore. The syntax _x still binds the value to the variable, whereas _ doesn’t bind at all. To show a case where this distinction matters, Listing 19-21 will provide us with an error.
...
However, using the underscore by itself doesn’t ever bind to the value

Even then, I'm surprised the docs aren't a little more direct about this potential footgun. Some issues in rust-lang would suggest that other people find this bit a little tricky too but 🤷

FWIW, it looks like doing receiver: _receiver would have sufficed too but the explicit drop() at the end does the job as well !

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, such a foot gun... I'm happy to change this to _receiver as opposed to the explicit drop at the end.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicit is better than implicit.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also should not wait without timeout for some atomics to have desired value. Will fix this as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants