Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for sessions to drain during sync client shutdown #6293

Merged
merged 12 commits into from
Feb 17, 2023

Conversation

jbreams
Copy link
Contributor

@jbreams jbreams commented Feb 10, 2023

What, How & Why?

There's a race when force closing a connection where sessions may still be getting finalized. This makes it so that if there are still any active sessions on a connection, force_close() makes it so that the connection will close as soon as the last session has ended and will skip all the linger period timing.

☑️ ToDos

  • 📝 Changelog update
  • 🚦 Tests (or not relevant)
  • C-API, if public C++ API changed.

@cla-bot cla-bot bot added the cla: yes label Feb 10, 2023
@jbreams jbreams marked this pull request as ready for review February 10, 2023 15:00
@@ -1264,6 +1266,12 @@ inline void ClientImpl::Connection::one_less_active_unsuspended_session()
{
if (--m_num_active_unsuspended_sessions != 0)
return;

if (m_force_closed) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the check also include m_state != ConnectionState::disconnected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call.

@jbreams
Copy link
Contributor Author

jbreams commented Feb 10, 2023

If this goes into the current release then it doesn't need a changelog entry since the assertions its fixing haven't been released yet.

Copy link
Contributor

@michael-wb michael-wb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I didn't see the new debug statements in any of the test runs I checked, but the logic looks good.

@danieltabacaru
Copy link
Collaborator

There is a failure in CI:
ERROR in Sync_FailingReadsOnServerSide: CHECK(compare_groups(rt_1, rt_2))

I think it should not be related to this change.

@jbreams
Copy link
Contributor Author

jbreams commented Feb 11, 2023

The test failures were actually relevant. I've gone and re-worked some things and added more asserts. I've run the all the realm-sync-tests locally 1000 times with the latest changes.

Copy link
Contributor

@michael-wb michael-wb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good - just a few minor questions/comments

@@ -544,6 +568,11 @@ void Connection::initiate_reconnect_wait()
REALM_ASSERT(!m_reconnect_delay_in_progress);
REALM_ASSERT(!m_disconnect_delay_in_progress);

// If we've been force closed then we don't need/want to reconnect. Just return early here.
if (m_force_closed) {
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also ensure the m_reconnect_disconnect_timer timer is stopped (e.g. m_reconnect_disconnect_timer.reset()), or maybe that should still happen in Connection::force_close()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've also added some assertions to make sure that we end up disconnected at the end of voluntary_disconnect and that we're in the correct delay mode if we are disconnected.

{
std::lock_guard lock{m_mutex};
m_actualize_and_finalize_needed = false;
swap(m_unactualized_session_wrappers, unactualized_session_wrappers);
swap(m_abandoned_session_wrappers, abandoned_session_wrappers);
stopped = m_stopped;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this since stopped is now an atomic? You can read m_stopped directly on line 602 and avoid a tiny window where m_stopped might be changed between here and there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided not to make m_stopped atomic after all. I think since we're threading force_closed states down into sessions now, we can just use that to efficiently check if it's safe to call callbacks etc rather than introducing an atomic bool.

@@ -491,24 +495,17 @@ void ClientImpl::drain_connections_on_loop()
{
post([this](Status status) mutable {
REALM_ASSERT(status.is_ok());
actualize_and_finalize_session_wrappers();
drain_connections();
});
}

void ClientImpl::drain()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about renaming Client::drain() to something like Client::shutdown() or Client::shutdown_and_wait() so the name is a bit more descriptive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I was actually thinking the same thing but just hadn't typed it up yet.

@jbreams jbreams force-pushed the jbr/drain_sessions_async branch from 8212db9 to 7cfef56 Compare February 15, 2023 15:46
@@ -491,24 +495,17 @@ void ClientImpl::drain_connections_on_loop()
{
post([this](Status status) mutable {
REALM_ASSERT(status.is_ok());
actualize_and_finalize_session_wrappers();
drain_connections();
});
}

void ClientImpl::drain()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I was actually thinking the same thing but just hadn't typed it up yet.

{
std::lock_guard lock{m_mutex};
m_actualize_and_finalize_needed = false;
swap(m_unactualized_session_wrappers, unactualized_session_wrappers);
swap(m_abandoned_session_wrappers, abandoned_session_wrappers);
stopped = m_stopped;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided not to make m_stopped atomic after all. I think since we're threading force_closed states down into sessions now, we can just use that to efficiently check if it's safe to call callbacks etc rather than introducing an atomic bool.

@@ -566,7 +566,9 @@ class MultiClientServerFixture {
{
unit_test::TestContext& test_context = m_test_context;
stop();
m_clients.clear();
for (int i = 0; i < m_num_clients; ++i) {
m_clients[i]->shutdown_and_wait();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it turns out there can be callbacks that may fire during the call to shutdown_and_wait() and that if we clear this array than the unique_ptr in m_clients will be null and callbacks will segfault. So we need to defer destroying the Clients until they are actually all shutdown.

m_conn.one_less_active_unsuspended_session(); // Throws
on_suspended(SessionErrorInfo{error.code(), false});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually a bug where we'd end up making the number of active unsuspended sessions underflow.

@@ -1262,8 +1266,10 @@ inline void ClientImpl::Connection::one_more_active_unsuspended_session()

inline void ClientImpl::Connection::one_less_active_unsuspended_session()
{
REALM_ASSERT(m_num_active_unsuspended_sessions);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added these assertions wherever we -- a variable because there was at least one case where we were underflowing and making this value uint64_t(-1).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having the assertions there - it will help with tracking down any issues in the future.

@@ -544,6 +568,11 @@ void Connection::initiate_reconnect_wait()
REALM_ASSERT(!m_reconnect_delay_in_progress);
REALM_ASSERT(!m_disconnect_delay_in_progress);

// If we've been force closed then we don't need/want to reconnect. Just return early here.
if (m_force_closed) {
return;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've also added some assertions to make sure that we end up disconnected at the end of voluntary_disconnect and that we're in the correct delay mode if we are disconnected.

Copy link
Contributor

@michael-wb michael-wb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - I walked through a couple of shutdown scenarios and I couldn't come up with any possible issues that aren't already handled. One area that I thought might me an issue was around the outstanding_posts, but the timers always run the handler when canceled and the code waits for all the posts to complete.

@@ -1262,8 +1266,10 @@ inline void ClientImpl::Connection::one_more_active_unsuspended_session()

inline void ClientImpl::Connection::one_less_active_unsuspended_session()
{
REALM_ASSERT(m_num_active_unsuspended_sessions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having the assertions there - it will help with tracking down any issues in the future.

@jbreams
Copy link
Contributor Author

jbreams commented Feb 16, 2023

@danieltabacaru , do you want to weigh in here?

@danieltabacaru
Copy link
Collaborator

@danieltabacaru , do you want to weigh in here?

Yes, I'll have a look this evening.

std::vector<Session*> to_close;
for (auto& session_pair : m_sessions) {
if (session_pair.second->m_state == Session::State::Active) {
to_close.push_back(session_pair.second.get());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why not closing the sessions directly here?

@jbreams
Copy link
Contributor Author

jbreams commented Feb 16, 2023 via email

Copy link
Collaborator

@danieltabacaru danieltabacaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jbreams jbreams merged commit b05f9f0 into master Feb 17, 2023
@jbreams jbreams deleted the jbr/drain_sessions_async branch February 17, 2023 00:46
@kiburtse kiburtse mentioned this pull request Mar 3, 2023
3 tasks
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants