Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RCORE-1982 Opening realm with cached user while offline results in fatal error and session does not retry connection #7469

Merged
merged 29 commits into from
Mar 20, 2024

Conversation

michael-wb
Copy link
Contributor

@michael-wb michael-wb commented Mar 12, 2024

What, How & Why?

Updated the sync_route provided to then SyncManager so it receives a generated initial value that should be correct most of the time. If the websocket fails to connect using this value, the Session will force a location update attempt (via updating the access token) to request the proper websocket address using the current base URL. If that fails, the Session will use the current connection reconnect timer to try again after the usual holdoff.

A "verified" flag was added to determine if the websocket address was either provided via a location request or has been successfully connected to prevent excessively querying the location endpoint.

In addition, when the sync route is updated via a call to SyncManager::set_sync_route(), this function will also restart the existing sessions to ensure they are using the updated sync route value.

Fixes #7349

☑️ ToDos

  • 📝 Changelog update
  • 🚦 Tests (or not relevant)
  • [ ] C-API, if public C++ API changed
  • [ ] bindgen/spec.yml, if public C++ API changed

@michael-wb michael-wb self-assigned this Mar 12, 2024
@cla-bot cla-bot bot added the cla: yes label Mar 12, 2024
Copy link

coveralls-official bot commented Mar 12, 2024

Pull Request Test Coverage Report for Build michael.wilkersonbarker_1007

Details

  • 550 of 573 (95.99%) changed or added relevant lines in 16 files are covered.
  • 42 unchanged lines in 13 files lost coverage.
  • Overall coverage increased (+0.03%) to 91.848%

Changes Missing Coverage Covered Lines Changed/Added Lines %
test/object-store/util/test_file.cpp 37 42 88.1%
src/realm/object-store/sync/sync_manager.cpp 9 15 60.0%
src/realm/object-store/sync/sync_session.cpp 45 51 88.24%
src/realm/sync/noinst/client_impl_base.cpp 19 25 76.0%
Files with Coverage Reduction New Missed Lines %
test/object-store/util/test_file.cpp 1 89.22%
test/test_dictionary.cpp 1 99.85%
test/test_index_string.cpp 1 94.63%
src/realm/array_string.cpp 2 85.25%
src/realm/table_view.cpp 2 94.18%
src/realm/uuid.cpp 2 97.01%
src/realm/sync/noinst/client_impl_base.cpp 3 85.84%
src/realm/sync/noinst/protocol_codec.hpp 3 76.14%
src/realm/sync/noinst/server/server.cpp 3 76.8%
src/realm/sync/instructions.hpp 4 76.03%
Totals Coverage Status
Change from base Build 2147: 0.03%
Covered Lines: 243208
Relevant Lines: 264793

💛 - Coveralls

@michael-wb michael-wb marked this pull request as ready for review March 14, 2024 13:12
src/realm/object-store/sync/app.cpp Outdated Show resolved Hide resolved
src/realm/object-store/sync/sync_manager.cpp Outdated Show resolved Hide resolved
src/realm/object-store/sync/sync_session.hpp Show resolved Hide resolved
@jbreams jbreams requested a review from tgoyne March 14, 2024 21:18
src/realm/object-store/sync/app.cpp Outdated Show resolved Hide resolved
src/realm/object-store/sync/app.cpp Show resolved Hide resolved
std::unique_lock lock{m_mutex};
E cur_state = m_cur_state;
lock.unlock();
std::optional<E> new_state = func(cur_state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this function was to guarantee that you were updating the state of the state machine while holding the mutex so nobody else could change the state while you were evaluating whether to move to a new state. So this change indicates to me that there's something broken in one of your tests. What kind of deadlock were you hitting that led you to making this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a TSAN failure around checking the base URL inside the transition_with() func handler. I reverted this code and updated the test.

@@ -3250,6 +3151,98 @@ TEST_CASE("app: sync integration", "[sync][pbs][app][baas]") {
}
}

#if 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this test isn't used in this commit?

edit: nm, i see this gets enabled in a later commit.

std::unique_ptr<sync::WebSocketInterface> connect(std::unique_ptr<sync::WebSocketObserver> observer,
sync::WebSocketEndpoint&& endpoint) override
{
sync::WebSocketEndpoint ep{std::move(endpoint)};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the compiler probably optimizes this away, but you could also just rename the parameter endpoint to ep if you want to call it ep throughout the function rather than move-constructing a new endpoint.

SyncTestFile r_config(user, partition, schema);

std::vector<SharedRealm> realms;
int i;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why i is declared outside of the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to use the same var for both loops, but it's probably more efficient to declare it in each loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there's only one loop that uses a non-range-based for loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah - there used to be two until I updated the second loop.

// to make sure the websocket URL is correct
if (!m_server_endpoint.is_verified) {
SessionErrorInfo error_info(
{ErrorCodes::SyncConnectFailed, util::format("Failed to connect to sync: %1", msg)},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we typically get this error? I think ErrorCodes::ConnectionClosed is could be the more accurate error code but RefreshLocation is definitely the right action to take.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used here, when the websocket handshake response isn't one of the expected status codes (e.g. 404).
https://github.com/realm/realm-core/blob/mwb/location-failure/src/realm/sync/network/default_socket.cpp#L122

Should I just update this to always refresh the location, regardless of whether or not the sync route has been verified?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. So it really is a fatal error during connection rather than a fatal error just somewhere along the lifetime of the websocket. What if we make the error code SyncConnectFailed both here and on line 597 then.

CHANGELOG.md Outdated Show resolved Hide resolved
SessionErrorInfo error_info(
{ErrorCodes::SyncConnectFailed, util::format("Failed to connect to sync: %1", msg)},
IsFatal{m_server_endpoint.is_verified});
ConnectionTerminationReason reason = ConnectionTerminationReason::http_response_says_fatal_error;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the correct reason if the errors is not fatal (i.e, is_verified = false)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - if the location hasn't been updated since the App was created, then we want this to not be a fatal error and perform a location update to try to request the correct sync route from the base URL address. Until the location info has been requested successfully, this error will continue to be treated as a nonfatal error and will retry to update the location info using the normal backoff delay.

Until the location info is verified, we're not sure if we got the failure because the URL was generated incorrectly or if the server is actually down/misconfigured.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment to make it more clear

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. My comment was mainly about http_response_says_fatal_error being used even if the error is non-fatal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason value is being updated to connect_operation_failed on line 596 if the sync route has not been verified (i.e. not fatal).

@michael-wb michael-wb merged commit 255cb33 into master Mar 20, 2024
4 of 5 checks passed
@michael-wb michael-wb deleted the mwb/location-failure branch March 20, 2024 17:33
nirinchev added a commit that referenced this pull request Mar 21, 2024
* master:
  update release note
  prepare v14.4.0
  sets not allowed at storage level inside mixed (#7502)
  🔄 Synced file(s) with realm/ci-actions (#7481)
  RCORE-1982 Opening realm with cached user while offline results in fatal error and session does not retry connection (#7469)
  RCORE-2008 Bump baas version (#7499)
  RCORE-2027: Setting log callback should not override existing log level hierarchy (#7494)
  Derive correct ubuntu version on linuxmint (#7471)
  Include nested path in 'OutOfBounds' error message (#7489)
nirinchev added a commit that referenced this pull request Mar 22, 2024
* master:
  update release note
  prepare v14.4.0
  sets not allowed at storage level inside mixed (#7502)
  🔄 Synced file(s) with realm/ci-actions (#7481)
  RCORE-1982 Opening realm with cached user while offline results in fatal error and session does not retry connection (#7469)
  RCORE-2008 Bump baas version (#7499)
  RCORE-2027: Setting log callback should not override existing log level hierarchy (#7494)
  Derive correct ubuntu version on linuxmint (#7471)
  Include nested path in 'OutOfBounds' error message (#7489)
  fix depth for nested collections set to 4 in debug mode (#7486)
  Set the minimum buffer size in Group::write() to be equal to the page size (#7492)
  Update the parent's content version when bumping it in a nested collection (#7470)
  release notes
  core v14.3.0 (#7482)
  RCORE-2007 Added Resumption delay configuration to SyncClientTimeouts (#7441)
  Improve performance of aggregate operations on empty dictionaries (#7418)
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Opening realm with cached user while offline results in fatal error and session does not retry connection
4 participants