-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RCORE-1982 Opening realm with cached user while offline results in fatal error and session does not retry connection #7469
Conversation
…and querying on first connect attempt
Pull Request Test Coverage Report for Build michael.wilkersonbarker_1007Details
💛 - Coveralls |
…ified flag to prevent updating location too much
std::unique_lock lock{m_mutex}; | ||
E cur_state = m_cur_state; | ||
lock.unlock(); | ||
std::optional<E> new_state = func(cur_state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this function was to guarantee that you were updating the state of the state machine while holding the mutex so nobody else could change the state while you were evaluating whether to move to a new state. So this change indicates to me that there's something broken in one of your tests. What kind of deadlock were you hitting that led you to making this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a TSAN failure around checking the base URL inside the transition_with()
func handler. I reverted this code and updated the test.
test/object-store/sync/app.cpp
Outdated
@@ -3250,6 +3151,98 @@ TEST_CASE("app: sync integration", "[sync][pbs][app][baas]") { | |||
} | |||
} | |||
|
|||
#if 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this test isn't used in this commit?
edit: nm, i see this gets enabled in a later commit.
std::unique_ptr<sync::WebSocketInterface> connect(std::unique_ptr<sync::WebSocketObserver> observer, | ||
sync::WebSocketEndpoint&& endpoint) override | ||
{ | ||
sync::WebSocketEndpoint ep{std::move(endpoint)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the compiler probably optimizes this away, but you could also just rename the parameter endpoint
to ep
if you want to call it ep
throughout the function rather than move-constructing a new endpoint.
test/object-store/sync/app.cpp
Outdated
SyncTestFile r_config(user, partition, schema); | ||
|
||
std::vector<SharedRealm> realms; | ||
int i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason why i
is declared outside of the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking to use the same var for both loops, but it's probably more efficient to declare it in each loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there's only one loop that uses a non-range-based for loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah - there used to be two until I updated the second loop.
// to make sure the websocket URL is correct | ||
if (!m_server_endpoint.is_verified) { | ||
SessionErrorInfo error_info( | ||
{ErrorCodes::SyncConnectFailed, util::format("Failed to connect to sync: %1", msg)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do we typically get this error? I think ErrorCodes::ConnectionClosed
is could be the more accurate error code but RefreshLocation
is definitely the right action to take.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used here, when the websocket handshake response isn't one of the expected status codes (e.g. 404).
https://github.com/realm/realm-core/blob/mwb/location-failure/src/realm/sync/network/default_socket.cpp#L122
Should I just update this to always refresh the location, regardless of whether or not the sync route has been verified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay. So it really is a fatal error during connection rather than a fatal error just somewhere along the lifetime of the websocket. What if we make the error code SyncConnectFailed both here and on line 597 then.
SessionErrorInfo error_info( | ||
{ErrorCodes::SyncConnectFailed, util::format("Failed to connect to sync: %1", msg)}, | ||
IsFatal{m_server_endpoint.is_verified}); | ||
ConnectionTerminationReason reason = ConnectionTerminationReason::http_response_says_fatal_error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the correct reason if the errors is not fatal (i.e, is_verified = false)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - if the location hasn't been updated since the App
was created, then we want this to not be a fatal error and perform a location update to try to request the correct sync route from the base URL address. Until the location info has been requested successfully, this error will continue to be treated as a nonfatal error and will retry to update the location info using the normal backoff delay.
Until the location info is verified, we're not sure if we got the failure because the URL was generated incorrectly or if the server is actually down/misconfigured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the comment to make it more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. My comment was mainly about http_response_says_fatal_error
being used even if the error is non-fatal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason
value is being updated to connect_operation_failed
on line 596 if the sync route has not been verified (i.e. not fatal).
* master: update release note prepare v14.4.0 sets not allowed at storage level inside mixed (#7502) 🔄 Synced file(s) with realm/ci-actions (#7481) RCORE-1982 Opening realm with cached user while offline results in fatal error and session does not retry connection (#7469) RCORE-2008 Bump baas version (#7499) RCORE-2027: Setting log callback should not override existing log level hierarchy (#7494) Derive correct ubuntu version on linuxmint (#7471) Include nested path in 'OutOfBounds' error message (#7489)
* master: update release note prepare v14.4.0 sets not allowed at storage level inside mixed (#7502) 🔄 Synced file(s) with realm/ci-actions (#7481) RCORE-1982 Opening realm with cached user while offline results in fatal error and session does not retry connection (#7469) RCORE-2008 Bump baas version (#7499) RCORE-2027: Setting log callback should not override existing log level hierarchy (#7494) Derive correct ubuntu version on linuxmint (#7471) Include nested path in 'OutOfBounds' error message (#7489) fix depth for nested collections set to 4 in debug mode (#7486) Set the minimum buffer size in Group::write() to be equal to the page size (#7492) Update the parent's content version when bumping it in a nested collection (#7470) release notes core v14.3.0 (#7482) RCORE-2007 Added Resumption delay configuration to SyncClientTimeouts (#7441) Improve performance of aggregate operations on empty dictionaries (#7418)
What, How & Why?
Updated the sync_route provided to then
SyncManager
so it receives a generated initial value that should be correct most of the time. If the websocket fails to connect using this value, the Session will force a location update attempt (via updating the access token) to request the proper websocket address using the current base URL. If that fails, the Session will use the current connection reconnect timer to try again after the usual holdoff.A "verified" flag was added to determine if the websocket address was either provided via a location request or has been successfully connected to prevent excessively querying the location endpoint.
In addition, when the sync route is updated via a call to
SyncManager::set_sync_route()
, this function will also restart the existing sessions to ensure they are using the updated sync route value.Fixes #7349
☑️ ToDos
[ ] C-API, if public C++ API changed[ ]bindgen/spec.yml
, if public C++ API changed