RCORE-1982 Opening realm with cached user while offline results in fatal error and session does not retry connection #7469

michael-wb · 2024-03-12T21:20:56Z

What, How & Why?

Updated the sync_route provided to then SyncManager so it receives a generated initial value that should be correct most of the time. If the websocket fails to connect using this value, the Session will force a location update attempt (via updating the access token) to request the proper websocket address using the current base URL. If that fails, the Session will use the current connection reconnect timer to try again after the usual holdoff.

A "verified" flag was added to determine if the websocket address was either provided via a location request or has been successfully connected to prevent excessively querying the location endpoint.

In addition, when the sync route is updated via a call to SyncManager::set_sync_route(), this function will also restart the existing sessions to ensure they are using the updated sync route value.

Fixes #7349

☑️ ToDos

📝 Changelog update
🚦 Tests (or not relevant)
~~[ ] C-API, if public C++ API changed~~
~~[ ] bindgen/spec.yml, if public C++ API changed~~

…and querying on first connect attempt

…n-failure

coveralls-official · 2024-03-12T22:08:41Z

Pull Request Test Coverage Report for Build michael.wilkersonbarker_1007

Details

550 of 573 (95.99%) changed or added relevant lines in 16 files are covered.
42 unchanged lines in 13 files lost coverage.
Overall coverage increased (+0.03%) to 91.848%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
test/object-store/util/test_file.cpp	37	42	88.1%
src/realm/object-store/sync/sync_manager.cpp	9	15	60.0%
src/realm/object-store/sync/sync_session.cpp	45	51	88.24%
src/realm/sync/noinst/client_impl_base.cpp	19	25	76.0%

Files with Coverage Reduction	New Missed Lines	%
test/object-store/util/test_file.cpp	1	89.22%
test/test_dictionary.cpp	1	99.85%
test/test_index_string.cpp	1	94.63%
src/realm/array_string.cpp	2	85.25%
src/realm/table_view.cpp	2	94.18%
src/realm/uuid.cpp	2	97.01%
src/realm/sync/noinst/client_impl_base.cpp	3	85.84%
src/realm/sync/noinst/protocol_codec.hpp	3	76.14%
src/realm/sync/noinst/server/server.cpp	3	76.8%
src/realm/sync/instructions.hpp	4	76.03%

Totals
Change from base Build 2147:	0.03%
Covered Lines:	243208
Relevant Lines:	264793

💛 - Coveralls

…n-failure

…ified flag to prevent updating location too much

…n-failure

…ssions param to set_sync_route()

…n-failure

src/realm/object-store/sync/app.cpp

src/realm/object-store/sync/sync_manager.cpp

src/realm/object-store/sync/sync_session.hpp

src/realm.h

src/realm/object-store/sync/app.cpp

src/realm/object-store/sync/sync_manager.hpp

src/realm/sync/noinst/client_impl_base.cpp

test/object-store/sync/app.cpp

…n-failure

src/realm/object-store/sync/app.cpp

jbreams · 2024-03-16T23:45:28Z

test/object-store/util/test_utils.hpp

+            std::unique_lock lock{m_mutex};
+            E cur_state = m_cur_state;
+            lock.unlock();
+            std::optional<E> new_state = func(cur_state);


The point of this function was to guarantee that you were updating the state of the state machine while holding the mutex so nobody else could change the state while you were evaluating whether to move to a new state. So this change indicates to me that there's something broken in one of your tests. What kind of deadlock were you hitting that led you to making this change?

It was a TSAN failure around checking the base URL inside the transition_with() func handler. I reverted this code and updated the test.

… common use

…n-failure

jbreams · 2024-03-19T19:21:42Z

test/object-store/sync/app.cpp

@@ -3250,6 +3151,98 @@ TEST_CASE("app: sync integration", "[sync][pbs][app][baas]") {
        }
    }

+#if 0


It looks like this test isn't used in this commit?

edit: nm, i see this gets enabled in a later commit.

jbreams · 2024-03-19T19:36:11Z

test/object-store/util/sync/sync_test_utils.hpp

+    std::unique_ptr<sync::WebSocketInterface> connect(std::unique_ptr<sync::WebSocketObserver> observer,
+                                                      sync::WebSocketEndpoint&& endpoint) override
+    {
+        sync::WebSocketEndpoint ep{std::move(endpoint)};


the compiler probably optimizes this away, but you could also just rename the parameter endpoint to ep if you want to call it ep throughout the function rather than move-constructing a new endpoint.

jbreams · 2024-03-19T20:00:46Z

test/object-store/sync/app.cpp

+        SyncTestFile r_config(user, partition, schema);
+
+        std::vector<SharedRealm> realms;
+        int i;


any reason why i is declared outside of the loop?

I was thinking to use the same var for both loops, but it's probably more efficient to declare it in each loop

But there's only one loop that uses a non-range-based for loop?

Oh yeah - there used to be two until I updated the second loop.

jbreams · 2024-03-19T20:10:54Z

src/realm/sync/noinst/client_impl_base.cpp

+            // to make sure the websocket URL is correct
+            if (!m_server_endpoint.is_verified) {
+                SessionErrorInfo error_info(
+                    {ErrorCodes::SyncConnectFailed, util::format("Failed to connect to sync: %1", msg)},


When do we typically get this error? I think ErrorCodes::ConnectionClosed is could be the more accurate error code but RefreshLocation is definitely the right action to take.

It is used here, when the websocket handshake response isn't one of the expected status codes (e.g. 404).
https://github.com/realm/realm-core/blob/mwb/location-failure/src/realm/sync/network/default_socket.cpp#L122

Should I just update this to always refresh the location, regardless of whether or not the sync route has been verified?

Ah, okay. So it really is a fatal error during connection rather than a fatal error just somewhere along the lifetime of the websocket. What if we make the error code SyncConnectFailed both here and on line 597 then.

…n-failure

CHANGELOG.md

danieltabacaru · 2024-03-19T23:10:27Z

src/realm/sync/noinst/client_impl_base.cpp

+            SessionErrorInfo error_info(
+                {ErrorCodes::SyncConnectFailed, util::format("Failed to connect to sync: %1", msg)},
+                IsFatal{m_server_endpoint.is_verified});
+            ConnectionTerminationReason reason = ConnectionTerminationReason::http_response_says_fatal_error;


is this the correct reason if the errors is not fatal (i.e, is_verified = false)?

Yes - if the location hasn't been updated since the App was created, then we want this to not be a fatal error and perform a location update to try to request the correct sync route from the base URL address. Until the location info has been requested successfully, this error will continue to be treated as a nonfatal error and will retry to update the location info using the normal backoff delay.

Until the location info is verified, we're not sure if we got the failure because the URL was generated incorrectly or if the server is actually down/misconfigured.

I updated the comment to make it more clear

I see. My comment was mainly about http_response_says_fatal_error being used even if the error is non-fatal.

The reason value is being updated to connect_operation_failed on line 596 if the sync route has not been verified (i.e. not fatal).

…n-failure

* master: update release note prepare v14.4.0 sets not allowed at storage level inside mixed (#7502) 🔄 Synced file(s) with realm/ci-actions (#7481) RCORE-1982 Opening realm with cached user while offline results in fatal error and session does not retry connection (#7469) RCORE-2008 Bump baas version (#7499) RCORE-2027: Setting log callback should not override existing log level hierarchy (#7494) Derive correct ubuntu version on linuxmint (#7471) Include nested path in 'OutOfBounds' error message (#7489)

* master: update release note prepare v14.4.0 sets not allowed at storage level inside mixed (#7502) 🔄 Synced file(s) with realm/ci-actions (#7481) RCORE-1982 Opening realm with cached user while offline results in fatal error and session does not retry connection (#7469) RCORE-2008 Bump baas version (#7499) RCORE-2027: Setting log callback should not override existing log level hierarchy (#7494) Derive correct ubuntu version on linuxmint (#7471) Include nested path in 'OutOfBounds' error message (#7489) fix depth for nested collections set to 4 in debug mode (#7486) Set the minimum buffer size in Group::write() to be equal to the page size (#7492) Update the parent's content version when bumping it in a nested collection (#7470) release notes core v14.3.0 (#7482) RCORE-2007 Added Resumption delay configuration to SyncClientTimeouts (#7441) Improve performance of aggregate operations on empty dictionaries (#7418)

Michael Wilkerson-Barker added 2 commits March 12, 2024 17:11

Moved to using a pre-initialized sync-route instead of leaving empty …

ff43531

…and querying on first connect attempt

Merge branch 'master' of github.com:realm/realm-core into mwb/locatio…

ee21683

…n-failure

michael-wb self-assigned this Mar 12, 2024

cla-bot bot added the cla: yes label Mar 12, 2024

Michael Wilkerson-Barker added 3 commits March 12, 2024 23:23

Merge branch 'master' of github.com:realm/realm-core into mwb/locatio…

a0fdcc7

…n-failure

Added base url update logic when session connection fails - added ver…

82f9e49

…ified flag to prevent updating location too much

Merge branch 'master' of github.com:realm/realm-core into mwb/locatio…

ad7d068

…n-failure

michael-wb requested review from jbreams, tgoyne and danieltabacaru March 14, 2024 13:12

michael-wb marked this pull request as ready for review March 14, 2024 13:12

Michael Wilkerson-Barker added 4 commits March 14, 2024 10:44

Updated changelog; added default_base_url for C_API; added restart_se…

cdda1d4

…ssions param to set_sync_route()

Fixed c_api compile error

5706e02

Silly uninitialized variable

3fad39c

Merge branch 'master' of github.com:realm/realm-core into mwb/locatio…

bcdb8f5

…n-failure

tgoyne reviewed Mar 14, 2024

View reviewed changes

src/realm/object-store/sync/app.cpp Outdated Show resolved Hide resolved

src/realm/object-store/sync/sync_manager.cpp Outdated Show resolved Hide resolved

src/realm/object-store/sync/sync_session.hpp Show resolved Hide resolved

Michael Wilkerson-Barker added 2 commits March 14, 2024 14:35

Updates from review; fixed deadlock in test

e4fe219

Updated test

8431a1d

jbreams requested a review from tgoyne March 14, 2024 21:18

Changed default base url to function in CAPI

963ae15