Use a single DB instance rather than two for synchronized Realms #4839

tgoyne · 2021-08-05T21:35:57Z

Similar to #4384, but a somewhat difference approach (and up-to-date). This generally simplifies things, deletes some code, cuts the file descriptors and address space used in half, and should set the foundation for some more architectural simplifications.

A rundown of the changes here:

RealmCoordinator::set_transaction_callback() was there specifically for the global notifier and has never been used for anything else, so I deleted it. This turned out to not actually simplify anything much because the complexity it introduced got resolved in other ways, but it's still dead code.
SyncManager::get_session() now takes a DB instead of a path, and RealmCoordinator now always opens the DB before creating the SyncSession. This is sort of a breaking change, but it appears that no one was calling get_session() from outside ObjectStore any more anyways; it was originally there for async open in Java but we pushed that into ObjectStore a while ago.
We now don't know if a History object will be the sync agent until after we open the DB, so I reworked the whole flow around that to make SyncSession try to claim the sync agent slot when it's created. This ended up simplifying things quite a bit as it cut out a lot of layers.
DB now (optionally) owns the History object used to open it because SyncSession can outlive the RealmCoordinator. It still can take a non-owning reference to avoid having to rewrite all the tests.
The sync session now never opens the file, so the cache of files it opened is gone, along with all of the code related to passing an encryption key to the sync session.
ClientHistoryImpl was aggressively thread-unsafe as it was designed around the pre-core-6 design of history objects being thread-confined. It now adopts the rule that m_group, m_arrays, and everything derived from those are guarded by the write lock. The functions which are called on the sync worker thread without holding the write lock (get_status(), find_uploadable_changesets()) construct local Array accessors rather than going through set_group() to avoid stomping on member variables. This doesn't appear to have performance implications.
SessionWrapperQueue::clear() and thus its destructor has always been broken and leaked all but the first object in it. This was never hit until now because we happened to always call pop_front() in a loop first and never destroyed a non-empty queue. While I was fixing this I went ahead and just turned it into a stack because that made the implementation simpler and the order doesn't matter.
There was a data race on m_mapping_version: it's written to while holding a lock, but read from multiple threads without holding a lock. Stale reads are fine here, so I changed it to an atomic rather than adding more locking.
This deletes a handful of tests that are no longer applicable:
- "sync: encrypt local realm file", Sync_EncryptClientRealmFiles: sync sessions no longer open the file so we no longer need to verify that it uses the encryption key
- ServerHistory_MaxOneOwnedByServer: The one sync agent per file restriction is now one sync session rather than one sync History. There's other tests which verify the remaining restriction.
- Sync_ClientFileIdentSpoofing, Sync_DisabledSession: These tests should have been deleted in https://github.com/realm/realm-sync/pull/993 but for some reason were merely disabled instead (even though there's a comment saying they should be deleted...).
- Sync_ManySessions: This tested the client file cache, which no longer exists.
No new tests are added to replace these as I'm not sure what, if anything, related to this would need to be tested which isn't already tested. All of the changes to the existing sync tests are just updating them to the new API for creating a session.
Some of the App tests failed inconsistently when running with tsan enabled (which makes everything a lot slower, and isn't currently run on CI). The culprit appeared to be some of the cleanup code in tests which was replicating what TestSyncManager does already, except incorrectly. Cleaning that up turned into refactoring the whole file to eliminate the vast quantities of duplicated code

All ObjectStore and Cocoa sync tests pass for me locally with these changes with ThreadSanitizer. The sync test suite currently doesn't compile because there's a lot of places to update to pass in the DB when constructing a session (and making the fixture construct a new DB would fail to test the changes). There's also some remaining cleanup work to do if this design is what we want to move forward with.

jbreams · 2021-08-10T14:52:27Z

@tgoyne, this is still marked as a draft and doesn't currently compile, are you still working on getting it ready for review?

tgoyne · 2021-08-10T15:01:34Z

Yes. I have the sync tests building and mostly passing locally but haven't pushed it because one of the tests is hanging forever.

jbreams

First pass through with just some nits, this is definitely a huge improvement in many places. I especially appreciate the refactor of test/object-store/app.cpp 🙌. Since this PR is so big and touches so many places I'm going to go through this again tomorrow to make sure I haven't missed anything, but a really good effort overall.

CHANGELOG.md

src/realm/db.cpp

src/realm/db.hpp

src/realm/sync/noinst/client_history_impl.hpp

test/object-store/sync/app.cpp

jbreams · 2021-08-25T15:04:25Z

@finnschiermer, did you get a chance to look at this yet?

src/realm/sync/noinst/client_reset.cpp

finnschiermer

This is really nice. :-)

nirinchev · 2021-08-28T00:24:01Z

I'm observing peculiar behavior with this change when using a fake user. It's possible that the .NET mechanics for obtaining a user with hardcoded access/refresh token were relying on a bug that is now fixed, but it could also point to a legitimate issue.

// GetFakeUser calls app->sync_manager()->get_user(...) with hardcoded
// refresh and access token. It doesn't perform an actual login against
// an integration server
var user = GetFakeUser();

var config = new SyncConfiguration(Guid.NewGuid().ToString(), user)
{
    ObjectClasses = new[] { typeof(CollectionsClass) },
    SessionStopPolicy = SessionStopPolicy.Immediately
};

var realm = Realm.GetInstance(config);

// Dispose calls SharedRealm::Close
realm.Dispose();

var sw = new Stopwatch();
sw.Start();
while (sw.ElapsedMilliseconds < 30_000)
{
    try
    {
        // Calls Realm::delete_files
        Realm.DeleteRealm(realm.Config);
        break;
    }
    catch
    {
        Task.Delay(50).Wait();
    }
}

Assert.That(sw.ElapsedMilliseconds, Is.LessThan(5000));

The peculiarity is that on Windows, the time between SharedRealm::Close and Core/Sync actually releasing the file so that it can be deleted is much longer than before - the test above often runs for more than 30 seconds due to Realm::delete_files throwing a "realm is in use" error. Occasionally, it completes faster, which implies there might be a race condition establishing the session and terminating it due to the Realm getting closed, but the changes in this PR go way over my head, so I don't have any evidence to prove it. Since the session stop policy is set to Immediately, the expectation is that the file would be released immediately or at least shortly after the dispose, but that doesn't appear to be the case.

Two other observations on the above:

If we replace the hardcoded user with an actual integration user, the test passes.
It appears that the schema may have an effect on the outcome of the test as using a less complex class (i.e. 1-2 properties) doesn't reproduce the file lock issues. I haven't been able to determine if there's a particular property type or schema size that triggers the issue.

I pushed the code for this test to realm/realm-dotnet#2589 and would be happy to walk someone through the .NET code or help someone on the Core team get the .NET unit tests running so they can repro and debug the issue locally.

tgoyne · 2021-08-30T17:16:57Z

There is an expected change in when exactly the Realm file will be closed. Session shutdown has two separate steps: the session is deactivated, and then torn down. Previously the file was closed when the session was deactivated and then reopened if the session was reactivated later. With these changes, the session never opens and closes the DB and so holds onto a reference until teardown.

What this points at is a pre-existing problem where for whatever reason the session is staying in the deactivated but not yet torn down state for an extended period of time, which previously wasn't causing problems but now is.

tgoyne · 2021-08-30T19:08:03Z

SessionStopPolicy.Immediately is perhaps a misleading name, as stopping the session is still not synchronous. A hypothesis about what's happening here:

Realm opens, initializes schema locally etc. and creates the sync session
Sync session starts uploading the schema init instructions
Realm is closed. This drops the external ref to the session and tells the session to stop. That just pushes a task onto the sync worker thread's task queue.
Sync session continues to perform its already-queued work uploading the schema init, which for some reason takes 30 seconds?
That finally completes, and the worker thread runs the task which closes and tears down the sync session
Delete finally completes

The race condition here is that 2 and 3 are happening simultaneously. If the Realm is closed before the sync session actually connects to the server then whatever long-running thing is happening on the worker thread never gets a chance to run and everything closes quickly. Using a real user is perhaps making the connection process a bit longer so that the Realm is closed before it connects much more often?

If this is due to the sync worker thread being busy processing something, then the next question is what exactly it's spending a long time processing. Is it possibly getting a large DOWNLOAD message from the server? If the test is using CPU time during the 30 second wait then throwing a profiler at it should answer that question easily. If it's not using CPU time then maybe the server is sending a message oddly slowly or something?

tgoyne · 2021-08-30T19:11:17Z

Depending on what step it's spending a bunch of time on it may even be that it's before where the old code would have first opened the Realm on the sync thread.

nirinchev · 2021-08-30T19:18:20Z

The interesting part here is that this test is using a bogus user - GetFakeUser calls app->sync_manager()->get_user(...) with some hardcoded refresh and access tokens, so a sync connection is never established with the server (which may or may not exist - on some platforms we're testing against a bogus server and on others - against the integration docker image). @jbreams's hypothesis is that the sync session may be stuck in the token refresh loop here:

realm-core/src/realm/object-store/sync/sync_session.cpp

Lines 405 to 408 in 7d60f58

    
           std::this_thread::sleep_for(milliseconds(10000)); 
        
           if (session_user) { 
        
               session_user->refresh_custom_data(handle_refresh(session)); 
        
           }

I'll try to verify if that is the case and post an update later today.

tgoyne · 2021-08-30T19:20:50Z

Oh, it'd make sense if the token refresh handler holds a strong reference to the session while it's in progress, so if you close the Realm before that happens then everything is fine but otherwise closing the Realm won't close the sync session until the refresh handler is no longer holding a ref.

jbreams · 2021-08-30T19:21:51Z

I'm trying to reproduce this in a C++ integration test now. Will update if/when I know more.

This was only used by the global notifier.

nirinchev · 2021-08-30T21:15:39Z

Hey, sorry for the red herring. I was able to track that down to the .NET sync error handling code extending the lifetime of the object store's sync session. The race outcome was determined based on whether the sync session managed to contact the server quickly enough to trigger an error that needed to be reported to the SDK. My best guess for why this bug was exposed by these changes is: removing the need to open the file reduced the sync session bootstrap time, so it became faster to contact the server and get an error than before.

I've fixed the .NET issue and now all sync tests are passing.

tgoyne · 2021-08-30T21:22:50Z

Ah yeah, that makes sense. I think the faster initial connection may also have been why some of the object store App tests previously were passing consistently but started failing sometimes with these changes.

danieltabacaru · 2024-03-23T21:41:23Z

src/realm/sync/noinst/client_history_impl.cpp

-        m_remote_versions = std::move(remote_versions);
-        m_origin_file_idents = std::move(origin_file_idents);
-        m_origin_timestamps = std::move(origin_timestamps);
+        m_arrays.emplace(m_db->get_alloc(), *m_group, ref);


@tgoyne This change assumes there is always a group, but that was not the case before. We have a crash because of this exact reason (#7041), and I'm trying to figure out if this change introduced a bug, or if it existed before (and we perhaps don't call set_group() accordingly)

set_group() is called by Replication::initiate_transact() as part of beginning a write transaction, and prepare_for_write() can only be called inside a write transaction. m_group being nullptr here suggests that a function which writes to the Realm is being called without the DB having an active write transactions.

update_from_ref_and_version is called from internal_advance_read after the write lock is acquired but before initiate_transact in the linked issue #7041. The code in Transaction was doing the same thing back then. Was that also taken into account somehow? From what i can tell this may fail if internal_advance_read actually moves to the latest version available.

tgoyne self-assigned this Aug 5, 2021

jbreams self-requested a review August 6, 2021 02:16

tgoyne force-pushed the tg/sync-session-db branch 6 times, most recently from a79de16 to 364065a Compare August 12, 2021 22:50

tgoyne marked this pull request as ready for review August 17, 2021 02:49

tgoyne requested a review from finnschiermer August 17, 2021 02:49

jbreams requested changes Aug 17, 2021

View reviewed changes

jbreams approved these changes Aug 25, 2021

View reviewed changes

src/realm/sync/noinst/client_reset.cpp Outdated Show resolved Hide resolved

finnschiermer approved these changes Aug 27, 2021

View reviewed changes

tgoyne added 5 commits August 30, 2021 12:29

Remove RealmCoordinator::set_transaction_callback()

b1504b7

This was only used by the global notifier.

Remove an unused forward declaration

5f84260

Rework how a session is marked as the sync agent

a845719

Reuse the existing DB instance for sync sessions

a46ffb6

Handle SyncSession outliving the associated RealmCoordinator

d4cee54

tgoyne force-pushed the tg/sync-session-db branch from 019b517 to 029186c Compare August 30, 2021 19:30

Refactor App tests and make some flakey ones more reliable

7b2d101

tgoyne force-pushed the tg/sync-session-db branch from 029186c to 7b2d101 Compare August 30, 2021 19:39

nirinchev mentioned this pull request Aug 30, 2021

Upgrade to latest Core realm/realm-dotnet#2589

Merged

tgoyne merged commit 779c0a5 into master Aug 30, 2021

tgoyne deleted the tg/sync-session-db branch August 30, 2021 21:29

jumilla mentioned this pull request Sep 18, 2021

No longer notify RealmResults changes from other clients #4909

Closed

tgoyne mentioned this pull request Oct 20, 2021

Replication objects always outlive DB, so stop trying to handle it being destroyed first #4979

Merged

sync-by-unito bot mentioned this pull request Oct 19, 2022

Support Mem-only durability for sync #4432

Closed

tgoyne mentioned this pull request Nov 6, 2023

Deliver appropriate subscription state change notifications in DiscardLocal client resets #7119

Merged

github-actions bot locked as resolved and limited conversation to collaborators Mar 22, 2024

danieltabacaru reviewed Mar 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a single DB instance rather than two for synchronized Realms #4839

Use a single DB instance rather than two for synchronized Realms #4839

tgoyne commented Aug 5, 2021 •

edited

Loading

jbreams commented Aug 10, 2021

tgoyne commented Aug 10, 2021

jbreams left a comment

jbreams commented Aug 25, 2021

finnschiermer left a comment

nirinchev commented Aug 28, 2021

tgoyne commented Aug 30, 2021

tgoyne commented Aug 30, 2021

tgoyne commented Aug 30, 2021

nirinchev commented Aug 30, 2021

tgoyne commented Aug 30, 2021

jbreams commented Aug 30, 2021

nirinchev commented Aug 30, 2021

tgoyne commented Aug 30, 2021

danieltabacaru Mar 23, 2024

tgoyne Mar 25, 2024

kiburtse Mar 25, 2024

Use a single DB instance rather than two for synchronized Realms #4839

Use a single DB instance rather than two for synchronized Realms #4839

Conversation

tgoyne commented Aug 5, 2021 • edited Loading

jbreams commented Aug 10, 2021

tgoyne commented Aug 10, 2021

jbreams left a comment

Choose a reason for hiding this comment

jbreams commented Aug 25, 2021

finnschiermer left a comment

Choose a reason for hiding this comment

nirinchev commented Aug 28, 2021

tgoyne commented Aug 30, 2021

tgoyne commented Aug 30, 2021

tgoyne commented Aug 30, 2021

nirinchev commented Aug 30, 2021

tgoyne commented Aug 30, 2021

jbreams commented Aug 30, 2021

nirinchev commented Aug 30, 2021

tgoyne commented Aug 30, 2021

danieltabacaru Mar 23, 2024

Choose a reason for hiding this comment

tgoyne Mar 25, 2024

Choose a reason for hiding this comment

kiburtse Mar 25, 2024

Choose a reason for hiding this comment

tgoyne commented Aug 5, 2021 •

edited

Loading