Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a single DB instance rather than two for synchronized Realms #4839

Merged
merged 6 commits into from
Aug 30, 2021

Conversation

tgoyne
Copy link
Member

@tgoyne tgoyne commented Aug 5, 2021

Similar to #4384, but a somewhat difference approach (and up-to-date). This generally simplifies things, deletes some code, cuts the file descriptors and address space used in half, and should set the foundation for some more architectural simplifications.

A rundown of the changes here:

  1. RealmCoordinator::set_transaction_callback() was there specifically for the global notifier and has never been used for anything else, so I deleted it. This turned out to not actually simplify anything much because the complexity it introduced got resolved in other ways, but it's still dead code.

  2. SyncManager::get_session() now takes a DB instead of a path, and RealmCoordinator now always opens the DB before creating the SyncSession. This is sort of a breaking change, but it appears that no one was calling get_session() from outside ObjectStore any more anyways; it was originally there for async open in Java but we pushed that into ObjectStore a while ago.

  3. We now don't know if a History object will be the sync agent until after we open the DB, so I reworked the whole flow around that to make SyncSession try to claim the sync agent slot when it's created. This ended up simplifying things quite a bit as it cut out a lot of layers.

  4. DB now (optionally) owns the History object used to open it because SyncSession can outlive the RealmCoordinator. It still can take a non-owning reference to avoid having to rewrite all the tests.

  5. The sync session now never opens the file, so the cache of files it opened is gone, along with all of the code related to passing an encryption key to the sync session.

  6. ClientHistoryImpl was aggressively thread-unsafe as it was designed around the pre-core-6 design of history objects being thread-confined. It now adopts the rule that m_group, m_arrays, and everything derived from those are guarded by the write lock. The functions which are called on the sync worker thread without holding the write lock (get_status(), find_uploadable_changesets()) construct local Array accessors rather than going through set_group() to avoid stomping on member variables. This doesn't appear to have performance implications.

  7. SessionWrapperQueue::clear() and thus its destructor has always been broken and leaked all but the first object in it. This was never hit until now because we happened to always call pop_front() in a loop first and never destroyed a non-empty queue. While I was fixing this I went ahead and just turned it into a stack because that made the implementation simpler and the order doesn't matter.

  8. There was a data race on m_mapping_version: it's written to while holding a lock, but read from multiple threads without holding a lock. Stale reads are fine here, so I changed it to an atomic rather than adding more locking.

  9. This deletes a handful of tests that are no longer applicable:

    • "sync: encrypt local realm file", Sync_EncryptClientRealmFiles: sync sessions no longer open the file so we no longer need to verify that it uses the encryption key
    • ServerHistory_MaxOneOwnedByServer: The one sync agent per file restriction is now one sync session rather than one sync History. There's other tests which verify the remaining restriction.
    • Sync_ClientFileIdentSpoofing, Sync_DisabledSession: These tests should have been deleted in https://github.com/realm/realm-sync/pull/993 but for some reason were merely disabled instead (even though there's a comment saying they should be deleted...).
    • Sync_ManySessions: This tested the client file cache, which no longer exists.

    No new tests are added to replace these as I'm not sure what, if anything, related to this would need to be tested which isn't already tested. All of the changes to the existing sync tests are just updating them to the new API for creating a session.

  10. Some of the App tests failed inconsistently when running with tsan enabled (which makes everything a lot slower, and isn't currently run on CI). The culprit appeared to be some of the cleanup code in tests which was replicating what TestSyncManager does already, except incorrectly. Cleaning that up turned into refactoring the whole file to eliminate the vast quantities of duplicated code

All ObjectStore and Cocoa sync tests pass for me locally with these changes with ThreadSanitizer. The sync test suite currently doesn't compile because there's a lot of places to update to pass in the DB when constructing a session (and making the fixture construct a new DB would fail to test the changes). There's also some remaining cleanup work to do if this design is what we want to move forward with.

@tgoyne tgoyne self-assigned this Aug 5, 2021
@jbreams jbreams self-requested a review August 6, 2021 02:16
@jbreams
Copy link
Contributor

jbreams commented Aug 10, 2021

@tgoyne, this is still marked as a draft and doesn't currently compile, are you still working on getting it ready for review?

@tgoyne
Copy link
Member Author

tgoyne commented Aug 10, 2021

Yes. I have the sync tests building and mostly passing locally but haven't pushed it because one of the tests is hanging forever.

@tgoyne tgoyne force-pushed the tg/sync-session-db branch 6 times, most recently from a79de16 to 364065a Compare August 12, 2021 22:50
@tgoyne tgoyne marked this pull request as ready for review August 17, 2021 02:49
@tgoyne tgoyne requested a review from finnschiermer August 17, 2021 02:49
Copy link
Contributor

@jbreams jbreams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass through with just some nits, this is definitely a huge improvement in many places. I especially appreciate the refactor of test/object-store/app.cpp 🙌. Since this PR is so big and touches so many places I'm going to go through this again tomorrow to make sure I haven't missed anything, but a really good effort overall.

CHANGELOG.md Outdated Show resolved Hide resolved
src/realm/db.cpp Show resolved Hide resolved
src/realm/db.hpp Show resolved Hide resolved
src/realm/sync/noinst/client_history_impl.hpp Show resolved Hide resolved
src/realm/sync/noinst/client_history_impl.hpp Show resolved Hide resolved
test/object-store/sync/app.cpp Outdated Show resolved Hide resolved
@jbreams
Copy link
Contributor

jbreams commented Aug 25, 2021

@finnschiermer, did you get a chance to look at this yet?

src/realm/sync/noinst/client_reset.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@finnschiermer finnschiermer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice. :-)

@nirinchev
Copy link
Member

I'm observing peculiar behavior with this change when using a fake user. It's possible that the .NET mechanics for obtaining a user with hardcoded access/refresh token were relying on a bug that is now fixed, but it could also point to a legitimate issue.

// GetFakeUser calls app->sync_manager()->get_user(...) with hardcoded
// refresh and access token. It doesn't perform an actual login against
// an integration server
var user = GetFakeUser();

var config = new SyncConfiguration(Guid.NewGuid().ToString(), user)
{
    ObjectClasses = new[] { typeof(CollectionsClass) },
    SessionStopPolicy = SessionStopPolicy.Immediately
};

var realm = Realm.GetInstance(config);

// Dispose calls SharedRealm::Close
realm.Dispose();

var sw = new Stopwatch();
sw.Start();
while (sw.ElapsedMilliseconds < 30_000)
{
    try
    {
        // Calls Realm::delete_files
        Realm.DeleteRealm(realm.Config);
        break;
    }
    catch
    {
        Task.Delay(50).Wait();
    }
}

Assert.That(sw.ElapsedMilliseconds, Is.LessThan(5000));

The peculiarity is that on Windows, the time between SharedRealm::Close and Core/Sync actually releasing the file so that it can be deleted is much longer than before - the test above often runs for more than 30 seconds due to Realm::delete_files throwing a "realm is in use" error. Occasionally, it completes faster, which implies there might be a race condition establishing the session and terminating it due to the Realm getting closed, but the changes in this PR go way over my head, so I don't have any evidence to prove it. Since the session stop policy is set to Immediately, the expectation is that the file would be released immediately or at least shortly after the dispose, but that doesn't appear to be the case.

Two other observations on the above:

  1. If we replace the hardcoded user with an actual integration user, the test passes.
  2. It appears that the schema may have an effect on the outcome of the test as using a less complex class (i.e. 1-2 properties) doesn't reproduce the file lock issues. I haven't been able to determine if there's a particular property type or schema size that triggers the issue.

I pushed the code for this test to realm/realm-dotnet#2589 and would be happy to walk someone through the .NET code or help someone on the Core team get the .NET unit tests running so they can repro and debug the issue locally.

@tgoyne
Copy link
Member Author

tgoyne commented Aug 30, 2021

There is an expected change in when exactly the Realm file will be closed. Session shutdown has two separate steps: the session is deactivated, and then torn down. Previously the file was closed when the session was deactivated and then reopened if the session was reactivated later. With these changes, the session never opens and closes the DB and so holds onto a reference until teardown.

What this points at is a pre-existing problem where for whatever reason the session is staying in the deactivated but not yet torn down state for an extended period of time, which previously wasn't causing problems but now is.

@tgoyne
Copy link
Member Author

tgoyne commented Aug 30, 2021

SessionStopPolicy.Immediately is perhaps a misleading name, as stopping the session is still not synchronous. A hypothesis about what's happening here:

  1. Realm opens, initializes schema locally etc. and creates the sync session
  2. Sync session starts uploading the schema init instructions
  3. Realm is closed. This drops the external ref to the session and tells the session to stop. That just pushes a task onto the sync worker thread's task queue.
  4. Sync session continues to perform its already-queued work uploading the schema init, which for some reason takes 30 seconds?
  5. That finally completes, and the worker thread runs the task which closes and tears down the sync session
  6. Delete finally completes

The race condition here is that 2 and 3 are happening simultaneously. If the Realm is closed before the sync session actually connects to the server then whatever long-running thing is happening on the worker thread never gets a chance to run and everything closes quickly. Using a real user is perhaps making the connection process a bit longer so that the Realm is closed before it connects much more often?

If this is due to the sync worker thread being busy processing something, then the next question is what exactly it's spending a long time processing. Is it possibly getting a large DOWNLOAD message from the server? If the test is using CPU time during the 30 second wait then throwing a profiler at it should answer that question easily. If it's not using CPU time then maybe the server is sending a message oddly slowly or something?

@tgoyne
Copy link
Member Author

tgoyne commented Aug 30, 2021

Depending on what step it's spending a bunch of time on it may even be that it's before where the old code would have first opened the Realm on the sync thread.

@nirinchev
Copy link
Member

The interesting part here is that this test is using a bogus user - GetFakeUser calls app->sync_manager()->get_user(...) with some hardcoded refresh and access tokens, so a sync connection is never established with the server (which may or may not exist - on some platforms we're testing against a bogus server and on others - against the integration docker image). @jbreams's hypothesis is that the sync session may be stuck in the token refresh loop here:

std::this_thread::sleep_for(milliseconds(10000));
if (session_user) {
session_user->refresh_custom_data(handle_refresh(session));
}

I'll try to verify if that is the case and post an update later today.

@tgoyne
Copy link
Member Author

tgoyne commented Aug 30, 2021

Oh, it'd make sense if the token refresh handler holds a strong reference to the session while it's in progress, so if you close the Realm before that happens then everything is fine but otherwise closing the Realm won't close the sync session until the refresh handler is no longer holding a ref.

@jbreams
Copy link
Contributor

jbreams commented Aug 30, 2021

I'm trying to reproduce this in a C++ integration test now. Will update if/when I know more.

@tgoyne tgoyne force-pushed the tg/sync-session-db branch from 019b517 to 029186c Compare August 30, 2021 19:30
@tgoyne tgoyne force-pushed the tg/sync-session-db branch from 029186c to 7b2d101 Compare August 30, 2021 19:39
@nirinchev
Copy link
Member

Hey, sorry for the red herring. I was able to track that down to the .NET sync error handling code extending the lifetime of the object store's sync session. The race outcome was determined based on whether the sync session managed to contact the server quickly enough to trigger an error that needed to be reported to the SDK. My best guess for why this bug was exposed by these changes is: removing the need to open the file reduced the sync session bootstrap time, so it became faster to contact the server and get an error than before.

I've fixed the .NET issue and now all sync tests are passing.

@tgoyne
Copy link
Member Author

tgoyne commented Aug 30, 2021

Ah yeah, that makes sense. I think the faster initial connection may also have been why some of the object store App tests previously were passing consistently but started failing sometimes with these changes.

m_remote_versions = std::move(remote_versions);
m_origin_file_idents = std::move(origin_file_idents);
m_origin_timestamps = std::move(origin_timestamps);
m_arrays.emplace(m_db->get_alloc(), *m_group, ref);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgoyne This change assumes there is always a group, but that was not the case before. We have a crash because of this exact reason (#7041), and I'm trying to figure out if this change introduced a bug, or if it existed before (and we perhaps don't call set_group() accordingly)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_group() is called by Replication::initiate_transact() as part of beginning a write transaction, and prepare_for_write() can only be called inside a write transaction. m_group being nullptr here suggests that a function which writes to the Realm is being called without the DB having an active write transactions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_from_ref_and_version is called from internal_advance_read after the write lock is acquired but before initiate_transact in the linked issue #7041. The code in Transaction was doing the same thing back then. Was that also taken into account somehow? From what i can tell this may fail if internal_advance_read actually moves to the latest version available.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants