-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a single DB instance rather than two for synchronized Realms #4839
Conversation
@tgoyne, this is still marked as a draft and doesn't currently compile, are you still working on getting it ready for review? |
Yes. I have the sync tests building and mostly passing locally but haven't pushed it because one of the tests is hanging forever. |
a79de16
to
364065a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass through with just some nits, this is definitely a huge improvement in many places. I especially appreciate the refactor of test/object-store/app.cpp 🙌. Since this PR is so big and touches so many places I'm going to go through this again tomorrow to make sure I haven't missed anything, but a really good effort overall.
@finnschiermer, did you get a chance to look at this yet? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice. :-)
I'm observing peculiar behavior with this change when using a fake user. It's possible that the .NET mechanics for obtaining a user with hardcoded access/refresh token were relying on a bug that is now fixed, but it could also point to a legitimate issue. // GetFakeUser calls app->sync_manager()->get_user(...) with hardcoded
// refresh and access token. It doesn't perform an actual login against
// an integration server
var user = GetFakeUser();
var config = new SyncConfiguration(Guid.NewGuid().ToString(), user)
{
ObjectClasses = new[] { typeof(CollectionsClass) },
SessionStopPolicy = SessionStopPolicy.Immediately
};
var realm = Realm.GetInstance(config);
// Dispose calls SharedRealm::Close
realm.Dispose();
var sw = new Stopwatch();
sw.Start();
while (sw.ElapsedMilliseconds < 30_000)
{
try
{
// Calls Realm::delete_files
Realm.DeleteRealm(realm.Config);
break;
}
catch
{
Task.Delay(50).Wait();
}
}
Assert.That(sw.ElapsedMilliseconds, Is.LessThan(5000)); The peculiarity is that on Windows, the time between Two other observations on the above:
I pushed the code for this test to realm/realm-dotnet#2589 and would be happy to walk someone through the .NET code or help someone on the Core team get the .NET unit tests running so they can repro and debug the issue locally. |
There is an expected change in when exactly the Realm file will be closed. Session shutdown has two separate steps: the session is deactivated, and then torn down. Previously the file was closed when the session was deactivated and then reopened if the session was reactivated later. With these changes, the session never opens and closes the DB and so holds onto a reference until teardown. What this points at is a pre-existing problem where for whatever reason the session is staying in the deactivated but not yet torn down state for an extended period of time, which previously wasn't causing problems but now is. |
The race condition here is that 2 and 3 are happening simultaneously. If the Realm is closed before the sync session actually connects to the server then whatever long-running thing is happening on the worker thread never gets a chance to run and everything closes quickly. Using a real user is perhaps making the connection process a bit longer so that the Realm is closed before it connects much more often? If this is due to the sync worker thread being busy processing something, then the next question is what exactly it's spending a long time processing. Is it possibly getting a large DOWNLOAD message from the server? If the test is using CPU time during the 30 second wait then throwing a profiler at it should answer that question easily. If it's not using CPU time then maybe the server is sending a message oddly slowly or something? |
Depending on what step it's spending a bunch of time on it may even be that it's before where the old code would have first opened the Realm on the sync thread. |
The interesting part here is that this test is using a bogus user - realm-core/src/realm/object-store/sync/sync_session.cpp Lines 405 to 408 in 7d60f58
I'll try to verify if that is the case and post an update later today. |
Oh, it'd make sense if the token refresh handler holds a strong reference to the session while it's in progress, so if you close the Realm before that happens then everything is fine but otherwise closing the Realm won't close the sync session until the refresh handler is no longer holding a ref. |
I'm trying to reproduce this in a C++ integration test now. Will update if/when I know more. |
This was only used by the global notifier.
019b517
to
029186c
Compare
029186c
to
7b2d101
Compare
Hey, sorry for the red herring. I was able to track that down to the .NET sync error handling code extending the lifetime of the object store's sync session. The race outcome was determined based on whether the sync session managed to contact the server quickly enough to trigger an error that needed to be reported to the SDK. My best guess for why this bug was exposed by these changes is: removing the need to open the file reduced the sync session bootstrap time, so it became faster to contact the server and get an error than before. I've fixed the .NET issue and now all sync tests are passing. |
Ah yeah, that makes sense. I think the faster initial connection may also have been why some of the object store App tests previously were passing consistently but started failing sometimes with these changes. |
m_remote_versions = std::move(remote_versions); | ||
m_origin_file_idents = std::move(origin_file_idents); | ||
m_origin_timestamps = std::move(origin_timestamps); | ||
m_arrays.emplace(m_db->get_alloc(), *m_group, ref); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set_group()
is called by Replication::initiate_transact()
as part of beginning a write transaction, and prepare_for_write()
can only be called inside a write transaction. m_group
being nullptr here suggests that a function which writes to the Realm is being called without the DB having an active write transactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update_from_ref_and_version
is called from internal_advance_read
after the write lock is acquired but before initiate_transact
in the linked issue #7041. The code in Transaction
was doing the same thing back then. Was that also taken into account somehow? From what i can tell this may fail if internal_advance_read
actually moves to the latest version available.
Similar to #4384, but a somewhat difference approach (and up-to-date). This generally simplifies things, deletes some code, cuts the file descriptors and address space used in half, and should set the foundation for some more architectural simplifications.
A rundown of the changes here:
RealmCoordinator::set_transaction_callback()
was there specifically for the global notifier and has never been used for anything else, so I deleted it. This turned out to not actually simplify anything much because the complexity it introduced got resolved in other ways, but it's still dead code.SyncManager::get_session()
now takes a DB instead of a path, and RealmCoordinator now always opens the DB before creating the SyncSession. This is sort of a breaking change, but it appears that no one was callingget_session()
from outside ObjectStore any more anyways; it was originally there for async open in Java but we pushed that into ObjectStore a while ago.We now don't know if a History object will be the sync agent until after we open the DB, so I reworked the whole flow around that to make SyncSession try to claim the sync agent slot when it's created. This ended up simplifying things quite a bit as it cut out a lot of layers.
DB now (optionally) owns the History object used to open it because SyncSession can outlive the RealmCoordinator. It still can take a non-owning reference to avoid having to rewrite all the tests.
The sync session now never opens the file, so the cache of files it opened is gone, along with all of the code related to passing an encryption key to the sync session.
ClientHistoryImpl was aggressively thread-unsafe as it was designed around the pre-core-6 design of history objects being thread-confined. It now adopts the rule that
m_group
,m_arrays
, and everything derived from those are guarded by the write lock. The functions which are called on the sync worker thread without holding the write lock (get_status(), find_uploadable_changesets()) construct local Array accessors rather than going through set_group() to avoid stomping on member variables. This doesn't appear to have performance implications.SessionWrapperQueue::clear() and thus its destructor has always been broken and leaked all but the first object in it. This was never hit until now because we happened to always call pop_front() in a loop first and never destroyed a non-empty queue. While I was fixing this I went ahead and just turned it into a stack because that made the implementation simpler and the order doesn't matter.
There was a data race on m_mapping_version: it's written to while holding a lock, but read from multiple threads without holding a lock. Stale reads are fine here, so I changed it to an atomic rather than adding more locking.
This deletes a handful of tests that are no longer applicable:
No new tests are added to replace these as I'm not sure what, if anything, related to this would need to be tested which isn't already tested. All of the changes to the existing sync tests are just updating them to the new API for creating a session.
Some of the App tests failed inconsistently when running with tsan enabled (which makes everything a lot slower, and isn't currently run on CI). The culprit appeared to be some of the cleanup code in tests which was replicating what TestSyncManager does already, except incorrectly. Cleaning that up turned into refactoring the whole file to eliminate the vast quantities of duplicated code
All ObjectStore and Cocoa sync tests pass for me locally with these changes with ThreadSanitizer. The sync test suite currently doesn't compile because there's a lot of places to update to pass in the DB when constructing a session (and making the fixture construct a new DB would fail to test the changes). There's also some remaining cleanup work to do if this design is what we want to move forward with.