Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring for safer olm and megolm session usage #5380

Merged
merged 28 commits into from
Mar 10, 2022

Conversation

BillCarsonFr
Copy link
Member

@BillCarsonFr BillCarsonFr commented Feb 28, 2022

Rework following investigations on strange rolling UISIs errors.
Logs contained a lot of BAD_MAC errors on both olm and megolm.

Olm bad macs can pretty much be anything, like using wrong/desync session or concurrency, or just false positive logs when iterating other known sessions (though proliferation of session might be suspicious)

Megolm could be more probably concurrent decryption, as well as desync with cache/store, as the current code was not very safe.

  • The code has been refactored to create spefic classes to handle cache of sessions, and move this responsability out of the realm crypto store. Access are guarded by synchronized directive.
  • Concurrent decryption attempt are guarded with mutex (per olm/megolm session)
  • Access to olm account has been protected against concurrent access.
  • Refactor the addInboundroupSession logic (there was some crash report of NPE there) and code could cause cache beeing out of sync
  • Better logging
  • EDIT Cleaner unwedging with better logs. Unwedging done at the end of the sync now, some synchronisation added.

Added a new E2E sanity test that the following is working:

  • Proper key sharing to a set of user in e2e room
  • Proper key backup restore from a new session
  • Proper key forwarding
  • Accept and replace existing session with a better forwarded one

EDIT Resurrected unwedging test

@github-actions
Copy link

github-actions bot commented Feb 28, 2022

Unit Test Results

  96 files  +  4    96 suites  +4   1m 15s ⏱️ -5s
172 tests +10  172 ✔️ +10  0 💤 ±0  0 ±0 
564 runs  +40  564 ✔️ +40  0 💤 ±0  0 ±0 

Results for commit 96b5174. ± Comparison against base commit bcdf004.

This pull request removes 2 and adds 12 tests. Note that renamed tests count towards both.
im.vector.app.features.onboarding.OnboardingViewModelTest ‑ given failure when handling display name updates action then emits failure event
im.vector.app.features.onboarding.OnboardingViewModelTest ‑ when handling display name updates action then updates user display name and emits name updated event
im.vector.app.features.onboarding.OnboardingViewModelTest ‑ given a selected picture when handling save selected profile picture then updates upstream avatar and completes personalization
im.vector.app.features.onboarding.OnboardingViewModelTest ‑ given no selected picture when saving selected profile picture then emits failure event
im.vector.app.features.onboarding.OnboardingViewModelTest ‑ given upstream failure when handling display name update then emits failure event
im.vector.app.features.onboarding.OnboardingViewModelTest ‑ given upstream update avatar fails when saving selected profile picture then emits failure event
im.vector.app.features.onboarding.OnboardingViewModelTest ‑ when handling display name update then updates upstream user display name
im.vector.app.features.onboarding.OnboardingViewModelTest ‑ when handling profile picture selected then updates selected picture state
im.vector.app.features.onboarding.OnboardingViewModelTest ‑ when handling profile picture skipped then completes personalization
im.vector.app.features.onboarding.UriFilenameResolverTest ‑ given a non content schema Uri when querying file name then returns last segment
im.vector.app.features.onboarding.UriFilenameResolverTest ‑ given a non hierarchical Uri when querying file name then is null
im.vector.app.features.onboarding.UriFilenameResolverTest ‑ given content schema Uri with backing content when querying file name then returns display name column
…

♻️ This comment has been updated with latest results.

@BillCarsonFr BillCarsonFr marked this pull request as ready for review March 1, 2022 16:14
@BillCarsonFr BillCarsonFr force-pushed the feature/bca/crypto_fix_rolling_uisi branch from 1fbe434 to 44599ed Compare March 2, 2022 08:59
Copy link
Member

@bmarty bmarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's amazing work. I am not sure I understand all the code related to the crypto part. I am glad that we have some unit test. And you added lots of unit test which is also very nice. Thanks!
I would be more confident if we can have a review from someone in the crypto team. WDYT? @poljar maybe?

@BillCarsonFr BillCarsonFr requested a review from poljar March 4, 2022 08:40
Copy link
Contributor

@poljar poljar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some small nits, logic wise nothing seems to be wrong.

return internalGetAllSessions(deviceKey)
}

private fun internalGetAllSessions(deviceKey: String): MutableList<String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a point of this being a separate method? Why not inline it into getDeviceSessionIds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, initialy it was called also in another method (that was already synchronized), so I created this. But looks like there is no point now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

/**
* Retrieve an end-to-end session between the logged-in user and another
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, olm sessions are between devices.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed (and some more in crypto store interface)

@BillCarsonFr BillCarsonFr force-pushed the feature/bca/crypto_fix_rolling_uisi branch from 9172e74 to 3c931d6 Compare March 4, 2022 18:21
Copy link
Contributor

@michaelkaye michaelkaye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've recently commented out some flaky integration tests that are in this area: #5449 - they seemed to be some sort of issue between two things opening a realm db at the same time. Do you think these concurrent access changes you mention might have helped there - is it worth un-ignoring those tests?

Also, it might be good to run the nightly integration tests against a PR like this that introduces more integration tests - otherwise we won't compile and run them until the night after merge.

You can do that in Actions -> "Nightly build" (in left sidebar) -> then in the blue bar, select this branch to run the workflow against.

It doesn't run them automatically any more as it takes a good long time and we don't want to slow down the PR merge time too much.

@@ -23,7 +23,7 @@ object TestConstants {
const val TESTS_HOME_SERVER_URL = "http://10.0.2.2:8080"

// Time out to use when waiting for server response.
private const val AWAIT_TIME_OUT_MILLIS = 30_000
Copy link
Contributor

@michaelkaye michaelkaye Mar 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found myself doing this in response to slow integration tests in #5459 - do you have any more information on why 60s is better?

(we changed 30s -> 60s here, but the github is not showing me both sides of the diff atm)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When doing test with several sessions everything is getting a lot slower (like E2E sanity test), each session is slower to sync, etc...
Everytime we test something with await/waitWithLatch it's using that as default timeout, and test are then failing when waiting from something coming back from the sync.
It also happens when running all tests, I wonder if we don't clean properly between test?
Previously I used the ANDROIDX_TEST_ORCHESTRATOR option to ensure that all test start with a clear state, but looks like it's not working anymore.

Copy link
Contributor

@michaelkaye michaelkaye Mar 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely don't clean up the server between tests in a single junit run, but will clean up between CI builds generally (and if you just run a demo server locally, then that won't be cleaned up unless you explicitly do so)

Copy link
Member

@bmarty bmarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this PR, 1.4.4 has been released.

@bmarty bmarty merged commit ce4ad88 into develop Mar 10, 2022
@bmarty bmarty deleted the feature/bca/crypto_fix_rolling_uisi branch March 10, 2022 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants