Skip to content

Conversation

@procr1337
Copy link
Contributor

@procr1337 procr1337 commented Apr 18, 2025

Previously, is_session_overshared_for_user did not take into account that shared_with_set also contains withheld device IDs who explicitly have never received the session keys. This would lead to it mistakenly determining oversharing for those devices for every event being sent in the presence of blacklisted/withheld devices in the room, and rotating the group session accordingly.

The fix is to correctly exclude devices with ShareInfo::Withheld from the enumeration.

Signed-off-by: Niklas Baumstark [email protected]

Previously, `is_session_overshared_for_user` did not take into account
that `shared_with_set` also contains withheld device IDs who explicitly have never
received the session keys. This would lead to it mistakenly determining
oversharing for those devices for every event being sent in the presence
of blacklisted/withheld devices in the room, and rotating the group session accordingly.

The fix is to correctly exclude devices with ShareInfo::Withheld from
the enumeration.
@procr1337 procr1337 requested review from a team as code owners April 18, 2025 15:12
@procr1337 procr1337 requested review from Hywan and poljar and removed request for a team April 18, 2025 15:12
@procr1337
Copy link
Contributor Author

procr1337 commented Apr 18, 2025

I don't really understand the failing test: It seems to withhold room keys for Dan's devices because of m.no_olm, so it doesn't actually share the keys with Dan, and hence correctly determines that no rotation is necessary after Dan's second device logs out.

The correct test case should be to actually do share the keys with both devices first, so that a rotation is in fact necessary.

@procr1337
Copy link
Contributor Author

procr1337 commented Apr 22, 2025

#2729 (which is kind of the opposite of the behavior I was trying to fix here) would indicate that there are probably other considerations about which devices may actually eventually get the key, e.g. to_share_with_set. As mentioned in the dev matrix channel, it's a bit hard for me to understand exactly the implcations but I'd be willing to dig deeper if somebody can point me to the right places

Copy link
Contributor

@poljar poljar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the failing test: It seems to withhold room keys for Dan's devices because of m.no_olm, so it doesn't actually share the keys with Dan, and hence correctly determines that no rotation is necessary after Dan's second device logs out.

I think the test might have been written before the support for the m.room_key.withheld messages was added. It therefore didn't set up the required Olm sessions for the room key to be shared even though it obviously wanted Dan's devices to be in the shared set.

To fix the test we'll probably want to rewrite the test so it doesn't call share_room_key() and mark_request_as_sent() to modify the state of the outbound group session. Instead we can just call OutboundGroupSession::mark_shared_with(). This won't require any setup of the Olm sessions.

The bug you discovered was likely introduced with the .withheld messages as well.

I think that this PR makes sense, could you just fix the failing test and introduce a regression test which checks the case where we don't want to rotate as well.

@procr1337
Copy link
Contributor Author

procr1337 commented Apr 23, 2025

@poljar Thank you for your feedback. Added the two tests.

I believe there was an additional bug in the previous implementation for not taking into account pending to-device requests. I can't prove to myself why it should not be possible for pending requests to exist before share_room_key is called, especially because it specifically handles this case after calling collect_session_recipients. I could be wrong and be missing another invariant at the API boundaries here. I guess another way to fix this would be to cancel pending requests to share keys with devices who are no longer in the set of recipients. This seems more complex to me and I'm not sure if races will be possible in this case (between some task picking up a request to send and it being canceled).

There is some potential for refactoring, especially to properly encapsulate the to_share_with_set field of OutboundGroupSession, and also to make OutboundGroupSession::is_shared_with accessible without full DeviceData

@codecov
Copy link

codecov bot commented Apr 23, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 85.94%. Comparing base (d2874af) to head (2563605).
Report is 55 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4954      +/-   ##
==========================================
- Coverage   85.95%   85.94%   -0.02%     
==========================================
  Files         325      325              
  Lines       35649    35651       +2     
==========================================
- Hits        30641    30639       -2     
- Misses       5008     5012       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@poljar poljar removed the request for review from Hywan April 24, 2025 07:03
@procr1337
Copy link
Contributor Author

procr1337 commented Apr 24, 2025

On second read, I believe the other bug is pretty much exactly #2729, so this should potentially be a fix for that IIUC

@poljar
Copy link
Contributor

poljar commented Apr 24, 2025

@poljar Thank you for your feedback. Added the two tests.

I believe there was an additional bug in the previous implementation for not taking into account pending to-device requests. I can't prove to myself why it should not be possible for pending requests to exist before share_room_key is called, especially because it specifically handles this case after calling collect_session_recipients.

They can exists if you call share_room_key() multiple times, this is what #2729 is about.

They can't exists if you call share_room_key() the first time. This can bee seen if you inspect how the pending requests are added.

The way to add the pending requests is to call OutboundGroupSession::add_request(). There are exactly two places where this is happening, once for the withheld to-device requests and once for the requests containing the room key.

Both of those cases happen after the call to collect_session_recipients().

I could be wrong and be missing another invariant at the API boundaries here.

You are right. The reason why this works is that we discard the room key if we don't manage to mark all the pending requests as sent:

let response = self.share_room_key().await;
// If one of the responses failed invalidate the group
// session as using it would end up in undecryptable
// messages.
if let Err(r) = response {
let machine = self.client.olm_machine().await;
if let Some(machine) = machine.as_ref() {
machine.discard_room_key(self.room_id()).await?;
}
return Err(r);
}

Now, there is still an edge case where #2729 is important, if the future gets cancelled and we fail to discard the room key. This can happen if the application closes unexpectedly, then we are left with an OutboundGroupSession in the store that might contain pending requests.

I guess another way to fix this would be to cancel pending requests to share keys with devices who are no longer in the set of recipients. This seems more complex to me and I'm not sure if races will be possible in this case (between some task picking up a request to send and it being canceled).

I was imagining that we should have a stateful API for this, i.e. share_room_key() returns an object that contains some state.

For example it would contain a per-room lock, users can't attempt to share a room key for the same room concurrently, it would also contain logic that would only persist and "activate" the room key if we managed to send out all the pending requests.

So something like:

async fn share_room_key() -> ShareRoomKeyState {
   ...
} 

impl Drop for ShareRoomKeyState {
    fn drop(&mut self) {
        if !self.is_done() {
            self.session_manager.discard_room_key(&self.room_id);
        }
    }
}

let state = olm_machine.share_room_key().await;

for request in state.pending_requests {
    // The `?` is fine because `drop()` will discard the room key if we fail.    
    let response = client.send_request(request).await?;
    // This will set the `done` flag for the state if this is our final request.
    state.mark_request_as_sent(response).await;
}

This still has a problem with cancel safety, unless we also change the way the OutboundGroupSession is stored, it should be only stored once the all the request have been shared. Furthermore expressing this over the FFI is pretty tricky.

All that being said, your approach might be better since it removes the cancel-safety problem.

Copy link
Contributor

@poljar poljar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please split out the fix for #2729 into a separate PR?

Your original fix and the modified tests look good to me, so we can merge that part right away.

The fix for #2729 might need some additional tests.

Copy link
Contributor

@poljar poljar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good.

@procr1337
Copy link
Contributor Author

@poljar Done.

You are right. The reason why this works is that we discard the room key if we don't manage to mark all the pending requests as sent

FYI there is at least one test which fails an assert!(outbound.pending_requests().is_empty()) placed before the call to collect_session_recipients:

thread 'session_manager::group_sessions::tests::test_no_olm_sent_once' panicked at crates/matrix-sdk-crypto/src/session_manager/group_sessions/mod.rs:677:9:
assertion failed: outbound.pending_requests().is_empty()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    session_manager::group_sessions::tests::test_no_olm_sent_once

@poljar
Copy link
Contributor

poljar commented Apr 24, 2025

FYI there is at least one test which fails an assert!(outbound.pending_requests().is_empty()) placed before the call to collect_session_recipients:

That doesn't surprise me, as mentioned, the API of the crypto crate doesn't prevent this state.

The matrix-sdk crate, which uses the crypto crate, does guard against this via room key discarding.

@poljar poljar merged commit afb6627 into matrix-org:main Apr 24, 2025
42 checks passed
richvdh added a commit to matrix-org/matrix-js-sdk that referenced this pull request Jun 21, 2025
For js-sdk users, this includes the following:

    -   Send stable identifier `sender_device_keys` for MSC4147 (Including device keys with Olm-encrypted events).
        ([#4964](matrix-org/matrix-rust-sdk#4964))

    -   Check the `sender_device_keys` field on _all_ incoming Olm-encrypted to-device messages and ignore any to-device messages which include the field but whose data is invalid (as per [MSC4147](matrix-org/matrix-spec-proposals#4147)).
        ([#4922](matrix-org/matrix-rust-sdk#4922))

    -   Fix bug which caused room keys to be unnecessarily rotated on every send in the presence of blacklisted/withheld devices in the room.
        ([#4954](matrix-org/matrix-rust-sdk#4954))

    -   Fix [matrix-rust-sdk#2729](matrix-org/matrix-rust-sdk#2729) which in rare cases can cause room key oversharing.
        ([#4975](matrix-org/matrix-rust-sdk#4975))
github-merge-queue bot pushed a commit to matrix-org/matrix-js-sdk that referenced this pull request Jun 23, 2025
For js-sdk users, this includes the following:

    -   Send stable identifier `sender_device_keys` for MSC4147 (Including device keys with Olm-encrypted events).
        ([#4964](matrix-org/matrix-rust-sdk#4964))

    -   Check the `sender_device_keys` field on _all_ incoming Olm-encrypted to-device messages and ignore any to-device messages which include the field but whose data is invalid (as per [MSC4147](matrix-org/matrix-spec-proposals#4147)).
        ([#4922](matrix-org/matrix-rust-sdk#4922))

    -   Fix bug which caused room keys to be unnecessarily rotated on every send in the presence of blacklisted/withheld devices in the room.
        ([#4954](matrix-org/matrix-rust-sdk#4954))

    -   Fix [matrix-rust-sdk#2729](matrix-org/matrix-rust-sdk#2729) which in rare cases can cause room key oversharing.
        ([#4975](matrix-org/matrix-rust-sdk#4975))
github-merge-queue bot pushed a commit to matrix-org/matrix-js-sdk that referenced this pull request Jun 23, 2025
For js-sdk users, this includes the following:

    -   Send stable identifier `sender_device_keys` for MSC4147 (Including device keys with Olm-encrypted events).
        ([#4964](matrix-org/matrix-rust-sdk#4964))

    -   Check the `sender_device_keys` field on _all_ incoming Olm-encrypted to-device messages and ignore any to-device messages which include the field but whose data is invalid (as per [MSC4147](matrix-org/matrix-spec-proposals#4147)).
        ([#4922](matrix-org/matrix-rust-sdk#4922))

    -   Fix bug which caused room keys to be unnecessarily rotated on every send in the presence of blacklisted/withheld devices in the room.
        ([#4954](matrix-org/matrix-rust-sdk#4954))

    -   Fix [matrix-rust-sdk#2729](matrix-org/matrix-rust-sdk#2729) which in rare cases can cause room key oversharing.
        ([#4975](matrix-org/matrix-rust-sdk#4975))
github-merge-queue bot pushed a commit to matrix-org/matrix-js-sdk that referenced this pull request Jun 23, 2025
For js-sdk users, this includes the following:

    -   Send stable identifier `sender_device_keys` for MSC4147 (Including device keys with Olm-encrypted events).
        ([#4964](matrix-org/matrix-rust-sdk#4964))

    -   Check the `sender_device_keys` field on _all_ incoming Olm-encrypted to-device messages and ignore any to-device messages which include the field but whose data is invalid (as per [MSC4147](matrix-org/matrix-spec-proposals#4147)).
        ([#4922](matrix-org/matrix-rust-sdk#4922))

    -   Fix bug which caused room keys to be unnecessarily rotated on every send in the presence of blacklisted/withheld devices in the room.
        ([#4954](matrix-org/matrix-rust-sdk#4954))

    -   Fix [matrix-rust-sdk#2729](matrix-org/matrix-rust-sdk#2729) which in rare cases can cause room key oversharing.
        ([#4975](matrix-org/matrix-rust-sdk#4975))
toger5 pushed a commit to matrix-org/matrix-js-sdk that referenced this pull request Jul 18, 2025
For js-sdk users, this includes the following:

    -   Send stable identifier `sender_device_keys` for MSC4147 (Including device keys with Olm-encrypted events).
        ([#4964](matrix-org/matrix-rust-sdk#4964))

    -   Check the `sender_device_keys` field on _all_ incoming Olm-encrypted to-device messages and ignore any to-device messages which include the field but whose data is invalid (as per [MSC4147](matrix-org/matrix-spec-proposals#4147)).
        ([#4922](matrix-org/matrix-rust-sdk#4922))

    -   Fix bug which caused room keys to be unnecessarily rotated on every send in the presence of blacklisted/withheld devices in the room.
        ([#4954](matrix-org/matrix-rust-sdk#4954))

    -   Fix [matrix-rust-sdk#2729](matrix-org/matrix-rust-sdk#2729) which in rare cases can cause room key oversharing.
        ([#4975](matrix-org/matrix-rust-sdk#4975))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants