Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Faster joins: fixes to device list tracking #12993

Closed
Tracked by #14030
richvdh opened this issue Jun 9, 2022 · 11 comments
Closed
Tracked by #14030

Faster joins: fixes to device list tracking #12993

richvdh opened this issue Jun 9, 2022 · 11 comments
Assignees
Labels
A-Federated-Join joins over federation generally suck T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@richvdh
Copy link
Member

richvdh commented Jun 9, 2022

The device-list-tracking code makes lots of requests to methods like get_users_in_room and get_rooms_for_user, which are likely to return completely bogus results during a faster join. This brings danger of bugs where we think we have up-to-date device lists for particular users, but actually don't, leading to E2EE failures.

So, the work needed here is to (a) understand how device list tracking works; (b) ensure that it will end up correct (even if it's not entirely correct during a faster join).

@richvdh richvdh added A-Federated-Join joins over federation generally suck T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. labels Jun 9, 2022
@richvdh richvdh changed the title Faster joins: support E2E rooms Faster joins: fix device list tracking Jul 20, 2022
@richvdh richvdh added T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. and removed T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. labels Jul 22, 2022
@erikjohnston
Copy link
Member

erikjohnston commented Jul 29, 2022

I had a quick look at the code, and I think there are broadly three flows we care about here.

First, when we receive a device list update for a remote user we check if any of local users share a room with them, dropping the update if they don't. If we're doing a partial join with a room for that user, either:

  1. we already share a room with that user and the update will be handled properly anyway; or
  2. we don't share a room and so drop the update. This is fine as the database won't have marked the fact we're tracking the remote user's devices.

Secondly, when a local user queries a remote user's devices if the remote user is in the room we're partially joining then one of:

  1. the server already has the remote user's devices cached, which we know must be up to date as to have cached those devices the server must already share a room with those users, and so we can just return the cached devices; or
  2. the server already shares a room with the remote user, but has no cached devices, at which point it fetches the remote user's devices, caches them in the DB and returns them; or
  3. the server doesn't already share a room with the remote user, and so always fetches the remote user's devices, doesn't cache them, and returns them to the user. Note that the local user's device also won't think it shares a room with the given user, so won't think its tracking the remote user.

In all three of those cases I think the current handling results in the correct behaviour.

Finally, when a local user adds a new device we need to send out a device update to all remote servers that think they share a room with that local user. For servers that are in the partially joined room we need to ensure that they get the device list update. If the user already share a room with the remote server then we'll already correctly share the device list update. Otherwise, we need to do either:

  1. if we know the set of servers in the room then use that to send out device list updates; or
  2. once the server has fully joined the room send out any device list updates for the local users in the room that happened while we were joining (we have a handy table for tracking local device list updates in a room: device_lists_changes_in_room).

TL;DR: I think the only thing we need to change is the last point, where we need to correctly send out any device list updates for the local user to servers in the partially joined room. The place where we currently do the logic is:

# Ignore any users that aren't ours
if self.hs.is_mine_id(user_id):
joined_user_ids = await self.store.get_users_in_room(room_id)
hosts = {get_domain_from_id(u) for u in joined_user_ids}
hosts.discard(self.server_name)

@richvdh
Copy link
Member Author

richvdh commented Jul 29, 2022

I think when you say "we don't share a room with a user" you mean "we don't share a non-partial-state room with a user"?

we don't share a room and so drop the update. This is fine as the database won't have marked the fact we're tracking the remote user's devices.

This is ok provided the logic for deciding if we're tracking the remote user's devices is actually consistent. Are you sure there's no way for the database to show we are tracking devices while only sharing partial-state rooms with the user?

Also: are there not flows that happen when a local user joins or leaves a room? What happens if a local user joins a partial-state room? Sounds like a similar situation to the "local user adds new device" flow?

@erikjohnston
Copy link
Member

erikjohnston commented Aug 1, 2022

I think when you say "we don't share a room with a user" you mean "we don't share a non-partial-state room with a user"?

Yup. Actually no, I think partial rooms still count if we have persisted any remote joins, as those will appear in get_rooms_for_user? Don't think that changes anything though.

we don't share a room and so drop the update. This is fine as the database won't have marked the fact we're tracking the remote user's devices.

This is ok provided the logic for deciding if we're tracking the remote user's devices is actually consistent. Are you sure there's no way for the database to show we are tracking devices while only sharing partial-state rooms with the user?

We only store cached devices if the server "shares a room with the remote user", which can include partial joins (as we determine this via calling get_rooms_for_user). We then check if we need to clear that table every time the remote user leaves a room (or the server leaves a room that the remote user is in). So, I think that logic is all fine, except we need to be careful of...

Also: are there not flows that happen when a local user joins or leaves a room? What happens if a local user joins a partial-state room? Sounds like a similar situation to the "local user adds new device" flow?

Yeah, I forgot about these flows. The crucial thing here is to ensure that if the servers stops sharing a room with a remote user (i.e. get_rooms_for_user used to return a non-empty set of rooms and now returns the empty set) we correctly clear out any cached devices in the DB.

In particular: if the server fails to join the room and so sends a leave for local users, we must ensure that we check if we need to clear the cached devices of any remote users.

@richvdh
Copy link
Member Author

richvdh commented Aug 1, 2022

We then check if we need to clear that table every time the remote user leaves a room

... or, presumably, that we realise that we incorrectly accepted a join event for a remote user which should have been rejected. I think it should be fine, but we'll want test cases for this stuff.

@richvdh richvdh changed the title Faster joins: fix device list tracking Faster joins: test device list tracking Aug 15, 2022
@squahtx squahtx self-assigned this Aug 15, 2022
@squahtx
Copy link
Contributor

squahtx commented Sep 14, 2022

Writing this up so it stops falling out of my head. It's largely a recap of the notes above.

There are 3 4 bits of API that need to work:

  • /keys/query requests on a remote homeserver must not return stale data.
    We must send out device list updates to remote homeservers in partially joined rooms, either during the partial join phase or the transition to fully joined.
    We can only estimate the remote homeservers in a partially joined room, based on the servers_in_room from the /send_join response and new member events in the room.

    • As a first pass, it's easy enough to use the estimate of participating remote homeservers, so that encryption mostly works during the partial join phase.
    • However, we may incorrectly accept or reject new member events, which implies we must do something during the transition to a fully joined room.
      As Erik says, we can use the device_lists_changes_in_room table for this.
      But we can still neglect to send a device list update to a user if they join a different room we are in and leave the partially joined room.
      When we transition to fully joined, the user will no longer be in the room and we'll think we have no device list updates to send to that user.
  • /keys/query requests on the local homeserver must not return stale data.
    We must not cache device lists of users whose homeserver is not sending us device list updates.

    As Erik says, this is largely fine, since we won't cache the device lists for users on homeservers only in a partially joined room.

    • There is a case where we can erroneously think a homeserver is in a room during the partial join phase, then decide they aren't when transitioning to fully joined.
      We have to ensure that we clear any cached device lists when that happens. It's probably a matter of calling _handle_potentially_left_users when transitioning to a fully joined room.
      • Note to self: What happens if we crash after updating the current state but before calling _handle_potentially_left_users?
  • /sync's device_lists.changed/left must correctly signal to clients when a remote user may have a device list update.
    During the partial join phase, this is going to be broken, since we'll think most users aren't in the room.

    • When we transition to fully joined, we may have to ignore lazy_load_members and signal that all members in the room have a device list update.
  • /keys/changes must also work. Annoyingly, this uses different code to /sync and will need testing too.

@erikjohnston
Copy link
Member

  • When we transition to fully joined, the user will no longer be in the room and we'll think we have no device list updates to send to that user.

Why won't we see the join/leave in the partially joined room? I'd expect us to get the join in the initial state, and then the leave in the subsequent timelines?

@erikjohnston
Copy link
Member

  • When we transition to fully joined, we may have to ignore lazy_load_members and signal that all members in the room have a device list update.

Do we have a good way of detecting when this happens in a /sync stream?

Also, I don't think any of the current device list tracking takes into account lazy members?

@erikjohnston
Copy link
Member

erikjohnston commented Sep 21, 2022

After chatting with @squahtx the plan to fix these issues is:

@squahtx
Copy link
Contributor

squahtx commented Sep 23, 2022

Turns out there's a bug where we don't emit device_lists.left entries when we complete a partial state join and discover that a user is actually not in the room: #13886

I'm minded to ignore #13886 for now under the assumption that left entries are less important to get right than changed, since the latter will come down once users share a room again.

@squahtx
Copy link
Contributor

squahtx commented Sep 23, 2022

Also filed #13887 to capture an edge case where /keys/query could return stale data, which we have decided to ignore for now.

erikjohnston added a commit that referenced this issue Sep 28, 2022
c.f. #12993 (comment), point 3

This stores all device list updates that we receive while partial joins are ongoing, and processes them once we have the full state.

Note: We don't actually process the device lists in the same ways as if we weren't partially joined. Instead of updating the device list remote cache, we simply notify local users that a change in the remote user's devices has happened. I think this is safe as if the local user requests the keys for the remote user and we don't have them we'll simply fetch them as normal.
@richvdh richvdh changed the title Faster joins: test device list tracking Faster joins: fixes to device list tracking Sep 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federated-Join joins over federation generally suck T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

4 participants