Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache invalidation #294

Closed
kegsay opened this issue Sep 7, 2023 · 3 comments · Fixed by #363
Closed

Cache invalidation #294

kegsay opened this issue Sep 7, 2023 · 3 comments · Fixed by #363
Assignees

Comments

@kegsay
Copy link
Member

kegsay commented Sep 7, 2023

Cache invalidation in SS

This is the process of informing downstream API caches that what they have remembered for a room is incorrect and needs refetching from the database. We (dmr & kegan) propose invalidation is room ID scoped, and not more fine-grained for now. This keeps things simple but means we ask the DB for more information than we strictly need to.

Cases when invalidation are needed:

  • Redactions. Whilst the proxy doesn't cache the event content of messages, it definitely does cache things like the room name. If the room name is redacted, we need to inform the global/user/connection caches that the name is now unset. This also applies to member display names (for room.name) and avatar URLs (for room.avatar), as well as the canonical alias (for room.name).

  • State resolution. This is part of a broader strategy for refreshing entire room state snapshots. The downstream room state caches are currently loaded on startup then strictly updated when live timeline events arrive from the poller. This is a problem when:

    • the state changes without a timeline event (state resets),
    • the proxy has a very old copy of a room which is updated when a poller joins that room.

    In the latter case, we fudged a solution by incorrectly prepending state block events to the timeline to get the caches to update downstream correctly. This makes the timeline incorrect when the client asks for a sufficiently high timeline_limit though.

We propose to add a new payload type V2InvalidateRoom which will contain the room ID to invalidate and the new snapshot ID to load from. This will cause downstream API processes to reload caches from that snapshot. The precise data that needs to be reloaded is the entire list of all joined members and invited members in addition to populating this struct:

type RoomMetadata struct {
	RoomID         string
	Heroes         []Hero
	NameEvent      string // the content of m.room.name, NOT the calculated name
	AvatarEvent    string // the content of m.room.avatar, NOT the resolved avatar
	CanonicalAlias string
	JoinCount      int
	InviteCount    int
	// LastMessageTimestamp is the origin_server_ts of the event most recently seen in
	// this room. Because events arrive at the upstream homeserver out-of-order (and
	// because origin_server_ts is an untrusted event field), this timestamp can
	// _decrease_ as new events come in.
	LastMessageTimestamp uint64
	// LatestEventsByType tracks timing information for the latest event in the room,
	// grouped by event type.
	LatestEventsByType map[string]EventMetadata
	Encrypted          bool
	PredecessorRoomID  *string
	UpgradedRoomID     *string
	RoomType           *string
	// if this room is a space, which rooms are m.space.child state events. This is the same for all users hence is global.
	ChildSpaceRooms map[string]struct{}
	// The latest m.typing ephemeral event for this room.
	TypingEvent json.RawMessage
}

For redactions, we strictly only need Heroes, NameEvent, AvatarEvent, CanonicalAlias as they contain redactable user input.

For state resets, we need all state related fields. This is all fields with the exception of LatestEventsByType, LastMessageTimestamp and TypingEvent, though in practice in most cases we would have a new latest event in the room, and the typing event would be nullified as too old.

Even if we populate downstream caches effectively, we need to communicate this information to the client. The room list sorting logic assumes 1 update affects at most 1 room and therefore causes at most 0-1 move operations. In the invalidation scenario, this assumption holds if we invalidate 1 room per payload. This also matches reality well as often we are talking about a single room being joined, redacted or state reset, not many. In theory, it should be enough to just send an update to connections with the newly updated RoomConnMetadata values and have it do the right thing, but code will need to be checked to not assume that only 1 action can take place per payload (e.g it is possible for the room name to change AND space children to change AND typing to change, etc). We probably also need to resend any required_state events with their new values. E.g if m.room.name was requested and we just redacted it, we should resend it with content: {}.

Testing wise, we need to engineer several failure scenarios:

  • Redactions: Room with name + canonical alias. Redact room name => room.name updates to canonical alias.
  • Gappy state: whilst sync v2 is not capable of expressing state resets, we can inject gappy state responses which roughly mimic this. For this, send 5 msgs, then send a gappy state response (room name state event in the state block, setting limited: true, and sending a single latest timeline event with a prev_batch token). Then do a sync for timeline_limit: 20. We should only get 1 event, the latest event, the correct prev_batch token and the correct updated room name.

Redactions work was done in #296.

Detecting cache invalidation in pollers:

@DMRobertson
Copy link
Contributor

start simple: e2e test that makes a room, sets a name, syncs and gets name. Then redacts name, sends sentinel, syncs until sentinel, check room name has reset to something sensible

@DMRobertson
Copy link
Contributor

For state resets, we need all state related fields. This is all fields with the exception of LatestEventsByType, LastMessageTimestamp and TypingEvent

RoomType shouldn't change here---it's a property of the create event, which is unique and immutable.

@DMRobertson
Copy link
Contributor

DMRobertson commented Sep 13, 2023

We do not store timelines in-memory, so the invalidation is purely database-backed (probably via a flag on the event row itself).

Could also have a NID field on the rooms table tracking the oldest event NID after which we have a contiguous timeline. That would mean we don't have to update 5000 bools if we see a gap after 5000 contiguous timeline events.

Or maybe your point is that we have a flag "parentUnknown" or something, default false and set to false in the situation we describe. Then any sql queries that load events paginating backwards have to stop when they see this flag is true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants