Cache invalidation #294

kegsay · 2023-09-07T13:56:47Z

Cache invalidation in SS

This is the process of informing downstream API caches that what they have remembered for a room is incorrect and needs refetching from the database. We (dmr & kegan) propose invalidation is room ID scoped, and not more fine-grained for now. This keeps things simple but means we ask the DB for more information than we strictly need to.

Cases when invalidation are needed:

Redactions. Whilst the proxy doesn't cache the event content of messages, it definitely does cache things like the room name. If the room name is redacted, we need to inform the global/user/connection caches that the name is now unset. This also applies to member display names (for room.name) and avatar URLs (for room.avatar), as well as the canonical alias (for room.name).
State resolution. This is part of a broader strategy for refreshing entire room state snapshots. The downstream room state caches are currently loaded on startup then strictly updated when live timeline events arrive from the poller. This is a problem when:
- the state changes without a timeline event (state resets),
- the proxy has a very old copy of a room which is updated when a poller joins that room.
In the latter case, we fudged a solution by incorrectly prepending state block events to the timeline to get the caches to update downstream correctly. This makes the timeline incorrect when the client asks for a sufficiently high timeline_limit though.

We propose to add a new payload type V2InvalidateRoom which will contain the room ID to invalidate and the new snapshot ID to load from. This will cause downstream API processes to reload caches from that snapshot. The precise data that needs to be reloaded is the entire list of all joined members and invited members in addition to populating this struct:

type RoomMetadata struct {
	RoomID         string
	Heroes         []Hero
	NameEvent      string // the content of m.room.name, NOT the calculated name
	AvatarEvent    string // the content of m.room.avatar, NOT the resolved avatar
	CanonicalAlias string
	JoinCount      int
	InviteCount    int
	// LastMessageTimestamp is the origin_server_ts of the event most recently seen in
	// this room. Because events arrive at the upstream homeserver out-of-order (and
	// because origin_server_ts is an untrusted event field), this timestamp can
	// _decrease_ as new events come in.
	LastMessageTimestamp uint64
	// LatestEventsByType tracks timing information for the latest event in the room,
	// grouped by event type.
	LatestEventsByType map[string]EventMetadata
	Encrypted          bool
	PredecessorRoomID  *string
	UpgradedRoomID     *string
	RoomType           *string
	// if this room is a space, which rooms are m.space.child state events. This is the same for all users hence is global.
	ChildSpaceRooms map[string]struct{}
	// The latest m.typing ephemeral event for this room.
	TypingEvent json.RawMessage
}

For redactions, we strictly only need Heroes, NameEvent, AvatarEvent, CanonicalAlias as they contain redactable user input.

For state resets, we need all state related fields. This is all fields with the exception of LatestEventsByType, LastMessageTimestamp and TypingEvent, though in practice in most cases we would have a new latest event in the room, and the typing event would be nullified as too old.

Even if we populate downstream caches effectively, we need to communicate this information to the client. The room list sorting logic assumes 1 update affects at most 1 room and therefore causes at most 0-1 move operations. In the invalidation scenario, this assumption holds if we invalidate 1 room per payload. This also matches reality well as often we are talking about a single room being joined, redacted or state reset, not many. In theory, it should be enough to just send an update to connections with the newly updated RoomConnMetadata values and have it do the right thing, but code will need to be checked to not assume that only 1 action can take place per payload (e.g it is possible for the room name to change AND space children to change AND typing to change, etc). We probably also need to resend any required_state events with their new values. E.g if m.room.name was requested and we just redacted it, we should resend it with content: {}.

Testing wise, we need to engineer several failure scenarios:

Redactions: Room with name + canonical alias. Redact room name => room.name updates to canonical alias.
Gappy state: whilst sync v2 is not capable of expressing state resets, we can inject gappy state responses which roughly mimic this. For this, send 5 msgs, then send a gappy state response (room name state event in the state block, setting limited: true, and sending a single latest timeline event with a prev_batch token). Then do a sync for timeline_limit: 20. We should only get 1 event, the latest event, the correct prev_batch token and the correct updated room name.

Redactions work was done in #296.

Detecting cache invalidation in pollers:

whenever a timeline response is limited and timeline[0] is unknown, then we don't know if timeline[-1] is known or unknown, so cannot safely respond to timeline_limits which span these events. We should "invalidate" the timeline to only return the span we know to be safe (the latest response). We do not store timelines in-memory, so the invalidation is purely database-backed (probably via a flag on the event row itself).
- Issue: Token expirations cause incorrect timelines (exacerbated by OIDC refreshing tokens) #283
- Fix: Don't load events when there's a gap between known events #300
whenever there are events in the state block for a room, and we don't know some of them (i.e prepend state events logic). At this point, we should make a new snapshot and invalidate caches downstream.
- Issue: Making the snapshot fixes Checkpoint room state #232 and obviates the need to do Consider purging redundant data when the proxy leaves a room #270 (unless we want to do the latter for storage space reasons).
- Fix: TBD
NOTE: these 2 points are often correlated but not always e.g send 1000 messages = limited: true but no state block.

The text was updated successfully, but these errors were encountered:

DMRobertson · 2023-09-07T14:55:02Z

start simple: e2e test that makes a room, sets a name, syncs and gets name. Then redacts name, sends sentinel, syncs until sentinel, check room name has reset to something sensible

DMRobertson · 2023-09-08T10:23:06Z

For state resets, we need all state related fields. This is all fields with the exception of LatestEventsByType, LastMessageTimestamp and TypingEvent

RoomType shouldn't change here---it's a property of the create event, which is unique and immutable.

DMRobertson · 2023-09-13T10:53:39Z

We do not store timelines in-memory, so the invalidation is purely database-backed (probably via a flag on the event row itself).

Could also have a NID field on the rooms table tracking the oldest event NID after which we have a contiguous timeline. That would mean we don't have to update 5000 bools if we see a gap after 5000 contiguous timeline events.

Or maybe your point is that we have a flag "parentUnknown" or something, default false and set to false in the situation we describe. Then any sql queries that load events paginating backwards have to stop when they see this flag is true.

kegsay added roadmap poller labels Sep 7, 2023

kegsay assigned DMRobertson Sep 7, 2023

DMRobertson mentioned this issue Sep 7, 2023

Invalidate the global cache after a redaction #296

Merged

This was referenced Sep 13, 2023

Don't load events when there's a gap between known events #300

Merged

Checkpoint room state #232

Closed

DMRobertson mentioned this issue Sep 21, 2023

Initialise: make snapshots instead of prepending state events #310

Closed

This was referenced Oct 26, 2023

Cancel outstanding requests when destroying conns #355

Merged

Add a separate payload for redacting state #359

Merged

DMRobertson mentioned this issue Nov 3, 2023

Handle gappy state with a single snapshot, attempt 3 #363

Merged

DMRobertson closed this as completed in #363 Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache invalidation #294

Cache invalidation #294

kegsay commented Sep 7, 2023 •

edited by DMRobertson

Loading

DMRobertson commented Sep 7, 2023

DMRobertson commented Sep 8, 2023

DMRobertson commented Sep 13, 2023 •

edited

Loading

Cache invalidation #294

Cache invalidation #294

Comments

kegsay commented Sep 7, 2023 • edited by DMRobertson Loading

Cache invalidation in SS

DMRobertson commented Sep 7, 2023

DMRobertson commented Sep 8, 2023

DMRobertson commented Sep 13, 2023 • edited Loading

kegsay commented Sep 7, 2023 •

edited by DMRobertson

Loading

DMRobertson commented Sep 13, 2023 •

edited

Loading