-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Faster room joins: avoid blocking when pulling events with missing prevs #13355
Changes from 7 commits
ad5d75e
ff8f6cf
f7c9a5a
bca5a17
1d48d23
6517210
b879d8c
7d55d88
7870d44
765614f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Faster room joins: skip soft fail checks while Synapse only has partial room state, since the current membership of event senders may not be accurately known. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Faster room joins: avoid blocking when pulling events with partially missing prev events. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -278,15 +278,19 @@ async def on_receive_pdu(self, origin: str, pdu: EventBase) -> None: | |
) | ||
|
||
try: | ||
await self._process_received_pdu(origin, pdu, state_ids=None) | ||
await self._process_received_pdu( | ||
origin, pdu, state_ids=None, partial_state=None | ||
) | ||
except PartialStateConflictError: | ||
# The room was un-partial stated while we were processing the PDU. | ||
# Try once more, with full state this time. | ||
logger.info( | ||
"Room %s was un-partial stated while processing the PDU, trying again.", | ||
room_id, | ||
) | ||
await self._process_received_pdu(origin, pdu, state_ids=None) | ||
await self._process_received_pdu( | ||
origin, pdu, state_ids=None, partial_state=None | ||
) | ||
|
||
async def on_send_membership_event( | ||
self, origin: str, event: EventBase | ||
|
@@ -534,14 +538,20 @@ async def update_state_for_partial_state_event( | |
# | ||
# This is the same operation as we do when we receive a regular event | ||
# over federation. | ||
state_ids = await self._resolve_state_at_missing_prevs(destination, event) | ||
|
||
# build a new state group for it if need be | ||
context = await self._state_handler.compute_event_context( | ||
event, | ||
state_ids_before_event=state_ids, | ||
state_ids, partial_state = await self._resolve_state_at_missing_prevs( | ||
destination, event | ||
) | ||
if context.partial_state: | ||
|
||
context = None | ||
if not partial_state: | ||
# build a new state group for it if need be | ||
context = await self._state_handler.compute_event_context( | ||
event, | ||
state_ids_before_event=state_ids, | ||
partial_state=partial_state, | ||
) | ||
|
||
if context is None or context.partial_state: | ||
# this can happen if some or all of the event's prev_events still have | ||
# partial state - ie, an event has an earlier stream_ordering than one | ||
# or more of its prev_events, so we de-partial-state it before its | ||
|
@@ -806,14 +816,32 @@ async def _process_pulled_event( | |
return | ||
|
||
try: | ||
state_ids = await self._resolve_state_at_missing_prevs(origin, event) | ||
# TODO(faster_joins): make sure that _resolve_state_at_missing_prevs does | ||
# not return partial state | ||
# https://github.com/matrix-org/synapse/issues/13002 | ||
|
||
await self._process_received_pdu( | ||
origin, event, state_ids=state_ids, backfilled=backfilled | ||
) | ||
try: | ||
state_ids, partial_state = await self._resolve_state_at_missing_prevs( | ||
origin, event | ||
) | ||
await self._process_received_pdu( | ||
origin, | ||
event, | ||
state_ids=state_ids, | ||
partial_state=partial_state, | ||
backfilled=backfilled, | ||
) | ||
except PartialStateConflictError: | ||
# The room was un-partial stated while we were processing the event. | ||
# Try once more, with full state this time. | ||
state_ids, partial_state = await self._resolve_state_at_missing_prevs( | ||
origin, event | ||
) | ||
# We ought to have full state now, barring some unlikely race where we left and | ||
# rejoned the room in the background. | ||
Comment on lines
+853
to
+854
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we check that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure, keeping in mind that it won't catch the "calculate-the-state-yourself" case. I can't think of how to handle a second partial state apart from raising an exception. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sorry yes, I'd envisaged raising an exception, on the grounds we could give a clearer idea of what's going wrong. But if you'd expect it to behave correctly (ish) in the case where we've rejoined, then I guess that's not appropriate and we should just leave this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've yet to figure out what happens when we rejoin (not that we can even leave right now!) so am happy to raise an exception. |
||
await self._process_received_pdu( | ||
origin, | ||
event, | ||
state_ids=state_ids, | ||
partial_state=partial_state, | ||
backfilled=backfilled, | ||
) | ||
except FederationError as e: | ||
if e.code == 403: | ||
logger.warning("Pulled event %s failed history check.", event_id) | ||
|
@@ -822,7 +850,7 @@ async def _process_pulled_event( | |
|
||
async def _resolve_state_at_missing_prevs( | ||
self, dest: str, event: EventBase | ||
) -> Optional[StateMap[str]]: | ||
) -> Tuple[Optional[StateMap[str]], Optional[bool]]: | ||
"""Calculate the state at an event with missing prev_events. | ||
|
||
This is used when we have pulled a batch of events from a remote server, and | ||
|
@@ -849,8 +877,10 @@ async def _resolve_state_at_missing_prevs( | |
event: an event to check for missing prevs. | ||
|
||
Returns: | ||
if we already had all the prev events, `None`. Otherwise, returns | ||
the event ids of the state at `event`. | ||
if we already had all the prev events, `None, None`. Otherwise, returns a | ||
tuple containing: | ||
* the event ids of the state at `event`. | ||
* a boolean indicating whether the state may be partial. | ||
|
||
Raises: | ||
FederationError if we fail to get the state from the remote server after any | ||
|
@@ -864,7 +894,7 @@ async def _resolve_state_at_missing_prevs( | |
missing_prevs = prevs - seen | ||
|
||
if not missing_prevs: | ||
return None | ||
return None, None | ||
|
||
logger.info( | ||
"Event %s is missing prev_events %s: calculating state for a " | ||
|
@@ -876,9 +906,15 @@ async def _resolve_state_at_missing_prevs( | |
# resolve them to find the correct state at the current event. | ||
|
||
try: | ||
# Determine whether we may be about to retrieve partial state | ||
# Events may be un-partial stated right after we compute the partial state | ||
# flag, but that's okay, as long as the flag errs on the conservative side. | ||
partial_state_flags = await self._store.get_partial_state_events(seen) | ||
partial_state = any(partial_state_flags.values()) | ||
|
||
# Get the state of the events we know about | ||
ours = await self._state_storage_controller.get_state_groups_ids( | ||
room_id, seen | ||
room_id, seen, await_full_state=False | ||
) | ||
|
||
# state_maps is a list of mappings from (type, state_key) to event_id | ||
|
@@ -924,7 +960,7 @@ async def _resolve_state_at_missing_prevs( | |
"We can't get valid state history.", | ||
affected=event_id, | ||
) | ||
return state_map | ||
return state_map, partial_state | ||
|
||
async def _get_state_ids_after_missing_prev_event( | ||
self, | ||
|
@@ -1094,6 +1130,7 @@ async def _process_received_pdu( | |
origin: str, | ||
event: EventBase, | ||
state_ids: Optional[StateMap[str]], | ||
partial_state: Optional[bool], | ||
backfilled: bool = False, | ||
) -> None: | ||
"""Called when we have a new non-outlier event. | ||
|
@@ -1117,21 +1154,29 @@ async def _process_received_pdu( | |
|
||
state_ids: Normally None, but if we are handling a gap in the graph | ||
(ie, we are missing one or more prev_events), the resolved state at the | ||
event. Must not be partial state. | ||
event | ||
|
||
partial_state: | ||
`True` if `state_ids` is partial and omits non-critical membership | ||
events. | ||
`False` if `state_ids` is the full state. | ||
`None` if `state_ids` is not provided. In this case, the flag will be | ||
calculated based on `event`'s prev events. | ||
|
||
backfilled: True if this is part of a historical batch of events (inhibits | ||
notification to clients, and validation of device keys.) | ||
|
||
PartialStateConflictError: if the room was un-partial stated in between | ||
computing the state at the event and persisting it. The caller should retry | ||
exactly once in this case. Will never be raised if `state_ids` is provided. | ||
exactly once in this case. | ||
""" | ||
logger.debug("Processing event: %s", event) | ||
assert not event.internal_metadata.outlier | ||
|
||
context = await self._state_handler.compute_event_context( | ||
event, | ||
state_ids_before_event=state_ids, | ||
partial_state=partial_state, | ||
) | ||
try: | ||
await self._check_event_auth(origin, event, context) | ||
|
@@ -1664,11 +1709,21 @@ async def _check_for_soft_fail( | |
"""Checks if we should soft fail the event; if so, marks the event as | ||
such. | ||
|
||
Does nothing for events in rooms with partial state, since we may not have an | ||
accurate membership event for the sender in the current state. | ||
|
||
Args: | ||
event | ||
state_ids: The state at the event if we don't have all the event's prev events | ||
origin: The host the event originates from. | ||
""" | ||
if await self._store.is_partial_state_room(event.room_id): | ||
# We might not know the sender's membership in the current state, so don't | ||
# soft fail anything. Even if we do have a membership for the sender in the | ||
# current state, it may have been derived from state resolution between | ||
# partial and full state and may not be accurate. | ||
return | ||
|
||
extrem_ids_list = await self._store.get_latest_event_ids_in_room(event.room_id) | ||
extrem_ids = set(extrem_ids_list) | ||
prev_event_ids = set(event.prev_event_ids()) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -82,13 +82,15 @@ async def get_state_group_delta( | |
return state_group_delta.prev_group, state_group_delta.delta_ids | ||
|
||
async def get_state_groups_ids( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as a random point for future reference: there is a convention that the subject line of a git commit message should be limited to 50 characters. That helps avoid this sort of thing: There's an art to writing such a pithy summary, of course. In this case I would probably just say:
... and put "so that it can return partial state" in the body of the commit message. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. interesting, I wasn't aware that the subject line had a shorter recommended length. Up until now I've been using a limit of 72 characters in merge commits (and ignoring it in commits that get squashed, like the one above). 50 characters certainly makes things tricky! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah, the problem comes when we're doing a commit-by-commit review: it's much nicer if there's a pithy summary for each commit. It's certainly not worth stressing overmuch over though.
Indeed. github actually seems to truncate at 66 chars, so that seems a more reasonable compromise. |
||
self, _room_id: str, event_ids: Collection[str] | ||
self, _room_id: str, event_ids: Collection[str], await_full_state: bool = True | ||
) -> Dict[int, MutableStateMap[str]]: | ||
"""Get the event IDs of all the state for the state groups for the given events | ||
Args: | ||
_room_id: id of the room for these events | ||
event_ids: ids of the events | ||
await_full_state: if `True`, will block if we do not yet have complete | ||
state at these events. | ||
Returns: | ||
dict of state_group_id -> (dict of (type, state_key) -> event id) | ||
|
@@ -100,7 +102,9 @@ async def get_state_groups_ids( | |
if not event_ids: | ||
return {} | ||
|
||
event_to_groups = await self.get_state_group_for_events(event_ids) | ||
event_to_groups = await self.get_state_group_for_events( | ||
event_ids, await_full_state=await_full_state | ||
) | ||
|
||
groups = set(event_to_groups.values()) | ||
group_to_state = await self.stores.state._get_state_for_groups(groups) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd find it helpful to enumerate the possibilities here. AIUI, they are:
state_ids
andpartial_state
are bothNone
, if we had all the prev_events (some of which may have partial state).state_ids
is non-empty, andpartial_state
isFalse
, meaning we were missing some or all prev_events, and have calculated the state after the prev_eventsstate_ids
is non-empty, andpartial_state
isTrue
, meaning we were missing some prev_events, but others had partial-state. We have calculated the state after all the prev_events, but it is still partial. There is no point building an event context in this case.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... I can't help feeling that the interface for
_resolve_state_at_missing_prevs
is excessively baroque and coupled to that ofcompute_event_context
. (Should we replace_resolve_state_at_missing_prevs
with something that returns anEventContext
, and we pass anEventContext
into_process_received_pdu
instead ofstate_ids
andpartial_state
?)But that probably requires a bunch of refactoring that isn't justified right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a big comment listing the cases, borrowing some of your wording here.
I agree that
_resolve_state_at_missing_prevs
is really unwieldy after this change. Is the proposal to have it *always* return anEventContext
? I'll try that refactor in a follow on PR.