Race condition with replication means `/messages` backfill lacks read-after-write consistency between workers #14211

matrixbot · 2023-12-20T17:10:01Z

This issue has been migrated from #14211.

We have have a problem where because events are persisted in a queue in a client_reader worker, there is no guarantee that they are available to read on other workers. So when we fire off a backfill request from /messages, those backfilled messages aren't necessarily available to paginate with after the backfill completes (even on the worker that put them in the persister queue).

CI failure: https://github.com/matrix-org/synapse/actions/runs/3182998161/jobs/5189731097#step:6:15343 (from discussion). This specific CI flake was addressed in matrix-org/complement#492

WORKERS=1 POSTGRES=1 COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestJumpToDateEndpoint/parallel/federation/can_paginate_after_getting_remote_event_from_timestamp_to_event_endpoint

Here is what happens:

serverB has event1 stored as an outlier from previous requests (specifically from MSC3030 jump to date pulling in a missing prev_event after backfilling)
Client on serverB calls /messages?dir=b
serverB:client_reader1 accepts the request and drives things
serverB:client_reader1 has some backward extremities in range and requests /backfill from serverA
serverB:client_reader1 processes the events from backfill including event1 and puts them in the _event_persist_queue
serverB:master picks up the events from the _event_persist_queue and persists them to the database, de-outliers event1 and invalidates its own cache and sends them over replication
serverB:client_reader1 starts assembling the /messages response and gets event1 out of the stale cache still as an outlier
serverB:client_reader1 responds to the /messages request without event1 because outliers are filtered out
serverB:client_reader1 finally gets the replication data and invalidates its own cache for event1 (too late, we already got the events from the stale cache and responded)

In a nutshell, we've written the test expecting "read-after-write consistency" but we don't have that.

-- matrix-org/synapse#13185 (comment)

It's exactly this but it really sucks that calling /messages doesn't include events we just backfilled for that request. This is a general problem with Synapse though, see issues labeled with https://github.com/matrix-org/synapse/labels/Z-Read-After-Write. In this case, it's all within the same /messages request so it's a little more insidious.

Having this be possible makes it even more of a reason that we should indicate gaps in /messages, MSC3871

The text was updated successfully, but these errors were encountered:

matrixbot closed this as completed Dec 20, 2023

matrixbot changed the title ~~Dummy issue~~ Race condition with replication means /messages backfill lacks read-after-write consistency between workers Dec 21, 2023

matrixbot added A-Workers S-Tolerable A-Testing O-Uncommon T-Defect Z-Read-After-Write labels Dec 21, 2023

matrixbot reopened this Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition with replication means `/messages` backfill lacks read-after-write consistency between workers #14211

Race condition with replication means `/messages` backfill lacks read-after-write consistency between workers #14211

matrixbot commented Dec 20, 2023 •

edited

Loading

Race condition with replication means /messages backfill lacks read-after-write consistency between workers #14211

Race condition with replication means /messages backfill lacks read-after-write consistency between workers #14211

Comments

matrixbot commented Dec 20, 2023 • edited Loading

Race condition with replication means `/messages` backfill lacks read-after-write consistency between workers #14211

Race condition with replication means `/messages` backfill lacks read-after-write consistency between workers #14211

matrixbot commented Dec 20, 2023 •

edited

Loading