This repository has been archived by the owner on Apr 26, 2024. It is now read-only.
Race condition with replication means /messages
backfill lacks read-after-write consistency between workers
#14211
Labels
A-Testing
Issues related to testing in complement, synapse, etc
A-Workers
Problems related to running Synapse in Worker Mode (or replication)
O-Uncommon
Most users are unlikely to come across this or unexpected workflow
S-Tolerable
Minor significance, cosmetic issues, low or no impact to users.
T-Defect
Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Z-Read-After-Write
A lack of read-after-write consistency, usually due to cache invalidation races with workers
We have have a problem where because events are persisted in a queue in a
client_reader
worker, there is no guarantee that they are available to read on other workers. So when we fire off a backfill request from/messages
, those backfilled messages aren't necessarily available to paginate with after the backfill completes (even on the worker that put them in the persister queue).CI failure: https://github.com/matrix-org/synapse/actions/runs/3182998161/jobs/5189731097#step:6:15343 (from discussion). This specific CI flake was addressed in matrix-org/complement#492
Here is what happens:
serverB
hasevent1
stored as anoutlier
from previous requests (specifically from MSC3030 jump to date pulling in a missingprev_event
after backfilling)serverB
calls/messages?dir=b
serverB:client_reader1
accepts the request and drives thingsserverB:client_reader1
has some backward extremities in range and requests/backfill
fromserverA
serverB:client_reader1
processes the events from backfill includingevent1
and puts them in the_event_persist_queue
serverB:master
picks up the events from the_event_persist_queue
and persists them to the database, de-outliersevent1
and invalidates its own cache and sends them over replicationserverB:client_reader1
starts assembling the/messages
response and getsevent1
out of the stale cache still as anoutlier
serverB:client_reader1
responds to the/messages
request withoutevent1
becauseoutliers
are filtered outserverB:client_reader1
finally gets the replication data and invalidates its own cache forevent1
(too late, we already got the events from the stale cache and responded)It's exactly this but it really sucks that callingA lack of read-after-write consistency, usually due to cache invalidation races with workers
. In this case, it's all within the same
/messages
doesn't include events we just backfilled for that request. This is a general problem with Synapse though, see issues labeled with Z-Read-After-Write/messages
request so it's a little more insidious.Having this be possible makes it even more of a reason that we should indicate gaps in
/messages
, MSC3871The text was updated successfully, but these errors were encountered: