Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition with replication means /messages backfill lacks read-after-write consistency between workers #14211

Open
matrixbot opened this issue Dec 20, 2023 · 0 comments

Comments

@matrixbot
Copy link
Collaborator

matrixbot commented Dec 20, 2023

This issue has been migrated from #14211.


We have have a problem where because events are persisted in a queue in a client_reader worker, there is no guarantee that they are available to read on other workers. So when we fire off a backfill request from /messages, those backfilled messages aren't necessarily available to paginate with after the backfill completes (even on the worker that put them in the persister queue).

CI failure: https://github.com/matrix-org/synapse/actions/runs/3182998161/jobs/5189731097#step:6:15343 (from discussion). This specific CI flake was addressed in matrix-org/complement#492

WORKERS=1 POSTGRES=1 COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestJumpToDateEndpoint/parallel/federation/can_paginate_after_getting_remote_event_from_timestamp_to_event_endpoint

Here is what happens:

  1. serverB has event1 stored as an outlier from previous requests (specifically from MSC3030 jump to date pulling in a missing prev_event after backfilling)
  2. Client on serverB calls /messages?dir=b
  3. serverB:client_reader1 accepts the request and drives things
  4. serverB:client_reader1 has some backward extremities in range and requests /backfill from serverA
  5. serverB:client_reader1 processes the events from backfill including event1 and puts them in the _event_persist_queue
  6. serverB:master picks up the events from the _event_persist_queue and persists them to the database, de-outliers event1 and invalidates its own cache and sends them over replication
  7. serverB:client_reader1 starts assembling the /messages response and gets event1 out of the stale cache still as an outlier
  8. serverB:client_reader1 responds to the /messages request without event1 because outliers are filtered out
  9. serverB:client_reader1 finally gets the replication data and invalidates its own cache for event1 (too late, we already got the events from the stale cache and responded)

In a nutshell, we've written the test expecting "read-after-write consistency" but we don't have that.

-- matrix-org/synapse#13185 (comment)

It's exactly this but it really sucks that calling /messages doesn't include events we just backfilled for that request. This is a general problem with Synapse though, see issues labeled with https://github.com/matrix-org/synapse/labels/Z-Read-After-Write. In this case, it's all within the same /messages request so it's a little more insidious.

Having this be possible makes it even more of a reason that we should indicate gaps in /messages, MSC3871

@matrixbot matrixbot changed the title Dummy issue Race condition with replication means /messages backfill lacks read-after-write consistency between workers Dec 21, 2023
@matrixbot matrixbot reopened this Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant