Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Refactor MSC3030 /timestamp_to_event to move away from our snowflake pull from destination pattern #14096

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Oct 7, 2022

Refactor MSC3030 /timestamp_to_event:

  1. federation_client.timestamp_to_event(...) now handles all destination looping and uses our generic _try_destination_list(...) helper.
  2. Consistently handling NotRetryingDestination and FederationDeniedError across get_pdu , backfill, and the generic _try_destination_list which is used for many places we use this pattern.
  3. get_pdu(...) now returns PulledPduInfo so we know which destination we ended up pulling the PDU from

Todo

Dev notes

COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestJumpToDateEndpoint
COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestJumpToDateEndpoint/parallel/federation/can_paginate_after_getting_remote_event_from_timestamp_to_event_endpoint

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Pull request includes a sign off
  • Code style is correct
    (run the linters)

synapse/http/servlet.py Outdated Show resolved Hide resolved
@MadLittleMods MadLittleMods added the T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. label Oct 7, 2022
@MadLittleMods MadLittleMods force-pushed the madlittlemods/13944-fix-msc3030-jump-to-date-complement-backfill-test-flake branch 2 times, most recently from e755f41 to 82535ff Compare October 7, 2022 04:19
@MadLittleMods MadLittleMods changed the title Fix Complement MSC3030 can_paginate_after_getting_remote_event_from_timestamp_to_event_endpoint test flake Fix Complement MSC3030 test flake - Make sure we backfill the closest local event if it's an outlier Oct 7, 2022
@MadLittleMods MadLittleMods changed the title Fix Complement MSC3030 test flake - Make sure we backfill the closest local event if it's an outlier Fix MSC3030 Complement test flake - Make sure we backfill the closest local event if it's an outlier Oct 7, 2022
@MadLittleMods MadLittleMods force-pushed the madlittlemods/13944-fix-msc3030-jump-to-date-complement-backfill-test-flake branch 2 times, most recently from 63489a9 to af3fecc Compare October 7, 2022 06:34
Comment on lines 441 to 446
except NotRetryingDestination as e:
logger.info(f"get_pdu(event_id={event_id}): {e}")
continue
except FederationDeniedError:
logger.info(
"Failed to get PDU %s from %s because %s",
event_id,
destination,
e,
f"get_pdu(event_id={event_id}): Not attempting to fetch PDU from {destination} because the homeserver is not on our federation whitelist"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistently handling these two errors across get_pdu , backfill, and the generic _try_destination_list which is used for many places we use this pattern.

Comment on lines +1665 to +1666
# Loop through each homeserver candidate until we get a succesful response
timestamp_to_event_response = await self._try_destination_list(
Copy link
Contributor Author

@MadLittleMods MadLittleMods Oct 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this logic from synapse/handlers/room.py and refactored to use the generic _try_destination_list which handles this pattern of looping over destinations

@MadLittleMods MadLittleMods force-pushed the madlittlemods/13944-fix-msc3030-jump-to-date-complement-backfill-test-flake branch from af3fecc to 7ea5136 Compare October 7, 2022 06:40
@@ -352,11 +366,11 @@ async def _record_failure_callback(
@tag_args
async def get_pdu(
self,
destinations: Iterable[str],
destinations: Collection[str],
Copy link
Contributor Author

@MadLittleMods MadLittleMods Oct 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed that we iterate over destinations twice so if it was actually use an Iterable anywhere it would get exhausted the first time we iterate in the log before we actually went over each destination.

@@ -51,7 +51,7 @@ def __init__(self, retry_last_ts: int, retry_interval: int, destination: str):
destination: the domain in question
"""

msg = "Not retrying server %s." % (destination,)
msg = f"Not retrying server {destination} because we tried it recently retry_last_ts={retry_last_ts} and we won't check for another retry_interval={retry_interval}ms."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a more verbose error message to spell things out

…timestamp_to_event_endpoint` test flake. Make sure we backfill the closest local event if it's an `outlier` so we can use it with `/context` and paginate `/messages` from this point. Previously we only did this for a remote event.

Fix  #13944

Why did this fail before? Why was it flakey?
====================================================

Sleuthing the server logs on the CI failure, it looks like `hs2:/timestamp_to_event` found `$NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU` as an `outlier` event locally. Then when we went and asked for it via `/context`, since it's an `outlier`, it was filtered out of the results -> `You don't have permission to access that event.`

This is reproducible when `sync_partial_state` races and persists `$NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU` as an `outlier` before we evaluate `get_event_for_timestamp(...)`. To consistently reproduce locally, just add a delay at the [start of `get_event_for_timestamp(...)`](https://github.com/matrix-org/synapse/blob/cb20b885cb4bd1648581dd043a184d86fc8c7a00/synapse/handlers/room.py#L1470-L1496) so it always runs after `sync_partial_state` completes.

```py
from twisted.internet import task as twisted_task
d = twisted_task.deferLater(self.hs.get_reactor(), 3.5)
await d
```

In a run where it passes, on `hs2`, `get_event_for_timestamp(...)` finds a different event locally which is next to a gap and we request from a closer one from `hs1` which gets backfilled. And since the backfilled event is not an `outlier`, it's returned as expected during `/context`.

Future changes
==========================

In a future PR, it would be nice if `/context` would just backfill the event for us. This would also help with #3848
@MadLittleMods MadLittleMods force-pushed the madlittlemods/13944-fix-msc3030-jump-to-date-complement-backfill-test-flake branch from 7ea5136 to da87def Compare October 7, 2022 06:53
@erikjohnston
Copy link
Member

Is this an actual bug user visible bug? It seems odd to have to have Synapse side changes to fix a flakey complement test, unless Synapse is actually hitting a bug.

@DMRobertson
Copy link
Contributor

Is this an actual bug user visible bug? It seems odd to have to have Synapse side changes to fix a flakey complement test, unless Synapse is actually hitting a bug.

Possibly related: matrix-org/complement#492 (review)

@erikjohnston
Copy link
Member

If we find an outlier locally as a close event, then that gives us the answer right away. The alternative is to not see an event locally, so we go out to federation and ask the same question and get the same outlier we already knew about back. Then we end up backfilling the event anyway.

I guess its not clear to me that we would get anywhere near the same answer. In particular, if the query returns an outlier than (pretty much by definition) we don't actually have that part of the DAG locally, and so we don't know if the outlier is the "closest" event to the timestamp.

As a concrete example: let's take a really busy room that has many messages a day, but relative infrequent joins (say one a week). A new server joins the room (and so persists all joins as outliers) and their user wants to jump to a point several months ago. The query as it stands will return the nearest join event, which is, say, several weeks before the requested point.

I think what this PR will do is cause the client to render a section of the timeline around that join, which is not what the user requested.

Instead, if we didn't pick that join and instead hit over federation, then we're at least more likely to get the right answer back?

@MadLittleMods
Copy link
Contributor Author

I guess its not clear to me that we would get anywhere near the same answer. In particular, if the query returns an outlier than (pretty much by definition) we don't actually have that part of the DAG locally, and so we don't know if the outlier is the "closest" event to the timestamp.

Keep in mind, it also has checks to make sure the local event we found is not next to a forward or backward gap. So chances are the outlier we find is next to a gap and we will go ask federation.

But I think the code/my assumptions are wrong. Re-reading docs/development/room-dag-concepts.md#outliers, we don't store anything in event_edges, event_backward_extremities, event_forward_extremeties for outliers which is_event_next_to_backward_gap(...) and is_event_next_to_forward_gap(...) rely on so our gap checks are inaccurate.

We probably need to add some assertions to those functions that the event is not an outlier.

And we can make the shortcut to not check for outliers in get_event_id_for_timestamp(...) 👍

I'll make these changes in a separate PR and leave these refactor changes until after we merge that. Thanks for chatting through this 🙂

@MadLittleMods MadLittleMods added T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. and removed T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Oct 18, 2022
@MadLittleMods MadLittleMods marked this pull request as draft October 18, 2022 04:43
@MadLittleMods MadLittleMods changed the title Fix MSC3030 Complement test flake - Make sure we backfill the closest local event if it's an outlier Refactor MSC3030 /timestamp_to_event to move away from our snowflake pull from destination pattern Oct 18, 2022
@MadLittleMods MadLittleMods removed the request for review from a team October 18, 2022 04:45
@MadLittleMods MadLittleMods marked this pull request as ready for review October 20, 2022 19:33
Copy link
Member

@erikjohnston erikjohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than a couple of nits like using f-strings in log lines I think this looks good

synapse/federation/federation_client.py Outdated Show resolved Hide resolved
synapse/federation/federation_client.py Outdated Show resolved Hide resolved
synapse/federation/federation_client.py Outdated Show resolved Hide resolved
@MadLittleMods MadLittleMods merged commit 40fa829 into develop Oct 26, 2022
@MadLittleMods MadLittleMods deleted the madlittlemods/13944-fix-msc3030-jump-to-date-complement-backfill-test-flake branch October 26, 2022 21:10
@MadLittleMods
Copy link
Contributor Author

Thanks for the review @erikjohnston 🐈

bradtgmurray added a commit to beeper/synapse-legacy-fork that referenced this pull request Nov 15, 2022
Synapse 1.71.0 (2022-11-08)
===========================

Please note that, as announced in the release notes for Synapse 1.69.0, legacy Prometheus metric names are now disabled by default.
They will be removed altogether in Synapse 1.73.0.
If not already done, server administrators should update their dashboards and alerting rules to avoid using the deprecated metric names.
See the [upgrade notes](https://matrix-org.github.io/synapse/v1.71/upgrade.html#upgrading-to-v1710) for more details.

**Note:** in line with our [deprecation policy](https://matrix-org.github.io/synapse/latest/deprecation_policy.html) for platform dependencies, this will be the last release to support PostgreSQL 10, which reaches upstream end-of-life on November 10th, 2022. Future releases of Synapse will require PostgreSQL 11+.

No significant changes since 1.71.0rc2.

Synapse 1.71.0rc2 (2022-11-04)
==============================

Improved Documentation
----------------------

- Document the changes to monthly active user metrics due to deprecation of legacy Prometheus metric names. ([\matrix-org#14358](matrix-org#14358), [\matrix-org#14360](matrix-org#14360))

Deprecations and Removals
-------------------------

- Disable legacy Prometheus metric names by default. They can still be re-enabled for now, but they will be removed altogether in Synapse 1.73.0. ([\matrix-org#14353](matrix-org#14353))

Internal Changes
----------------

- Run unit tests against Python 3.11. ([\matrix-org#13812](matrix-org#13812))

Synapse 1.71.0rc1 (2022-11-01)
==============================

Features
--------

- Support back-channel logouts from OpenID Connect providers. ([\matrix-org#11414](matrix-org#11414))
- Allow use of Postgres and SQLlite full-text search operators in search queries. ([\matrix-org#11635](matrix-org#11635), [\matrix-org#14310](matrix-org#14310), [\matrix-org#14311](matrix-org#14311))
- Implement [MSC3664](matrix-org/matrix-spec-proposals#3664), Pushrules for relations. Contributed by Nico. ([\matrix-org#11804](matrix-org#11804))
- Improve aesthetics of HTML templates. Note that these changes do not retroactively apply to templates which have been [customised](https://matrix-org.github.io/synapse/latest/templates.html#templates) by server admins. ([\matrix-org#13652](matrix-org#13652))
- Enable write-ahead logging for SQLite installations. Contributed by [@asymmetric](https://github.com/asymmetric). ([\matrix-org#13897](matrix-org#13897))
- Show erasure status when [listing users](https://matrix-org.github.io/synapse/latest/admin_api/user_admin_api.html#query-user-account) in the Admin API. ([\matrix-org#14205](matrix-org#14205))
- Provide a specific error code when a `/sync` request provides a filter which doesn't represent a JSON object. ([\matrix-org#14262](matrix-org#14262))

Bugfixes
--------

- Fix a long-standing bug where the `update_synapse_database` script could not be run with multiple databases. Contributed by @thefinn93 @ Beeper. ([\matrix-org#13422](matrix-org#13422))
- Fix a bug which prevented setting an avatar on homeservers which have an explicit port in their `server_name` and have `max_avatar_size` and/or `allowed_avatar_mimetypes` configuration. Contributed by @ashfame. ([\matrix-org#13927](matrix-org#13927))
- Check appservice user interest against the local users instead of all users in the room to align with [MSC3905](matrix-org/matrix-spec-proposals#3905). ([\matrix-org#13958](matrix-org#13958))
- Fix a long-standing bug where Synapse would accidentally include extra information in the response to [`PUT /_matrix/federation/v2/invite/{roomId}/{eventId}`](https://spec.matrix.org/v1.4/server-server-api/#put_matrixfederationv2inviteroomideventid). ([\matrix-org#14064](matrix-org#14064))
- Fix a bug introduced in Synapse 1.64.0 where presence updates could be missing from `/sync` responses. ([\matrix-org#14243](matrix-org#14243))
- Fix a bug introduced in Synapse 1.60.0 which caused an error to be logged when Synapse received a SIGHUP signal if debug logging was enabled. ([\matrix-org#14258](matrix-org#14258))
- Prevent history insertion ([MSC2716](matrix-org/matrix-spec-proposals#2716)) during an partial join ([MSC3706](matrix-org/matrix-spec-proposals#3706)). ([\matrix-org#14291](matrix-org#14291))
- Fix a bug introduced in Synapse 1.34.0 where device names would be returned via a federation user key query request when `allow_device_name_lookup_over_federation` was set to `false`. ([\matrix-org#14304](matrix-org#14304))
- Fix a bug introduced in Synapse 0.34.0 where logs could include error spam when background processes are measured as taking a negative amount of time. ([\matrix-org#14323](matrix-org#14323))
- Fix a bug introduced in Synapse 1.70.0 where clients were unable to PUT new [dehydrated devices](matrix-org/matrix-spec-proposals#2697). ([\matrix-org#14336](matrix-org#14336))

Improved Documentation
----------------------

- Explain how to disable the use of [`trusted_key_servers`](https://matrix-org.github.io/synapse/latest/usage/configuration/config_documentation.html#trusted_key_servers). ([\matrix-org#13999](matrix-org#13999))
- Add workers settings to [configuration manual](https://matrix-org.github.io/synapse/latest/usage/configuration/config_documentation.html#individual-worker-configuration). ([\matrix-org#14086](matrix-org#14086))
- Correct the name of the config option [`encryption_enabled_by_default_for_room_type`](https://matrix-org.github.io/synapse/latest/usage/configuration/config_documentation.html#encryption_enabled_by_default_for_room_type). ([\matrix-org#14110](matrix-org#14110))
- Update docstrings of `SynapseError` and `FederationError` to bettter describe what they are used for and the effects of using them are. ([\matrix-org#14191](matrix-org#14191))

Internal Changes
----------------

- Remove unused `@lru_cache` decorator. ([\matrix-org#13595](matrix-org#13595))
- Save login tokens in database and prevent login token reuse. ([\matrix-org#13844](matrix-org#13844))
- Refactor OIDC tests to better mimic an actual OIDC provider. ([\matrix-org#13910](matrix-org#13910))
- Fix type annotation causing import time error in the Complement forking launcher. ([\matrix-org#14084](matrix-org#14084))
- Refactor [MSC3030](matrix-org/matrix-spec-proposals#3030) `/timestamp_to_event` endpoint to loop over federation destinations with standard pattern and error handling. ([\matrix-org#14096](matrix-org#14096))
- Add initial power level event to batch of bulk persisted events when creating a new room. ([\matrix-org#14228](matrix-org#14228))
- Refactor `/key/` endpoints to use `RestServlet` classes. ([\matrix-org#14229](matrix-org#14229))
- Switch to using the `matrix-org/backend-meta` version of `triage-incoming` for new issues in CI. ([\matrix-org#14230](matrix-org#14230))
- Build wheels on macos 11, not 10.15. ([\matrix-org#14249](matrix-org#14249))
- Add debugging to help diagnose lost device list updates. ([\matrix-org#14268](matrix-org#14268))
- Add Rust cache to CI for `trial` runs. ([\matrix-org#14287](matrix-org#14287))
- Improve type hinting of `RawHeaders`. ([\matrix-org#14303](matrix-org#14303))
- Use Poetry 1.2.0 in the Twisted Trunk CI job. ([\matrix-org#14305](matrix-org#14305))

<details>
<summary>Dependency updates</summary>

Runtime:

- Bump anyhow from 1.0.65 to 1.0.66. ([\matrix-org#14278](matrix-org#14278))
- Bump jinja2 from 3.0.3 to 3.1.2. ([\matrix-org#14271](matrix-org#14271))
- Bump prometheus-client from 0.14.0 to 0.15.0. ([\matrix-org#14274](matrix-org#14274))
- Bump psycopg2 from 2.9.4 to 2.9.5. ([\matrix-org#14331](matrix-org#14331))
- Bump pysaml2 from 7.1.2 to 7.2.1. ([\matrix-org#14270](matrix-org#14270))
- Bump sentry-sdk from 1.5.11 to 1.10.1. ([\matrix-org#14330](matrix-org#14330))
- Bump serde from 1.0.145 to 1.0.147. ([\matrix-org#14277](matrix-org#14277))
- Bump serde_json from 1.0.86 to 1.0.87. ([\matrix-org#14279](matrix-org#14279))

Tooling and CI:

- Bump black from 22.3.0 to 22.10.0. ([\matrix-org#14328](matrix-org#14328))
- Bump flake8-bugbear from 21.3.2 to 22.9.23. ([\matrix-org#14042](matrix-org#14042))
- Bump peaceiris/actions-gh-pages from 3.8.0 to 3.9.0. ([\matrix-org#14276](matrix-org#14276))
- Bump peaceiris/actions-mdbook from 1.1.14 to 1.2.0. ([\matrix-org#14275](matrix-org#14275))
- Bump setuptools-rust from 1.5.1 to 1.5.2. ([\matrix-org#14273](matrix-org#14273))
- Bump twine from 3.8.0 to 4.0.1. ([\matrix-org#14332](matrix-org#14332))
- Bump types-opentracing from 2.4.7 to 2.4.10. ([\matrix-org#14133](matrix-org#14133))
- Bump types-requests from 2.28.11 to 2.28.11.2. ([\matrix-org#14272](matrix-org#14272))
</details>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants