Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Investigate why /messages is slow #13356

Open
Tracked by #15182
MadLittleMods opened this issue Jul 22, 2022 · 3 comments
Open
Tracked by #15182

Investigate why /messages is slow #13356

MadLittleMods opened this issue Jul 22, 2022 · 3 comments
Labels
A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) A-Performance Performance, both client-facing and admin-facing T-Other Questions, user support, anything else.

Comments

@MadLittleMods
Copy link
Contributor

MadLittleMods commented Jul 22, 2022

Background

For Matrix Public Archive, we're getting a lot of 504 timeouts waiting for Matrix API requests to finish. Specifically /messages seems to be the slow culprit and I haven't really heard anything about optimizing it like /join. From a clean Synapse cache, /messages often takes 80s - 120s to respond and many times timing out itself from Synapse.

Here is the average data for #matrix:matrix.org from the gitter.ems.host homeserver which is sampled every 61 minutes to be outside of the state cache expiry, /_matrix/client/r0/rooms/${roomId}/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter={"lazy_load_members":true}

  • ?limit=500: 111.81s (492 events / 129 state events)
  • ?limit=100: 119.504s (92 events / 50 state events)
  • ?limit=20: 112.541s (20 events / 11 state events)

After Synapse has a warm cache (I am guessing), these requests take about 5s - 10s

What can we do to make the /messages requests faster? Has any thought already been put in this? For existing issues, I see that we might be backfilling when we shouldn't which aligns with the request being slow regardless how big the limit is, #10742

Caching the response on our end in the archive doesn't make the request fast in the first place.

From the matrix.org Grafana, I looked into the RoomMessageListRestServlet timing and it all appears very fast (5ms - 500ms) and not what I experience. I am guessing it's not good at catching the outliers since the metric seems to just to total time divided by samples. But I would expect a lot more 1 second to 3 second timings even after a warm cache.

-- @MadLittleMods, https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$7ctkqN00-4HiYP5TT7bnfROfGXP5Lyt7yyWjrg8bGu4?via=matrix.org&via=element.io&via=beeper.com

Traces for reference

These include a bunch of the /messages instrumentation added:

Why is it slow?

Update: We now have a milestone tracking various known slow pieces to improve: https://github.com/matrix-org/synapse/milestone/11

This part is WIP. Just documenting some of the log diving I've done for /messages being slow. Still want to do this on some more requests and hopefully get access to Jaeger to also compare and investigate that way too.

1. Backfill linearizer lock takes forever

  • Room: #matrix:matrix.org
  • API URL: https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
  • Request -> response duration: 117.808s
  • Logs in Kibana: GET-6890
    • For reference on adjusting the logging config for gitter.ems.host (set DEBUG), see this comment.

Notes:

  • Why does linearizer lock take ~45s to aquire lock?
  • It doesn't seem to take a lot of time but we don't need to fetch 4.6k backfill points out of the database
  • For me on the outside, the request took 117.808s. Where did the rest of the time go?
    • Delay in connecting to the server?
    • Delay in serializing the response?
    • Need to time the requests more precisely when the started and ended to compare against the Synapse logs

Timing summary from looking at the logs:

  1. 03:42:43.129 (t+0s ): Received request
  2. 03:42:43.712 (t+0s ) Waiting to acquire linearizer lock 'room_backfill'
  3. 03:43:26.806 (t+43s) Acquired linearizer lock 'room_backfill'
  4. 03:43:26.825 (t+43s) _maybe_backfill_inner and pulled 4.6k events out of the database as potential backfill points
    • get_oldest_event_ids_with_depth_in_room only took .12s to get the 4.6k events
  5. 03:43:28.716 (t+45s) Asking t2bot.io for backill
  6. 03:43:28.985 (t+45s) t2bot.io responded with 100 events
  7. 03:43:29.009 (t+46s) Starting to verify content hashes
  8. 03:43:29.105 (t+46s) Done verifying content hashes
  9. 03:43:29.106 (t+46s) Processing the pulled events
  10. 03:43:29.147 (t+46s) Done processing 100 backfilled except for $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
  11. 03:43:29.348 (t+46s) Done trying to de-outlier $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E (404 /state_ids from t2bot.io)
  12. 03:43:34.026 (t+51s) _get_state_groups_from_groups start
  13. 03:43:38.470 (t+55s) _get_state_groups_from_groups end
  14. 03:43:41.112 (t+58s) Response sent

This isn't a fluke, here is another request I went through the logs on (GET-196). This time the duration matched the Synapse logs pretty well:

  1. 20:38:38.049 (t+0s ) Received request
  2. 20:38:38.062 (t+0s ) Waiting to acquire linearizer lock 'room_backfill'
  3. 20:39:23.622 (t+45s) Acquired linearizer lock 'room_backfill'
  4. 20:39:23.640 (t+45s) _maybe_backfill_inner and pulled 4.6k events out of the database as potential backfill points
  5. 20:39:25.625 (t+47s) Asking t2bot.io for backill
  6. 20:39:35.262 (t+57s) t2bot.io responded with 100 events
  7. 20:39:35.283 (t+...) Starting to verify content hashes
  8. 20:39:35.382 (t+...) Done verifying content hashes
  9. 20:39:35.383 (t+...) Processing the pulled events
  10. 20:39:35.424 (t+...) Done processing 100 backfilled events except for $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
  11. 20:39:35.651 (t+...) Done trying to de-outlier $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E (404 /state_ids from t2bot.io)
  12. 20:39:43.668 (t+65s) Response sent

2. Loading tons of events

  • Room: #matrix:matrix.org
  • API URL: https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
  • Request -> response duration: ~60s
  • Logs in Kibana: GET-5541

Notes:

  • Tons of get_aggregation_groups_for_event of get_recent_references_for_event calls (2k). Why?
  • Why did we load 79k events?
    • This lines up with the number of state_events in the room
  • Why did we redact some events? Is this redacting events in our response or in general?
  • Why didn't these things happen in the other calls above?

Timing summary from looking at the logs:

  1. 02:09:51.026 (t+0s ) Received request
  2. 02:09:51.959 (t+1s ) _maybe_backfill_inner backfill and pulled 4.6k events out of the database as potential backfill points
  3. 02:09:52.726 (t+2s) synapse.storage.databases.main.events_worker Loading 79277 events (why?)
  4. 02:10:10.462 (t+19s) Also fetching redaction events
  5. 02:10:10.462 (t+19s) Loaded 79277 events
  6. 02:10:23.246 (t+31s) Done redacting 105 events (why?)
  7. 02:10:23.779 (t+31s) Asking t2bot.io for backill
  8. 02:10:33.290 (t+41s) t2bot.io responded with 100 events
  9. ... (t+...) 2k calls to get_recent_references_for_event and get_aggregation_groups_for_event (why?)
  10. 02:10:54.261 (t+63s) Response sent

3. Slow /state_id requests but also really slow processing

524 timeout: 123.26s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=ss4cm
524 timeout
2022-07-22 @4-43

123.26s
https://jaeger.proxy.matrix.org/trace/7c4b7fe54ba6f5ab

  • 19.24s /backfill request
  • 53.45s /state_ids
524 timeout: 117.96s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=0p14c
524 timeout
2022-07-22 @4:56

117.96s
https://jaeger.proxy.matrix.org/trace/e67f019385c47fd9

  • 921ms /backfill request
  • 47.24s /state_ids request
524 timeout: 115.64s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=p8c09g
524 timeout
2022-7-22 @5:02:33

{
    "ok": false,
    "timing": 99185.70000004768,
    "startTime": 1658527353492,
    "endTime": 1658527452680,
    "requestId": "p8c09g"
}

115.64s
https://jaeger.proxy.matrix.org/trace/ef47a44ea445b3ae

  • 14.49s /backfill request
  • 54.47s /state_ids request
200 ok: 83.51s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=cjfpw
200 ok

{
    "ok": true,
    "status": 200,
    "timing": 83771.90000009537,
    "numEventsReturned": 491,
    "numStateEventsReturned": 129,
    "startTime": 1658527887072,
    "endTime": 1658527970847,
    "requestId": "cjfpw",
    "traceId": "ae7c694e57113282"
}

83.51s
https://jaeger.proxy.matrix.org/trace/ae7c694e57113282

  • 526ms /backfill request
  • 42.77s /state_ids request
200 ok: 75.7s

https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=wg6g8k
2022-07-22 @5-27pm

{
    "ok": true,
    "status": 200,
    "timing": 76131.40000009537,
    "numEventsReturned": 491,
    "numStateEventsReturned": 129,
    "startTime": 1658528812471,
    "endTime": 1658528888604,
    "requestId": "wg6g8k"
    "traceId": "d048d9f59c20555c"
}

75.7s
https://jaeger.proxy.matrix.org/trace/d048d9f59c20555c

  • 549ms /backfill request
  • 32.58s /state_ids request

/messages timing script

Every 61 minutes (to be outside of the state cache expiry), will call and time /messages for each room defined. And will do this with ?limit= 500, 100, 20 to see if the amount of messages matters on the timing.

Run in the browser. Define let MATRIX_TOKEN = 'xxx'; before running the script.

matrix-messages-timing.js
let resultantTimingData = {};
  
async function fetchAndTimeMessagesFromRoom(roomId, numMessagesToFetch) {
  // via https://stackoverflow.com/a/8084248/796832
  const requestId = (Math.random() + 1).toString(36).substring(7);

  let res;
  let timeBeforeRes;
  let timeAfterRes;
  let startTime;
  let endTime;
  try {
    startTime = Date.now();
    timeBeforeRes = performance.now();
    res = await fetch(`https://matrix-client.matrix.org/_matrix/client/r0/rooms/${roomId}/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=${numMessagesToFetch}&filter=%7B%22lazy_load_members%22:true%7D&foo=${requestId}`, {
      headers: {
        'Authorization': `Bearer ${MATRIX_TOKEN}`
      }
    });
  } catch(err) {
    console.error(`Error getting result for ${roomId} at ${numMessagesToFetch} messages`, err);
  } finally {
    endTime = Date.now();
    timeAfterRes = performance.now();
    }
  
  const traceId = res?.headers?.['synapse-trace-id'];
  let data = {};
  if (res?.ok) {
    data = await res.json();
  }

  const timingEntry = {
    ok: res?.ok || false,
    status: res?.status,
    timing: timeAfterRes - timeBeforeRes,
    numEventsReturned: data?.chunk?.length,
    numStateEventsReturned: data?.state?.length,
    startTime,
    endTime,
    requestId,
    traceId
  };

  return timingEntry;
}

(async () => {
  const ROOMS_TO_TEST = {
    'matrix:matrix.org': '!OGEhHVWSdvArJzumhm:matrix.org',
    'openwisp/general': '!RBzfoBeqYcCwLAAenz:gitter.im',
    'ethereum/solidity-dev': '!poXqlbVpQfXKWGseLY:gitter.im',
    'matrix-org/gitter': '!OvgDmYaPOFRRWnIdQe:half-shot.uk',
    'MadLittleMods/new-room1': '!aMzLHLvScQCGKDNqCB:gitter.im'
  };

  const NUM_MESSAGES_TO_FETCH_MATRIX = [
    500,
    100,
    20
  ];

  const ONE_MINUTE_MS = 60 * 1000;
  // The `_state_cache` in Synapse has 60 minute expiry so we chose something just bigger
  const DELAY_BETWEEN_TRIALS = 61 * ONE_MINUTE_MS;

  const REPEAT = 5;

  if (!MATRIX_TOKEN) {
    console.error('MATRIX_TOKEN is not defined. Please define this before running this script');
  }

  for (let i = 0; i < REPEAT; i++) {
    for (let numMessagesToFetch of NUM_MESSAGES_TO_FETCH_MATRIX) {
      for (let roomKey of Object.keys(ROOMS_TO_TEST)) {
        const roomId = ROOMS_TO_TEST[roomKey];

        const timingEntry = await fetchAndTimeMessagesFromRoom(roomId, numMessagesToFetch);


        // Default the result map so we don't run into null-pointer exception (NPE) issues
        if (!resultantTimingData[roomKey]) {
          resultantTimingData[roomKey] = {};
        }

        if (!resultantTimingData[roomKey][numMessagesToFetch]) {
          resultantTimingData[roomKey][numMessagesToFetch] = [];
        }

        resultantTimingData[roomKey][numMessagesToFetch].push(timingEntry);

        console.log(`New result for ${roomKey} at ${numMessagesToFetch} messages:`, timingEntry);
      }

      // Wait for the caches to clear out before running another trial against the same room
      await new Promise((resolve) => {
        setTimeout(resolve, DELAY_BETWEEN_TRIALS);
      })
    }
  }
})()
function getAvg(roomId, numMessages) {
    const timingEntries = data1[roomId][numMessages];
    return timingEntries
        .filter((timingEntry) => timingEntry.ok)
        .reduce((total, timingEntry) => (total + timingEntry.timing), 0) / timingEntries.length;
}

getAvg('matrix:matrix.org', "500")
@MadLittleMods MadLittleMods added A-Performance Performance, both client-facing and admin-facing T-Other Questions, user support, anything else. A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) labels Jul 22, 2022
MadLittleMods added a commit that referenced this issue Aug 16, 2022
The after timing can be influenced by slow `/state_ids` requests
when proessing the pulled events.

Part of #13356
MadLittleMods added a commit that referenced this issue Aug 16, 2022
Instrument the federation/backfill part of `/messages` so it's easier to follow what's going on in Jaeger when viewing a trace.

Split out from #13440

Follow-up from #13368

Part of #13356
MadLittleMods added a commit that referenced this issue Aug 16, 2022
MadLittleMods added a commit that referenced this issue Aug 17, 2022
MadLittleMods added a commit that referenced this issue Aug 18, 2022
Track number of hosts affected by the rate limiter so we can differentiate one really noisy homeserver from a general ratelimit tuning problem across the federation.

Follow-up to #13534

Part of #13356
MadLittleMods added a commit that referenced this issue Aug 24, 2022
…ild concurrent calls (#13588)

Instrument `_check_sigs_and_hash_and_fetch` to trace time spent in child concurrent calls because I've see `_check_sigs_and_hash_and_fetch` take [10.41s to process 100 events](#13587)

Fix #13587

Part of #13356
MadLittleMods added a commit that referenced this issue Sep 14, 2022
We can follow-up this PR with:

 1. Only try to backfill from an event if we haven't tried recently -> #13622
 1. When we decide to backfill that event again, process it in the background so it doesn't block and make `/messages` slow when we know it will probably fail again -> #13623
 1. Generally track failures everywhere we try and fail to pull an event over federation -> #13700

Fix #13621

Part of #13356

Mentioned in [internal doc](https://docs.google.com/document/d/1lvUoVfYUiy6UaHB6Rb4HicjaJAU40-APue9Q4vzuW3c/edit#bookmark=id.qv7cj51sv9i5)
MadLittleMods added a commit that referenced this issue Sep 15, 2022
MadLittleMods added a commit that referenced this issue Sep 23, 2022
Only try to backfill event if we haven't tried before recently (exponential backoff). No need to keep trying the same backfill point that fails over and over.

Fix #13622
Fix #8451

Follow-up to #13589

Part of #13356
MadLittleMods added a commit that referenced this issue Sep 28, 2022
…se (#13879)

There is no need to grab thousands of backfill points when we only need 5 to make the `/backfill` request with. We need to grab a few extra in case the first few aren't visible in the history.

Previously, we grabbed thousands of backfill points from the database, then sorted and filtered them in the app. Fetching the 4.6k backfill points for `#matrix:matrix.org` from the database takes ~50ms - ~570ms so it's not like this saves a lot of time 🤷. But it might save us more time now that `get_backfill_points_in_room`/`get_insertion_event_backward_extremities_in_room` are more complicated after #13635 

This PR moves the filtering and limiting to the SQL query so we just have less data to work with in the first place.

Part of #13356
MadLittleMods added a commit that referenced this issue Oct 3, 2022
Because we're doing the recording in `_check_sigs_and_hash_for_pulled_events_and_fetch` (previously named `_check_sigs_and_hash_and_fetch`), this means we will track signature failures for `backfill`, `get_room_state`, `get_event_auth`, and `get_missing_events` (all pulled event scenarios). And we also record signature failures from `get_pdu`.

Part of #13700

Part of #13676 and #13356

This PR will be especially important for #13816 so we can avoid the costly `_get_state_ids_after_missing_prev_event` down the line when `/messages` calls backfill.
@MadLittleMods
Copy link
Contributor Author

Progress

We've made some good progress on these slow 180s type requests. You can see the big drop in these long-running requests around Sept. 26 after we shipped #13635 (possible with all of the PRs before it). This metric can be influenced by other servers being slow to respond to /backfill, /state_ids, /state, /event from federation so it’s not like we can eliminate all slow requests but it seems obvious that they happen a lot less now.

https://grafana.matrix.org/d/dYoRgTgVz/messages-timing?orgId=1&from=1663806404570&to=1665016004570&viewPanel=230
Grafana graph showing: Of requests that take more than 5s, what percentage take 180s or longer

Grafana graph showing: Of requests that take more than 5s, what percentage take 120s or longer

Grafana graph showing: Of requests that take more than 5s, what percentage take 60s or longer

The effects don't really show for the RoomMessageListRestServlet overall. Perhaps because they have a low ?limit, they don’t have to backfill, maybe they are going forwards instead of backwards, etc. Another influencing factor is that I have only focused on the big requests that took 180s+ before which now take around 30s with the optimizations and this isn’t within range of the Apdex 2.5s satisfied, 10s tolerable thresholds to show progress. I tried playing with those thresholds but didn’t see much. Too many normal fast requests 🤷

What's left?

/messages is still not anywhere close to fast enough as it still takes 30s for my benchmark request I have been running that used to take 180s. A great goal would be to get all of these down to 5s or less.

Here is the latest trace now that we're down in the ~30s range, https://gist.github.com/MadLittleMods/c3ace249ce2aec77cb78d19a4f85a776

~30s /messages trace in Jaeger

As you can see, we need to:

  1. Speed-up filter_events_for_client (this takes 7.83s in that request), filter_events_for_client is time-consuming and expensive #14108
  2. Speed-up get_bundled_aggregations (this takes 3.84s in that request), get_bundled_aggregations takes a few seconds even when there is no relations #13624
  3. Speed up the maybe_backfill part (this takes 15.29s in that request),
    • Speed up _process_pulled_events in backfill (this takes 5.15s in that request), Process previously failed backfill events in the background #13623
    • 7.24s is from the /backfill outgoing federation request which we have less control over unless we can figure out that we don't actually need to backfill in that situation.
    • That still leaves 2.9s of random stuff which should surely be able to go away 🤞

If we can get rid of the time in the bolded numbers, that saves us 20s gets us down to 10s /messages requests. We will have to look at the traces again once we get /messages down to the 10s mark.

MadLittleMods added a commit that referenced this issue Oct 15, 2022
…ure is invalid (#13816)

While #13635 stops us from doing the slow thing after we've already done it once, this PR stops us from doing one of the slow things in the first place.

Related to
 - #13622
    - #13635
 - #13676

Part of #13356

Follow-up to #13815 which tracks event signature failures.

With this PR, we avoid the call to the costly `_get_state_ids_after_missing_prev_event` because the signature failure will count as an attempt before and we filter events based on the backoff before calling `_get_state_ids_after_missing_prev_event` now.

For example, this will save us 156s out of the 185s total that this `matrix.org` `/messages` request. If you want to see the full Jaeger trace of this, you can drag and drop this `trace.json` into your own Jaeger, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761

To explain this exact scenario around `/messages` -> backfill, we call `/backfill` and first check the signatures of the 100 events. We see bad signature for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` and `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` (both member events). Then we process the 98 events remaining that have valid signatures but one of the events references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event`. So we have to do the whole `_get_state_ids_after_missing_prev_event` rigmarole which pulls in those same events which fail again because the signatures are still invalid.

 - `backfill`
    - `outgoing-federation-request` `/backfill`
    - `_check_sigs_and_hash_and_fetch`
       - `_check_sigs_and_hash_and_fetch_one` for each event received over backfill
          - ❗ `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)`
          - ❗ `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)`
   - `_process_pulled_events`
      - `_process_pulled_event` for each validated event
         - ❗ Event `$Q0iMdqtz3IJYfZQU2Xk2WjB5NDF8Gg8cFSYYyKQgKJ0` references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event` which is missing so we try to get it
            - `_get_state_ids_after_missing_prev_event`
               - `outgoing-federation-request` `/state_ids`
               - ❗ `get_pdu` for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` which fails the signature check again
               - ❗ `get_pdu` for `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` which fails the signature check
MadLittleMods added a commit that referenced this issue Nov 22, 2022
Seems like it will make it a magnitude faster.

Fix #14108

Part of #13356 (comment)
MadLittleMods added a commit that referenced this issue May 12, 2023
Fix #13623

Follow-up to #13621 and #13622

Part of making `/messages` faster: #13356
MadLittleMods added a commit that referenced this issue May 19, 2023
…#15610)

Instrument `state` and `state_group` storage related things (tracing) so it's a little more clear where these database transactions are coming from as there is a lot of wires crossing in these functions.

Part of `/messages` performance investigation: #13356
MadLittleMods added a commit that referenced this issue May 25, 2023
Process previously failed backfill events in the background because they are bound to fail again and we don't need to waste time holding up the request for something that is bound to fail again.

Fix #13623

Follow-up to #13621 and #13622

Part of making `/messages` faster: #13356
@MadLittleMods
Copy link
Contributor Author

Progress report

The benchmark /messages?dir=b&limit=500 request is down to ~4s now (we originally started at over 185s+)

Full Jaeger trace JSON

Benchmark request: https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D

What's left?

The big things:

  1. get_state_for_events/get_state_ids_for_event -> _get_state_for_groups -> _get_state_groups_from_groups took 1.94s in that request
  2. paginate_room_events -> get_events_as_list took 551ms in that request
    • ⭐ Seems like lots of individual lookups that we could simplify

The smaller things:

  • get_bundled_aggregations took 200ms from that request
  • get_state_for_events/get_state_ids_for_event -> get_state_group_for_events took 229ms from that request
  • get_state_for_events/get_state_ids_for_event -> _get_state_for_groups_using_cache took 137ms from that request

But all of that notable stuff only totals up 3057ms so there is still 373ms of random baggage.


In terms of waterfalls:

  • We could save 490ms by running get_state_ids_for_event in parallel with filter_events_for_client since it only needs state at the first event. And we don't need to worry whether the first event would have been filtered because the state just builds on top.
    • ⭐ Can we instead just re-use the work to get all the state out for filter_events_for_client when we try to get_state_ids_for_event at the end to return in the /messages response? That would eliminate the 490ms lookup.
  • ⭐ We could save 200ms by running get_bundled_aggregations in parallel with filter_events_for_client. We might grab a few more aggregations but can always filter them down.
  • We could save 30ms by removing the ignored_users -> are_users_erased sequential call waterfall from filter_events_for_client
    • This sometimes runs fast and is negligible so I'm not sure the overhead of running in parallel is worth it.
  • ⭐ Instead of waiting for us to process the whole 500 events before moving on to the next step, could we break it up into a pipeline and stream results through as we go (perhaps 100 at a time).

There is a 160ms mystery gap after encode_json_response

What's achievable?

Feels do-able to reduce the response time by another 1s (I've starred some things above). And for example if we get another /messages response that looks like this trace, it could get us down to that target 1.85s mark to say "we've improved /messages by 100x"

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) A-Performance Performance, both client-facing and admin-facing T-Other Questions, user support, anything else.
Projects
None yet
Development

No branches or pull requests

3 participants