Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Improve performance of background update populate_user_directory_process_rooms #15264

Closed
H-Shay opened this issue Mar 13, 2023 · 9 comments
Closed
Labels
A-Background-Updates Filling in database columns, making the database eventually up-to-date A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.

Comments

@H-Shay
Copy link
Contributor

H-Shay commented Mar 13, 2023

Relevant code begins here:

self.db_pool.updates.register_background_update_handler(

This process can cause excess load on the DB, perhaps need to reconfigure the batch size to make it less resource-intensive.

@H-Shay H-Shay added A-Background-Updates Filling in database columns, making the database eventually up-to-date A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. labels Mar 13, 2023
@clokep
Copy link
Member

clokep commented Mar 13, 2023

The main issue (from my understanding) is that it reports based on the number of rooms which are processed, but rooms vary in how "expensive" they are. Maybe using something like rooms * users_in_room would be better?

That might allow the background update algorithm to process it a bit more reasonably.

@richvdh
Copy link
Member

richvdh commented Mar 14, 2023

For background: we saw a significant slowdown on matrix.org :

image

This appears to have aligned with some work done by the populate_user_directory_process_rooms background database update. In particular, the update starts with a big delete:

async def delete_all_from_user_dir(self) -> None:
"""Delete the entire user directory"""
def _delete_all_from_user_dir_txn(txn: LoggingTransaction) -> None:
txn.execute("DELETE FROM user_directory")
txn.execute("DELETE FROM user_directory_search")
txn.execute("DELETE FROM users_in_public_rooms")
txn.execute("DELETE FROM users_who_share_private_rooms")
txn.call_after(self.get_user_in_directory.invalidate_all)
await self.db_pool.runInteraction(
"delete_all_from_user_dir", _delete_all_from_user_dir_txn
)

... which, based on postgres's slow query logs, completed at 21:12 UTC on 2023-03-13, exactly coinciding with the first spike in the above graph.

The main issue (from my understanding) is that it reports based on the number of rooms which are processed, but rooms vary in how "expensive" they are. Maybe using something like rooms * users_in_room would be better?

I'm not convinced that's the main issue. Certainly the updates are quite expensive, but empirically the delete caused a problem too - and it's much harder to break down into small steps.

To be determined here:

  • why does a delete in users_who_share_private_rooms cause a significant slowdown in event persistence
  • this was far from the only spike in event persistence times over the course of the evening: there were others at 20:15 and a long slowdown between 18:00 and 18:25. It is currently unknown what caused them.

@reivilibre
Copy link
Contributor

We should use TRUNCATE instead of unconditional DELETE FROM, which is faster on Postgres (no table scan).

@clokep
Copy link
Member

clokep commented Apr 4, 2023

Do we think #15316 is enough here or is there more to do?

@reivilibre
Copy link
Contributor

I don't know of anything more to do here for now — do we still see trouble on Matrix.org?

@erikjohnston
Copy link
Member

Empirically its doing ~1Hz of room processing on matrix.org, which is going to take months?

We should have a look at what queries its doing etc and see if we can speed it up. We can just turn on DEBUG logging on the background worker to see what its doing.

@erikjohnston erikjohnston reopened this Apr 6, 2023
erikjohnston added a commit that referenced this issue Apr 14, 2023
c.f. #15264

The two changes are:
1. Add indexes so that the select / deletes don't do sequential scans
2. Don't repeatedly call `SELECT count(*)` each iteration, as that's slow
@clokep
Copy link
Member

clokep commented May 18, 2023

Even after #15435 we seem to still be at about ~1 Hz FTR.

@erikjohnston
Copy link
Member

Bah, that sounds slower than it was before :(

@DMRobertson
Copy link
Contributor

Let's consider this good enough for now, in the light of #15529 and #15665. Can re-open if this causes future pain.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Background-Updates Filling in database columns, making the database eventually up-to-date A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

No branches or pull requests

6 participants