Improve performance of background update `populate_user_directory_process_rooms` #15264

H-Shay · 2023-03-13T22:45:44Z

Relevant code begins here:

synapse/synapse/storage/databases/main/user_directory.py

Line 87 in 335f52d

self.db_pool.updates.register_background_update_handler(

This process can cause excess load on the DB, perhaps need to reconfigure the batch size to make it less resource-intensive.

clokep · 2023-03-13T22:59:04Z

The main issue (from my understanding) is that it reports based on the number of rooms which are processed, but rooms vary in how "expensive" they are. Maybe using something like rooms * users_in_room would be better?

That might allow the background update algorithm to process it a bit more reasonably.

richvdh · 2023-03-14T11:28:08Z

For background: we saw a significant slowdown on matrix.org :

This appears to have aligned with some work done by the populate_user_directory_process_rooms background database update. In particular, the update starts with a big delete:

synapse/synapse/storage/databases/main/user_directory.py

Lines 583 to 595 in 335f52d

    
           async def delete_all_from_user_dir(self) -> None: 
        
               """Delete the entire user directory""" 
        
               def _delete_all_from_user_dir_txn(txn: LoggingTransaction) -> None: 
        
                   txn.execute("DELETE FROM user_directory") 
        
                   txn.execute("DELETE FROM user_directory_search") 
        
                   txn.execute("DELETE FROM users_in_public_rooms") 
        
                   txn.execute("DELETE FROM users_who_share_private_rooms") 
        
                   txn.call_after(self.get_user_in_directory.invalidate_all) 
        
               await self.db_pool.runInteraction( 
        
                   "delete_all_from_user_dir", _delete_all_from_user_dir_txn 
        
               )

... which, based on postgres's slow query logs, completed at 21:12 UTC on 2023-03-13, exactly coinciding with the first spike in the above graph.

The main issue (from my understanding) is that it reports based on the number of rooms which are processed, but rooms vary in how "expensive" they are. Maybe using something like rooms * users_in_room would be better?

I'm not convinced that's the main issue. Certainly the updates are quite expensive, but empirically the delete caused a problem too - and it's much harder to break down into small steps.

To be determined here:

why does a delete in users_who_share_private_rooms cause a significant slowdown in event persistence
this was far from the only spike in event persistence times over the course of the evening: there were others at 20:15 and a long slowdown between 18:00 and 18:25. It is currently unknown what caused them.

reivilibre · 2023-03-23T16:58:28Z

We should use TRUNCATE instead of unconditional DELETE FROM, which is faster on Postgres (no table scan).

clokep · 2023-04-04T11:26:04Z

Do we think #15316 is enough here or is there more to do?

reivilibre · 2023-04-06T15:56:48Z

I don't know of anything more to do here for now — do we still see trouble on Matrix.org?

erikjohnston · 2023-04-06T15:59:36Z

Empirically its doing ~1Hz of room processing on matrix.org, which is going to take months?

We should have a look at what queries its doing etc and see if we can speed it up. We can just turn on DEBUG logging on the background worker to see what its doing.

c.f. #15264 The two changes are: 1. Add indexes so that the select / deletes don't do sequential scans 2. Don't repeatedly call `SELECT count(*)` each iteration, as that's slow

clokep · 2023-05-18T12:32:24Z

Even after #15435 we seem to still be at about ~1 Hz FTR.

erikjohnston · 2023-05-24T09:55:18Z

Bah, that sounds slower than it was before :(

DMRobertson · 2023-06-01T16:04:12Z

Let's consider this good enough for now, in the light of #15529 and #15665. Can re-open if this causes future pain.

erikjohnston assigned reivilibre Mar 23, 2023

reivilibre mentioned this issue Mar 24, 2023

As an optimisation, use TRUNCATE on Postgres when clearing the user directory tables. #15316

Merged

erikjohnston closed this as completed Apr 6, 2023

erikjohnston reopened this Apr 6, 2023

erikjohnston mentioned this issue Apr 14, 2023

User directory background update speedup #15435

Merged

This was referenced May 24, 2023

Speed up rebuilding of the user directory for local users #15529

Merged

Speed up user directory rebuild for users some more... #15665

Merged

reivilibre removed their assignment Jun 1, 2023

DMRobertson closed this as completed Jun 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of background update `populate_user_directory_process_rooms` #15264

Improve performance of background update `populate_user_directory_process_rooms` #15264

H-Shay commented Mar 13, 2023

clokep commented Mar 13, 2023

richvdh commented Mar 14, 2023

reivilibre commented Mar 23, 2023

clokep commented Apr 4, 2023

reivilibre commented Apr 6, 2023

erikjohnston commented Apr 6, 2023

clokep commented May 18, 2023

erikjohnston commented May 24, 2023

DMRobertson commented Jun 1, 2023

Improve performance of background update populate_user_directory_process_rooms #15264

Improve performance of background update populate_user_directory_process_rooms #15264

Comments

H-Shay commented Mar 13, 2023

clokep commented Mar 13, 2023

richvdh commented Mar 14, 2023

reivilibre commented Mar 23, 2023

clokep commented Apr 4, 2023

reivilibre commented Apr 6, 2023

erikjohnston commented Apr 6, 2023

clokep commented May 18, 2023

erikjohnston commented May 24, 2023

DMRobertson commented Jun 1, 2023

Improve performance of background update `populate_user_directory_process_rooms` #15264

Improve performance of background update `populate_user_directory_process_rooms` #15264