Investigate scheduler queue saturation #3765

jpbruinsslot · 2024-10-30T08:19:20Z

No description provided.

jpbruinsslot · 2024-10-30T08:44:48Z

Possible sources:

Long start-up times because of bootstrapping organisations and creating plugin caches for those organisations resulting in potentially long wait times
Flushing of plugin caches does this for all organizations resulting in potentially long wait times
Referencing plugin caches of organizations that don't have plugins might result in strange behaviour

Local investigation:

For point 1 it could be possible that with many organisations can lead to long wait times starting up the application which would lead to the scheduler not being responsive and able to handle requests. However, this does not explain why the queue was at full capacity. For this to happen that means the scheduler had to be already been running and pushing items on the queue.

Indeed long wait times for many organisations and creating caches is sub-optimal (see Katalogus caching in the scheduler #3357). It does not yet explain the queue saturation. Local testing of creating 300+ organisations did not result in long wait times.
...

underdarknl · 2024-10-30T13:55:59Z

Long locks (for example how the plugin cache refresh was implemented before pr #3752 would also probably stall the threads that create jobs.

Another long running task might be the endpoint that dispenses all the queues to the runners. do we really need that list in there to just pop a job (any job) from the queue?
Can we not periodically refresh the list of organizations from an endpoint that returns just that (and as such is fast), instead of gathering the queues? (we know a boefje queue should exists, and if not, we could retry a few times if the scheduler can't find it, and remove that orga from the list in the job runners)

underdarknl · 2024-10-30T13:58:52Z

It does not yet explain the queue saturation. Local testing of creating 300+ organisations did not result in long wait times.

It looks like the boefje runner would try to ingest the list of queues for all 300 organisations, and would timeout on receiving this list (as the scheduler itself was probably also still very busy fetching all katalogi) resulting in no queue's being present in the runner and as such it would stop fetching jobs.

jpbruinsslot · 2024-10-30T14:33:24Z

Long locks (for example how the plugin cache refresh was implemented before pr #3752 would also probably stall the threads that create jobs.

Exactly. Which could indicate that the queue was already full and a restart happened and would take substantial time to bootstrap all the caches

Another long running task might be the endpoint that dispenses all the queues to the runners. do we really need that list in there to just pop a job (any job) from the queue? Can we not periodically refresh the list of organizations from an endpoint that returns just that (and as such is fast), instead of gathering the queues? (we know a boefje queue should exists, and if not, we could retry a few times if the scheduler can't find it, and remove that orga from the list in the job runners)

We can optimize the endpoint to relay the available queues that a runner can pop off jobs from. This issue addresses this particular issue: #3358 Filtering parameters can also be added to give the task runner a narrower view on whats available.

underdarknl · 2024-10-30T14:39:55Z

Long locks (for example how the plugin cache refresh was implemented before pr #3752 would also probably stall the threads that create jobs.

Exactly. Which could indicate that the queue was already full and a restart happened and would take substantial time to bootstrap all the caches

It could but if the second problem existed it would mean Job-popping would stop, and the queue's would fill up regardless of the katalogus locks being slow/or broad. Refreshing the katalogus caches continuously, because it takes longer for them to fill just adds a lot of load to the system, but does not stop functionality as far as I can see (since the code stops the cache timers while updating). there would at least be a small (default 30s) window of valid plugin caches.

jpbruinsslot added bug Something isn't working mula Issues related to the scheduler labels Oct 30, 2024

jpbruinsslot self-assigned this Oct 30, 2024

jpbruinsslot added this to KAT Oct 30, 2024

github-project-automation bot moved this to Incoming features / Need assessment in KAT Oct 30, 2024

jpbruinsslot moved this from Incoming features / Need assessment to In Progress in KAT Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate scheduler queue saturation #3765

Investigate scheduler queue saturation #3765

jpbruinsslot commented Oct 30, 2024

jpbruinsslot commented Oct 30, 2024 •

edited

Loading

underdarknl commented Oct 30, 2024

underdarknl commented Oct 30, 2024

jpbruinsslot commented Oct 30, 2024

underdarknl commented Oct 30, 2024 •

edited

Loading

Investigate scheduler queue saturation #3765

Investigate scheduler queue saturation #3765

Comments

jpbruinsslot commented Oct 30, 2024

jpbruinsslot commented Oct 30, 2024 • edited Loading

underdarknl commented Oct 30, 2024

underdarknl commented Oct 30, 2024

jpbruinsslot commented Oct 30, 2024

underdarknl commented Oct 30, 2024 • edited Loading

jpbruinsslot commented Oct 30, 2024 •

edited

Loading

underdarknl commented Oct 30, 2024 •

edited

Loading