Skip to content

Trash overgrown schedulers with many usage queues#1672

Merged
ryoqun merged 2 commits intoanza-xyz:masterfrom
ryoqun:overgrown-scheduler-cleaning
Jun 11, 2024
Merged

Trash overgrown schedulers with many usage queues#1672
ryoqun merged 2 commits intoanza-xyz:masterfrom
ryoqun:overgrown-scheduler-cleaning

Conversation

@ryoqun
Copy link
Copy Markdown
Member

@ryoqun ryoqun commented Jun 10, 2024

Problem

From #1211:

Specifically, there are following several termination conditions which it must handle respectively:

...
4. Because UsageQueueLoader doesn't never evict unused entries. it can become too big with many entries. Its design relies on some external mechanism to mitigate the unbounded growth. Thus, the unified scheduler needs to drop the scheduler itself depending on the size of UsageQueueLoader.
...

Summary of Changes

This pr addresses (4).

Copy link
Copy Markdown

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit and clarification

Comment thread unified-scheduler-pool/src/lib.rs Outdated
const DEFAULT_POOL_CLEANER_INTERVAL: Duration = Duration::from_secs(10);
const DEFAULT_MAX_POOLING_DURATION: Duration = Duration::from_secs(180);
// Rough estimate of max UsageQueueLoader size in bytes:
// UsageFromTask * UsageQeueue's capacity * DEFAULT_MAX_USAGE_QUEUE_COUNT
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo:

Suggested change
// UsageFromTask * UsageQeueue's capacity * DEFAULT_MAX_USAGE_QUEUE_COUNT
// UsageFromTask * UsageQueue's capacity * DEFAULT_MAX_USAGE_QUEUE_COUNT

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops: e4bb046

Comment thread unified-scheduler-pool/src/lib.rs Outdated
// 16 bytes * 128 items * 262_144 entries == 512 MiB
// It's expected that there will be 2 or 3 pooled schedulers constantly when running against
// mainnnet-beta. That means the total memory consumption for the idle close-to-be-trashed pooled
// schedulers is set to high: 1.0 ~ 1.5 GiB. This value is chosen to maximize performance under the
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "high" reads a bit confusingly to me. maybe justs "is set to 1.0 ~ 1.5 GiB"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks: e4bb046

@@ -648,6 +671,10 @@ where
}

fn is_trashed(&self) -> bool {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so want to make a clarification here in the review, doesn't necessarily need a code comment.

This is only called after the scheduler is terminated due to abort or successful completion of the slot, is that right?
I want to make sure that the limits from is_overgrown will not cause us to skip a slot if such a large one comes through.

Copy link
Copy Markdown
Member Author

@ryoqun ryoqun Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only called after the scheduler is terminated due to abort or successful completion of the slot, is that right?

yes. it's only called by ::return_to_pool().

I want to make sure that the limits from is_overgrown will not cause us to skip a slot if such a large one comes through.

this is true, again.

@ryoqun ryoqun requested a review from apfitzge June 11, 2024 04:24
Copy link
Copy Markdown

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ryoqun ryoqun merged commit 56d1572 into anza-xyz:master Jun 11, 2024
samkim-crypto pushed a commit to samkim-crypto/agave that referenced this pull request Jul 31, 2024
* Trash overgrown schedulers with many usage queues

* Fix a typo and improve comment wording
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants