-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Await for a chain worker to be available for eviction #3037
base: main
Are you sure you want to change the base?
Conversation
598e667
to
43fb237
Compare
Prepare to move the chain worker cache handling out of the `WorkerState` type.
Simplify `WorkerState` by using the new type to handle how the chain worker cache is evicted.
Create a separate `handle_request` method with the code to handle a single request.
Prepare to change the eviction procedure.
Split the previous method to make it more readable.
Ensure that the synchronous `Mutex` is locked and used in a separate synchronous method.
Replace the old alias with the same name with `ChainRequestSender`. The goal is for the new type to also hold a permit to indicate it is using a chain worker.
Ensure that attempts to use chain workers when the cache is full are fair.
It's no longer needed, because eviction is now only performed once a permit is held, guaranteeing that there's at least one chain worker that's idle.
43fb237
to
e5ffc04
Compare
cache: Arc::new(Mutex::new(LruCache::new(limit))), | ||
active_endpoints: Arc::new(Semaphore::new(limit.get())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does using the same limit here mean that if limit
requests are in flight to a single chain worker, and there's another incoming request for a different worker, we will now block that new request even though a slot would be free?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! It does seem like we're first trying to acquire a self.active_endpoints
Semaphore even if there already exists a chain worker for the same chain_id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is true. I thought about having a map of permits, so that requests that get queued anyway would reuse the existing permit. However, that may lead to starvation because a chain worker that is heavily used will never release its permit.
The solution I thought then was to have another semaphore for acquiring each permit, but I thought the permitception would be overengineering. I'm not sure how to proceed here, and would love some feedback and ideas.
I still think it's okay to improve (fix?) this in a follow-up PR because:
- this impacts a (hopefully) edge case (lots of requests for a single chain)
- worst outcome is degraded performance
- with the benefit that it would be fairer in extreme scenarios (all chains getting many requests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's an edge case to have several requests for the same chain worker. And every time we have that we underutilize at least one other chain worker.
But I don't have a better idea at the moment either.
let mut cache = self.cache.lock().unwrap(); | ||
|
||
if let Some(sender) = cache.get(&chain_id) { | ||
Ok(ChainActorEndpoint::new(sender.clone(), permit)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(See my previous comment.) Would it be better to clone the existing permit here?
.iter() | ||
.rev() | ||
.find(|(_, candidate_endpoint)| candidate_endpoint.strong_count() <= 1) | ||
.expect("`stop_one` should only be called while holding a permit for an endpoint"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there be a race condition where the permit is already freed but the reference count hasn't been set to 0 yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I was lucky that I declared the fields in ChainActorEndpoint
in the right order, so that the channel sender is always dropped before the permit 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this PR affect the semantics of getting the chan worker endpoint? previously we'd try to find a chain worker (for eviction) and if none available we'd wait for 250ms and repeat that for 3 seconds (approx 12 times). Can you put that into the PR description?
Ok so since #3052 is pending, we don't know how much this helps, do we? |
I don't think we need to measure that now. The goal is to remove the delays and to make it fairer. |
Motivation
The
WorkerState
contains an LRU cache of chain workers (more specifically, it keeps open channels to theChainWorkerActor
s so that they keep running). Once the cache becomes full, requests for new chain workers randomly attempt to evict an idle chain worker.Eviction was done by looking for an idle chain worker, in the order of least-recently used endpoints. This means that we iterate through the endpoints, starting with the one that was last used the earliest, and checked if it had a strong count of one or less. If it did, it meant that the endpoint that sent the request had already been dropped, and that only happens after the response is received.
If no idle worker is found, that means that there is no worker that's a candidate for eviction, because they all seem to be handling a request. If we evict a chain worker before it finishes handling the request and later on start another chain worker for the same chain while the request is still executing, a race condition is formed which could corrupt the chain state data. Therefore, if there's no eviction candidate the code would wait 250 milliseconds and try again repeatedly for up to three seconds before giving up with an error.
This whole eviction process may run concurrently between many requesters. This leads to possible unfairness, as there's no guarantee which one will succeed. It also introduces undesirable delays between retries.
Proposal
Use a
Semaphore
to establish a queue among the chain worker requests. This way there's certainty that once the permit is acquired, there will be at least one idle chain worker ready to be evicted.In order to implement this change, a new
ChainWorkers
type was created to manage the cache. TheChainActorEndpoint
also became a new type, in order to hold the semaphore permit.Test Plan
CI should catch any regressions to the happy path, and #3052 should add a new test for the extreme load case.
Release Plan
Links