-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Passenger Enterprise zero-downtime restarts & request routing can prolong outages and delay recovery #2551
Comments
An real world example of Passenger's behaviour causing customer pain was https://www.intercomstatus.com/incidents/zv75pnhpgchh We initiated a rolling restart to deploy new application code which moved one of our database to an external database provider. The red vertical is an approximation of when our deployment tool finishes its work initiating rolling restarts across our entire fleet. It's a useful guide, but not exact: For the first minute or so we see no requests on the new code (purple). We begin to serve requests with the new code and our latency alarms spiked while successful request throughput drops. At this point the rolling restart was not fully finished - we've still got requests using the old, good code. We initiate a rollback in our deployment tool. This swaps the symlink for our app back to the previous version and initiates a rolling restart again which cancels the ongoing one that's putting bad code into service. It appears to us that the behaviour of rolling restarts means that processing killing happens as follows:
|
Some notes to help myself understand the issue better. You can ignore this since this is only for myself. A summary of the identified problem:
User-proposed solution:
|
Please follow the steps below if you have found an issue with Passenger, or would like to request a feature.
Problem Statement
Passenger's request routing prioritises the least-busy process in an ordered list of processes. The worker pool presents its availability to serve requests oldest process first per application group which means of the available workers the selection will always favour the oldest one. It's explained this is for application-level caching, memory optimisation, JIT etc.
This starts to break down during a zero downtime restart. Once we get past the swapping of the application preloader and ensure we're not going to resist the deployment (i.e. the new code can boot/ack), it's time to roll some processes. Passenger picks the oldest worker in the pool - our hottest worker - first. When it replaces this process, the new worker is put at the end of the worker pool. The cycle repeats until all old workers are replaced.
Because the Passenger routing algorithm has no concept of new workers vs old workers and remains biasing to the oldest worker first two things happen:
Expected behaviour
Given that the desired end state is that we want the newer application code running, I expect Passenger to bias towards new workers where available.
Actual behaviour
Passenger continues to serve requests to older processes running old code. These processes are potentially, or in our case very likely, cold in terms of application caching/YJIT etc.
Potential solution
If Passenger had a concept of an application version (monotonically increasing, starting at 1), then the order of the pool of workers could be
version DESC, age/PID ASC
when routing requests. Triggering a zero-downtime restart would increase the application version and therefore processes started after the restart was triggered would be preferred, in oldest PID order to maintain the optimisations Passenger desires for application-level caching.For a little while during restarting Passenger probably won't have enough application processes on the new version to cope with the full request load so some will fall over into the older version. I'm not married to this idea but it might be possible to allow for Passenger to maintain hot old processes if it chose the process to kill during restarts as
previous_version, age/PID descending
- the newest processes on the older version. This of course comes at a cost - the memory used by the hot older processes won't be freed up until they are killed last and meanwhile the newer version's hot processes are warming up so you could run into memory issues; but if you've enough free memory going to cope with it it would prevent the churn on otherwise mostly idle older processes.Context
We're running a Rails monolith on Ruby 3.3.3 on EC2 using Flying Passenger. We run 96 passenger processes in our main pool per instance. Generally around peak Passenger's busyness is ~50%:
We run more workers than we need in a static pool. This helps us to smooth out small request spikes as we've got some capacity ready to go and don't need to wait for workers to boot (cold caches notwithstanding). While we have request queues enabled, they're generally empty:
Here's a normal Passenger status output:
The last processes are 1066603 and 1068441 sitting bone idle, 0 requests processed. A rolling restart is triggered, selecting the first process for rotation at the start of the pool:
Towards the end of the restart you can see the request count for PID
1066603
and1068441
is now in the hundreds and memory jumped up about 500MB a piece since they're the top processes for serving requests now if they're free.The text was updated successfully, but these errors were encountered: