-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic threshold #83
Comments
Design notes: Such a panic check might happen here when first pulling the list of healthy destinations. If it's empty, or below some threshold, consider using the entire list of destinations instead. reverse-proxy/src/ReverseProxy/Middleware/DestinationInitializerMiddleware.cs Lines 51 to 59 in a6c755c
Alternate proposal: The health check service could have that kind of logic built in when building the list of healthy destinations. |
Triage: In 1.0 we need panic response/policy |
Design notes:
Side note: Do we want to add a Degraded health state so it's not such a sharp cliff from Healthy to Unhealthy? Then an update algorithm could say "Only include the healthy ones, unless there aren't enough then include the degraded ones, unless there's not enough then include the unknown ones, unless there's not enough then include the unhealthy ones." |
Thoughts on the above design suggestions:
public interface IPanicPolicy
{
// Decides if it's time to start panic depending on how many destinations are available out of all of them
bool IsInPanic(HealthCheckConfig config, IReadOnlyList<DestinationState> allDestinations, IReadOnlyList<DestinationState> availableDestinations);
}
public interface IAvaliableDestinationsCalculator
{
// Selects available destinations based on active and passive health states as well as applies a panic policy
GetAvailalableDestinations(HealthCheckConfig config, IReadOnlyList<DestinationState> allDestinations)
}
public sealed class ClusterState
{
...
public void UpdateDynamicState(IAvaliableDestinationsCalculator caculator)
{
UpdateDynamicStateInternal(caculator, force: false);
}
public void ProcessDestinationChanges(IAvaliableDestinationsCalculator caculator)
{
UpdateDynamicStateInternal(caculator, force: false);
}
private void UpdateDynamicStateInternal(IAvaliableDestinationsCalculator caculator, bool force)
{
...
lock (_stateLock)
{
// Most logic is moved into IAvaliableDestinationsCalculator
var allDestinations = _destinationsSnapshot;
var availableDestinations = caculator.GetAvailalableDestinations(_model?.Config.HealthCheck, allDestinations);
_dynamicState = new ClusterDynamicState(allDestinations, availableDestinations);
}
}
} |
Fair point, unknown can't be excluded for passive checks. That makes me think people may customize this policy for active vs passive checks. Right now if both are enabled then both must be healthy/unknown. Imagine a policy that started with the defaults, but if nothing was available then it would fall back to any that were marked healthy/unknown by passive checks, excluding active checks. If that failed then may opt to disable proxying (return an empty list) or fall back to the whole list.
That's a nice way to deal with the lock ownership issue. I'd prefer to move the methods off ClusterState entirely, but I don't know if that's practical yet. Could IAvaliableDestinationsCalculator use a weak reference table for cluster locks? As for the API, I think it should be more general than panic mode, it should let people compose the list themselves:
A simple panic policy could be implemented by deriving from the base implementation and only taking action if the base returned an empty list. A more advanced policy could completely customize the list generation. |
I prefer the idea of separating this out into its own module/stage in the pipeline. |
@samsp-msft the downside of middleware is that the results have to be re-computed per request, adding allocations and CPU overhead. It's computed in the background right now because the results aren't expected to change per request, only when health or config changes. |
@Tratcher - doh. So it needs a callback when health status has changed for a node in a cluster, and that in turn will compute the AvailableDestinations? |
Right, today ClusterState.UpdateDynamicState() is that callback, and with this change it might move to IAvaliableDestinationsCalculator.UpdateAvailableDestinations. |
It seems a bit overengineered in my opinion. At least at this moment. I'd prefer to start with a simpler solution (i.e. keeping locking logic in Regarding |
The main goal for getting the methods off ClusterState is that ClusterState is not extensible and shouldn't own any functionality, only state.
Right, IAvaliableDestinationsCalculator will be pretty small. It's job is to do the per-cluster lookup of the policies (and maybe the locking). You need it mainly because ClusterState itself doesn't have access to DI to retrieve policies, and you don't want to implement the policy lookup in every health check. |
OK, understood. Maintaining a collection of weak references to lock objects is basically the same as the caching we already have, so it will be easy to implement. |
Thus, it seems we can completely remove remove available destinations calculation logic from |
Does anybody else have some more thoughts/objections regarding this? Otherwise, I will start implementing it since the design seems to be clear enough. |
) Cluster's destination update logic is extracted into a separate service `IClusterDestinationsUpdater` which is now responsible for updating `ClusterDynamicState` (it's now renamed to `ClusterDestinationsState`), specifically updating the full destination collection and filtering the destinations available for proxying requests to. This PR also adds `IAvailableDestinationsPolicy` that decides which destinations from the full collection should be available for requests. There are 2 such policies are implemented: - `HealtyOrUnknownDesitnationsPolicy` decides that a destination is available if active AND passive health states are either 'Healthy' or 'Unknown' - `HealthyOrPanicDestinationsPolicy` calls `HealtyOrUnknownDesitnationsPolicy` as the first step and then checks if the returned available destination collection is empty. If it's **not** empty, this collection is returned as the result, otherwise the full collection containing all destinations is returned. This implements 'panic' health evaluation strategy. Fixes #83
#78 (comment)
@davidni: Something to keep in mind is what to do when a large portion of endpoints in a backend are unhealthy. Envoy for example has a
panic mode
where it will prefer to try to route to endpoints it knows to be unhealthy, because it may be better than the alternative of not routing at all. See e.g. Panic thresholdThe text was updated successfully, but these errors were encountered: