Panic threshold #83

Tratcher · 2020-04-23T17:15:30Z

#78 (comment)
@davidni: Something to keep in mind is what to do when a large portion of endpoints in a backend are unhealthy. Envoy for example has a panic mode where it will prefer to try to route to endpoints it knows to be unhealthy, because it may be better than the alternative of not routing at all. See e.g. Panic threshold

The text was updated successfully, but these errors were encountered:

Tratcher · 2020-07-09T22:17:39Z

Design notes: Such a panic check might happen here when first pulling the list of healthy destinations. If it's empty, or below some threshold, consider using the entire list of destinations instead.

reverse-proxy/src/ReverseProxy/Middleware/DestinationInitializerMiddleware.cs

Lines 51 to 59 in a6c755c

    
           if (dynamicState.HealthyDestinations.Count == 0) 
        
           { 
        
               Log.NoHealthyDestinations(_logger, routeConfig.Route.RouteId, cluster.ClusterId); 
        
               context.Response.StatusCode = 503; 
        
               return Task.CompletedTask; 
        
           } 
        
           context.Features.Set(cluster); 
        
           context.Features.Set<IAvailableDestinationsFeature>(new AvailableDestinationsFeature() { Destinations = dynamicState.HealthyDestinations });

Alternate proposal: The health check service could have that kind of logic built in when building the list of healthy destinations.

karelz · 2021-05-13T17:33:02Z

Triage: In 1.0 we need panic response/policy

Tratcher · 2021-05-19T23:31:20Z

Design notes:

If you tried to implement this as part of health checks today:
- For active health checks you could do it by replacing the IDestinationHealthUpdater, having it check the results and overriding them if it determined that too many were marked as unhealthy.
- Replacing the health state results in a loss of information and should be avoided.
- Even IDestinationHealthUpdater would have trouble doing this for passive health checks. They only check and update one destination at a time.
- I think this rules out trying to build a panic mode into the health check system.
Doing the panic check in middleware is pretty simple:
- This has the advantage of not changing any persistent state, only request state.
- This is still some per-request overhead, especially if you wanted to do any other kind of filtering.

public Task PanicMiddleware(HttpContext context, Func<Task> next)
{
    var proxyFeature = context.Features.Get<IReverseProxyFeature>();

    if (proxyFeature.AvailableDestinations.Count == 0) // or a percentage of AllDestinations
    {
        // Panic!
        proxyFeature.AvailableDestinations = proxyFeature.AllDestinations;
    }

    return next();
}

What if we made the AvailableDestinations generation code extensible?
- That logic is currently embedded in ClusterState and there's no way to override it.
- The current logic includes Healthy and Unknown endpoints, excluding Unhealthy ones.
- What if someone wanted to exclude Unknown endpoints as well? Or to only include them if there weren't enough Healthy endpoints?
- Would this be a global service, or would you want to apply different policies per cluster? Or both (configurable per cluster with a replaceable default)?
- It would be a nice cleanup to get this functional logic off of ClusterState
- ClusterState.DynamicState would have to be publicly settable.
- Figuring out the locking might be tricky.
- Possible API (with the existing bad names from ClusterState):

public interface IClusterUpdator
{
    // Triggers a refresh but doesn't block if one is already in progress. Called from health checks to notify of state changes.
    void UpdateDynamicState(ClusterState);
    // Triggers a refresh and always waits for it to finish. Called when config changes.
    void ProcessDestinationChanges(ClusterState cluster);
}

Side note: Do we want to add a Degraded health state so it's not such a sharp cliff from Healthy to Unhealthy? Then an update algorithm could say "Only include the healthy ones, unless there aren't enough then include the degraded ones, unless there's not enough then include the unknown ones, unless there's not enough then include the unhealthy ones."
(Filed as #1011)

alnikola · 2021-05-21T13:14:44Z

Thoughts on the above design suggestions:

Agree, that panic mode should be separated from the health checks
PanicMiddleware options seems goods if we make it extensible with the new IPanicPolicy interface
But, extracting AvailableDestinations update code into a new service seems even better because currently it seems as feature gap in YARP extensibility mechanism. Let's extract the most of current ClusterState.UpdateDynamicStateInternal() logic into IAvaliableDestinationsCalculator and extend it with IPanicPolicy
IPanicPolicy should be configurable per cluster
By design Unknown health state has 2 purposes: indicate that the actual health state is not calculated yet AND make reactivated destination eligible for receiving traffic to see if they actually recovered. Thus, I don't want to remove Unknown destinations from AvailableDestinations
One option to avoid locking on ClusterState problem is to invert control. Meaning, to pass an instance of IAvaliableDestinationsCalculator into ClusterState.UpdateDynamicStateInternal()

public interface IPanicPolicy
{
  // Decides if it's time to start panic depending on how many destinations are available out of all of them
  bool IsInPanic(HealthCheckConfig config, IReadOnlyList<DestinationState> allDestinations, IReadOnlyList<DestinationState> availableDestinations);
}

public interface IAvaliableDestinationsCalculator
{
  // Selects available destinations based on active and passive health states as well as applies a panic policy
  GetAvailalableDestinations(HealthCheckConfig config, IReadOnlyList<DestinationState> allDestinations)
}

public sealed class ClusterState
{
  ...

  public void UpdateDynamicState(IAvaliableDestinationsCalculator caculator)
  {
     UpdateDynamicStateInternal(caculator, force: false);
  }

  public void ProcessDestinationChanges(IAvaliableDestinationsCalculator caculator)
  {
     UpdateDynamicStateInternal(caculator, force: false);
  }

  private void UpdateDynamicStateInternal(IAvaliableDestinationsCalculator caculator, bool force)
  {
    ...

    lock (_stateLock)
    {
      // Most logic is moved into IAvaliableDestinationsCalculator
      var allDestinations = _destinationsSnapshot;
      var availableDestinations = caculator.GetAvailalableDestinations(_model?.Config.HealthCheck, allDestinations);
      _dynamicState = new ClusterDynamicState(allDestinations, availableDestinations);
    }
  }
}

Tratcher · 2021-05-21T18:55:52Z

By design Unknown health state has 2 purposes: indicate that the actual health state is not calculated yet AND make reactivated destination eligible for receiving traffic to see if they actually recovered. Thus, I don't want to remove Unknown destinations from AvailableDestinations

Fair point, unknown can't be excluded for passive checks. That makes me think people may customize this policy for active vs passive checks. Right now if both are enabled then both must be healthy/unknown. Imagine a policy that started with the defaults, but if nothing was available then it would fall back to any that were marked healthy/unknown by passive checks, excluding active checks. If that failed then may opt to disable proxying (return an empty list) or fall back to the whole list.

One option to avoid locking on ClusterState problem is to invert control. Meaning, to pass an instance of IAvaliableDestinationsCalculator into ClusterState.UpdateDynamicStateInternal()

That's a nice way to deal with the lock ownership issue. I'd prefer to move the methods off ClusterState entirely, but I don't know if that's practical yet. Could IAvaliableDestinationsCalculator use a weak reference table for cluster locks?

As for the API, I think it should be more general than panic mode, it should let people compose the list themselves:

public interface IAvaliableDestinationsPolicy
{
   // The name of the policy
   string Name { get; }
  // Reviews allDestinations and returns which should be considered as available for use.
  IReadOnlyList<DestinationState> GetAvailalableDestinations(ClusterConfig config, IReadOnlyList<DestinationState> allDestinations);
}

// This assumes ClusterState manages the locking:
public interface IAvaliableDestinationsCalculator
{
  // Looks up the per-cluster policy and invokes it
  IReadOnlyList<DestinationState> GetAvailalableDestinations(ClusterConfig config, IReadOnlyList<DestinationState> allDestinations)
}

// Alternative that manages the locking itself:
// Looks up the per-cluster policy, and invokes it,  and updates ClusterState.DynamicState directly.
public interface IAvaliableDestinationsCalculator
{
  // Called for health state changes, debounces
  UpdateAvailableDestinations(ClusterState cluster)
  // Called for config changes, blocking
  UpdateAllDestinations(ClusterState cluster)
}

A simple panic policy could be implemented by deriving from the base implementation and only taking action if the base returned an empty list. A more advanced policy could completely customize the list generation.

samsp-msft · 2021-05-21T19:02:13Z

I prefer the idea of separating this out into its own module/stage in the pipeline.
Is there a way of adding health information to each destination that is passed to this module. It could then have passive & active state and decide what it wants to do.

Tratcher · 2021-05-21T20:21:53Z

@samsp-msft the downside of middleware is that the results have to be re-computed per request, adding allocations and CPU overhead. It's computed in the background right now because the results aren't expected to change per request, only when health or config changes.

samsp-msft · 2021-05-21T20:36:48Z

@Tratcher - doh. So it needs a callback when health status has changed for a node in a cluster, and that in turn will compute the AvailableDestinations?

Tratcher · 2021-05-21T20:48:51Z

Right, today ClusterState.UpdateDynamicState() is that callback, and with this change it might move to IAvaliableDestinationsCalculator.UpdateAvailableDestinations.

alnikola · 2021-05-24T14:19:54Z

I'd prefer to move the methods off ClusterState entirely, but I don't know if that's practical yet. Could IAvaliableDestinationsCalculator use a weak reference table for cluster locks?

It seems a bit overengineered in my opinion. At least at this moment. I'd prefer to start with a simpler solution (i.e. keeping locking logic in ClusterState) and then make it advanced if necessary.

Regarding IAvaliableDestinationsPolicy and IAvaliableDestinationsCalculator. The proposal seems reasonable overall, but in this case IAvaliableDestinationsCalculator won't have too much logic because it will only call IAvaliableDestinationsPolicy and update AvailableDestinations on ClusterState. I'm a bit concerned by this fact.

Tratcher · 2021-05-24T15:45:07Z

I'd prefer to start with a simpler solution (i.e. keeping locking logic in ClusterState) and then make it advanced if necessary.

The main goal for getting the methods off ClusterState is that ClusterState is not extensible and shouldn't own any functionality, only state.

IAvaliableDestinationsCalculator won't have too much logic because it will only call IAvaliableDestinationsPolicy and update AvailableDestinations on ClusterState. I'm a bit concerned by this fact.

Right, IAvaliableDestinationsCalculator will be pretty small. It's job is to do the per-cluster lookup of the policies (and maybe the locking). You need it mainly because ClusterState itself doesn't have access to DI to retrieve policies, and you don't want to implement the policy lookup in every health check.

alnikola · 2021-05-24T15:54:05Z

The main goal for getting the methods off ClusterState is that ClusterState is not extensible and shouldn't own any functionality, only state.

OK, understood. Maintaining a collection of weak references to lock objects is basically the same as the caching we already have, so it will be easy to implement.

alnikola · 2021-05-24T15:55:55Z

Thus, it seems we can completely remove remove available destinations calculation logic from ClusterState.

alnikola · 2021-05-24T15:57:27Z

Does anybody else have some more thoughts/objections regarding this? Otherwise, I will start implementing it since the design seems to be clear enough.
cc: @davidfowl @MihaZupan

) Cluster's destination update logic is extracted into a separate service `IClusterDestinationsUpdater` which is now responsible for updating `ClusterDynamicState` (it's now renamed to `ClusterDestinationsState`), specifically updating the full destination collection and filtering the destinations available for proxying requests to. This PR also adds `IAvailableDestinationsPolicy` that decides which destinations from the full collection should be available for requests. There are 2 such policies are implemented: - `HealtyOrUnknownDesitnationsPolicy` decides that a destination is available if active AND passive health states are either 'Healthy' or 'Unknown' - `HealthyOrPanicDestinationsPolicy` calls `HealtyOrUnknownDesitnationsPolicy` as the first step and then checks if the returned available destination collection is empty. If it's **not** empty, this collection is returned as the result, otherwise the full collection containing all destinations is returned. This implements 'panic' health evaluation strategy. Fixes #83

Tratcher added the Type: Enhancement New feature or request label Apr 23, 2020

karelz added this to the Backlog milestone May 12, 2020

Tratcher mentioned this issue Jun 3, 2020

Request can be affinitized to destinations #174

Merged

alnikola mentioned this issue Oct 29, 2020

Passive health policy controlling request transport failures rate in proxying them to destinations #499

Merged

karelz mentioned this issue May 13, 2021

Policy for what to do if all destinations are unhealthy #992

Closed

karelz modified the milestones: Backlog, YARP 1.0.0 May 13, 2021

alnikola self-assigned this May 17, 2021

Tratcher changed the title ~~Load balancing panic threshold~~ Panic threshold May 20, 2021

Tratcher mentioned this issue May 20, 2021

LoadBalancingPolicy = First broken in preview 11 #980

Closed

alnikola mentioned this issue May 25, 2021

Cluster's destinations are updated by IClusterDestinationsUpdater #1024

Merged

alnikola closed this as completed in #1024 May 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic threshold #83

Panic threshold #83

Tratcher commented Apr 23, 2020

Tratcher commented Jul 9, 2020

karelz commented May 13, 2021

Tratcher commented May 19, 2021 •

edited

Loading

alnikola commented May 21, 2021

Tratcher commented May 21, 2021 •

edited

Loading

samsp-msft commented May 21, 2021

Tratcher commented May 21, 2021

samsp-msft commented May 21, 2021

Tratcher commented May 21, 2021 •

edited

Loading

alnikola commented May 24, 2021 •

edited

Loading

Tratcher commented May 24, 2021

alnikola commented May 24, 2021

alnikola commented May 24, 2021

alnikola commented May 24, 2021

Panic threshold #83

Panic threshold #83

Comments

Tratcher commented Apr 23, 2020

Tratcher commented Jul 9, 2020

karelz commented May 13, 2021

Tratcher commented May 19, 2021 • edited Loading

alnikola commented May 21, 2021

Tratcher commented May 21, 2021 • edited Loading

samsp-msft commented May 21, 2021

Tratcher commented May 21, 2021

samsp-msft commented May 21, 2021

Tratcher commented May 21, 2021 • edited Loading

alnikola commented May 24, 2021 • edited Loading

Tratcher commented May 24, 2021

alnikola commented May 24, 2021

alnikola commented May 24, 2021

alnikola commented May 24, 2021

Tratcher commented May 19, 2021 •

edited

Loading

Tratcher commented May 21, 2021 •

edited

Loading

Tratcher commented May 21, 2021 •

edited

Loading

alnikola commented May 24, 2021 •

edited

Loading