Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic threshold #83

Closed
Tratcher opened this issue Apr 23, 2020 · 14 comments · Fixed by #1024
Closed

Panic threshold #83

Tratcher opened this issue Apr 23, 2020 · 14 comments · Fixed by #1024
Assignees
Labels
Type: Enhancement New feature or request
Milestone

Comments

@Tratcher
Copy link
Member

#78 (comment)
@davidni: Something to keep in mind is what to do when a large portion of endpoints in a backend are unhealthy. Envoy for example has a panic mode where it will prefer to try to route to endpoints it knows to be unhealthy, because it may be better than the alternative of not routing at all. See e.g. Panic threshold

@Tratcher Tratcher added the Type: Enhancement New feature or request label Apr 23, 2020
@karelz karelz added this to the Backlog milestone May 12, 2020
@Tratcher
Copy link
Member Author

Tratcher commented Jul 9, 2020

Design notes: Such a panic check might happen here when first pulling the list of healthy destinations. If it's empty, or below some threshold, consider using the entire list of destinations instead.

if (dynamicState.HealthyDestinations.Count == 0)
{
Log.NoHealthyDestinations(_logger, routeConfig.Route.RouteId, cluster.ClusterId);
context.Response.StatusCode = 503;
return Task.CompletedTask;
}
context.Features.Set(cluster);
context.Features.Set<IAvailableDestinationsFeature>(new AvailableDestinationsFeature() { Destinations = dynamicState.HealthyDestinations });

Alternate proposal: The health check service could have that kind of logic built in when building the list of healthy destinations.

@karelz
Copy link
Member

karelz commented May 13, 2021

Triage: In 1.0 we need panic response/policy

@karelz karelz modified the milestones: Backlog, YARP 1.0.0 May 13, 2021
@alnikola alnikola self-assigned this May 17, 2021
@Tratcher
Copy link
Member Author

Tratcher commented May 19, 2021

Design notes:

  • If you tried to implement this as part of health checks today:
    • For active health checks you could do it by replacing the IDestinationHealthUpdater, having it check the results and overriding them if it determined that too many were marked as unhealthy.
    • Replacing the health state results in a loss of information and should be avoided.
    • Even IDestinationHealthUpdater would have trouble doing this for passive health checks. They only check and update one destination at a time.
    • I think this rules out trying to build a panic mode into the health check system.
  • Doing the panic check in middleware is pretty simple:
    • This has the advantage of not changing any persistent state, only request state.
    • This is still some per-request overhead, especially if you wanted to do any other kind of filtering.
public Task PanicMiddleware(HttpContext context, Func<Task> next)
{
    var proxyFeature = context.Features.Get<IReverseProxyFeature>();

    if (proxyFeature.AvailableDestinations.Count == 0) // or a percentage of AllDestinations
    {
        // Panic!
        proxyFeature.AvailableDestinations = proxyFeature.AllDestinations;
    }

    return next();
}
  • What if we made the AvailableDestinations generation code extensible?
    • That logic is currently embedded in ClusterState and there's no way to override it.
    • The current logic includes Healthy and Unknown endpoints, excluding Unhealthy ones.
    • What if someone wanted to exclude Unknown endpoints as well? Or to only include them if there weren't enough Healthy endpoints?
    • Would this be a global service, or would you want to apply different policies per cluster? Or both (configurable per cluster with a replaceable default)?
    • It would be a nice cleanup to get this functional logic off of ClusterState
    • ClusterState.DynamicState would have to be publicly settable.
    • Figuring out the locking might be tricky.
    • Possible API (with the existing bad names from ClusterState):
public interface IClusterUpdator
{
    // Triggers a refresh but doesn't block if one is already in progress. Called from health checks to notify of state changes.
    void UpdateDynamicState(ClusterState);
    // Triggers a refresh and always waits for it to finish. Called when config changes.
    void ProcessDestinationChanges(ClusterState cluster);
}

Side note: Do we want to add a Degraded health state so it's not such a sharp cliff from Healthy to Unhealthy? Then an update algorithm could say "Only include the healthy ones, unless there aren't enough then include the degraded ones, unless there's not enough then include the unknown ones, unless there's not enough then include the unhealthy ones."
(Filed as #1011)

@Tratcher Tratcher changed the title Load balancing panic threshold Panic threshold May 20, 2021
@alnikola
Copy link
Contributor

Thoughts on the above design suggestions:

  • Agree, that panic mode should be separated from the health checks
  • PanicMiddleware options seems goods if we make it extensible with the new IPanicPolicy interface
  • But, extracting AvailableDestinations update code into a new service seems even better because currently it seems as feature gap in YARP extensibility mechanism. Let's extract the most of current ClusterState.UpdateDynamicStateInternal() logic into IAvaliableDestinationsCalculator and extend it with IPanicPolicy
  • IPanicPolicy should be configurable per cluster
  • By design Unknown health state has 2 purposes: indicate that the actual health state is not calculated yet AND make reactivated destination eligible for receiving traffic to see if they actually recovered. Thus, I don't want to remove Unknown destinations from AvailableDestinations
  • One option to avoid locking on ClusterState problem is to invert control. Meaning, to pass an instance of IAvaliableDestinationsCalculator into ClusterState.UpdateDynamicStateInternal()
public interface IPanicPolicy
{
  // Decides if it's time to start panic depending on how many destinations are available out of all of them
  bool IsInPanic(HealthCheckConfig config, IReadOnlyList<DestinationState> allDestinations, IReadOnlyList<DestinationState> availableDestinations);
}

public interface IAvaliableDestinationsCalculator
{
  // Selects available destinations based on active and passive health states as well as applies a panic policy
  GetAvailalableDestinations(HealthCheckConfig config, IReadOnlyList<DestinationState> allDestinations)
}

public sealed class ClusterState
{
  ...

  public void UpdateDynamicState(IAvaliableDestinationsCalculator caculator)
  {
     UpdateDynamicStateInternal(caculator, force: false);
  }

  public void ProcessDestinationChanges(IAvaliableDestinationsCalculator caculator)
  {
     UpdateDynamicStateInternal(caculator, force: false);
  }

  private void UpdateDynamicStateInternal(IAvaliableDestinationsCalculator caculator, bool force)
  {
    ...

    lock (_stateLock)
    {
      // Most logic is moved into IAvaliableDestinationsCalculator
      var allDestinations = _destinationsSnapshot;
      var availableDestinations = caculator.GetAvailalableDestinations(_model?.Config.HealthCheck, allDestinations);
      _dynamicState = new ClusterDynamicState(allDestinations, availableDestinations);
    }
  }
}

@Tratcher
Copy link
Member Author

Tratcher commented May 21, 2021

  • By design Unknown health state has 2 purposes: indicate that the actual health state is not calculated yet AND make reactivated destination eligible for receiving traffic to see if they actually recovered. Thus, I don't want to remove Unknown destinations from AvailableDestinations

Fair point, unknown can't be excluded for passive checks. That makes me think people may customize this policy for active vs passive checks. Right now if both are enabled then both must be healthy/unknown. Imagine a policy that started with the defaults, but if nothing was available then it would fall back to any that were marked healthy/unknown by passive checks, excluding active checks. If that failed then may opt to disable proxying (return an empty list) or fall back to the whole list.

  • One option to avoid locking on ClusterState problem is to invert control. Meaning, to pass an instance of IAvaliableDestinationsCalculator into ClusterState.UpdateDynamicStateInternal()

That's a nice way to deal with the lock ownership issue. I'd prefer to move the methods off ClusterState entirely, but I don't know if that's practical yet. Could IAvaliableDestinationsCalculator use a weak reference table for cluster locks?

As for the API, I think it should be more general than panic mode, it should let people compose the list themselves:

public interface IAvaliableDestinationsPolicy
{
   // The name of the policy
   string Name { get; }
  // Reviews allDestinations and returns which should be considered as available for use.
  IReadOnlyList<DestinationState> GetAvailalableDestinations(ClusterConfig config, IReadOnlyList<DestinationState> allDestinations);
}

// This assumes ClusterState manages the locking:
public interface IAvaliableDestinationsCalculator
{
  // Looks up the per-cluster policy and invokes it
  IReadOnlyList<DestinationState> GetAvailalableDestinations(ClusterConfig config, IReadOnlyList<DestinationState> allDestinations)
}

// Alternative that manages the locking itself:
// Looks up the per-cluster policy, and invokes it,  and updates ClusterState.DynamicState directly.
public interface IAvaliableDestinationsCalculator
{
  // Called for health state changes, debounces
  UpdateAvailableDestinations(ClusterState cluster)
  // Called for config changes, blocking
  UpdateAllDestinations(ClusterState cluster)
}

A simple panic policy could be implemented by deriving from the base implementation and only taking action if the base returned an empty list. A more advanced policy could completely customize the list generation.

@samsp-msft
Copy link
Contributor

I prefer the idea of separating this out into its own module/stage in the pipeline.
Is there a way of adding health information to each destination that is passed to this module. It could then have passive & active state and decide what it wants to do.

@Tratcher
Copy link
Member Author

@samsp-msft the downside of middleware is that the results have to be re-computed per request, adding allocations and CPU overhead. It's computed in the background right now because the results aren't expected to change per request, only when health or config changes.

@samsp-msft
Copy link
Contributor

@Tratcher - doh. So it needs a callback when health status has changed for a node in a cluster, and that in turn will compute the AvailableDestinations?

@Tratcher
Copy link
Member Author

Tratcher commented May 21, 2021

Right, today ClusterState.UpdateDynamicState() is that callback, and with this change it might move to IAvaliableDestinationsCalculator.UpdateAvailableDestinations.

@alnikola
Copy link
Contributor

alnikola commented May 24, 2021

I'd prefer to move the methods off ClusterState entirely, but I don't know if that's practical yet. Could IAvaliableDestinationsCalculator use a weak reference table for cluster locks?

It seems a bit overengineered in my opinion. At least at this moment. I'd prefer to start with a simpler solution (i.e. keeping locking logic in ClusterState) and then make it advanced if necessary.

Regarding IAvaliableDestinationsPolicy and IAvaliableDestinationsCalculator. The proposal seems reasonable overall, but in this case IAvaliableDestinationsCalculator won't have too much logic because it will only call IAvaliableDestinationsPolicy and update AvailableDestinations on ClusterState. I'm a bit concerned by this fact.

@Tratcher
Copy link
Member Author

I'd prefer to start with a simpler solution (i.e. keeping locking logic in ClusterState) and then make it advanced if necessary.

The main goal for getting the methods off ClusterState is that ClusterState is not extensible and shouldn't own any functionality, only state.

IAvaliableDestinationsCalculator won't have too much logic because it will only call IAvaliableDestinationsPolicy and update AvailableDestinations on ClusterState. I'm a bit concerned by this fact.

Right, IAvaliableDestinationsCalculator will be pretty small. It's job is to do the per-cluster lookup of the policies (and maybe the locking). You need it mainly because ClusterState itself doesn't have access to DI to retrieve policies, and you don't want to implement the policy lookup in every health check.

@alnikola
Copy link
Contributor

The main goal for getting the methods off ClusterState is that ClusterState is not extensible and shouldn't own any functionality, only state.

OK, understood. Maintaining a collection of weak references to lock objects is basically the same as the caching we already have, so it will be easy to implement.

@alnikola
Copy link
Contributor

Thus, it seems we can completely remove remove available destinations calculation logic from ClusterState.

@alnikola
Copy link
Contributor

Does anybody else have some more thoughts/objections regarding this? Otherwise, I will start implementing it since the design seems to be clear enough.
cc: @davidfowl @MihaZupan

alnikola added a commit that referenced this issue May 28, 2021
)

Cluster's destination update logic is extracted into a separate service `IClusterDestinationsUpdater` which is now responsible for updating `ClusterDynamicState` (it's now renamed to `ClusterDestinationsState`), specifically updating the full destination collection and filtering the destinations available for proxying requests to.
This PR also adds `IAvailableDestinationsPolicy` that decides which destinations from the full collection should be available for requests. There are 2 such policies are implemented:
- `HealtyOrUnknownDesitnationsPolicy` decides that a destination is available if active AND passive health states are either 'Healthy' or 'Unknown'
- `HealthyOrPanicDestinationsPolicy` calls `HealtyOrUnknownDesitnationsPolicy` as the first step and then checks if the returned available destination collection is empty. If it's **not** empty, this collection is returned as the result, otherwise the full collection containing all destinations is returned. This implements 'panic' health evaluation strategy.

Fixes #83
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants