retry extensions: implement "other priority" extension by snowp · Pull Request #4529 · envoyproxy/envoy

snowp · 2018-09-25T18:55:30Z

Implements a RetryPriority which will keep track of attempted
priorities and attempt to route retry requests to other priorities. The
update frequency is configurable, allowing multiple requests to hit each
priority if desired.

As a fallback, when no healthy priorities remain, the list of attempted
priorities will be reset and a host will selected again using the
original priority load.

Extracts out the recalculatePerPriorityState from LoadBalancerBase to
recompute the priority load with the same code used by the LB.

Signed-off-by: Snow Pettersen snowp@squareup.com

Risk Level: Medium, new extension
Testing: UTs
Docs Changes: n/a
Release Notes: n/a

Implements a RetryPriority which will keep track of attempted priorities and attempt to route retry requests to other priorities. The update frequency is configurable, allowing multiple requests to hit each priority if desired. As a fallback, when no healthy priorities remain, the list of attempted priorities will be reset and a host will selected again using the original priority load. Signed-off-by: Snow Pettersen <snowp@squareup.com>

Signed-off-by: Snow Pettersen <snowp@squareup.com>

junr03 · 2018-09-26T17:31:02Z

@alyssawilk do you mind taking this one on?

dnoe · 2018-09-27T18:56:25Z

api/envoy/config/retry/other_priority/other_priority_config.proto

+message OtherPriorityConfig {
+  // How often the priority load should be updated based on previously attempted priorities. Useful
+  // to allow each priorities to receive more than one request before being excluded.
+  int32 update_frequency = 1;


There might be some context I'm missing here but it's not obvious to me what the units of this frequency are.

The unit here is "number of attempts". I'll try to clarify the comment

The comment definitely helps but reading it I misread and thought this was number_of_requests_using_default_priority. Can we extend it just a bit to say that the fourth and fifth will then use the priority load excluding priorities for the frirst 4 attempts?

We should also comment what happens when we run out of priority levels

dnoe · 2018-09-27T19:01:44Z

source/common/upstream/load_balancer_impl.h

+                                          std::vector<uint32_t>& per_priority_health_);

  // The percentage load (0-100) for each priority level
  std::vector<uint32_t> per_priority_load_;


Did you meant to make these vectors public?

No, I'll fix

dnoe · 2018-09-27T19:02:41Z

source/common/upstream/load_balancer_impl.h

-  void recalculatePerPriorityState(uint32_t priority);
+  void static recalculatePerPriorityState(uint32_t priority, const PrioritySet& priority_set,
+                                          PriorityLoad& priority_load,
+                                          std::vector<uint32_t>& per_priority_health_);


If this becomes a static function that doesn't access the members, I don't think the argument names want the _ suffix (and the comment should also probably be updated).

dnoe · 2018-09-27T19:09:38Z

source/extensions/retry/priority/other_priority/other_priority.h

+  }
+
+  const Upstream::PriorityLoad&
+  determinePriorityLoad(const Upstream::PrioritySet& priority_set,


This function persist the address of priority_set after exiting. I'm not as familiar with this area of the code, so I don't know if this is an expected/safe thing to do. But I think there should be a comment about the ownership semantics here if this is indeed what we want to do.

Thinking about this some more I don't think this is actually safe, since the set is owned by the LB which is owned by the cluster. If a cluster is removed during routing, we'd run into issues accessing the set. I'll try to figure out a better way to keep the priority load in sync with membership updates, or perhaps not handle that at all and document the limitation.

Changed it to no longer persist the priority_set address and instead documented the limitation. While watching cluster membership is possible it gets tricky due to the possibility of cluster being removed, so I'd rather tackle that separately to adding this basic implementation

dnoe · 2018-09-27T19:10:19Z

source/extensions/retry/priority/other_priority/other_priority.h

+        priority, *priority_set_, per_priority_load_, per_priority_health_);
+  }
+
+  // Distributes priority load between priorities that should be consider after


typo: considered?

dnoe · 2018-09-27T19:11:15Z

source/extensions/retry/priority/other_priority/other_priority.h

+  // excluding attempted priorities.
+  void adjustForAttemptedPriorities();
+
+  uint32_t update_frequency_;


Is it const?

dnoe · 2018-09-27T19:13:22Z

source/extensions/retry/priority/well_known_names.h

+ */
+class RetryPriorityNameValues {
+public:
+  // Previous host predicate. Rejects hosts that have already been tried.


This description doesn't seem to match.

Oops, bad copy paste

dnoe · 2018-09-27T19:18:43Z

source/extensions/retry/priority/other_priority/other_priority.cc

+  if (attempted_priorites_.size() < update_frequency_) {
+    return original_priority_load;
+  } else if (attempted_priorites_.size() % update_frequency_ == 0) {
+    for (auto priority : attempted_priorites_) {


nit: const auto

dnoe · 2018-09-27T19:25:13Z

source/extensions/retry/priority/other_priority/other_priority.cc

+    }
+  }
+
+  per_priority_load_ = per_priority_load;


It's unclear to me why this doesn't adjust the per_priority_load in place. If it can't be adjusted in place, this might be worth refactoring out so the call here looks more like:

per_priority_load_ = refactoredRebalancingFunction(adjusted_per_priority_health)

I think it was a remnant of a previous iteration where I short circuiting during the update to fall back to the original values. I'll update it to update in place

dnoe · 2018-09-27T19:26:18Z

Alyssa asked me to take a look at this - hopefully some comments to get you started.

Signed-off-by: Snow Pettersen <snowp@squareup.com>

alyssawilk

Very cool to have a good example of our new retry framework. I'm going to have to do at least 2 passes to page back in all the health stuff but looks great so far!

alyssawilk · 2018-10-02T14:40:58Z

api/envoy/config/retry/other_priority/other_priority_config.proto

+  // to allow each priorities to receive more than one request before being excluded.
+  // For example, by setting this to 2, then the first two attempts (initial attempt and one retry)
+  // will use the unmodified priority load. The third and fourth attempt will use priority load which
+  // excludes the priorities routed to with the first two attempts.


I think it would be useful to have a more detailed explanation either here in the proto or in the load balancing docs* detailing about which priorities will be selected on retry and why.

*https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/load_balancing#priority-levels

alyssawilk · 2018-10-02T14:43:04Z

api/envoy/config/retry/other_priority/other_priority_config.proto

+message OtherPriorityConfig {
+  // How often the priority load should be updated based on previously attempted priorities. Useful
+  // to allow each priorities to receive more than one request before being excluded.
+  int32 update_frequency = 1;


The comment definitely helps but reading it I misread and thought this was number_of_requests_using_default_priority. Can we extend it just a bit to say that the fourth and fifth will then use the priority load excluding priorities for the frirst 4 attempts?

We should also comment what happens when we run out of priority levels

alyssawilk · 2018-10-02T14:54:36Z

source/extensions/retry/priority/other_priority/other_priority.cc

+    // Initialize our local priority_load_ and priority_health_,
+    // keeping them in sync with the member update cb.
+    for (auto& host_set : priority_set.hostSetsPerPriority()) {
+      recalculatePerPriorityState(host_set->priority(), priority_set);


I think this is fine for 2-3 priority levels but this might get expensive fast if one has a larger number of host sets and priorities. I think by nature of this code it's always going to have a lot of O(#priorities) work per pick - maybe we should have a warning comment about the scaling constraint here? I suspect very few people will have N>3 but still better to not surprise anyone :-P

We use something like 6-8 priorities so I totally share that concern. I'll add a note about potential perf issues

alyssawilk · 2018-10-02T14:56:31Z

source/extensions/retry/priority/other_priority/other_priority.h

+  void adjustForAttemptedPriorities();
+
+  const uint32_t update_frequency_;
+  std::vector<uint32_t> attempted_priorites_;


attempted_priorites_ -> attempted_priorites_

alyssawilk · 2018-10-02T15:08:04Z

source/extensions/retry/priority/other_priority/other_priority.cc

+    }
+    attempted_priorites_.clear();
+
+    adjustForAttemptedPriorities();


I'd prefer we avoid recursion in tricky code like this - I think it's too easy to have a refactor result in a loop.

Given we;ve already verified that we aren't in the state where everything is unhealthy, what if we just factored out the prior section into a helper function and rerun the helper if we hit this case?

Yeah that sounds better, I was being a bit too cute about it

alyssawilk · 2018-10-02T15:22:02Z

api/envoy/config/retry/other_priority/other_priority_config.proto

+//
+// Note that changes made to the cluster during retries will not be reflected in the priority
+// load of retries, so care should be taken when using this with long running requests that
+// might retry.


Can we add more detail about what care they should take?

I had a moment of being convinced this would cause crashes due to the priority set being resized in a way not reflected by attempted_priorities before I recalled that priority_set_ always grows but never shrinks :-P

Realizing that the limitation I had in mind is due to me having the if (!initialized_) piece where I only read from the PrioritySet on the first retry, so changes to host health wouldn't be picked up on subsequent attempts. I can avoid this altogether by running recalculatePerPriorityState on each attempt to pick up changes to PrioritySet, although it would come at additional runtime cost. Any thoughts on this trade off?

Thinking about it some I think I prefer the slower approach as it's less surprising, so I'll update the comment and code to use that

Signed-off-by: Snow Pettersen <snowp@squareup.com>

snowp · 2018-10-04T02:22:32Z

@alyssawilk Wanna give this another look? I believe I've addressed the feedback

alyssawilk

Looking good - just 2 little nits left!

alyssawilk · 2018-10-04T13:42:33Z

api/envoy/config/retry/other_priority/other_priority_config.proto

+// Attempt 4: P2
+//
+// Using this PriorityFilter requires rebuilding the priority load, which runs in O(# of
+// priorities), which might incur significant overhead for clusters with many priorities.


This is fantastic. 5***** would read again :-)

alyssawilk · 2018-10-04T13:45:32Z

source/common/upstream/load_balancer_impl.cc

        std::min<uint32_t>(total_load, per_priority_health[i] * 100 / total_health);
    total_load -= per_priority_load[i];
  }



Oh nice improvement. Mind adding both a load balancer test and adding this to the commit notes?

Huh this was actually added in #4533, not sure how this ended up in this diff. I'll see if I can fix it

Haha, the joys of git :-P

I merged in a bunch of unrelated commits and the diff went away /shrug

alyssawilk · 2018-10-04T13:47:39Z

source/extensions/retry/priority/other_priority/other_priority.cc

  }
 }
+
+std::pair<std::vector<uint32_t>, uint32_t> OtherPriorityRetryPriority::adjustedHealth() const {


Can we get your comment back?

// create an adjusted health view of the priorities, where attempted priorities are
// given a zero weight.

Signed-off-by: Snow Pettersen <snowp@squareup.com>

Implements a RetryPriority which will keep track of attempted priorities and attempt to route retry requests to other priorities. The update frequency is configurable, allowing multiple requests to hit each priority if desired. As a fallback, when no healthy priorities remain, the list of attempted priorities will be reset and a host will selected again using the original priority load. Extracts out the recalculatePerPriorityState from LoadBalancerBase to recompute the priority load with the same code used by the LB. Signed-off-by: Snow Pettersen snowp@squareup.com Risk Level: Medium, new extension Testing: unit tests Docs Changes: n/a Release Notes: n/a Signed-off-by: Snow Pettersen <snowp@squareup.com> Signed-off-by: Aaltan Ahmad <aa@stripe.com>

snowp changed the title ~~retry extensions: implement other priority extension~~ retry extensions: implement "other priority" extension Sep 25, 2018

snowp force-pushed the other-priority branch from 64081ff to f91c5ce Compare September 25, 2018 19:10

Snow Pettersen added 3 commits September 25, 2018 12:15

format

f52c9dc

Signed-off-by: Snow Pettersen <snowp@squareup.com>

Kick CI

6b6c025

Signed-off-by: Snow Pettersen <snowp@squareup.com>

explicit this in lambda

15967bc

Signed-off-by: Snow Pettersen <snowp@squareup.com>

junr03 assigned alyssawilk Sep 26, 2018

dnoe reviewed Sep 27, 2018

View reviewed changes

dnoe self-assigned this Sep 27, 2018

Snow Pettersen added 4 commits October 2, 2018 10:33

feedback: remove membership watch, other nits

2274a5a

Signed-off-by: Snow Pettersen <snowp@squareup.com>

add example to docs

158d084

Signed-off-by: Snow Pettersen <snowp@squareup.com>

private -> protected

f619696

Signed-off-by: Snow Pettersen <snowp@squareup.com>

format

16be1ce

Signed-off-by: Snow Pettersen <snowp@squareup.com>

alyssawilk reviewed Oct 2, 2018

View reviewed changes

Snow Pettersen added 5 commits October 2, 2018 10:18

update comments, rebuild entire priority_load on update

60c4fa6

Signed-off-by: Snow Pettersen <snowp@squareup.com>

add comment on how load is distributed when priorities are excluded

6d446f5

Signed-off-by: Snow Pettersen <snowp@squareup.com>

Merge remote-tracking branch 'origin/master' into other-priority

978f3f7

Signed-off-by: Snow Pettersen <snowp@squareup.com>

reserve instead of fill

e5b3b36

Signed-off-by: Snow Pettersen <snowp@squareup.com>

format

48c0963

Signed-off-by: Snow Pettersen <snowp@squareup.com>

alyssawilk mentioned this pull request Oct 3, 2018

upstream: changed how load calculation and panic level interact #4442

Merged

snowp mentioned this pull request Oct 3, 2018

docs: add retry plugin arch docs and remove not-implemented-hide #4592

Merged

Merge branch 'master' into other-priority

83166de

alyssawilk reviewed Oct 4, 2018

View reviewed changes

Snow Pettersen added 2 commits October 4, 2018 07:01

Merge branch 'master' into other-priority

580cb5c

Signed-off-by: Snow Pettersen <snowp@squareup.com>

add back comment

b8abcd3

Signed-off-by: Snow Pettersen <snowp@squareup.com>

alyssawilk approved these changes Oct 4, 2018

View reviewed changes

alyssawilk merged commit ba5d3f0 into envoyproxy:master Oct 4, 2018

Conversation

snowp commented Sep 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

junr03 commented Sep 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnoe commented Sep 27, 2018

Uh oh!

alyssawilk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snowp commented Oct 4, 2018

Uh oh!

alyssawilk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snowp commented Sep 25, 2018 •

edited

Loading