Skip to content

retry extensions: implement "other priority" extension#4529

Merged
alyssawilk merged 16 commits intoenvoyproxy:masterfrom
snowp:other-priority
Oct 4, 2018
Merged

retry extensions: implement "other priority" extension#4529
alyssawilk merged 16 commits intoenvoyproxy:masterfrom
snowp:other-priority

Conversation

@snowp
Copy link
Contributor

@snowp snowp commented Sep 25, 2018

Implements a RetryPriority which will keep track of attempted
priorities and attempt to route retry requests to other priorities. The
update frequency is configurable, allowing multiple requests to hit each
priority if desired.

As a fallback, when no healthy priorities remain, the list of attempted
priorities will be reset and a host will selected again using the
original priority load.

Extracts out the recalculatePerPriorityState from LoadBalancerBase to
recompute the priority load with the same code used by the LB.

Signed-off-by: Snow Pettersen snowp@squareup.com

Risk Level: Medium, new extension
Testing: UTs
Docs Changes: n/a
Release Notes: n/a

@snowp snowp changed the title retry extensions: implement other priority extension retry extensions: implement "other priority" extension Sep 25, 2018
Implements a RetryPriority which will keep track of attempted
priorities and attempt to route retry requests to other priorities. The
update frequency is configurable, allowing multiple requests to hit each
priority if desired.

As a fallback, when no healthy priorities remain, the list of attempted
priorities will be reset and a host will selected again using the
original priority load.

Signed-off-by: Snow Pettersen <snowp@squareup.com>
Snow Pettersen added 3 commits September 25, 2018 12:15
Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Snow Pettersen <snowp@squareup.com>
@junr03
Copy link
Member

junr03 commented Sep 26, 2018

@alyssawilk do you mind taking this one on?

message OtherPriorityConfig {
// How often the priority load should be updated based on previously attempted priorities. Useful
// to allow each priorities to receive more than one request before being excluded.
int32 update_frequency = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be some context I'm missing here but it's not obvious to me what the units of this frequency are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit here is "number of attempts". I'll try to clarify the comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment definitely helps but reading it I misread and thought this was number_of_requests_using_default_priority. Can we extend it just a bit to say that the fourth and fifth will then use the priority load excluding priorities for the frirst 4 attempts?

We should also comment what happens when we run out of priority levels

std::vector<uint32_t>& per_priority_health_);

// The percentage load (0-100) for each priority level
std::vector<uint32_t> per_priority_load_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you meant to make these vectors public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'll fix

void recalculatePerPriorityState(uint32_t priority);
void static recalculatePerPriorityState(uint32_t priority, const PrioritySet& priority_set,
PriorityLoad& priority_load,
std::vector<uint32_t>& per_priority_health_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this becomes a static function that doesn't access the members, I don't think the argument names want the _ suffix (and the comment should also probably be updated).

}

const Upstream::PriorityLoad&
determinePriorityLoad(const Upstream::PrioritySet& priority_set,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function persist the address of priority_set after exiting. I'm not as familiar with this area of the code, so I don't know if this is an expected/safe thing to do. But I think there should be a comment about the ownership semantics here if this is indeed what we want to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this some more I don't think this is actually safe, since the set is owned by the LB which is owned by the cluster. If a cluster is removed during routing, we'd run into issues accessing the set. I'll try to figure out a better way to keep the priority load in sync with membership updates, or perhaps not handle that at all and document the limitation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to no longer persist the priority_set address and instead documented the limitation. While watching cluster membership is possible it gets tricky due to the possibility of cluster being removed, so I'd rather tackle that separately to adding this basic implementation

priority, *priority_set_, per_priority_load_, per_priority_health_);
}

// Distributes priority load between priorities that should be consider after
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: considered?

// excluding attempted priorities.
void adjustForAttemptedPriorities();

uint32_t update_frequency_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it const?

*/
class RetryPriorityNameValues {
public:
// Previous host predicate. Rejects hosts that have already been tried.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description doesn't seem to match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, bad copy paste

if (attempted_priorites_.size() < update_frequency_) {
return original_priority_load;
} else if (attempted_priorites_.size() % update_frequency_ == 0) {
for (auto priority : attempted_priorites_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const auto

}
}

per_priority_load_ = per_priority_load;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me why this doesn't adjust the per_priority_load in place. If it can't be adjusted in place, this might be worth refactoring out so the call here looks more like:

per_priority_load_ = refactoredRebalancingFunction(adjusted_per_priority_health)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was a remnant of a previous iteration where I short circuiting during the update to fall back to the original values. I'll update it to update in place

@dnoe dnoe self-assigned this Sep 27, 2018
@dnoe
Copy link
Contributor

dnoe commented Sep 27, 2018

Alyssa asked me to take a look at this - hopefully some comments to get you started.

Snow Pettersen added 4 commits October 2, 2018 10:33
Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Snow Pettersen <snowp@squareup.com>
Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool to have a good example of our new retry framework. I'm going to have to do at least 2 passes to page back in all the health stuff but looks great so far!

// to allow each priorities to receive more than one request before being excluded.
// For example, by setting this to 2, then the first two attempts (initial attempt and one retry)
// will use the unmodified priority load. The third and fourth attempt will use priority load which
// excludes the priorities routed to with the first two attempts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to have a more detailed explanation either here in the proto or in the load balancing docs* detailing about which priorities will be selected on retry and why.

*https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/load_balancing#priority-levels

message OtherPriorityConfig {
// How often the priority load should be updated based on previously attempted priorities. Useful
// to allow each priorities to receive more than one request before being excluded.
int32 update_frequency = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment definitely helps but reading it I misread and thought this was number_of_requests_using_default_priority. Can we extend it just a bit to say that the fourth and fifth will then use the priority load excluding priorities for the frirst 4 attempts?

We should also comment what happens when we run out of priority levels

// Initialize our local priority_load_ and priority_health_,
// keeping them in sync with the member update cb.
for (auto& host_set : priority_set.hostSetsPerPriority()) {
recalculatePerPriorityState(host_set->priority(), priority_set);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for 2-3 priority levels but this might get expensive fast if one has a larger number of host sets and priorities. I think by nature of this code it's always going to have a lot of O(#priorities) work per pick - maybe we should have a warning comment about the scaling constraint here? I suspect very few people will have N>3 but still better to not surprise anyone :-P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use something like 6-8 priorities so I totally share that concern. I'll add a note about potential perf issues

void adjustForAttemptedPriorities();

const uint32_t update_frequency_;
std::vector<uint32_t> attempted_priorites_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attempted_priorites_ -> attempted_priorites_

}
attempted_priorites_.clear();

adjustForAttemptedPriorities();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer we avoid recursion in tricky code like this - I think it's too easy to have a refactor result in a loop.

Given we;ve already verified that we aren't in the state where everything is unhealthy, what if we just factored out the prior section into a helper function and rerun the helper if we hit this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds better, I was being a bit too cute about it

//
// Note that changes made to the cluster during retries will not be reflected in the priority
// load of retries, so care should be taken when using this with long running requests that
// might retry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more detail about what care they should take?

I had a moment of being convinced this would cause crashes due to the priority set being resized in a way not reflected by attempted_priorities before I recalled that priority_set_ always grows but never shrinks :-P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realizing that the limitation I had in mind is due to me having the if (!initialized_) piece where I only read from the PrioritySet on the first retry, so changes to host health wouldn't be picked up on subsequent attempts. I can avoid this altogether by running recalculatePerPriorityState on each attempt to pick up changes to PrioritySet, although it would come at additional runtime cost. Any thoughts on this trade off?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it some I think I prefer the slower approach as it's less surprising, so I'll update the comment and code to use that

Snow Pettersen added 5 commits October 2, 2018 10:18
Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Snow Pettersen <snowp@squareup.com>
@snowp
Copy link
Contributor Author

snowp commented Oct 4, 2018

@alyssawilk Wanna give this another look? I believe I've addressed the feedback

Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good - just 2 little nits left!

// Attempt 4: P2
//
// Using this PriorityFilter requires rebuilding the priority load, which runs in O(# of
// priorities), which might incur significant overhead for clusters with many priorities.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic. 5***** would read again :-)

std::min<uint32_t>(total_load, per_priority_health[i] * 100 / total_health);
total_load -= per_priority_load[i];
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice improvement. Mind adding both a load balancer test and adding this to the commit notes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh this was actually added in #4533, not sure how this ended up in this diff. I'll see if I can fix it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, the joys of git :-P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged in a bunch of unrelated commits and the diff went away /shrug

}
}

std::pair<std::vector<uint32_t>, uint32_t> OtherPriorityRetryPriority::adjustedHealth() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get your comment back?

// create an adjusted health view of the priorities, where attempted priorities are
// given a zero weight.

Snow Pettersen added 2 commits October 4, 2018 07:01
Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Snow Pettersen <snowp@squareup.com>
@alyssawilk alyssawilk merged commit ba5d3f0 into envoyproxy:master Oct 4, 2018
aa-stripe pushed a commit to aa-stripe/envoy that referenced this pull request Oct 11, 2018
Implements a RetryPriority which will keep track of attempted
priorities and attempt to route retry requests to other priorities. The
update frequency is configurable, allowing multiple requests to hit each
priority if desired.

As a fallback, when no healthy priorities remain, the list of attempted
priorities will be reset and a host will selected again using the
original priority load.

Extracts out the recalculatePerPriorityState from LoadBalancerBase to
recompute the priority load with the same code used by the LB.

Signed-off-by: Snow Pettersen snowp@squareup.com

Risk Level: Medium, new extension
Testing: unit tests
Docs Changes: n/a
Release Notes: n/a
Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Aaltan Ahmad <aa@stripe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants