upstream: locality weighted load balancing.#2892
Conversation
|
@alyssawilk @zuercher This is ready for review, I'm particularly interested in @zuercher's thoughts on how this plays with subset LB (I haven't given that a lot of thought, that's the only missing bit AFAICT). For context, this rounds out locality balancing for the non-affinity case. I'll be following up next week with support for affinity balancers. |
|
I'll try to take a look at this tonight or tomorrow, but I'll be out of the office all next week. |
source/common/upstream/subset_lb.cc
Outdated
There was a problem hiding this comment.
I think not passing any LocalityWeights here will end up disabling this feature. It seems sufficient to just pass along original_host_set_.localityWeights(), though. That way the underlying load balancers have all the information to make the locality weighted load balancing decisions.
Risk Level: Low (only enabled when explicitly configured). Testing: Unit tests for EDS, LoadBalancerImpl and UpstreamImpl. The load stats integration test provides end-to-end validation. Docs Changes: envoyproxy/data-plane-api#565 Release Notes: Signed-off-by: Harvey Tuch <htuch@google.com>
alyssawilk
left a comment
There was a problem hiding this comment.
Given today's meeting slate, review load, and rotation I did a first pass on code only - I owe you a review of tests.
Looks good overall - the usual passel of nits follow :-P
source/common/upstream/eds.cc
Outdated
There was a problem hiding this comment.
I seem to remember you offering to make more of this structured data rather than vectors of pairs and such? It's optional (you're not making things much worse than they were) but it'd be nice!
There was a problem hiding this comment.
This one I'll leave as is, since it's local to this function, so it would be probably less readable to start defining new types. I've been generally trying to avoid the contagion of vectors at the interface level.
source/common/upstream/eds.cc
Outdated
There was a problem hiding this comment.
Do we consider a weight change to be a change to the host or should this comment be updated?
There was a problem hiding this comment.
It's a bit more nuanced in that weight changes for hosts don't cause us to recompute, but weight changes for localities do cause us to rebuild the schedule. This has to do with plumbing more than anything, we can do the same lazy weight updates at the locality level. I've updated the log and will leave some TODOs to optimize this later.
include/envoy/upstream/upstream.h
Outdated
There was a problem hiding this comment.
if performing locality weighted balancing? If locality weighted balancing is configured?
There was a problem hiding this comment.
This is such a nice clean change given the prior refactor - thanks!
There was a problem hiding this comment.
TODO for follow-up totally reasonable but I'd expect the host list to change super frequently and the locality list to change infrequently, and the weights to change infrequently. Do we really want to rebuild this every time?
This also seems like a nice utility function to break out for unit testing.
There was a problem hiding this comment.
I agree; as mentioned above, we should probably lazily rebuild as we age entries out of the scheduler during picks. I think it's better as a TODO to tune as this starts to matter (rebuilding with few localities is still really cheap). We would need to first verify that the locality indexes haven't changed (i.e. the map from host vector per locality to absolute locality), then we could just update the local map; this might cost more than just rebuilding the scheduler for a small number of localities.
There was a problem hiding this comment.
The internal scheduling structures always auto-add after pick. I think it makes the code a bit easier to use but it's definitely not normal priority queue logic. Your call if you want to leave it this way for C++ consistency but I figured I'd at least mention it.
There was a problem hiding this comment.
Yeah, this is actually deliberate, as we use it in host-level picking for the lazy weight updates. I'll leave it as is, since as discussed above, we'd like to do this at the locality level as well.
There was a problem hiding this comment.
lazyreview: I'm sure we have coverage of this but fo we have functional unit testing? It'd be nice to have, even if we're just querying the scheduler or some such.
There was a problem hiding this comment.
We have this coverage from HostSetImplLocalityTest. I think these tests are fairly unit/functional level, but they do include the scheduler. LMK if you want me to make this method public and I can do the finer grained test.
There was a problem hiding this comment.
Comments would be nice. I got confused reading this code, sanity checked against the priority weights, realized it was the same calculation and got sorted out by my comments over there ;-)
There was a problem hiding this comment.
Also we should consider defining 1.4 as a kOverProvisioning factor so if we make it configurable we remember to update it in both place.
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
29fc19d to
118336d
Compare
|
@zuercher I've left a TODO for subset LB, as I think the situation is a bit more complicated than just plumbing through the original locality weights. Here's the copy+paste of the comment: |
|
builds look unhappy. Still worth me taking another look? |
|
@alyssawilk yeah, please take a look, I'm sure it's something minor, will dig into this. |
Signed-off-by: Harvey Tuch <htuch@google.com>
| // Priority levels are considered overprovisioned with this factor. This means that we don't | ||
| // consider a priority level unhealthy until the ratio of healthy hosts multiplied by | ||
| // kOverProvisioningFactor drops below 1.0. | ||
| static constexpr double kOverProvisioningFactor = 1.4; |
There was a problem hiding this comment.
If we're defining these in 2 places can we have a unit test making sure they stay in sync?
There was a problem hiding this comment.
Do we need them to stay in sync? I think you might conceivably want to use different setting for priority vs. locality level..
There was a problem hiding this comment.
At the end of the day they're both talking about how many extra hosts you have. If we ever allow configuring per-hostset we'd have to then adapt the priority level over-provisioning to take that into account. So yes, I think they need to stay in sync.
test/common/upstream/eds_test.cc
Outdated
| EXPECT_EQ(nullptr, cluster_->prioritySet().hostSetsPerPriority()[0]->localityWeights()); | ||
| } | ||
|
|
||
| // Validate that onConfigUpdate() propagatees locality weights to the host set when locality |
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
…579) See envoyproxy/envoy#2892. Signed-off-by: Harvey Tuch <htuch@google.com>
…#579) See envoyproxy/envoy#2892. Signed-off-by: Harvey Tuch <htuch@google.com>
Underlying issue: #2725
Risk Level: Low (only enabled when explicitly configured).
Testing: Unit tests for EDS, LoadBalancerImpl and UpstreamImpl. The load stats integration test
provides end-to-end validation.
Docs Changes: envoyproxy/data-plane-api#565
Release Notes: envoyproxy/data-plane-api#579
Signed-off-by: Harvey Tuch htuch@google.com