Skip to content

StaticStrideScheduler loops too many times #10366

@ejona86

Description

@ejona86

Running WeightedRoundRobinLoadBalancerTesis.pickByWeight_avgWeight_zeroCpuUtilization_withEps_customErrorUtilizationPenalty on my laptop once took 14.969s. That's surprising for this type of test. I'd have expected a max of a ~1s. It seems the static stride isn't quite right and we have to loop over all the entries many times. The worst-case for static stride should be looping through all addresses once. If we are doing it more than that, something is wrong.

It appears the test was the first to run, so it will be slower than other tests because of class loading at the like. But I wouldn't expect 15 seconds slow.

I did a quick hack to see if it was iterating too many times, and it was iterating way too many times.

diff --git a/xds/src/main/java/io/grpc/xds/WeightedRoundRobinLoadBalancer.java b/xds/src/main/java/io/grpc/xds/WeightedRoundRobinLoadBalancer.java
index d5d8c4d9e..792aa5450 100644
--- a/xds/src/main/java/io/grpc/xds/WeightedRoundRobinLoadBalancer.java
+++ b/xds/src/main/java/io/grpc/xds/WeightedRoundRobinLoadBalancer.java
@@ -433,7 +433,9 @@ final class WeightedRoundRobinLoadBalancer extends RoundRobinLoadBalancer {
      * an offset that varies per backend index is also included to the calculation.
      */
     int pick() {
+      int i = 0;
       while (true) {
+        i++;
         long sequence = this.nextSequence();
         int backendIndex = (int) (sequence % this.sizeDivisor);
         long generation = sequence / this.sizeDivisor;
@@ -442,6 +444,8 @@ final class WeightedRoundRobinLoadBalancer extends RoundRobinLoadBalancer {
         if ((weight * generation + offset) % K_MAX_WEIGHT < K_MAX_WEIGHT - weight) {
           continue;
         }
+        if (i > 2*scaledWeights.length)
+          throw new RuntimeException(String.format("%d > 2*%d\n", i, scaledWeights.length));
         return backendIndex;
       }
     }
io.grpc.xds.WeightedRoundRobinLoadBalancerTest > pickByWeight_avgWeight_zeroCpuUtilization_withEps_customErrorUtilizationPenalty FAILED
    java.lang.RuntimeException: 98300 > 2*3

I wonder if we have an off-by-one caused by rounding, or some such.

CC @YifeiZhuang

I noticed because TSAN timed out after ~4 minutes. TSAN will be slow, but not 4 minutes slow.

Starting full thread dump ...

"main" Id=1 RUNNABLE
	at [email protected]/jdk.internal.misc.Unsafe.getIntVolatile(Native Method)
	at [email protected]/jdk.internal.misc.Unsafe.getAndAddInt(Unsafe.java:2343)
	at [email protected]/java.util.concurrent.atomic.AtomicInteger.getAndIncrement(AtomicInteger.java:182)
	at app//io.grpc.xds.WeightedRoundRobinLoadBalancer$StaticStrideScheduler.nextSequence(WeightedRoundRobinLoadBalancer.java:397)
	at app//io.grpc.xds.WeightedRoundRobinLoadBalancer$StaticStrideScheduler.pick(WeightedRoundRobinLoadBalancer.java:437)
	at app//io.grpc.xds.WeightedRoundRobinLoadBalancer$WeightedRoundRobinPicker.pickSubchannel(WeightedRoundRobinLoadBalancer.java:279)
	at app//io.grpc.xds.WeightedRoundRobinLoadBalancerTest.pickByWeight(WeightedRoundRobinLoadBalancerTest.java:328)
	at app//io.grpc.xds.WeightedRoundRobinLoadBalancerTest.pickByWeight_avgWeight_zeroCpuUtilization_withEps_customErrorUtilizationPenalty(WeightedRoundRobinLoadBalancerTest.java:463)

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions