[Bug]: Flaky test TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip #10295

fvaleri · 2024-07-03T07:28:22Z

Bug Description

The system test TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip often fails with a timeout when running locally the entire class, while it always works when running it in isolation. I think this was already observed before by other people.

Steps to reproduce

mvn verify -B -Pall -pl systemtest -Dit.test="TopicReplicasChangeST" \
  -DexcludedGroups=flaky,loadbalancer -Dmaven.javadoc.skip -Dskip.surefire.tests

Expected behavior

The test should always work.

Strimzi version

main

Kubernetes version

1.30

scholzj · 2024-07-03T08:26:39Z

It does not seem to be a problem in the Azure pipelines.

fvaleri · 2024-07-03T11:44:03Z

It does not seem to be a problem in the Azure pipelines.

I know, but have you tried locally? I would be good to have some confirmation besides me.

scholzj · 2024-07-03T15:47:14Z

Funny enough, I saw it fail once in the next ST run after I wrote that it doesn't seem to fail. :-( That said, I still think that in the CI it seems to fail only rarely. There are several tests that are significantly more flaky. The general issue with this is that unless it is a real problem, nobody is going to jump on it and fix it. So if you think it is flaky, it will usually be up to you to deal with it.

I know, but have you tried locally? It would be good to have some confirmation besides me.

Interesting, I ran it 5 times and it failed twice. Normally my local is much more reliable than Azure, not the other way around.

fvaleri · 2024-07-03T16:15:20Z

There are several tests that are significantly more flaky.

Sure, and we should log them so that they can eventually be fixed.

This one is tricky as you also found. The test looks correct, and we applied some optimization to make the CC model generation faster. There must be some race condition going on, possibly related to the testing infrastructure.

scholzj · 2024-07-03T16:17:01Z

I think all of the flaky tests are related to Cruise Control. It is really unpredictable. Sometimes you have a proposal and rebalance done in 30 seconds. Sometimes you wait 5 minutes for a proposal.

fvaleri · 2024-07-04T06:56:56Z

That's true. In this class I added some optimizations to make it faster and a bit more predictable. If I remember correctly, the flakiness was already there before that.

see-quick · 2024-07-04T07:57:53Z

That's true. In this class I added some optimizations to make it faster and a bit more predictable. If I remember correctly, the flakiness was already there before that.

Correct, not sure if increasing timeout would solve a race there. There have been countless problems with the CC, where many race conditions happen.

Sometimes you have a proposal and rebalance done in 30 seconds. Sometimes you wait 5 minutes for a proposal.

This is true...and even with our optimizations is still un-predictable sometimes. So not sure how we could improve it more.

fvaleri · 2024-07-04T08:02:39Z

First thing we have to confirm is that this issue is actually related to the cluster model not being ready in time, which I'm not sure at the moment. I'll try to have a look if I find some time.

Frawless · 2024-07-09T16:15:12Z

Test executor should collect logs from operands and CO so I guess the problem should be visible in CC log or maybe in operator/kafka log.

scholzj · 2024-07-11T16:41:30Z

Discussed on the community call on 10.7.2024: @fvaleri plans to look at this and try to improve the stability.

fvaleri · 2024-07-15T05:58:21Z

It took me many runs to hit the timeout a couple of times.
As we thought, it is related to the Cruise Control's cluster model not being ready in time.
This is confirmed by the TO logs where we have more than 5 minutes from the first pending to the ongoing state.

# test logs
2024-07-13 08:05:07 [main] INFO  [KafkaTopicUtils:315] Starting wait for condition: replica change status not present
io.strimzi.test.WaitException: Timeout after 300000 ms waiting for replica change status not present
	at io.strimzi.test.TestUtils.waitFor(TestUtils.java:182)
	at io.strimzi.test.TestUtils.waitFor(TestUtils.java:133)
	at io.strimzi.systemtest.utils.kafkaUtils.KafkaTopicUtils.waitForCondition(KafkaTopicUtils.java:316)
	at io.strimzi.systemtest.utils.kafkaUtils.KafkaTopicUtils.waitForReplicaChangeStatusNotPresent(KafkaTopicUtils.java:377)
	at io.strimzi.systemtest.operators.topic.TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip(TopicReplicasChangeST.java:170)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:725)
        ....	

# TO logs
2024-07-13 08:04:57,18740 INFO  [LoopRunnable-0] ReplicasChangeHandler:176 - Replicas change pending, Topics: [my-topic-1976557641-236539828]
...
2024-07-13 08:10:11,05290 INFO  [LoopRunnable-0] ReplicasChangeHandler:184 - Replicas change ongoing, Topics: [my-topic-1976557641-236539828]

# CC logs
2024-07-13 08:05:09 WARN  OperationRunnable:50 - Received exception when trying to execute runnable for "Update Topic Configuration"
com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: com.linkedin.cruisecontrol.exception.NotEnoughValidWindowsException: There are only 0 valid windows when aggregating in range [-1, 1720857909207] for aggregation options (minValidEntityRatio=1.00, minValidEntityGroupRatio=0.00, minValidWindows=1, numEntitiesToInclude=68, granularity=ENTITY)
	at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.GoalBasedOperationRunnable.computeResult(GoalBasedOperationRunnable.java:167) ~[cruise-control-2.5.137.jar:?]
	at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.UpdateTopicConfigurationRunnable.getResult(UpdateTopicConfigurationRunnable.java:157) ~[cruise-control-2.5.137.jar:?]
	at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.UpdateTopicConfigurationRunnable.getResult(UpdateTopicConfigurationRunnable.java:73) ~[cruise-control-2.5.137.jar:?]
	at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.OperationRunnable.run(OperationRunnable.java:45) ~[cruise-control-2.5.137.jar:?]
	at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.GoalBasedOperationRunnable.run(GoalBasedOperationRunnable.java:36) ~[cruise-control-2.5.137.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
	....

I opened a new PR which should help to make the model generation faster.
If it doesn't work, then we could try to raise the timeout to 8 or 10 minutes.

fvaleri added bug needs-triage labels Jul 3, 2024

scholzj removed the needs-triage label Jul 11, 2024

This was referenced Jul 12, 2024

Complete BatchingTopicController refactoring: handler classes #10248

Closed

Fix flaky TopicReplicasChangeST#testKafkaTopicReplicaChangePositiveRoundTrip #10339

Merged

Frawless closed this as completed in #10339 Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Flaky test TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip #10295

[Bug]: Flaky test TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip #10295

fvaleri commented Jul 3, 2024 •

edited

Loading

scholzj commented Jul 3, 2024

fvaleri commented Jul 3, 2024 •

edited

Loading

scholzj commented Jul 3, 2024

fvaleri commented Jul 3, 2024

scholzj commented Jul 3, 2024

fvaleri commented Jul 4, 2024

see-quick commented Jul 4, 2024

fvaleri commented Jul 4, 2024

Frawless commented Jul 9, 2024

scholzj commented Jul 11, 2024

fvaleri commented Jul 15, 2024

[Bug]: Flaky test TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip #10295

[Bug]: Flaky test TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip #10295

Comments

fvaleri commented Jul 3, 2024 • edited Loading

Bug Description

Steps to reproduce

Expected behavior

Strimzi version

Kubernetes version

scholzj commented Jul 3, 2024

fvaleri commented Jul 3, 2024 • edited Loading

scholzj commented Jul 3, 2024

fvaleri commented Jul 3, 2024

scholzj commented Jul 3, 2024

fvaleri commented Jul 4, 2024

see-quick commented Jul 4, 2024

fvaleri commented Jul 4, 2024

Frawless commented Jul 9, 2024

scholzj commented Jul 11, 2024

fvaleri commented Jul 15, 2024

fvaleri commented Jul 3, 2024 •

edited

Loading

fvaleri commented Jul 3, 2024 •

edited

Loading