Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Flaky test TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip #10295

Closed
fvaleri opened this issue Jul 3, 2024 · 11 comments · Fixed by #10339
Closed

[Bug]: Flaky test TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip #10295

fvaleri opened this issue Jul 3, 2024 · 11 comments · Fixed by #10339
Labels

Comments

@fvaleri
Copy link
Contributor

fvaleri commented Jul 3, 2024

Bug Description

The system test TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip often fails with a timeout when running locally the entire class, while it always works when running it in isolation. I think this was already observed before by other people.

Steps to reproduce

mvn verify -B -Pall -pl systemtest -Dit.test="TopicReplicasChangeST" \
  -DexcludedGroups=flaky,loadbalancer -Dmaven.javadoc.skip -Dskip.surefire.tests

Expected behavior

The test should always work.

Strimzi version

main

Kubernetes version

1.30

@scholzj
Copy link
Member

scholzj commented Jul 3, 2024

It does not seem to be a problem in the Azure pipelines.

@fvaleri
Copy link
Contributor Author

fvaleri commented Jul 3, 2024

It does not seem to be a problem in the Azure pipelines.

I know, but have you tried locally? I would be good to have some confirmation besides me.

@scholzj
Copy link
Member

scholzj commented Jul 3, 2024

Funny enough, I saw it fail once in the next ST run after I wrote that it doesn't seem to fail. :-( That said, I still think that in the CI it seems to fail only rarely. There are several tests that are significantly more flaky. The general issue with this is that unless it is a real problem, nobody is going to jump on it and fix it. So if you think it is flaky, it will usually be up to you to deal with it.

I know, but have you tried locally? It would be good to have some confirmation besides me.

Interesting, I ran it 5 times and it failed twice. Normally my local is much more reliable than Azure, not the other way around.

@fvaleri
Copy link
Contributor Author

fvaleri commented Jul 3, 2024

There are several tests that are significantly more flaky.

Sure, and we should log them so that they can eventually be fixed.

This one is tricky as you also found. The test looks correct, and we applied some optimization to make the CC model generation faster. There must be some race condition going on, possibly related to the testing infrastructure.

@scholzj
Copy link
Member

scholzj commented Jul 3, 2024

I think all of the flaky tests are related to Cruise Control. It is really unpredictable. Sometimes you have a proposal and rebalance done in 30 seconds. Sometimes you wait 5 minutes for a proposal.

@fvaleri
Copy link
Contributor Author

fvaleri commented Jul 4, 2024

That's true. In this class I added some optimizations to make it faster and a bit more predictable. If I remember correctly, the flakiness was already there before that.

@see-quick
Copy link
Member

That's true. In this class I added some optimizations to make it faster and a bit more predictable. If I remember correctly, the flakiness was already there before that.

Correct, not sure if increasing timeout would solve a race there. There have been countless problems with the CC, where many race conditions happen.

Sometimes you have a proposal and rebalance done in 30 seconds. Sometimes you wait 5 minutes for a proposal.

This is true...and even with our optimizations is still un-predictable sometimes. So not sure how we could improve it more.

@fvaleri
Copy link
Contributor Author

fvaleri commented Jul 4, 2024

First thing we have to confirm is that this issue is actually related to the cluster model not being ready in time, which I'm not sure at the moment. I'll try to have a look if I find some time.

@Frawless
Copy link
Member

Frawless commented Jul 9, 2024

Test executor should collect logs from operands and CO so I guess the problem should be visible in CC log or maybe in operator/kafka log.

@scholzj
Copy link
Member

scholzj commented Jul 11, 2024

Discussed on the community call on 10.7.2024: @fvaleri plans to look at this and try to improve the stability.

@fvaleri
Copy link
Contributor Author

fvaleri commented Jul 15, 2024

It took me many runs to hit the timeout a couple of times.
As we thought, it is related to the Cruise Control's cluster model not being ready in time.
This is confirmed by the TO logs where we have more than 5 minutes from the first pending to the ongoing state.

# test logs
2024-07-13 08:05:07 [main] INFO  [KafkaTopicUtils:315] Starting wait for condition: replica change status not present
io.strimzi.test.WaitException: Timeout after 300000 ms waiting for replica change status not present
	at io.strimzi.test.TestUtils.waitFor(TestUtils.java:182)
	at io.strimzi.test.TestUtils.waitFor(TestUtils.java:133)
	at io.strimzi.systemtest.utils.kafkaUtils.KafkaTopicUtils.waitForCondition(KafkaTopicUtils.java:316)
	at io.strimzi.systemtest.utils.kafkaUtils.KafkaTopicUtils.waitForReplicaChangeStatusNotPresent(KafkaTopicUtils.java:377)
	at io.strimzi.systemtest.operators.topic.TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip(TopicReplicasChangeST.java:170)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:725)
        ....	

# TO logs
2024-07-13 08:04:57,18740 INFO  [LoopRunnable-0] ReplicasChangeHandler:176 - Replicas change pending, Topics: [my-topic-1976557641-236539828]
...
2024-07-13 08:10:11,05290 INFO  [LoopRunnable-0] ReplicasChangeHandler:184 - Replicas change ongoing, Topics: [my-topic-1976557641-236539828]

# CC logs
2024-07-13 08:05:09 WARN  OperationRunnable:50 - Received exception when trying to execute runnable for "Update Topic Configuration"
com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: com.linkedin.cruisecontrol.exception.NotEnoughValidWindowsException: There are only 0 valid windows when aggregating in range [-1, 1720857909207] for aggregation options (minValidEntityRatio=1.00, minValidEntityGroupRatio=0.00, minValidWindows=1, numEntitiesToInclude=68, granularity=ENTITY)
	at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.GoalBasedOperationRunnable.computeResult(GoalBasedOperationRunnable.java:167) ~[cruise-control-2.5.137.jar:?]
	at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.UpdateTopicConfigurationRunnable.getResult(UpdateTopicConfigurationRunnable.java:157) ~[cruise-control-2.5.137.jar:?]
	at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.UpdateTopicConfigurationRunnable.getResult(UpdateTopicConfigurationRunnable.java:73) ~[cruise-control-2.5.137.jar:?]
	at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.OperationRunnable.run(OperationRunnable.java:45) ~[cruise-control-2.5.137.jar:?]
	at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.GoalBasedOperationRunnable.run(GoalBasedOperationRunnable.java:36) ~[cruise-control-2.5.137.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
	....

I opened a new PR which should help to make the model generation faster.
If it doesn't work, then we could try to raise the timeout to 8 or 10 minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants