-
Notifications
You must be signed in to change notification settings - Fork 14k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-15793 Fix ZkMigrationIntegrationTest#testMigrateTopicDeletions #17004
Conversation
core/src/test/scala/integration/kafka/zk/ZkMigrationIntegrationTest.scala
Show resolved
Hide resolved
After some timeout adjustments and removing the 3.4 case, this is looking a lot more stable I did a 10x deflake run and all the tests passed |
This reverts commit 635896f.
cc @soarez @divijvaidya who originally reported this flakiness |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this @mumrah.
Many of the new timeout values are oddly specific. How did you determine them?
core/src/test/scala/integration/kafka/zk/ZkMigrationIntegrationTest.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/integration/kafka/zk/ZkMigrationIntegrationTest.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/integration/kafka/zk/ZkMigrationIntegrationTest.scala
Outdated
Show resolved
Hide resolved
@soarez, thanks for taking a look!
That's a trick I've used before when debugging timeout issues. Sometimes we don't get a good stacktrace or log message on timeouts, so having a unique value lets you correlate it back to a particular line. |
@mumrah having a look at the test report, it seems that all attempts at running this test are being skipped. If any of the replicas is offline during the deletion, the topic is kept under |
This reverts commit 6c997a3.
@soarez I never could figure out a reliable way to ensure there were pending deletions. You're suggestion could work, but I'd like to address that in another PR |
@mumrah makes sense. Seems that was maybe an unlucky run anyway, as only one execution was skipped in the latest run:
However, there are 6 failures, all hitting the same 5 minute timeout to delete topics:
Does this mean there's some race condition wherein after the migration the topics cannot be deleted? I'm wondering if the test should timeout earlier and retry the deletion. If the topics cannot be deleted at all after the migration, that would be a serious bug. |
At this point in the test, we are waiting for ZK or KRaft to finalize one of the pending deletions. We're not actually doing an explicit delete. I wonder if the 30s timeout on the listTopics call is too short (others are 60s). |
@soarez I may have found a fix. In https://github.com/apache/kafka/actions/runs/10729396763 I was able to see the "Timed out waiting for topics to be deleted" failure case. In the logs, it looks like the topic hadn't finished being created by the time I manually put the deletion in ZK. I've added a condition to wait on a full ISR before continuing, and the next run (both deflake x10 and this PR) were successful. I've kicked off both again, if they both pass I'd like to merge this 🤞 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this @mumrah.
LGTM
…17004) Reviewers: Igor Soarez <[email protected]>, Ajit Singh <>
Increases a few timeouts and fixes some retry logic. Also remove the technically unsupported 3.4 MV.