-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replicated fetches timeouts #20088
Replicated fetches timeouts #20088
Conversation
Why would it timeout at |
It seems then ZK connections did timeout and never reconnected after partition manager "healed" the network. 🤔
|
node3.query("SYSTEM START FETCHES t") | ||
|
||
# Wait some time to give a chance for node3 to try replicating without success from node1. | ||
time.sleep(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long sleeps in tests makes me sad :(
Can we create a small loop, like below, which reads system.replication_queue
and checks if it has fetch with exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be much better indeed.
if parts_fetched == "5": | ||
break | ||
|
||
time.sleep(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sleep in a loop is much better. But I would rather use small timeout, like 0.1
.
Also, maybe we can use infinite loop and throw if, like, 10 seconds passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Infinite loop sounds better, my bad that I copy/pasted this from another test (...replication limits...).
Not sure if 10 seconds are enough for all tests environments, CI sometimes is slow. I think it should be fine to loop for the test timeout 300s in the worst case...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
300s is also ok
CI behaviour is different from my test environment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrote the test. Let's see how CI likes it.
@@ -80,6 +80,9 @@ struct Settings; | |||
M(UInt64, replicated_max_parallel_fetches_for_host, DEFAULT_COUNT_OF_HTTP_CONNECTIONS_PER_ENDPOINT, "Limit parallel fetches from endpoint (actually pool size).", 0) \ | |||
M(UInt64, replicated_max_parallel_sends, 0, "Limit parallel sends.", 0) \ | |||
M(UInt64, replicated_max_parallel_sends_for_table, 0, "Limit parallel sends for one table.", 0) \ | |||
M(Seconds, replicated_fetches_http_connection_timeout, 0, "HTTP connection timeout for part fetch requests. Inherited from default profile `http_connection_timeout` if not set explicitly.", 0) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can someone please double-check whether this is true? I tried to re-read the code, and it seems that there is no way to actually change timeouts in a non-query context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to change it in default profile (or in system profile if it is set).
I did not check it, though.
The test is a bit lousy. Wrote it to confirm that timeouts work and achieve the behavior I was looking for.
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Allow customizing timeouts for http connections used for replication independently from other http timeouts.
Information about CI checks: https://clickhouse.tech/docs/en/development/continuous-integration/