Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replicated fetches timeouts #20088

Merged
merged 9 commits into from
Mar 30, 2021
Merged

Replicated fetches timeouts #20088

merged 9 commits into from
Mar 30, 2021

Conversation

nvartolomei
Copy link
Contributor

The test is a bit lousy. Wrote it to confirm that timeouts work and achieve the behavior I was looking for.

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Allow customizing timeouts for http connections used for replication independently from other http timeouts.

Information about CI checks: https://clickhouse.tech/docs/en/development/continuous-integration/

@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Feb 4, 2021
@nvartolomei
Copy link
Contributor Author

Why would it timeout at DROP TABLE? 🤔

@nvartolomei
Copy link
Contributor Author

nvartolomei commented Feb 5, 2021

2021.02.05 01:06:38.857000 [ 36 ] {} <Error> DatabaseCatalog: Cannot drop table default.t (bb18f622-0fc3-464f-b653-0b358c2f1c20). Will retry later.: Code: 242, e.displayText() = DB::Exception: Can't drop readonly replicated table (need to drop data in ZooKeeper as well)

It seems then ZK connections did timeout and never reconnected after partition manager "healed" the network.

🤔

2021.02.05 01:06:38.319840 [ 11 ] {596c9b42-4a2f-4e2c-ba26-3232faeaf53a} <Debug> executeQuery: (from 172.18.0.1:47056, using production parser) DROP TABLE t SYNC
2021.02.05 01:06:38.320176 [ 11 ] {596c9b42-4a2f-4e2c-ba26-3232faeaf53a} <Trace> ContextAccess (default): Access granted: DROP TABLE ON default.t
2021.02.05 01:06:38.840110 [ 22 ] {} <Trace> default.t (ReplicatedMergeTreeRestartingThread): Execution took 23476 ms.
2021.02.05 01:06:38.840280 [ 11 ] {596c9b42-4a2f-4e2c-ba26-3232faeaf53a} <Trace> default.t (ReplicatedMergeTreeRestartingThread): Restarting thread finished
2021.02.05 01:06:38.841237 [ 11 ] {596c9b42-4a2f-4e2c-ba26-3232faeaf53a} <Trace> default.t (ReplicatedMergeTreeRestartingThread): Waiting for threads to finish
2021.02.05 01:06:38.841309 [ 32 ] {} <Information> default.t (bb18f622-0fc3-464f-b653-0b358c2f1c20): Became leader
2021.02.05 01:06:38.841396 [ 11 ] {596c9b42-4a2f-4e2c-ba26-3232faeaf53a} <Information> default.t (bb18f622-0fc3-464f-b653-0b358c2f1c20): Stopped being leader
2021.02.05 01:06:38.842423 [ 11 ] {596c9b42-4a2f-4e2c-ba26-3232faeaf53a} <Trace> default.t (ReplicatedMergeTreeRestartingThread): Threads finished
2021.02.05 01:06:38.842682 [ 11 ] {596c9b42-4a2f-4e2c-ba26-3232faeaf53a} <Debug> DatabaseCatalog: Waiting for table bb18f622-0fc3-464f-b653-0b358c2f1c20 to be finally dropped
2021.02.05 01:06:38.842721 [ 36 ] {} <Information> DatabaseCatalog: Have 1 tables in drop queue (0 of them are in use), will try drop default.t (bb18f622-0fc3-464f-b653-0b358c2f1c20)
2021.02.05 01:06:38.857000 [ 36 ] {} <Error> DatabaseCatalog: Cannot drop table default.t (bb18f622-0fc3-464f-b653-0b358c2f1c20). Will retry later.: Code: 242, e.displayText() = DB::Exception: Can't drop readonly replicated table (need to drop data in ZooKeeper as well)

node3.query("SYSTEM START FETCHES t")

# Wait some time to give a chance for node3 to try replicating without success from node1.
time.sleep(10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long sleeps in tests makes me sad :(

Can we create a small loop, like below, which reads system.replication_queue and checks if it has fetch with exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be much better indeed.

if parts_fetched == "5":
break

time.sleep(3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleep in a loop is much better. But I would rather use small timeout, like 0.1.
Also, maybe we can use infinite loop and throw if, like, 10 seconds passed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Infinite loop sounds better, my bad that I copy/pasted this from another test (...replication limits...).
Not sure if 10 seconds are enough for all tests environments, CI sometimes is slow. I think it should be fine to loop for the test timeout 300s in the worst case...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

300s is also ok

@KochetovNicolai KochetovNicolai self-assigned this Feb 8, 2021
Copy link
Contributor Author

@nvartolomei nvartolomei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote the test. Let's see how CI likes it.

@@ -80,6 +80,9 @@ struct Settings;
M(UInt64, replicated_max_parallel_fetches_for_host, DEFAULT_COUNT_OF_HTTP_CONNECTIONS_PER_ENDPOINT, "Limit parallel fetches from endpoint (actually pool size).", 0) \
M(UInt64, replicated_max_parallel_sends, 0, "Limit parallel sends.", 0) \
M(UInt64, replicated_max_parallel_sends_for_table, 0, "Limit parallel sends for one table.", 0) \
M(Seconds, replicated_fetches_http_connection_timeout, 0, "HTTP connection timeout for part fetch requests. Inherited from default profile `http_connection_timeout` if not set explicitly.", 0) \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can someone please double-check whether this is true? I tried to re-read the code, and it seems that there is no way to actually change timeouts in a non-query context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to change it in default profile (or in system profile if it is set).
I did not check it, though.

@alexey-milovidov alexey-milovidov self-assigned this Mar 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants