Replicated fetches timeouts #20088

nvartolomei · 2021-02-04T17:44:07Z

The test is a bit lousy. Wrote it to confirm that timeouts work and achieve the behavior I was looking for.

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Allow customizing timeouts for http connections used for replication independently from other http timeouts.

Information about CI checks: https://clickhouse.tech/docs/en/development/continuous-integration/

nvartolomei · 2021-02-05T00:01:05Z

Why would it timeout at DROP TABLE? 🤔

nvartolomei · 2021-02-05T00:11:15Z

2021.02.05 01:06:38.857000 [ 36 ] {} <Error> DatabaseCatalog: Cannot drop table default.t (bb18f622-0fc3-464f-b653-0b358c2f1c20). Will retry later.: Code: 242, e.displayText() = DB::Exception: Can't drop readonly replicated table (need to drop data in ZooKeeper as well)

It seems then ZK connections did timeout and never reconnected after partition manager "healed" the network.

🤔

2021.02.05 01:06:38.319840 [ 11 ] {596c9b42-4a2f-4e2c-ba26-3232faeaf53a} <Debug> executeQuery: (from 172.18.0.1:47056, using production parser) DROP TABLE t SYNC
2021.02.05 01:06:38.320176 [ 11 ] {596c9b42-4a2f-4e2c-ba26-3232faeaf53a} <Trace> ContextAccess (default): Access granted: DROP TABLE ON default.t
2021.02.05 01:06:38.840110 [ 22 ] {} <Trace> default.t (ReplicatedMergeTreeRestartingThread): Execution took 23476 ms.
2021.02.05 01:06:38.840280 [ 11 ] {596c9b42-4a2f-4e2c-ba26-3232faeaf53a} <Trace> default.t (ReplicatedMergeTreeRestartingThread): Restarting thread finished
2021.02.05 01:06:38.841237 [ 11 ] {596c9b42-4a2f-4e2c-ba26-3232faeaf53a} <Trace> default.t (ReplicatedMergeTreeRestartingThread): Waiting for threads to finish
2021.02.05 01:06:38.841309 [ 32 ] {} <Information> default.t (bb18f622-0fc3-464f-b653-0b358c2f1c20): Became leader
2021.02.05 01:06:38.841396 [ 11 ] {596c9b42-4a2f-4e2c-ba26-3232faeaf53a} <Information> default.t (bb18f622-0fc3-464f-b653-0b358c2f1c20): Stopped being leader
2021.02.05 01:06:38.842423 [ 11 ] {596c9b42-4a2f-4e2c-ba26-3232faeaf53a} <Trace> default.t (ReplicatedMergeTreeRestartingThread): Threads finished
2021.02.05 01:06:38.842682 [ 11 ] {596c9b42-4a2f-4e2c-ba26-3232faeaf53a} <Debug> DatabaseCatalog: Waiting for table bb18f622-0fc3-464f-b653-0b358c2f1c20 to be finally dropped
2021.02.05 01:06:38.842721 [ 36 ] {} <Information> DatabaseCatalog: Have 1 tables in drop queue (0 of them are in use), will try drop default.t (bb18f622-0fc3-464f-b653-0b358c2f1c20)
2021.02.05 01:06:38.857000 [ 36 ] {} <Error> DatabaseCatalog: Cannot drop table default.t (bb18f622-0fc3-464f-b653-0b358c2f1c20). Will retry later.: Code: 242, e.displayText() = DB::Exception: Can't drop readonly replicated table (need to drop data in ZooKeeper as well)

KochetovNicolai · 2021-02-08T16:04:47Z

tests/integration/test_replicated_fetches_network_partition/test.py

+        node3.query("SYSTEM START FETCHES t")
+
+        # Wait some time to give a chance for node3 to try replicating without success from node1.
+        time.sleep(10)


Long sleeps in tests makes me sad :(

Can we create a small loop, like below, which reads system.replication_queue and checks if it has fetch with exception?

That would be much better indeed.

KochetovNicolai · 2021-02-08T16:09:37Z

tests/integration/test_replicated_fetches_network_partition/test.py

+            if parts_fetched == "5":
+                break
+
+            time.sleep(3)


Sleep in a loop is much better. But I would rather use small timeout, like 0.1.
Also, maybe we can use infinite loop and throw if, like, 10 seconds passed

Infinite loop sounds better, my bad that I copy/pasted this from another test (...replication limits...).
Not sure if 10 seconds are enough for all tests environments, CI sometimes is slow. I think it should be fine to loop for the test timeout 300s in the worst case...

300s is also ok

CI behaviour is different from my test environment

…esults archive

…donly

nvartolomei

Rewrote the test. Let's see how CI likes it.

nvartolomei · 2021-02-08T21:30:18Z

src/Storages/MergeTree/MergeTreeSettings.h

@@ -80,6 +80,9 @@ struct Settings;
    M(UInt64, replicated_max_parallel_fetches_for_host, DEFAULT_COUNT_OF_HTTP_CONNECTIONS_PER_ENDPOINT, "Limit parallel fetches from endpoint (actually pool size).", 0) \
    M(UInt64, replicated_max_parallel_sends, 0, "Limit parallel sends.", 0) \
    M(UInt64, replicated_max_parallel_sends_for_table, 0, "Limit parallel sends for one table.", 0) \
+    M(Seconds, replicated_fetches_http_connection_timeout, 0, "HTTP connection timeout for part fetch requests. Inherited from default profile `http_connection_timeout` if not set explicitly.", 0) \


Can someone please double-check whether this is true? I tried to re-read the code, and it seems that there is no way to actually change timeouts in a non-query context.

It should be possible to change it in default profile (or in system profile if it is set).
I did not check it, though.

robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Feb 4, 2021

KochetovNicolai reviewed Feb 8, 2021

View reviewed changes

KochetovNicolai self-assigned this Feb 8, 2021

nvartolomei and others added 8 commits February 8, 2021 19:44

Test replicated fetches_network partition

813092f

Add support for custom fetchPart timeouts

b153e8c

Update test_replicated_fetches_network_partition with new settings

75fca08

Increase network delay and add more info to logs for debugging

165ba59

CI behaviour is different from my test environment

Reformat test and drop tables at the end to reduce the size of test r…

f39eef9

…esults archive

Correctly document that settings apply only for fetch part requests

135f82c

Workaround for drop not finishing if it is started while table is rea…

6181daf

…donly

Improve replicated fetches timeouts test and make it 3x faster

84489b8

nvartolomei commented Feb 8, 2021

View reviewed changes

nvartolomei mentioned this pull request Feb 9, 2021

Some strange issues related to ConfigReloader, Roles and who knows what else #20265

Closed

alexey-milovidov approved these changes Mar 29, 2021

View reviewed changes

alexey-milovidov self-assigned this Mar 29, 2021

Merge branch 'master' into nv/replicated-fetches-timeouts

1512ff8

alexey-milovidov merged commit b7b55ce into ClickHouse:master Mar 30, 2021

nvartolomei mentioned this pull request Mar 31, 2021

Delete temporary fetch directory if it already exists #22411

Merged

nvartolomei deleted the nv/replicated-fetches-timeouts branch March 31, 2021 15:21

adevyatova mentioned this pull request May 13, 2021

DOCSUP-8026: Document the settings for customizing timeouts #24106

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replicated fetches timeouts #20088

Replicated fetches timeouts #20088

nvartolomei commented Feb 4, 2021

nvartolomei commented Feb 5, 2021

nvartolomei commented Feb 5, 2021 •

edited

Loading

KochetovNicolai Feb 8, 2021

nvartolomei Feb 8, 2021

KochetovNicolai Feb 8, 2021

nvartolomei Feb 8, 2021

KochetovNicolai Feb 8, 2021

nvartolomei left a comment

nvartolomei Feb 8, 2021

alexey-milovidov Mar 29, 2021

Replicated fetches timeouts #20088

Replicated fetches timeouts #20088

Conversation

nvartolomei commented Feb 4, 2021

nvartolomei commented Feb 5, 2021

nvartolomei commented Feb 5, 2021 • edited Loading

KochetovNicolai Feb 8, 2021

Choose a reason for hiding this comment

nvartolomei Feb 8, 2021

Choose a reason for hiding this comment

KochetovNicolai Feb 8, 2021

Choose a reason for hiding this comment

nvartolomei Feb 8, 2021

Choose a reason for hiding this comment

KochetovNicolai Feb 8, 2021

Choose a reason for hiding this comment

nvartolomei left a comment

Choose a reason for hiding this comment

nvartolomei Feb 8, 2021

Choose a reason for hiding this comment

alexey-milovidov Mar 29, 2021

Choose a reason for hiding this comment

nvartolomei commented Feb 5, 2021 •

edited

Loading