Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better distributed DDL queue cleanup #20448

Merged
merged 5 commits into from
Feb 16, 2021
Merged

Conversation

tavplubix
Copy link
Member

@tavplubix tavplubix commented Feb 12, 2021

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fixed race between execution of distributed DDL tasks and cleanup of DDL queue. Now DDL task cannot be removed from ZooKeeper if there are active workers. Fixes #20016

@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label Feb 12, 2021
@tavplubix
Copy link
Member Author

Functional stateless tests flaky check - Coordination::Exception: Operation timeout - some issue with TestKeeper, subsequent failures are expected
Stress test (thread) - unrelated, probably fixed in #19516:

$ zgrep -Fa Fatal clickhouse-server.log.gz1
2021.02.15 17:24:38.141767 [ 374 ] {} <Fatal> BaseDaemon: (version 21.3.1.6024, build id: 2024D7E3F2212CCED45A2395529BD611334B8DFF) (from thread 4074) Terminate called for uncaught exception:
2021.02.15 17:24:38.558291 [ 22268 ] {} <Fatal> BaseDaemon: ########################################
2021.02.15 17:24:38.599318 [ 22268 ] {} <Fatal> BaseDaemon: (version 21.3.1.6024, build id: 2024D7E3F2212CCED45A2395529BD611334B8DFF) (from thread 4074) (query_id: 484b6f7c-c85e-4f6e-82b2-09871a9d9565) Received signal Aborted (6)
2021.02.15 17:24:38.798984 [ 22268 ] {} <Fatal> BaseDaemon: 
2021.02.15 17:24:38.929337 [ 22268 ] {} <Fatal> BaseDaemon: Stack trace: 0x7f1072dcf18b 0x7f1072dae859 0x89e992e 0x8c6a22d 0x17c2eb04 0x17c2ea07 0x8a7638b 0x8b9b8a5 0x8b9b8ca 0x1187d9a4 0x1187e35a 0x12e15435 0x12dffe3a 0x12e0346f 0x15a06350 0x15a42b53 0x15a4327f 0x15babc52 0x15baa1f0 0x15ba89f8 0x89e412d 0x7f1072f84609 0x7f1072eab293
2021.02.15 17:24:39.095939 [ 22268 ] {} <Fatal> BaseDaemon: 5. raise @ 0x4618b in /usr/lib/x86_64-linux-gnu/libc-2.31.so
2021.02.15 17:24:39.178763 [ 22268 ] {} <Fatal> BaseDaemon: 6. abort @ 0x25859 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
2021.02.15 17:25:06.846768 [ 369 ] {} <Fatal> Application: Child process was terminated by signal 6.

$ zgrep -Fa "[ 4074 ]" clickhouse-server.log.gz1 | tail -10
2021.02.15 17:22:54.551588 [ 4074 ] {55f928e1-410f-4e84-bef5-43fd3c05b10c} <Debug> test_9cxb35.t: Removing part from filesystem all_34_34_0
2021.02.15 17:22:55.412585 [ 4074 ] {55f928e1-410f-4e84-bef5-43fd3c05b10c} <Debug> MemoryTracker: Peak memory usage (for query): 0.00 B.
2021.02.15 17:22:55.446960 [ 4074 ] {} <Debug> TCPHandler: Processed in 3.117910787 sec.
2021.02.15 17:22:55.518182 [ 4074 ] {} <Debug> TCPHandler: Done processing connection.
2021.02.15 17:23:47.169652 [ 4074 ] {} <Trace> HTTPHandler-factory: HTTP Request for HTTPHandler-factory. Method: POST, Address: [::1]:39530, User-Agent: curl/7.68.0, Length: 28, Content Type: application/x-www-form-urlencoded, Transfer Encoding: identity, X-Forwarded-For: (none)
2021.02.15 17:23:47.175854 [ 4074 ] {} <Trace> DynamicQueryHandler: Request URI: /?database=test_qdgpir&log_comment=/usr/share/clickhouse-test/queries/0_stateless/00944_clear_index_in_partition.sh&enable_http_compression=1&http_zlib_compression_level=1
2021.02.15 17:23:47.260299 [ 4074 ] {484b6f7c-c85e-4f6e-82b2-09871a9d9565} <Debug> executeQuery: (from [::1]:39530, using production parser) (comment: /usr/share/clickhouse-test/queries/0_stateless/00944_clear_index_in_partition.sh) SELECT * FROM numbers(34534)
2021.02.15 17:23:47.261661 [ 4074 ] {484b6f7c-c85e-4f6e-82b2-09871a9d9565} <Trace> ContextAccess (default): Access granted: CREATE TEMPORARY TABLE ON *.*
2021.02.15 17:23:47.266168 [ 4074 ] {484b6f7c-c85e-4f6e-82b2-09871a9d9565} <Trace> InterpreterSelectQuery: FetchColumns -> Complete
2021.02.15 17:24:36.479027 [ 4074 ] {484b6f7c-c85e-4f6e-82b2-09871a9d9565} <Information> executeQuery: Read 34534 rows, 269.80 KiB in 49.197366973 sec., 701 rows/sec., 5.48 KiB/sec.

$ arr="0x7f1072dcf18b 0x7f1072dae859 0x89e992e 0x8c6a22d 0x17c2eb04 0x17c2ea07 0x8a7638b 0x8b9b8a5 0x8b9b8ca 0x1187d9a4 0x1187e35a 0x12e15435 0x12dffe3a 0x12e0346f 0x15a06350 0x15a42b53 0x15a4327f 0x15babc52 0x15baa1f0 0x15ba89f8 0x89e412d 0x7f1072f84609 0x7f1072eab293"
$ for a in ${arr}; do addr2line -aipsfC -e ./tmp/clickhouse1 $a; done
0x00007f1072dcf18b: ?? ??:0
0x00007f1072dae859: ?? ??:0
0x00000000089e992e: __interceptor_abort at ??:?
0x0000000008c6a22d: terminate_handler() at BaseDaemon.cpp:?
0x0000000017c2eb04: std::__terminate(void (*)()) at cxa_handlers.cpp:61
0x0000000017c2ea07: void (*std::__1::(anonymous namespace)::__libcpp_atomic_load<void (*)()>(void (* const*)(), int))() at atomic_support.h:78
 (inlined by) std::get_terminate() at cxa_handlers.cpp:49
 (inlined by) std::terminate() at cxa_handlers.cpp:92
0x0000000008a7638b: __clang_call_terminate at main.cpp:?
0x0000000008b9b8a5: DB::Memory<Allocator<false, false> >::~Memory() at BufferWithOwnMemory.h:48
 (inlined by) DB::BufferWithOwnMemory<DB::WriteBuffer>::~BufferWithOwnMemory() at BufferWithOwnMemory.h:137
 (inlined by) ~WriteBufferFromOStream at WriteBufferFromOStream.cpp:48
0x0000000008b9b8ca: ~WriteBufferFromOStream at WriteBufferFromOStream.cpp:44
0x000000001187d9a4: DB::BufferWithOwnMemory<DB::WriteBuffer>::~BufferWithOwnMemory() at BufferWithOwnMemory.h:137
 (inlined by) ~ZlibDeflatingWriteBuffer at ZlibDeflatingWriteBuffer.cpp:68
0x000000001187e35a: ~ZlibDeflatingWriteBuffer at ZlibDeflatingWriteBuffer.cpp:50
0x0000000012e15435: DB::WriteBufferFromHTTPServerResponse::finalize() at memory:?
0x0000000012dffe3a: DB::HTTPHandler::processQuery(DB::Context&, Poco::Net::HTTPServerRequest&, HTMLForm&, Poco::Net::HTTPServerResponse&, DB::HTTPHandler::Output&, std::__1::optional<DB::CurrentThread::QueryScope>&) at HTTPHandler.cpp:?
0x0000000012e0346f: DB::HTTPHandler::handleRequest(Poco::Net::HTTPServerRequest&, Poco::Net::HTTPServerResponse&) at HTTPHandler.cpp:763
0x0000000015a06350: Poco::Net::HTTPServerConnection::run() at AutoPtr.h:215
 (inlined by) Poco::Net::HTTPServerConnection::run() at HTTPServerConnection.cpp:90
0x0000000015a42b53: Poco::Net::TCPServerConnection::start() at TCPServerConnection.cpp:57
0x0000000015a4327f: Poco::Net::TCPServerDispatcher::run() at TCPServerDispatcher.cpp:?
0x0000000015babc52: Poco::ScopedLock<Poco::FastMutex>::ScopedLock(Poco::FastMutex&) at ScopedLock.h:36
 (inlined by) Poco::PooledThread::run() at ThreadPool.cpp:213
0x0000000015baa1f0: Poco::(anonymous namespace)::RunnableHolder::run() at Thread.cpp:56
0x0000000015ba89f8: Poco::ThreadImpl::runnableEntry(void*) at SharedPtr.h:277
 (inlined by) ?? at SharedPtr.h:156
 (inlined by) ?? at SharedPtr.h:208
 (inlined by) Poco::ThreadImpl::runnableEntry(void*) at Thread_POSIX.cpp:360
0x00000000089e412d: __tsan_thread_start_func at crtstuff.c:?
0x00007f1072f84609: ?? ??:0
0x00007f1072eab293: ?? ??:0

@tavplubix tavplubix merged commit 68b427a into master Feb 16, 2021
@tavplubix tavplubix deleted the better_ddl_queue_cleanup branch February 16, 2021 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Distributed ddl worker task loop forever for no znode error
2 participants