bug: simple aggregation does not return #17036

rad-pat · 2024-12-11T14:20:21Z

Search before asking

I had searched in the issues and found no similar issues.

Version

v1.2.670-nightly

What's Wrong?

Simple Aggregation query does not return. If we remove just one field from the aggregation, it will run quickly. Extracting the data to another server and running the query there seems to also work, so I'm not sure why the query will not run where it is. I was advised to record logs at the INFO level and they are attached.

Please see attached logs, query of interest is id 1adebd66-611e-41eb-9f2c-3bbf93931d36
databend-query-logs.txt

If you require any further details, please let me know.

How to Reproduce?

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

zhang2014 · 2024-12-11T14:45:19Z

Possibly related(only in 670 version): #17037

rad-pat · 2024-12-11T16:03:29Z

Possibly related(only in 670 version): #17037

It was happening prior to 670, we upgraded to see if maybe fixed.

zhang2014 · 2024-12-12T05:33:26Z

Possibly related(only in 670 version): #17037

It was happening prior to 670, we upgraded to see if maybe fixed.

Please wait for the version after the merge of #17037

rad-pat · 2024-12-12T09:33:39Z

Possibly related(only in 670 version): #17037

It was happening prior to 670, we upgraded to see if maybe fixed.

Please wait for the version after the merge of #17037

I used version v1.2.672 with #17037 , still does not return. Please see revised logs attached.
databend-query-logs12Dec2024.txt

inviscid · 2025-01-08T13:45:16Z

@zhang2014 any chance this could get some attention on the priority list? :)

This is causing significant issues during workflows that involve any significant data sizes.

sundy-li · 2025-01-08T13:51:59Z

can you reproduce it via set enable_experimental_aggregate_hashtable = 0 ?

rad-pat · 2025-01-08T19:43:35Z

I will try this and report back

rad-pat · 2025-01-09T09:23:24Z

I have tried that just now and it doesn't seem to make a difference
databend-query-logs-09Jan25.txt

rad-pat · 2025-01-09T09:35:11Z

Does the message below mean that it is waiting for a task for which no workers are assigned?
WARN databend_query::pipelines::executor::processor_async_task: processor_async_task.rs:146 Very slow processor async task, query_id:"5cb349b3-d3d0-4126-94c9-776c52e96cce", processor id: NodeIndex(28), name: "ExchangeSourceReader", elapsed: 210.051349435s, active sync workers: 0

rad-pat · 2025-01-09T09:46:30Z

Eventually aborted with these final logs:

5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:05.505018Z  WARN databend_query::pipelines::executor::processor_async_task: processor_async_task.rs:146 Very slow processor async task, query_id:"5cb349b3-d3d0-4126-94c9-776c52e96cce", processor id: NodeIndex(170), name: "ExchangeSourceReader", elapsed: 1195.281296491s, active sync workers: 0
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:05.817455Z  WARN databend_query::pipelines::executor::processor_async_task: processor_async_task.rs:146 Very slow processor async task, query_id:"5cb349b3-d3d0-4126-94c9-776c52e96cce", processor id: NodeIndex(28), name: "ExchangeSourceReader", elapsed: 1195.260041322s, active sync workers: 0
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:05.859804Z  WARN databend_query::pipelines::executor::processor_async_task: processor_async_task.rs:146 Very slow processor async task, query_id:"5cb349b3-d3d0-4126-94c9-776c52e96cce", processor id: NodeIndex(28), name: "ExchangeSourceReader", elapsed: 1195.281849182s, active sync workers: 0
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.222264Z  INFO databend_common_pipeline_core::finished_chain: finished_chain.rs:217 Executor apply finished callback state:
├──✅:4.34µs - (always) src/query/service/src/pipelines/pipeline_builder.rs:98:14
├──✅:711ns - (always) src/query/service/src/pipelines/pipeline_builder.rs:98:14
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.222308Z  INFO databend_query::pipelines::executor::pipeline_executor: pipeline_executor.rs:173 Pipeline executor finished, elapsed: 1200.001358518s
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.222469Z ERROR databend_query::servers::flight::v1::packets::packet_data: packet_data.rs:93 Got error code data packet: AbortedQuery. Code: 1043, Text = Aborted query, because the execution time exceeds the maximum execution time limit.
<Backtrace disabled by default. Please use RUST_BACKTRACE=1 to enable>
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.224702Z  INFO databend_common_base::runtime::runtime: runtime.rs:400 close_sender to shutdown Runtime is sent
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.224747Z  INFO databend_common_base::runtime::runtime: runtime.rs:157 Runtime(Some("StatisticsReceiver")) received shutdown signal, start to shut down
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.225725Z  INFO databend_common_base::runtime::runtime: runtime.rs:400 close_sender to shutdown Runtime is sent
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.225740Z  INFO databend_common_base::runtime::runtime: runtime.rs:157 Runtime(Some("query-ctx")) received shutdown signal, start to shut down
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.226255Z  INFO databend_query::interpreters::common::query_log: query_log.rs:72 query: 5cb349b3-d3d0-4126-94c9-776c52e96cce becomes Aborted
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.226281Z  INFO databend_common_pipeline_core::finished_chain: finished_chain.rs:217 Executor apply finished callback state:
├──❌:1.1879ms - src/query/service/src/servers/flight/v1/exchange/exchange_manager.rs:484:41
├──❌:0ns - (interrupt) src/query/service/src/servers/flight/v1/exchange/exchange_manager.rs:499:22
├──❌:0ns - (interrupt) src/query/service/src/interpreters/hook/compact_hook.rs:80:14
├──❌:0ns - (interrupt) src/query/service/src/interpreters/hook/refresh_hook.rs:61:14
├──✅:3.56µs - (always) src/query/service/src/pipelines/pipeline_builder.rs:98:14
├──✅:480ns - (always) src/query/service/src/pipelines/pipeline_builder.rs:98:14
├──✅:9.22µs - (always) src/query/service/src/pipelines/pipeline_builder.rs:98:14
├──❌:411.29µs - (always) src/query/service/src/interpreters/interpreter.rs:119:14
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.226484Z  INFO databend_query::pipelines::executor::pipeline_executor: pipeline_executor.rs:173 Pipeline executor finished, elapsed: 1200.004236593s
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.228367Z  INFO databend_query::interpreters::common::query_log: query_log.rs:72 query: 5cb349b3-d3d0-4126-94c9-776c52e96cce becomes Aborted
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.228504Z  INFO databend_query::servers::http::v1::query::execute_state: execute_state.rs:297 5cb349b3-d3d0-4126-94c9-776c52e96cce: http query changing state from Running to Stopped, reason Err(AbortedQuery. Code: 1043, Text = Aborted query, because the execution time exceeds the maximum execution time limit.
<Backtrace disabled by default. Please use RUST_BACKTRACE=1 to enable> )
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.228522Z  INFO databend_query::servers::http::v1::query::execute_state: execute_state.rs:326 5cb349b3-d3d0-4126-94c9-776c52e96cce: http query has change state to Stopped, reason Err(AbortedQuery. Code: 1043, Text = Aborted query, because the execution time exceeds the maximum execution time limit.
<Backtrace disabled by default. Please use RUST_BACKTRACE=1 to enable> )
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.228588Z  INFO databend_query::servers::http::v1::query::page_manager: page_manager.rs:195 http query reach end of blocks
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.228668Z  INFO databend_common_base::runtime::runtime: runtime.rs:400 close_sender to shutdown Runtime is sent
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.228690Z  INFO databend_common_base::runtime::runtime: runtime.rs:157 Runtime(Some("query-ctx")) received shutdown signal, start to shut down

sundy-li · 2025-01-09T10:26:05Z

We still can't figure out the reason, better provide us a case to reproduce the issue.

rad-pat · 2025-01-09T10:31:41Z

It is difficult to provide the replica case, I have tried myself previously by extracting the data to a development machine. The query appeared to run fine there. However, I will try that again in case I get a different result and can demonstrate the failure. If not, is there any other useful information to provide from this instance?

zhang2014 · 2025-01-09T11:50:56Z

I have tried that just now and it doesn't seem to make a difference databend-query-logs-09Jan25.txt

It appears to be caused by other nodes. Could you provide the logs from all nodes in the this cluster?

rad-pat · 2025-01-09T11:59:23Z

Those are consolidated logs, contain entries from both query pods, unfortunately it does not include the node for each query entry. I will gather the individual logs
databend-query-node-0.txt
databend-query-node-1.txt

sundy-li · 2025-01-09T13:25:35Z

@rad-pat

Hi, we may find some possible codes that lead this issue, you can try two different approaches

Expand your cluster to have three nodes, and check the query works or not
Check fix(cluster): try fix cluster agg hang #17235 pr , you can build this from source, and test it with origin cluster deployment (two nodes)

rad-pat · 2025-01-09T13:48:56Z

Thanks @sundy-li , I will try Option 1 firstly as that is easy to try. Option 2 requires a little more work to get a build env going, I don't suppose you could provide link to built docker image of that PR?

sundy-li · 2025-01-09T13:53:01Z

See Makefile which has

build-in-docker:
	bash ./scripts/setup/run_build_tool.sh cargo build --target $(TARGET)

rad-pat · 2025-01-09T13:54:59Z

Expanding to 3 nodes made the query work

rad-pat · 2025-01-10T09:53:24Z

Apologies, I could not figure out how to make a custom query service docker image from the PR. Seems to be many interactions in the github workflows/actions.
Is it possible for you guys to make a custom docker image from the PR for me to try in our k8s environment, or to describe to me how to do it? In your cloud github workflow, I can see it calls build_linux action and then uses query.Dockerfile to create image, but the process uploads/downloads artifacts somewhere so is a little too complicated for me to figure out easily.

sundy-li · 2025-01-10T14:14:46Z

Hi, we find it's not related to that pr. We will submit another pr in recent days and give you an image to test this case.

Please keep the data not changed, the bug is related to data distribution and is hard to reproduce....

rad-pat · 2025-01-10T14:16:14Z

Great, thanks for the update. We found another place where this happened and again, increasing to 3 query pods over 2 worked. I shall keep the test and await a test image, thanks

sundy-li · 2025-01-13T00:48:59Z

@rad-pat Please test with 2 nodes

pr: #17245
docker musl image: sundyli/databend-query:agg
in https://hub.docker.com/r/sundyli/databend-query/tags

rad-pat · 2025-01-13T09:31:35Z

Thanks, I will test and report back 👍

rad-pat · 2025-01-13T11:59:43Z

@sundy-li I performed the following test:
Run query with v1.2.680-nightly - query failed to complete in 20 mins
Upgrade image to sundyli/databend-query:agg - query completed in 5 seconds
Downgrade to v1.2.680-nightly - query failed to complete in 20 mins
Upgrade image again to sundyli/databend-query:agg - query completed in 5 seconds

So, I think we can conclude that the revised image prevents the query failure in our case.

sundy-li · 2025-01-13T12:14:13Z

Please wait for the next official nightly release, cause we fixed some memory leak bug in cluster #17252

rad-pat added the C-bug Category: something isn't working label Dec 11, 2024

rad-pat mentioned this issue Dec 11, 2024

bug: Random excessive query duration - guidance on how to build replica for debug #17024

Closed

2 tasks

zhang2014 mentioned this issue Jan 9, 2025

fix(cluster): try fix cluster agg hang #17235

Closed

11 tasks

sundy-li closed this as completed Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: simple aggregation does not return #17036

bug: simple aggregation does not return #17036

rad-pat commented Dec 11, 2024

zhang2014 commented Dec 11, 2024

rad-pat commented Dec 11, 2024

zhang2014 commented Dec 12, 2024

rad-pat commented Dec 12, 2024

inviscid commented Jan 8, 2025

sundy-li commented Jan 8, 2025

rad-pat commented Jan 8, 2025

rad-pat commented Jan 9, 2025 •

edited

Loading

rad-pat commented Jan 9, 2025

rad-pat commented Jan 9, 2025

sundy-li commented Jan 9, 2025

rad-pat commented Jan 9, 2025

zhang2014 commented Jan 9, 2025

rad-pat commented Jan 9, 2025

sundy-li commented Jan 9, 2025

rad-pat commented Jan 9, 2025

sundy-li commented Jan 9, 2025

rad-pat commented Jan 9, 2025

rad-pat commented Jan 10, 2025

sundy-li commented Jan 10, 2025 •

edited

Loading

rad-pat commented Jan 10, 2025

sundy-li commented Jan 13, 2025

rad-pat commented Jan 13, 2025

rad-pat commented Jan 13, 2025

sundy-li commented Jan 13, 2025

bug: simple aggregation does not return #17036

bug: simple aggregation does not return #17036

Comments

rad-pat commented Dec 11, 2024

Search before asking

Version

What's Wrong?

How to Reproduce?

Are you willing to submit PR?

zhang2014 commented Dec 11, 2024

rad-pat commented Dec 11, 2024

zhang2014 commented Dec 12, 2024

rad-pat commented Dec 12, 2024

inviscid commented Jan 8, 2025

sundy-li commented Jan 8, 2025

rad-pat commented Jan 8, 2025

rad-pat commented Jan 9, 2025 • edited Loading

rad-pat commented Jan 9, 2025

rad-pat commented Jan 9, 2025

sundy-li commented Jan 9, 2025

rad-pat commented Jan 9, 2025

zhang2014 commented Jan 9, 2025

rad-pat commented Jan 9, 2025

sundy-li commented Jan 9, 2025

rad-pat commented Jan 9, 2025

sundy-li commented Jan 9, 2025

rad-pat commented Jan 9, 2025

rad-pat commented Jan 10, 2025

sundy-li commented Jan 10, 2025 • edited Loading

rad-pat commented Jan 10, 2025

sundy-li commented Jan 13, 2025

rad-pat commented Jan 13, 2025

rad-pat commented Jan 13, 2025

sundy-li commented Jan 13, 2025

rad-pat commented Jan 9, 2025 •

edited

Loading

sundy-li commented Jan 10, 2025 •

edited

Loading