Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: simple aggregation does not return #17036

Closed
1 of 2 tasks
rad-pat opened this issue Dec 11, 2024 · 25 comments
Closed
1 of 2 tasks

bug: simple aggregation does not return #17036

rad-pat opened this issue Dec 11, 2024 · 25 comments
Labels
C-bug Category: something isn't working

Comments

@rad-pat
Copy link

rad-pat commented Dec 11, 2024

Search before asking

  • I had searched in the issues and found no similar issues.

Version

v1.2.670-nightly

What's Wrong?

Simple Aggregation query does not return. If we remove just one field from the aggregation, it will run quickly. Extracting the data to another server and running the query there seems to also work, so I'm not sure why the query will not run where it is. I was advised to record logs at the INFO level and they are attached.

Please see attached logs, query of interest is id 1adebd66-611e-41eb-9f2c-3bbf93931d36
databend-query-logs.txt

If you require any further details, please let me know.

How to Reproduce?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@zhang2014
Copy link
Member

Possibly related(only in 670 version): #17037

@rad-pat
Copy link
Author

rad-pat commented Dec 11, 2024

Possibly related(only in 670 version): #17037

It was happening prior to 670, we upgraded to see if maybe fixed.

@zhang2014
Copy link
Member

Possibly related(only in 670 version): #17037

It was happening prior to 670, we upgraded to see if maybe fixed.

Please wait for the version after the merge of #17037

@rad-pat
Copy link
Author

rad-pat commented Dec 12, 2024

Possibly related(only in 670 version): #17037

It was happening prior to 670, we upgraded to see if maybe fixed.

Please wait for the version after the merge of #17037

I used version v1.2.672 with #17037 , still does not return. Please see revised logs attached.
databend-query-logs12Dec2024.txt

@inviscid
Copy link

inviscid commented Jan 8, 2025

@zhang2014 any chance this could get some attention on the priority list? :)

This is causing significant issues during workflows that involve any significant data sizes.

@sundy-li
Copy link
Member

sundy-li commented Jan 8, 2025

can you reproduce it via set enable_experimental_aggregate_hashtable = 0 ?

@rad-pat
Copy link
Author

rad-pat commented Jan 8, 2025

I will try this and report back

@rad-pat
Copy link
Author

rad-pat commented Jan 9, 2025

I have tried that just now and it doesn't seem to make a difference
databend-query-logs-09Jan25.txt

@rad-pat
Copy link
Author

rad-pat commented Jan 9, 2025

Does the message below mean that it is waiting for a task for which no workers are assigned?
WARN databend_query::pipelines::executor::processor_async_task: processor_async_task.rs:146 Very slow processor async task, query_id:"5cb349b3-d3d0-4126-94c9-776c52e96cce", processor id: NodeIndex(28), name: "ExchangeSourceReader", elapsed: 210.051349435s, active sync workers: 0

@rad-pat
Copy link
Author

rad-pat commented Jan 9, 2025

Eventually aborted with these final logs:

5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:05.505018Z  WARN databend_query::pipelines::executor::processor_async_task: processor_async_task.rs:146 Very slow processor async task, query_id:"5cb349b3-d3d0-4126-94c9-776c52e96cce", processor id: NodeIndex(170), name: "ExchangeSourceReader", elapsed: 1195.281296491s, active sync workers: 0
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:05.817455Z  WARN databend_query::pipelines::executor::processor_async_task: processor_async_task.rs:146 Very slow processor async task, query_id:"5cb349b3-d3d0-4126-94c9-776c52e96cce", processor id: NodeIndex(28), name: "ExchangeSourceReader", elapsed: 1195.260041322s, active sync workers: 0
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:05.859804Z  WARN databend_query::pipelines::executor::processor_async_task: processor_async_task.rs:146 Very slow processor async task, query_id:"5cb349b3-d3d0-4126-94c9-776c52e96cce", processor id: NodeIndex(28), name: "ExchangeSourceReader", elapsed: 1195.281849182s, active sync workers: 0
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.222264Z  INFO databend_common_pipeline_core::finished_chain: finished_chain.rs:217 Executor apply finished callback state:
├──✅:4.34µs - (always) src/query/service/src/pipelines/pipeline_builder.rs:98:14
├──✅:711ns - (always) src/query/service/src/pipelines/pipeline_builder.rs:98:14
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.222308Z  INFO databend_query::pipelines::executor::pipeline_executor: pipeline_executor.rs:173 Pipeline executor finished, elapsed: 1200.001358518s
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.222469Z ERROR databend_query::servers::flight::v1::packets::packet_data: packet_data.rs:93 Got error code data packet: AbortedQuery. Code: 1043, Text = Aborted query, because the execution time exceeds the maximum execution time limit.
<Backtrace disabled by default. Please use RUST_BACKTRACE=1 to enable>
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.224702Z  INFO databend_common_base::runtime::runtime: runtime.rs:400 close_sender to shutdown Runtime is sent
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.224747Z  INFO databend_common_base::runtime::runtime: runtime.rs:157 Runtime(Some("StatisticsReceiver")) received shutdown signal, start to shut down
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.225725Z  INFO databend_common_base::runtime::runtime: runtime.rs:400 close_sender to shutdown Runtime is sent
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.225740Z  INFO databend_common_base::runtime::runtime: runtime.rs:157 Runtime(Some("query-ctx")) received shutdown signal, start to shut down
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.226255Z  INFO databend_query::interpreters::common::query_log: query_log.rs:72 query: 5cb349b3-d3d0-4126-94c9-776c52e96cce becomes Aborted
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.226281Z  INFO databend_common_pipeline_core::finished_chain: finished_chain.rs:217 Executor apply finished callback state:
├──❌:1.1879ms - src/query/service/src/servers/flight/v1/exchange/exchange_manager.rs:484:41
├──❌:0ns - (interrupt) src/query/service/src/servers/flight/v1/exchange/exchange_manager.rs:499:22
├──❌:0ns - (interrupt) src/query/service/src/interpreters/hook/compact_hook.rs:80:14
├──❌:0ns - (interrupt) src/query/service/src/interpreters/hook/refresh_hook.rs:61:14
├──✅:3.56µs - (always) src/query/service/src/pipelines/pipeline_builder.rs:98:14
├──✅:480ns - (always) src/query/service/src/pipelines/pipeline_builder.rs:98:14
├──✅:9.22µs - (always) src/query/service/src/pipelines/pipeline_builder.rs:98:14
├──❌:411.29µs - (always) src/query/service/src/interpreters/interpreter.rs:119:14
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.226484Z  INFO databend_query::pipelines::executor::pipeline_executor: pipeline_executor.rs:173 Pipeline executor finished, elapsed: 1200.004236593s
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.228367Z  INFO databend_query::interpreters::common::query_log: query_log.rs:72 query: 5cb349b3-d3d0-4126-94c9-776c52e96cce becomes Aborted
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.228504Z  INFO databend_query::servers::http::v1::query::execute_state: execute_state.rs:297 5cb349b3-d3d0-4126-94c9-776c52e96cce: http query changing state from Running to Stopped, reason Err(AbortedQuery. Code: 1043, Text = Aborted query, because the execution time exceeds the maximum execution time limit.
<Backtrace disabled by default. Please use RUST_BACKTRACE=1 to enable> )
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.228522Z  INFO databend_query::servers::http::v1::query::execute_state: execute_state.rs:326 5cb349b3-d3d0-4126-94c9-776c52e96cce: http query has change state to Stopped, reason Err(AbortedQuery. Code: 1043, Text = Aborted query, because the execution time exceeds the maximum execution time limit.
<Backtrace disabled by default. Please use RUST_BACKTRACE=1 to enable> )
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.228588Z  INFO databend_query::servers::http::v1::query::page_manager: page_manager.rs:195 http query reach end of blocks
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.228668Z  INFO databend_common_base::runtime::runtime: runtime.rs:400 close_sender to shutdown Runtime is sent
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.228690Z  INFO databend_common_base::runtime::runtime: runtime.rs:157 Runtime(Some("query-ctx")) received shutdown signal, start to shut down

@sundy-li
Copy link
Member

sundy-li commented Jan 9, 2025

We still can't figure out the reason, better provide us a case to reproduce the issue.

@rad-pat
Copy link
Author

rad-pat commented Jan 9, 2025

It is difficult to provide the replica case, I have tried myself previously by extracting the data to a development machine. The query appeared to run fine there. However, I will try that again in case I get a different result and can demonstrate the failure. If not, is there any other useful information to provide from this instance?

@zhang2014
Copy link
Member

I have tried that just now and it doesn't seem to make a difference databend-query-logs-09Jan25.txt

It appears to be caused by other nodes. Could you provide the logs from all nodes in the this cluster?

@rad-pat
Copy link
Author

rad-pat commented Jan 9, 2025

Those are consolidated logs, contain entries from both query pods, unfortunately it does not include the node for each query entry. I will gather the individual logs
databend-query-node-0.txt
databend-query-node-1.txt

@sundy-li
Copy link
Member

sundy-li commented Jan 9, 2025

@rad-pat

Hi, we may find some possible codes that lead this issue, you can try two different approaches

  1. Expand your cluster to have three nodes, and check the query works or not
  2. Check fix(cluster): try fix cluster agg hang #17235 pr , you can build this from source, and test it with origin cluster deployment (two nodes)

@rad-pat
Copy link
Author

rad-pat commented Jan 9, 2025

Thanks @sundy-li , I will try Option 1 firstly as that is easy to try. Option 2 requires a little more work to get a build env going, I don't suppose you could provide link to built docker image of that PR?

@sundy-li
Copy link
Member

sundy-li commented Jan 9, 2025

See Makefile which has

build-in-docker:
	bash ./scripts/setup/run_build_tool.sh cargo build --target $(TARGET)

@rad-pat
Copy link
Author

rad-pat commented Jan 9, 2025

Expanding to 3 nodes made the query work

@rad-pat
Copy link
Author

rad-pat commented Jan 10, 2025

Apologies, I could not figure out how to make a custom query service docker image from the PR. Seems to be many interactions in the github workflows/actions.
Is it possible for you guys to make a custom docker image from the PR for me to try in our k8s environment, or to describe to me how to do it? In your cloud github workflow, I can see it calls build_linux action and then uses query.Dockerfile to create image, but the process uploads/downloads artifacts somewhere so is a little too complicated for me to figure out easily.

@sundy-li
Copy link
Member

sundy-li commented Jan 10, 2025

Hi, we find it's not related to that pr. We will submit another pr in recent days and give you an image to test this case.

Please keep the data not changed, the bug is related to data distribution and is hard to reproduce....

@rad-pat
Copy link
Author

rad-pat commented Jan 10, 2025

Great, thanks for the update. We found another place where this happened and again, increasing to 3 query pods over 2 worked. I shall keep the test and await a test image, thanks

@sundy-li
Copy link
Member

@rad-pat Please test with 2 nodes

pr: #17245
docker musl image: sundyli/databend-query:agg
in https://hub.docker.com/r/sundyli/databend-query/tags

@rad-pat
Copy link
Author

rad-pat commented Jan 13, 2025

Thanks, I will test and report back 👍

@rad-pat
Copy link
Author

rad-pat commented Jan 13, 2025

@sundy-li I performed the following test:
Run query with v1.2.680-nightly - query failed to complete in 20 mins
Upgrade image to sundyli/databend-query:agg - query completed in 5 seconds
Downgrade to v1.2.680-nightly - query failed to complete in 20 mins
Upgrade image again to sundyli/databend-query:agg - query completed in 5 seconds

So, I think we can conclude that the revised image prevents the query failure in our case.

@sundy-li
Copy link
Member

Please wait for the next official nightly release, cause we fixed some memory leak bug in cluster #17252

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants