Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting max_threads_per_query = 12 leads to 99.9 CPU load for two threads on 16 core box #1631

Closed
starinacool opened this issue Nov 26, 2023 · 8 comments
Labels
bug rel::6.3.0 Released in 6.3.0

Comments

@starinacool
Copy link

Describe the bug
2 of 16 worker threads go 99.9 CPU time when I try to change max_threads_per_query from 10 to 12 on a 16 core box. Even after removing all workload from the server these two threads keep consuming 99.9 CPU.
Server cannot be stoped with systemctl stop manticore. Only kill -9 helps.

To Reproduce
Steps to reproduce the behavior:

  1. Setup a 16 core 32GB SSD box width RT index
  2. Load some data
  3. Change to max_threads_per_query = 12 , restart
  4. Add workload

Expected behavior
All worker threads working normaly.

Describe the environment:
Manticore 6.2.12 dc5144d@230822 (columnar 2.2.4 5aec342@230822) (secondary 2.2.4 5aec342@230822)

Messages from log files:
[Sun Nov 26 06:31:34.042 2023] [634140] caught SIGTERM, shutting down
[Sun Nov 26 06:31:39.550 2023] [634140] WARNING: still 2 alive tasks during shutdown, after 5.508 sec
[Sun Nov 26 06:31:39.701 2023] [634153] rt: table listing_finished: ramchunk saved in 0.150 sec

Additional context
Config:
optimize_cutoff = 8
max_threads_per_query = 10
access_doclists=mmap
access_hitlists=mmap
network_timeout = 20
client_timeout = 300
seamless_rotate = 1
unlink_old = 1
max_packet_size = 64M
max_filter_values = 65535
listen_backlog = 255
max_batch_queries = 32
subtree_docs_cache = 16M
subtree_hits_cache = 32M
binlog_flush = 2
binlog_max_log_size = 128M
expansion_limit = 100
query_log_format = sphinxql
collation_server = utf8_general_ci
collation_libc_locale = ru_RU.UTF-8
query_log_min_msec = 200
predicted_time_costs = doc=64, hit=48, skip=2048, match=64

@sanikolaev
Copy link
Collaborator

Even after removing all workload from the server these two threads keep consuming 99.9 CPU.

Please show the following at this moment:

  • top
  • vmstat 5 during a minute
  • show threads option format=all
  • select * from @@system.sessions
  • show status
  • searchd log
  • query log
  • show table <name> status of your table(s)

@sanikolaev sanikolaev added the waiting Waiting for the original poster (in most cases) or something else label Nov 26, 2023
@Korkman
Copy link

Korkman commented Jan 10, 2024

I have observed a similar failure. Workers would go to 100%, the connection to the client would break (the client receives no response).

They were processing sphinx protocol requests querying a text field "all_childs" which can contain the words "child_1", "child_2", ... up to "child_18". These were the hanging queries I had to kill -9:

@(all_childs) child_4 | child_5
@(all_childs) child_4 | child_5 @(all_childs) child_4 | child_5 (yes, duplicate expression)
@(all_childs) child_1 | child_2 | child_3 | child_16 | child_17 | child_18
@(all_childs) child_1 | child_2 | child_3 | child_16 | child_17 | child_18
@(all_childs) child_1 | child_2 | child_3 | child_16 | child_17 | child_18

RT indices were present, but the queries ran against a non-RT index.

strace started from htop showed no syscall activity on the crashed worker processes.

These unspecific queries match a good portion of 230k documents. Other, more specific queries did not crash.

After reading this issue, I set max_threads_per_query = 4 to lower my threads per query. No failing workers so far.
UPDATE: this setting did not fix the issue for me.

Hardware: AMD Ryzen 9 5950X 16-Core Processor, 128 GB RAM
OS: Debian Bookworm within a KVM VM, 16 vcores assigned (hyperthreading is enabled, so this is 16 of 32 possible vcores) and 16 GB RAM
Config:

max_connections = 100
expansion_limit = 500
seamless_rotate = 1
collation_libc_locale = de_DE.UTF-8
network_timeout = 5m
qcache_max_bytes = 0

searchd: Manticore 6.2.12 dc5144d35@230822 (columnar 2.2.4 5aec342@230822) (secondary 2.2.4 5aec342@230822)

@Korkman
Copy link

Korkman commented Jan 10, 2024

@sanikolaev
Copy link
Collaborator

@Korkman if you can stably reproduce it by running on of the @(all_childs) queries, could you share your table files and your config with us by sending them to our write-only s3 storage - https://manual.manticoresearch.com/Reporting_bugs#Uploading-your-data ? If we can reproduce this issue on our side, we'll be able to fix it.

@tomatolog
Copy link
Contributor

could you try to use head of the dev version as it has fixes of CPU limit during FT queries ?

@Korkman
Copy link

Korkman commented Jan 10, 2024

@tomatolog @sanikolaev 6.2.13 a2af06ca3@240110 dev (columnar 2.2.5 1d1e432@231204) (secondary 2.2.5 1d1e432@231204) (knn 2.2.5 1d1e432@231204) seems to work fine.

@tomatolog Would a workaround be possible in 6.2.12 or can this only be fixed with the release of 6.2.13?

@tomatolog
Copy link
Contributor

you could set max_threads_per_query for full-text with multiple OR terms to keep CPU under control at the 6.2.12
or use 6.2.13 as dev version soon be released into main repository

@sanikolaev
Copy link
Collaborator

seems to work fine.

Thanks. I'm closing this issue then.

@starinacool feel free to reopen in case it doesn't work for you in the dev version or the upcoming release.

@sanikolaev sanikolaev added rel::upcoming Upcoming release and removed waiting Waiting for the original poster (in most cases) or something else labels Jan 12, 2024
@sanikolaev sanikolaev added the bug label Feb 7, 2024
@sanikolaev sanikolaev added rel::6.3.0 Released in 6.3.0 and removed rel::upcoming Upcoming release labels May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug rel::6.3.0 Released in 6.3.0
Projects
None yet
Development

No branches or pull requests

4 participants