Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searchd 6.2.12 thread hangs when executing a SELECT query #1774

Closed
kzakhark opened this issue Jan 30, 2024 · 7 comments
Closed

Searchd 6.2.12 thread hangs when executing a SELECT query #1774

kzakhark opened this issue Jan 30, 2024 · 7 comments
Labels

Comments

@kzakhark
Copy link

kzakhark commented Jan 30, 2024

Describe the bug

Searchd 6.2.12 thread hangs indefinitely when executing a select query and consumes 100% of a CPU core. The query cannot be killed with kill CLI command, searchd must be killed with system tools to clear the stuck thread. Index checks run fine and show no issues.

To Reproduce
Steps to reproduce the behavior:

  1. Connect to searchd with MySQL client;
  2. Execute a specific query against our dataset:
SELECT id,length(concat(street,city)) as length,weight() FROM `coverage_autocomplete` WHERE MATCH('@(city,street) "karlstad"') AND number_and_letter='' AND country_code='SE' ORDER BY weight() DESC,length ASC LIMIT 1 OFFSET 0;
  1. The query hangs, the thread executing the query consumes 100% of 1 CPU core.

Expected behavior
The query is expected to complete and return results in a finite time. We have another system with identical OS and software versions and similarly sized index (with somewhat different data), where the same query returns results very quickly without any issues.

Describe the environment:

  • Manticore Search version:
Server version: 6.2.12 dc5144d35@230822 (columnar 2.2.4 5aec342@230822) (secondary 2.2.4 5aec342@230822) git branch manticore-6.2.12...origin/manticore-6.2.12
  • OS version (uname -a if on a Unix-like system):
Linux systemname 3.10.0-1160.45.1.el7.x86_64 #1 SMP Wed Oct 13 17:20:51 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

CentOS Linux release 7.9.2009 (Core)
Derived from Red Hat Enterprise Linux 7.9 (Source)
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Messages from log files:
There's nothing relevant in searchd.log and query.log.

Additional context
I'll provide additional information separately, as some files are too large to upload to GitHub.
threads-output.txt
lsof-output.txt
gdb-output.txt

@kzakhark
Copy link
Author

I've uploaded additional data to s3.manticoresearch.com, manticore/write-only/issue-1774:

core.26109.gz - searchd process core dump;
coverage_autocomplete.tar.gz - complete tarball of the data against which we run the query;
Manticore.sql.gz - SQL dump of the above data.

@kzakhark
Copy link
Author

What's interesting, after the 1st select thread hangs, the same query made from a new connection repeatedly works correctly:

| 32000 | work_0     | mysql | query | 127.0.0.1:43210 |   2986 | 1491.909407 | 31s       |    117316 | 25m           | No (working) | Mini Mini Mini Query Conn  | 5 ch 0: 5 ch 2: api-search query="@(city,street) "karlstad"" comment="" index="coverage_autocomplete" SELECT id,length(concat(street,city)) as length,weight() FROM `coverage_autocomplete` WHERE MATCH('@(city,street) "karlstad"') AND number_and_letter='' AND country_code='SE' ORDER BY weight() DESC,length ASC LIMIT 1 OFFSET 0 |
...
MySQL [(none)]> SELECT id,length(concat(street,city)) as length,weight() FROM `coverage_autocomplete` WHERE MATCH('@(city,street) "karlstad"') AND number_and_letter='' AND country_code='SE' ORDER BY weight() DESC,length ASC LIMIT 1 OFFSET 0;
+---------+--------+----------+
| id      | length | weight() |
+---------+--------+----------+
| 2147906 |     24 |     1625 |
+---------+--------+----------+
1 row in set (0.002 sec)

@sanikolaev
Copy link
Collaborator

Would you like to try to reproduce it in the latest dev version (https://mnt.cr/nightly) ?

@sanikolaev sanikolaev added bug rel::upcoming Upcoming release labels Jan 30, 2024
@kzakhark
Copy link
Author

We only face the issue on a production system, where we'd rather not use nightly builds unless absolutely necessary. We're unable to reproduce the issue on a test system, although the test system doesn't have the same load as the production environment.

@kzakhark
Copy link
Author

kzakhark commented Feb 5, 2024

An update: unfortunately, we were unable to reproduce the issue in our test systems, including an exact clone of the affected system. After we rebooted the affected system the issue disappeared and we couldn't reproduce it in the last few days. It is unclear what caused specific queries to get stuck, as the affected system didn't show any abnormal readings or other problems, but it's no longer happening after the system reboot.

@kzakhark kzakhark closed this as completed Feb 5, 2024
@sanikolaev
Copy link
Collaborator

BTW I tried to reproduce this issue on our side on 6.2.12 by sending the mentioned above select query with some concurrency and a few concurrent inserts/updates concurrent threads and it didn't hang or crash.

@sanikolaev sanikolaev removed the rel::upcoming Upcoming release label Feb 5, 2024
@kzakhark
Copy link
Author

kzakhark commented Feb 5, 2024

Same here, we were unable to reproduce the issue with rather high request rate with high concurrency. We're rather puzzled, as it's unclear what could have been the culprit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants