Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solved - FATAL: failed to create rotation thread #126

Closed
sshilko opened this issue Oct 30, 2018 · 17 comments
Closed

Solved - FATAL: failed to create rotation thread #126

sshilko opened this issue Oct 30, 2018 · 17 comments

Comments

@sshilko
Copy link

sshilko commented Oct 30, 2018

Are you in the right place?

For general questions, post a question to Forum or Slack

If you open a Github issue, it must be a bug or feature request and you need to provide the below information.

Describe the environment

Amazon Linux 2

Manticore Search version:
2.7.3

OS version:
Linux ip-172-30-0-217.ec2.internal 4.14.62-70.117.amzn2.x86_64 #1 SMP Fri Aug 10 20:14:53 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Describe the problem

When using workers = thread_pool getting this error, with workers = threads everything is fine.
Cant reproduce locally, only on AWS EC2 inside docker.
There are no other containers, no docker settings make any difference (ports, memory, host network)

Description of the issue:

Steps to reproduce:

Messsages from log files:
FATAL: failed to create rotation thread

@manticoresearch
Copy link
Contributor

Hi. Are you using our docker image and the default config or smth else?

@sshilko
Copy link
Author

sshilko commented Oct 30, 2018

I'am using your build image for building, except its alpine-3.7 instead of 3.6, then take binaries and move into my alpine + config yes, i'am in slack now if we can chat

@sshilko
Copy link
Author

sshilko commented Oct 30, 2018

Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: Manticore 2.7.3 ad9b22d0@180928 release
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: Copyright (c) 2001-2016, Andrew Aksyonoff
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: Copyright (c) 2017-2018, Manticore Software LTD (http://manticoresearch.com)
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: using config file '/etc/sphinxsearch/sphinx.conf'...
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: WARNING: TCP fast open unavailable (can't read /proc/sys/net/ipv4/tcp_fastopen, unsuppo
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: listening on all interfaces, port=9312
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: listening on all interfaces, port=9212
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: precaching index 'stream_test_0'
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: Locking the index via file /var/lib/sphinx/stream_test_0.spl
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: lock /var/lib/sphinx/stream_test_0.spl success
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: precaching index 'stream_test_1'
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: Locking the index via file /var/lib/sphinx/stream_test_1.spl
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: lock /var/lib/sphinx/stream_test_1.spl success
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: precaching index 'stream_test_2'
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: crond: crond (busybox 1.27.2) started, log level 6
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: Locking the index via file /var/lib/sphinx/stream_test_2.spl
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: lock /var/lib/sphinx/stream_test_2.spl success
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: precaching index 'stream_test_3'
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: Locking the index via file /var/lib/sphinx/stream_test_3.spl
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: lock /var/lib/sphinx/stream_test_3.spl success
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: precached 4 indexes in 0.007 sec
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: 'binlog_max_log_size' - nothing specified, using default value 268435456
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: 'read_unhinted' - nothing specified, using default value 0
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: DEBUG: 'read_buffer' - nothing specified, using default value 0
Oct 30 15:59:07 ip-172-30-0-217.ec2.internal docker[10879]: FATAL: failed to create rotation thread

@sshilko
Copy link
Author

sshilko commented Oct 30, 2018

Cured by

vm.overcommit_memory=1
instead of
vm.overcommit_memory=2

https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

P.S. Carefull with memory usage, even if it says 90% free - sphinx tries to allocate alot on start i guess.
Error message could be improved tho.

@sshilko sshilko changed the title FATAL: failed to create rotation thread Solved - FATAL: failed to create rotation thread Oct 30, 2018
@tomatolog
Copy link
Contributor

What AWS instance type do you use? How much memory it got?
What is you config and indexes size?

@sshilko
Copy link
Author

sshilko commented Oct 31, 2018

AWS t2.small - VPC

              total        used        free      shared  buff/cache   available
Mem:           1993         218        1304          69         470        1418
Swap:             0           0           0

CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
1e738da8XXXX        4.32%               144.6MiB / 1.947GiB   7.25%               0B / 0B             12.5MB / 71.2MB     4107


And this is beta - uses 144MB while having 0% real load (not production), the production with million rows uses 11.29MiB (docker ps) only under heavy load. (4 cpu 8gb ram ~50% load)

Changing workers from thread_pool to threads reduces memory usage to previous levels

CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
53948e4ca4a8        2.19%               4.727MiB / 1.947GiB   0.24%               0B / 0B             32.8kB / 12.3kB     12

@sshilko
Copy link
Author

sshilko commented Oct 31, 2018

workers = thread_pool = 4105 pids, workers = threads = ~10 pids too

@sshilko
Copy link
Author

sshilko commented Oct 31, 2018

Reproducing memory issue

workers=threads
sysctl vm.overcommit_memory=2

[root@ip-172-30-0-217 ec2-user]# free -m
              total        used        free      shared  buff/cache   available
Mem:           1993         131        1444          69         417        1554

indexes list

/var/lib/sphinx $ ls -lhS
total 248
-rw-r--r--    1 dalek    dalek       2.0K Oct 31 10:06 stream_test_0.sph
-rw-r--r--    1 dalek    dalek       2.0K Oct 31 10:06 stream_test_1.sph
-rw-r--r--    1 dalek    dalek       2.0K Oct 31 10:06 stream_test_2.sph
-rw-r--r--    1 dalek    dalek       2.0K Oct 31 10:06 stream_test_3.sph
-rw-r--r--    1 dalek    dalek       2.0K Oct 31 10:06 stream_test_delta_0.sph
-rw-r--r--    1 dalek    dalek       2.0K Oct 31 10:06 stream_test_delta_1.sph
-rw-r--r--    1 dalek    dalek       2.0K Oct 31 10:06 stream_test_delta_2.sph
-rw-r--r--    1 dalek    dalek       2.0K Oct 31 10:06 stream_test_delta_3.sph
-rw-r--r--    1 dalek    dalek       1.1K Oct 31 10:06 stream_test_0.spa
-rw-r--r--    1 dalek    dalek        968 Oct 31 10:06 stream_test_1.spa
-rw-r--r--    1 dalek    dalek        968 Oct 31 10:06 stream_test_2.spa
-rw-r--r--    1 dalek    dalek        834 Oct 31 10:06 stream_test_3.sps
-rw-r--r--    1 dalek    dalek        792 Oct 31 10:06 stream_test_3.spa
-rw-r--r--    1 dalek    dalek        773 Oct 31 10:06 stream_test_2.sps
-rw-r--r--    1 dalek    dalek        754 Oct 31 10:06 stream_test_0.sps
-rw-r--r--    1 dalek    dalek        720 Oct 31 10:06 stream_test_0.spm
-rw-r--r--    1 dalek    dalek        692 Oct 31 10:06 stream_test_2.spm
-rw-r--r--    1 dalek    dalek        616 Oct 31 10:06 stream_test_delta_2.spa
-rw-r--r--    1 dalek    dalek        588 Oct 31 10:06 stream_test_1.spm

              total        used        free      shared  buff/cache   available
Mem:           1993         129        1446          69         417        1556

Everything started OK

Now changing to
workers=thread_pool
sysctl vm.overcommit_memory=2

Immediately fails (while having 1.4Gb free) with indexes taking few megs


              total        used        free      shared  buff/cache   available
Mem:           1993         120        1454          68         418        1565


Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: Manticore 2.7.3 ad9b22d0@180928 release
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: Copyright (c) 2001-2016, Andrew Aksyonoff
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: Copyright (c) 2017-2018, Manticore Software LTD (http://manticoresearch.com)
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: using config file '/etc/sphinxsearch/sphinx.conf'...
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: listening on all interfaces, port=9312
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: listening on all interfaces, port=9212
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: precaching index 'stream_test_0'
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: Locking the index via file /var/lib/sphinx/stream_test_0.spl
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: lock /var/lib/sphinx/stream_test_0.spl success
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: precaching index 'stream_test_1'
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: Locking the index via file /var/lib/sphinx/stream_test_1.spl
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: lock /var/lib/sphinx/stream_test_1.spl success
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: precaching index 'stream_test_2'
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: Locking the index via file /var/lib/sphinx/stream_test_2.spl
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: crond: crond (busybox 1.27.2) started, log level 6
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: lock /var/lib/sphinx/stream_test_2.spl success
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: precaching index 'stream_test_3'
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: Locking the index via file /var/lib/sphinx/stream_test_3.spl
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: lock /var/lib/sphinx/stream_test_3.spl success
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: precached 4 indexes in 0.006 sec
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: 'binlog_max_log_size' - nothing specified, using default value 268435456
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: 'read_unhinted' - nothing specified, using default value 0
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: DEBUG: 'read_buffer' - nothing specified, using default value 0
Oct 31 10:12:00 ip-172-30-0-217.ec2.internal docker[41316]: FATAL: failed to create rotation thread
Oct 31 10:12:01 ip-172-30-0-217.ec2.internal systemd[1]: Failed to start Docker

@sshilko
Copy link
Author

sshilko commented Oct 31, 2018

There is nothing running inside container except searchd and cron to reindex.
I suspect max_children=4096 might be related to 4107 pids running

@tomatolog
Copy link
Contributor

max_children for thread_pool workers set number of workers daemon creates at pool of threads. You should set there values related to your box or comment out that option then daemon auto-calc its value.

@sshilko
Copy link
Author

sshilko commented Oct 31, 2018

main docker (4k pids, not connected directly by clients)

...
    max_children        = 4096
...

    index stream_test
    {
        type = distributed
        local = stream_test_0
        local = stream_test_1
        local = stream_test_2
        local = stream_test_3

    }


relay docker (all clients connect via this instance, its actually on same 127.0.0.1 as clients),
172.30.0.217 is main docker

...
    max_children        = 4096
...
    index stream_test
    {
        type = distributed
        agent_connect_timeout = 1500
        agent_query_timeout   = 3000
        agent_persistent = 172.30.0.217:9312|172.30.0.217:9212:stream_test_0
        agent_persistent = 172.30.0.217:9312|172.30.0.217:9212:stream_test_1
        agent_persistent = 172.30.0.217:9312|172.30.0.217:9212:stream_test_2
        agent_persistent = 172.30.0.217:9312|172.30.0.217:9212:stream_test_3

    }

@sshilko
Copy link
Author

sshilko commented Oct 31, 2018

Pids issue is related to max_children

max_children=4096
docker stats PIDS

CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
89a37248b868        2.19%               138.4MiB / 1.947GiB   6.94%               0B / 0B             0B / 177kB          4105

max_children=1096
docker stats PIDS

CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
34ecf5fac5bf        4.38%               40.46MiB / 1.947GiB   2.03%               0B / 0B             0B / 26.1kB         1107

But when doing ps aux, i dont see any pids running (doing ps aux neither on host neither in container)

@sshilko
Copy link
Author

sshilko commented Oct 31, 2018

Docker version 17.06.2-ce, build 3dfb8343b139d6342acfd9975d7f1068b5b1c3d3


/ $ ps aux
PID   USER     TIME   COMMAND
    1 dalek      0:03 /usr/bin/searchd --nodetach --index stream_test --index stream_test_0 --index stream_test_1 --index stream_test_2 --index st
   15 root       0:00 crond -l 6 -d 6 -b -L /dev/console
 1292 dalek      0:00 /bin/sh
 1296 dalek      0:00 ps aux
/ $ exit


CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
34ecf5fac5bf        0.01%               40.63MiB / 1.947GiB   2.04%               0B / 0B             0B / 165kB          1105

@sshilko
Copy link
Author

sshilko commented Oct 31, 2018

@tomatolog not specifying max_children fixes the memory issue on start and PIDS
maybe should be mentioned in documentation, probably on start it tries to start all max_children threads (runs out of memory with vm_overcommit=2)

@sshilko
Copy link
Author

sshilko commented Oct 31, 2018

Documentation could be fixed, or maybe there is issue in code - that it tries to create max_children threads on start.

@tomatolog
Copy link
Contributor

for workers thread_pool daemon always creates N threads for thread pool. That N value calculated as

  • Max ( 3*sphCpuThreadsCount()/2, 2 ) on case no max_children option set
  • value of max_children searchd option

that is why in case you explicitly set that option then thread pool got started with 4096 threads (max_children = 4096), ie daemon creates such thread pool for you.

That is why I see no issue - just comment out that option or set value that fits your box.

You also might create PR with doc fixed for that section https://manticoresearch.gitlab.io/dev/conf_options_reference/searchd_program_configuration_options.html#max-children to make this option clear.

@sshilko
Copy link
Author

sshilko commented Oct 31, 2018

Created PR
#127

Thx, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants