-
-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manticore crashes with signal 11
when inserting data
#1891
Comments
I can't reproduce a crash in 6.2.12 with the loading script based on your schema:
even with the higher concurrency of 8:
There was a somewhat similar issue #1458 (comment) which has already been fixed. I suggest you check if the crash persists in the latest dev version - https://mnt.cr/dev/nightly You can also try modifying the script, so it reproduces the crash, so we can reproduce it on our end to fix it. |
@sanikolaev it is happening very randomly, I can load 100Gb of data without any probles at all, or have problems on 15Gb dataset at random point. It is just like your comment on that issue:
I'm currently testing |
I dont know if it helps but i managed to get into similar state with two vector fields and columnar engine in the same table |
@MirosOwners Do you mean in the same table as in the script here #1891 (comment) ? |
It stopped crashed on this index with dev version, but started to crash on other index. Edit: as far as I can see, it just slowly overflows all available memory, Any ideas how to debug this? I've tried adding |
@yharahuts So it doesn't crash in the dev version, but just an OOM occurs? |
I've been dealing with this sporadic problem for weeks now. I finally found this thread and after review, one comment stood out:
Although I cannot attest to precisely when the problem started happening, I do know that I somewhat recently (weeks) added vector fields (3 of them, dim = 384, hnsw, l2, to my table.) I cannot recall having this problem before doing so, although I am not positive. The problem occurs sporadically during large throughput indexing, whether its bulk API or not, the server crashes with signal 11, and upon restart and replaying binlog, also crashes (perpetual crash loop from there) I immediately implemented a sleep mechanism between the batches, which may help but it does not solve the issue. It does not occur when indexing small amounts and I can utilize these 3 vector fields during search time. I seem to be able to reset it to a stable state with a rm -rf table_name, and then re-indexing smaller amounts or with sleeps added between calls but it appears like I can re-introduce the bug by just throwing data at the instance long enough (I re-index in developer environment frequently and sometimes test large datasets) My initial test, which I don't see as confirmation, but it inspired me to post this message with this detail: I just removed the three vector fields from being initialized in the RT |
if you have a crash loop
it could be better to upload your index files along with binlog to reproduce that crash loop here and fix the issue. |
If maybe I just got lucky and the problem happens again, I can think on how to safely send anonymized data - need to plan the feasibility of the rest of the fields and still there is data that I do not wish to send. Based on the details of my report: I just wanted to clarify again, although I can't make sense of it, that it points directly to the issue not being the data itself. I am utilizing the entire dataset fine after removing the 3 vector fields during table creation. Finally, in either case, the sporadic bug happens during indexing without even passing values for those vector fields. I will test further and maybe run an even larger job and report back only if the problem begins again. If the bad state does not happen, since the only change made was not adding these fields, I can immediately pass over the create table statement, although it's just 384 dims, hnsw, l2 x3 vectors fields (and about ~25 other fields.) @tomatolog or another: If you have a suspicion that the root cause is in fact the data - and i am thus missing something crucial about the code / architecture itself - I'd appreciate you clarifying that as well. |
all crashes these could be reproduced locally are fixed already or on the way into master branch. Our team do not have any clue what could cause such crash as we do not have data that reproduces the crash not the crash log from the searchd.log the crash stack could be checked. |
Completely understand. Is there a single viable thesis on the addition of vector fields to the table? I can try to go back and forth to assert further certainty on this being it. Other than that, I am not sure I can help atm with submitting data; will think on that more. |
It would be much easier if we had at least one of the following:
Without these, it may be hard to resolve the issue. It's best to have all of them, as this significantly improves the chances of finding a solution. |
Describe the bug
Manticore crashes when inserting a large amount of data into index.
Manticore is running in
rt
mode with following tables:Data is inserted via (rather large?) batches of 500 records per single insert, and whole dataset contains about 100m rows splitted into 1-3 indexes. Crash happens randomly, data can be inserted without problems at all, or can crash at ~1-2% at random line.
Since it is prod instance, I'm afraid I can not give you our datasets, or test multiple (older?) manticore versions.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
It should not crash.
Describe the environment:
searchd -v
: Manticore 6.2.12 dc5144d@230822manticoresearch/manticore:6.2.12
Messages from log files:
docker logs
shows following:After that it restarts with:
Additional context
While writing this issue, I came up with two ideas:
decrease batch size to maybe 50 rows per single insert;didn't helpI'll try both options, but since crash is happening randomly - I couldnt guarantee it will work or not.
Any advices is greatly appreciated,
indextool --check
on both indexes returns:The text was updated successfully, but these errors were encountered: