Upgrading manticore from 6.0.4_230314.1a3a4ea82-1.el8 to 6.2.12_230822.dc5144d35-1.el8 results in significant cpu usage increase #1563

digirave · 2023-11-05T02:35:43Z

Describe the bug
Upgrading manticore from 6.0.4_230314.1a3a4ea82-1.el8 to 6.2.12_230822.dc5144d35-1.el8 results in significant cpu usage increase

To Reproduce
Steps to reproduce the behavior:

Have many search queries using ngram searches using plain indexes with 20 shards on same local server
conf file has
max_threads_per_query = 20
threads = 200

Expected behavior
A clear and concise description of what you expected to happen.
Similar performance is expected

Describe the environment:
Linux [REDACTED] 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Thu Aug 31 10:29:22 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

# rpm -qa|grep manti
manticore-repo-0.1.0-1.noarch
manticore-buddy-1.0.18_23080408.2befdbe-1.noarch
manticore-common-6.2.12_230822.dc5144d35-1.el8.noarch
manticore-extra-0.7.8_23082209.810d7d3-1.noarch
manticore-columnar-lib-2.2.4_230822.5aec342-1.el8.x86_64
manticore-server-6.2.12_230822.dc5144d35-1.el8.x86_64
manticore-server-6.0.2_230210.89c7a5139-1.el8.x86_64
manticore-6.2.12_230822.dc5144d35-1.el8.x86_64
manticore-icudata-5.0.3_221123.d2d9e5e56-1.el7.centos.noarch
manticore-executor-0.7.8_23082210.810d7d3-1.x86_64
manticore-server-6.0.4_230314.1a3a4ea82-1.el8.x86_64
manticore-server-core-6.2.12_230822.dc5144d35-1.el8.x86_64
manticore-devel-6.2.12_230822.dc5144d35-1.el8.noarch
manticore-tools-6.2.12_230822.dc5144d35-1.el8.x86_64
manticore-backup-1.0.8_23080408.f7638f9-1.noarch

Messages from log files:
Messages from searchd.log and query.log (if applicable).

Additional context
We have two servers with exact hardware and exact same setup with manticore.
They create indexes independently of each but have the same setup.
They have queries sent to them in the same amount.

ServerA was upgraded from manticore server 6.0.4_230314.1a3a4ea82-1.el8 to 6.2.12_230822.dc5144d35-1.el8 during the black lines.
ServerB is manticore-server-6.0.4_230314.1a3a4ea82-1.el8.x86_64 in the whole graph

You can easily see ServerA has increased CPU load.
Query response time seem overall similar.
One difference between the servers is that ServerA was booted while ServerB has same rpm's for everything other than manticore but was not booted.

serverA:

serverB:

The text was updated successfully, but these errors were encountered:

sanikolaev · 2023-11-05T02:52:44Z

Hi

Query response time seem overall similar.

Can you show the graph please?

digirave · 2023-11-05T02:55:32Z

Query response time seem overall similar.

Can you show the graph please?

Unfortunately I don't have graphs for it. However internally every search shows output for admins in console, and overall that output seems similar right now.

digirave · 2023-11-05T03:53:34Z

Regarding
"One difference between the servers is that ServerA was booted while ServerB has same rpm's for everything other than manticore but was not booted."

I just booted ServerB to exclude kernel version differences causing the CPU usage different.
After boot, the CPU usage difference between manticoresearch versions still seem to be the same
This suggests a version difference is causing this issue, especially since comparisons on the same sever and comparison with different server both show increased CPU usage with recent version.

Of note, both servers are now running the same bios version, mainboard firmware version, hardware and network specs, OS and kernel version.

sanikolaev · 2023-11-05T06:01:12Z

Here's what you can do for a fair comparison:

stop both searchds
purge OS cache
start them
wait some time
inspect the load and query stats. For the latter you can calculate avg, 95p and 99p respone time from the query logs.

If after that you can still see that the new version consumes more CPU not improving performance, it will be indeed a problem we'd like to look further into.

digirave · 2023-11-29T14:07:31Z

Conclusion: I highly suspect 6.2.12_230822.dc5144d35-1.el8 is slower.

I have been running the servers with same setup, hardware, bios/firmware version, bios settings, OS version, and same weighted load balancing for weeks now.
As mentioned both had same server load before upgrade.
The manticore upgraded server keeps having increased server load.

To see if the increased server load helped with query performance, I used the following command to take the last 1,000,000 queries "real" time in server logs and average them.

tail -n 1000000 manticore_query.log > last_1000000_lines.log
grep -oP 'real \K[0-9.]+(?= wall)' last_1000000_lines.log | awk '{ total += $1; count++ } END { print total/count }'

ServerA was upgraded from manticore server 6.0.4_230314.1a3a4ea82-1.el8 to 6.2.12_230822.dc5144d35-1.el8 during the black lines.
ServerB is manticore-server-6.0.4_230314.1a3a4ea82-1.el8.x86_64 in the whole graph

ServerA result: 0.141038
ServerB result: 0.13018

Of note:
These servers are using ngram searches on plain indexes.
These servers create the whole plain index every 3 days but on different dates, but this never caused any visible performance difference in the past.
We have been running these servers for probably 2 years, with stable performance until the recent 6.2.12_230822.dc5144d35-1.el8 upgrade

sanikolaev · 2023-11-29T15:56:13Z

Thanks for providing the additional info @digirave

ServerA result: 0.141038
ServerB result: 0.13018

How do the loads differ? If you have older query logs when the both servers were running 6.0.4, can you run the same script to compare the response times for that period?

digirave · 2023-11-29T16:14:52Z

Conclusion: Before update to 6.2.12_230822.dc5144d35-1.el8, query times don't differ much between servers and non-upgraded server in the past actually had a miniscule but faster response. However, upgrade to 6.2.12_230822.dc5144d35-1.el8 seems to have caused significant increase in server load with slightly worse query performance.

How do the loads differ?

They shouldn't differ long term. We load balance the same amount to both servers.

If you have older query logs when the both servers were running 6.0.4, can you run the same script to compare the response times for that period?

checked if one million queries were analyzed before further testing:
cat last_1000000_lines.log|wc -l
output: 1000000
confirming statistics were done using 1 million queries

commands ran for further testing
gunzip -k manticore_query.log-20231029.gz
tail -n 1000000 manticore_query.log-20231029 > last_1000000_lines_old.log
grep -oP 'real \K[0-9.]+(?= wall)' last_1000000_lines_old.log | awk '{ total += $1; count++ } END { print total/count }'

checked if one million queries were analyzed:
cat last_1000000_lines_old.log|wc -l
output: 1000000

ServerA: 0.13555
ServerB: 0.136529

sanikolaev · 2023-11-29T16:19:22Z

These servers are using ngram searches on plain indexes

Can you provide your table schema, a dozen of sample documents and your typical queries? I'd like us to benchmark that on our side.

digirave · 2023-11-29T16:22:54Z

Can you provide your table schema, a dozen of sample documents and your typical queries? I'd like us to benchmark that on our side.

This would take some time and resources.
If it is possible I will get back to you by email, thank you.

sanikolaev · 2023-11-29T16:23:54Z

If it is possible I will get back to you by email, thank you

Sure. [email protected]

digirave · 2023-11-29T17:21:34Z

Can you provide your table schema, a dozen of sample documents and your typical queries? I'd like us to benchmark that on our side.

At this moment, this will take a lot of resources, so we wish to avoid as much as possible.

I am not 100% sure, but I do think I found a workaround.

pseudo_sharding = 0

seems to decrease server load considerably, while query performance seems comparable to other servers(too little difference to tell right now).
The server load seems to decrease massively(a noticable immediate huge drop in server load)
We plan to run with pseudo_sharding = 0 for a while

Addtionally preliminary tests on prior versions, pseudo_sharding = 0 doesn't seem to have such a noticable effect(so little I can't tell right now).

digirave · 2023-11-29T17:22:45Z

My current conclusion is pseudo_sharding in our case has a significant negative performance impact in version 6.2.12_230822.dc5144d35-1.el8 which is not visible in 6.0.4_230314.1a3a4ea82-1.el8

Workaround seems to be
pseudo_sharding = 0

digirave · 2023-11-30T09:32:05Z

Here are screenshot of
ServerA is version 6.2.12_230822.dc5144d35-1.el8 black line is when server was switched to pseudo_sharding = 0

ServerB is 6.0.4_230314.1a3a4ea82-1.el8 in whole graph with default pseudo_sharding as default and also after changing to pseudo_sharding = 0 with no significant difference(probably small difference but hard to tell).

sanikolaev · 2023-11-30T11:31:01Z

@digirave BTW regarding

awk '{ total += $1; count++ } END { print total/count }'

could you also check the 50p, 95p and 99p metrics?

digirave · 2023-11-30T12:38:13Z

@sanikolaev

could you also check the 50p, 95p and 99p metrics?

Statistics for default pseudo_sharding

Server A version 6.2.12_230822.dc5144d35-1.el8 with default pseudo_sharding
Average: 0.146567
50th Percentile (Median): 0.104
95th Percentile: 0.413
99th Percentile: 0.746

Server B version 6.0.4_230314.1a3a4ea82-1.el8 with default pseudo_sharding
Average: 0.133866
50th Percentile (Median): 0.096
95th Percentile: 0.373
99th Percentile: 0.673

Statistics for pseudo_sharding = 0

Server A version 6.2.12_230822.dc5144d35-1.el8 with pseudo_sharding = 0
Average: 0.132263
50th Percentile (Median): 0.094
95th Percentile: 0.363
99th Percentile: 0.631

Server B version 6.0.4_230314.1a3a4ea82-1.el8 with pseudo_sharding = 0
Average: 0.131565
50th Percentile (Median): 0.097
95th Percentile: 0.356
99th Percentile: 0.584

Commands used
default pseudo_sharding setting: 1 million queries analyzed
pseudo_sharding = 0 setting: 10,000 queries analyzed

tail -n 1000000 manticore_query.log-20231119 > last_1000000_lines.log
grep -oP 'real \K[0-9.]+(?= wall)' last_1000000_lines.log | sort -n | awk '
{
    times[NR] = $1;
    total += $1;
}
END {
    average = total / NR;
    median = (NR % 2) ? times[(NR + 1) / 2] : (times[NR / 2] + times[NR / 2 + 1]) / 2;
    pct95 = times[int(NR * 0.95)];
    pct99 = times[int(NR * 0.99)];

    print "Average:", average;
    print "50th Percentile (Median):", median;
    print "95th Percentile:", pct95;
    print "99th Percentile:", pct99;
}'

sanikolaev · 2024-01-24T06:16:00Z

As discussed with @glookka, even though we say in the docs that when you do manual sharding, you can disable pseudo sharding, in this case (21 shards, presumably 25 cpu cores, max_threads_per_query=20) not disabling it should not have caused the overload and should not have impacted the performance negatively. Since @digirave can't provide the data to reproduce it, @glookka will try to reproduce it on synthetic tests.

sanikolaev added the waiting Waiting for the original poster (in most cases) or something else label Nov 6, 2023

sanikolaev assigned glookka Jan 19, 2024

sanikolaev added est::size_S and removed waiting Waiting for the original poster (in most cases) or something else labels Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading manticore from 6.0.4_230314.1a3a4ea82-1.el8 to 6.2.12_230822.dc5144d35-1.el8 results in significant cpu usage increase #1563

Upgrading manticore from 6.0.4_230314.1a3a4ea82-1.el8 to 6.2.12_230822.dc5144d35-1.el8 results in significant cpu usage increase #1563

digirave commented Nov 5, 2023 •

edited

Loading

sanikolaev commented Nov 5, 2023

digirave commented Nov 5, 2023

digirave commented Nov 5, 2023

sanikolaev commented Nov 5, 2023

digirave commented Nov 29, 2023 •

edited

Loading

sanikolaev commented Nov 29, 2023

digirave commented Nov 29, 2023 •

edited

Loading

sanikolaev commented Nov 29, 2023

digirave commented Nov 29, 2023

sanikolaev commented Nov 29, 2023

digirave commented Nov 29, 2023

digirave commented Nov 29, 2023

digirave commented Nov 30, 2023 •

edited

Loading

sanikolaev commented Nov 30, 2023

digirave commented Nov 30, 2023 •

edited

Loading

sanikolaev commented Jan 24, 2024

Upgrading manticore from 6.0.4_230314.1a3a4ea82-1.el8 to 6.2.12_230822.dc5144d35-1.el8 results in significant cpu usage increase #1563

Upgrading manticore from 6.0.4_230314.1a3a4ea82-1.el8 to 6.2.12_230822.dc5144d35-1.el8 results in significant cpu usage increase #1563

Comments

digirave commented Nov 5, 2023 • edited Loading

sanikolaev commented Nov 5, 2023

digirave commented Nov 5, 2023

digirave commented Nov 5, 2023

sanikolaev commented Nov 5, 2023

digirave commented Nov 29, 2023 • edited Loading

sanikolaev commented Nov 29, 2023

digirave commented Nov 29, 2023 • edited Loading

sanikolaev commented Nov 29, 2023

digirave commented Nov 29, 2023

sanikolaev commented Nov 29, 2023

digirave commented Nov 29, 2023

digirave commented Nov 29, 2023

digirave commented Nov 30, 2023 • edited Loading

sanikolaev commented Nov 30, 2023

digirave commented Nov 30, 2023 • edited Loading

sanikolaev commented Jan 24, 2024

digirave commented Nov 5, 2023 •

edited

Loading

digirave commented Nov 29, 2023 •

edited

Loading

digirave commented Nov 29, 2023 •

edited

Loading

digirave commented Nov 30, 2023 •

edited

Loading

digirave commented Nov 30, 2023 •

edited

Loading