Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading manticore from 6.0.4_230314.1a3a4ea82-1.el8 to 6.2.12_230822.dc5144d35-1.el8 results in significant cpu usage increase #1563

Open
digirave opened this issue Nov 5, 2023 · 16 comments
Assignees

Comments

@digirave
Copy link

digirave commented Nov 5, 2023

Describe the bug
Upgrading manticore from 6.0.4_230314.1a3a4ea82-1.el8 to 6.2.12_230822.dc5144d35-1.el8 results in significant cpu usage increase

To Reproduce
Steps to reproduce the behavior:

  1. Have many search queries using ngram searches using plain indexes with 20 shards on same local server
  2. conf file has
    max_threads_per_query = 20
    threads = 200

Expected behavior
A clear and concise description of what you expected to happen.
Similar performance is expected

Describe the environment:
Linux [REDACTED] 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Thu Aug 31 10:29:22 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

# rpm -qa|grep manti
manticore-repo-0.1.0-1.noarch
manticore-buddy-1.0.18_23080408.2befdbe-1.noarch
manticore-common-6.2.12_230822.dc5144d35-1.el8.noarch
manticore-extra-0.7.8_23082209.810d7d3-1.noarch
manticore-columnar-lib-2.2.4_230822.5aec342-1.el8.x86_64
manticore-server-6.2.12_230822.dc5144d35-1.el8.x86_64
manticore-server-6.0.2_230210.89c7a5139-1.el8.x86_64
manticore-6.2.12_230822.dc5144d35-1.el8.x86_64
manticore-icudata-5.0.3_221123.d2d9e5e56-1.el7.centos.noarch
manticore-executor-0.7.8_23082210.810d7d3-1.x86_64
manticore-server-6.0.4_230314.1a3a4ea82-1.el8.x86_64
manticore-server-core-6.2.12_230822.dc5144d35-1.el8.x86_64
manticore-devel-6.2.12_230822.dc5144d35-1.el8.noarch
manticore-tools-6.2.12_230822.dc5144d35-1.el8.x86_64
manticore-backup-1.0.8_23080408.f7638f9-1.noarch

Messages from log files:
Messages from searchd.log and query.log (if applicable).

Additional context
We have two servers with exact hardware and exact same setup with manticore.
They create indexes independently of each but have the same setup.
They have queries sent to them in the same amount.

ServerA was upgraded from manticore server 6.0.4_230314.1a3a4ea82-1.el8 to 6.2.12_230822.dc5144d35-1.el8 during the black lines.
ServerB is manticore-server-6.0.4_230314.1a3a4ea82-1.el8.x86_64 in the whole graph

You can easily see ServerA has increased CPU load.
Query response time seem overall similar.
One difference between the servers is that ServerA was booted while ServerB has same rpm's for everything other than manticore but was not booted.

serverA:
serverA

serverB:
serverB

@sanikolaev
Copy link
Collaborator

Hi

Query response time seem overall similar.

Can you show the graph please?

@digirave
Copy link
Author

digirave commented Nov 5, 2023

Query response time seem overall similar.

Can you show the graph please?

Unfortunately I don't have graphs for it. However internally every search shows output for admins in console, and overall that output seems similar right now.

@digirave
Copy link
Author

digirave commented Nov 5, 2023

Regarding
"One difference between the servers is that ServerA was booted while ServerB has same rpm's for everything other than manticore but was not booted."

I just booted ServerB to exclude kernel version differences causing the CPU usage different.
After boot, the CPU usage difference between manticoresearch versions still seem to be the same
This suggests a version difference is causing this issue, especially since comparisons on the same sever and comparison with different server both show increased CPU usage with recent version.

Of note, both servers are now running the same bios version, mainboard firmware version, hardware and network specs, OS and kernel version.

@sanikolaev
Copy link
Collaborator

Here's what you can do for a fair comparison:

  • stop both searchds
  • purge OS cache
  • start them
  • wait some time
  • inspect the load and query stats. For the latter you can calculate avg, 95p and 99p respone time from the query logs.

If after that you can still see that the new version consumes more CPU not improving performance, it will be indeed a problem we'd like to look further into.

@sanikolaev sanikolaev added the waiting Waiting for the original poster (in most cases) or something else label Nov 6, 2023
@digirave
Copy link
Author

digirave commented Nov 29, 2023

Conclusion: I highly suspect 6.2.12_230822.dc5144d35-1.el8 is slower.

I have been running the servers with same setup, hardware, bios/firmware version, bios settings, OS version, and same weighted load balancing for weeks now.
As mentioned both had same server load before upgrade.
The manticore upgraded server keeps having increased server load.

To see if the increased server load helped with query performance, I used the following command to take the last 1,000,000 queries "real" time in server logs and average them.

tail -n 1000000 manticore_query.log > last_1000000_lines.log
grep -oP 'real \K[0-9.]+(?= wall)' last_1000000_lines.log | awk '{ total += $1; count++ } END { print total/count }'

ServerA was upgraded from manticore server 6.0.4_230314.1a3a4ea82-1.el8 to 6.2.12_230822.dc5144d35-1.el8 during the black lines.
ServerB is manticore-server-6.0.4_230314.1a3a4ea82-1.el8.x86_64 in the whole graph

ServerA result: 0.141038
ServerB result: 0.13018

Of note:
These servers are using ngram searches on plain indexes.
These servers create the whole plain index every 3 days but on different dates, but this never caused any visible performance difference in the past.
We have been running these servers for probably 2 years, with stable performance until the recent 6.2.12_230822.dc5144d35-1.el8 upgrade

@sanikolaev
Copy link
Collaborator

Thanks for providing the additional info @digirave

ServerA result: 0.141038
ServerB result: 0.13018

How do the loads differ? If you have older query logs when the both servers were running 6.0.4, can you run the same script to compare the response times for that period?

@digirave
Copy link
Author

digirave commented Nov 29, 2023

Conclusion: Before update to 6.2.12_230822.dc5144d35-1.el8, query times don't differ much between servers and non-upgraded server in the past actually had a miniscule but faster response. However, upgrade to 6.2.12_230822.dc5144d35-1.el8 seems to have caused significant increase in server load with slightly worse query performance.

How do the loads differ?

They shouldn't differ long term. We load balance the same amount to both servers.

If you have older query logs when the both servers were running 6.0.4, can you run the same script to compare the response times for that period?

checked if one million queries were analyzed before further testing:
cat last_1000000_lines.log|wc -l
output: 1000000
confirming statistics were done using 1 million queries

commands ran for further testing
gunzip -k manticore_query.log-20231029.gz
tail -n 1000000 manticore_query.log-20231029 > last_1000000_lines_old.log
grep -oP 'real \K[0-9.]+(?= wall)' last_1000000_lines_old.log | awk '{ total += $1; count++ } END { print total/count }'

checked if one million queries were analyzed:
cat last_1000000_lines_old.log|wc -l
output: 1000000

ServerA: 0.13555
ServerB: 0.136529

@sanikolaev
Copy link
Collaborator

These servers are using ngram searches on plain indexes

Can you provide your table schema, a dozen of sample documents and your typical queries? I'd like us to benchmark that on our side.

@digirave
Copy link
Author

Can you provide your table schema, a dozen of sample documents and your typical queries? I'd like us to benchmark that on our side.

This would take some time and resources.
If it is possible I will get back to you by email, thank you.

@sanikolaev
Copy link
Collaborator

If it is possible I will get back to you by email, thank you

Sure. [email protected]

@digirave
Copy link
Author

Can you provide your table schema, a dozen of sample documents and your typical queries? I'd like us to benchmark that on our side.

At this moment, this will take a lot of resources, so we wish to avoid as much as possible.

I am not 100% sure, but I do think I found a workaround.

pseudo_sharding = 0

seems to decrease server load considerably, while query performance seems comparable to other servers(too little difference to tell right now).
The server load seems to decrease massively(a noticable immediate huge drop in server load)
We plan to run with pseudo_sharding = 0 for a while

Addtionally preliminary tests on prior versions, pseudo_sharding = 0 doesn't seem to have such a noticable effect(so little I can't tell right now).

@digirave
Copy link
Author

My current conclusion is pseudo_sharding in our case has a significant negative performance impact in version 6.2.12_230822.dc5144d35-1.el8 which is not visible in 6.0.4_230314.1a3a4ea82-1.el8

Workaround seems to be
pseudo_sharding = 0

@digirave
Copy link
Author

digirave commented Nov 30, 2023

Here are screenshot of
ServerA is version 6.2.12_230822.dc5144d35-1.el8 black line is when server was switched to pseudo_sharding = 0

ServerA

ServerB is 6.0.4_230314.1a3a4ea82-1.el8 in whole graph with default pseudo_sharding as default and also after changing to pseudo_sharding = 0 with no significant difference(probably small difference but hard to tell).

ServerB

@sanikolaev
Copy link
Collaborator

@digirave BTW regarding

awk '{ total += $1; count++ } END { print total/count }'

could you also check the 50p, 95p and 99p metrics?

@digirave
Copy link
Author

digirave commented Nov 30, 2023

@sanikolaev

could you also check the 50p, 95p and 99p metrics?

Statistics for default pseudo_sharding

Server A version 6.2.12_230822.dc5144d35-1.el8 with default pseudo_sharding
Average: 0.146567
50th Percentile (Median): 0.104
95th Percentile: 0.413
99th Percentile: 0.746

Server B version 6.0.4_230314.1a3a4ea82-1.el8 with default pseudo_sharding
Average: 0.133866
50th Percentile (Median): 0.096
95th Percentile: 0.373
99th Percentile: 0.673


Statistics for pseudo_sharding = 0

Server A version 6.2.12_230822.dc5144d35-1.el8 with pseudo_sharding = 0
Average: 0.132263
50th Percentile (Median): 0.094
95th Percentile: 0.363
99th Percentile: 0.631

Server B version 6.0.4_230314.1a3a4ea82-1.el8 with pseudo_sharding = 0
Average: 0.131565
50th Percentile (Median): 0.097
95th Percentile: 0.356
99th Percentile: 0.584

Commands used
default pseudo_sharding setting: 1 million queries analyzed
pseudo_sharding = 0 setting: 10,000 queries analyzed

tail -n 1000000 manticore_query.log-20231119 > last_1000000_lines.log
grep -oP 'real \K[0-9.]+(?= wall)' last_1000000_lines.log | sort -n | awk '
{
    times[NR] = $1;
    total += $1;
}
END {
    average = total / NR;
    median = (NR % 2) ? times[(NR + 1) / 2] : (times[NR / 2] + times[NR / 2 + 1]) / 2;
    pct95 = times[int(NR * 0.95)];
    pct99 = times[int(NR * 0.99)];

    print "Average:", average;
    print "50th Percentile (Median):", median;
    print "95th Percentile:", pct95;
    print "99th Percentile:", pct99;
}'

@sanikolaev sanikolaev added est::size_S and removed waiting Waiting for the original poster (in most cases) or something else labels Jan 19, 2024
@sanikolaev
Copy link
Collaborator

As discussed with @glookka, even though we say in the docs that when you do manual sharding, you can disable pseudo sharding, in this case (21 shards, presumably 25 cpu cores, max_threads_per_query=20) not disabling it should not have caused the overload and should not have impacted the performance negatively. Since @digirave can't provide the data to reproduce it, @glookka will try to reproduce it on synthetic tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants