Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU utilization due to large number of DELETE statements #1412

Closed
crirusu opened this issue Aug 26, 2024 · 6 comments · Fixed by #1443
Closed

High CPU utilization due to large number of DELETE statements #1412

crirusu opened this issue Aug 26, 2024 · 6 comments · Fixed by #1443
Milestone

Comments

@crirusu
Copy link

crirusu commented Aug 26, 2024

Symptoms

When using PostgreSQL transport, the database used as messaging infrastructure is put under heavy load caused by an excessive number of unnecessary DELETE statements i.e. statements that result in 0 rows being affected.

image

Who's affected

All PostgreSQL transport users.

Root cause

The problem is caused by the peek statement overestimating the number of messages ready to be received. The query calculates the difference between "oldest" and "newest" (based on a sequence number) messages including messages that are already being received. This in turn, causes situation when a single receive transaction that takes longer than the others can cause a significant overestimation of the number of messages.

Patched version

Original content ### Describe the bug

Description

We deployed 9 net core services using NSB with Postgres which used a database on an amazon server with 48 acu on aurora postgress serverless v2.
According to Amazon 1 ACU = 2GB RAM
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless-v2.html

Even with few data to be processed CPU did not dropped, and we were unable to keep up with processing.
I am attaching a picture with a very high number of commits and rollbacks.
image

With the database for the queues on a MSSQL Server with 8 cores and 32 GB of RAM we can run all services with less than 30% CPU.

Versions

Please list the version of the relevant packages or applications in which the bug exists.

Steps to reproduce

Try to see how many requests are actually done on the database.

@crirusu crirusu added the Bug label Aug 26, 2024
@SzymonPobiega
Copy link
Member

Hi

For some reason I can't open the picture in full resolution but as far as I can tell, you get 10-20 times more calls for the DELETE than the rows affected. Is that correct?

Could you describe also the specifics of your deployment i.e. how many instances there is of each logical endpoint. In order to keep that data confidential, please send up an email to the support address.

@crirusu
Copy link
Author

crirusu commented Aug 29, 2024

Hi,

We have 9 services but some have 2 NSB endpoints

Service 1 - 1 NSB endpoint with NServiceBusMaxConcurrency = 20
Service 2 - 1 NSB endpoint with NServiceBusMaxConcurrency = 32 and 1 NSB endpoint with NServiceBusMaxConcurrency = 18
Service 3 - 1 NSB endpoint with NServiceBusMaxConcurrency = 16
Service 4 - 1 NSB endpoint with NServiceBusMaxConcurrency = 16 and 1 NSB endpoint with NServiceBusMaxConcurrency = 25
Service 5 - 1 NSB endpoint with NServiceBusMaxConcurrency = 18
Service 6 - 1 NSB endpoint with NServiceBusMaxConcurrency = 1 and 1 NSB endpoint with NServiceBusMaxConcurrency = 1
Service 7 - 1 NSB endpoint with NServiceBusMaxConcurrency = 16
Service 8 - 1 NSB endpoint with NServiceBusMaxConcurrency = 16
Service 9 - 1 NSB endpoint with NServiceBusMaxConcurrency = 20

I can send the picture again on some email.
The problem is that are too many commits and rollbacks.
We now reverted to MSSQL Server instance and it is nowhere close to what was in postgres - compared with the number of transactions/ database activity.

@SzymonPobiega
Copy link
Member

Yes, the number of commits is off but so is the ration of calls to rows affected. In the ideal scenario when queues always have some messages the number of DELETE calls should be equal to number of rows returned.

I am especially concerned about the Sub_ExportData endpoint. Is it one of the scaled-out endpoints? (2, 4 or 6).

Does the rows/s value of 50-80 match the expected number of messages per second these endpoints process?

@crirusu
Copy link
Author

crirusu commented Sep 4, 2024

hi,
i don't think we process that much. This is the picture with the mssql server that we are running the same endpoints as we did in postgresql.
image

@tmasternak
Copy link
Member

@crirusu it seems that the extensive number of DELETE statements are due to overestimating the size of the input queue (number of rows in the input table), which in turn is caused by the SQL statement running not skipping the locked (held by other receive transactions) rows.

We are working on the fix:

@tmasternak tmasternak changed the title High CPU using NSB with Postgres High CPU utilization due to large number of DELETE statements Oct 16, 2024
@tmasternak tmasternak added this to the 8.1.5 milestone Oct 16, 2024
@tmasternak
Copy link
Member

@crirusu fyi, the fix for PostgreSQL is out in the 8.1.5 version of the package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants